MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia

MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia

MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia

Full Citation in the ACM Digital Library

SESSION: Full Papers

TrackNetV3: Enhancing ShuttleCock Tracking with Augmentations and Trajectory Rectification

  • Yu-Jou Chen
  • Yu-Shuen Wang

We present TrackNetV3, a sophisticated model designed to enhance the precision of shuttlecock localization in broadcast badminton videos. TrackNetV3 is composed of two core modules: trajectory prediction and rectification. The trajectory prediction module leverages an estimated background as auxiliary data to locate the shuttlecock in spite of the fluctuating visual interferences. This module also incorporates mixup data augmentation to formulate complex scenarios to strengthen the network’s robustness. Given that a shuttlecock can occasionally be obstructed, we create repair masks by analyzing the predicted trajectory, subsequently rectifying the path via inpainting. This process significantly enhances the accuracy of tracking and the completeness of the trajectory. Our experimental results illustrate a substantial enhancement over previous standard methods, increasing the accuracy from 87.72% to 97.51%. These results validate the effectiveness of TrackNetV3 in progressing shuttlecock tracking within the context of badminton matches. We release the source code at

Personalized Federated Learning via Backbone Self-Distillation

  • Pengju Wang
  • Bochao Liu
  • Dan Zeng
  • Chenggang Yan
  • Shiming Ge

In practical scenarios, federated learning frequently necessitates training personalized models for each client using heterogeneous data. This paper proposes a backbone self-distillation approach to facilitate personalized federated learning. In this approach, each client trains its local model and only sends the backbone weights to the server. These weights are then aggregated to create a global backbone, which is returned to each client for updating. However, the client’s local backbone lacks personalization because of the common representation. To solve this problem, each client further performs backbone self-distillation by using the global backbone as a teacher and transferring knowledge to update the local backbone. This process involves learning two components: the shared backbone for common representation and the private head for local personalization, which enables effective global knowledge transfer. Extensive experiments and comparisons with 12 state-of-the-art approaches demonstrate the effectiveness of our approach.

Lambda-Domain Rate Control for Neural Image Compression

  • Naifu Xue
  • Yuan Zhang

Rate control based on rate-distortion modeling is a classic problem in lossy image compression. Despite extensive research in neural image compression, its rate control remains understudied. In this paper, we introduce a variable rate neural image compression scheme that supports precise rate control with one-pass encoding. Our approach utilizes the Lagrangian multiplier method to transform rate control into an unconstrained optimization problem, mapping the target bitrate to λ for rate-distortion trade-off adjustment. We propose an improved exponential R-λ model and estimate the bitrates with a hybrid convolution-transformer network for model fitting. The encoder is controlled by λ, and a multi-layer modulation mechanism ensures variable rate ability. In our experiments, the proposed method outperforms the intra-frame coding of Versatile Video Coding (VVC). Meanwhile, the average rate control error is less than 5.1%, while maintaining almost identical rate-distortion performance and acceptable complexity.

History-Detr: Optimize Query Initialization Strategy by Using Historical Information and Kinematics

  • Weijie Luo
  • Zihao Liu
  • Guohao Dai
  • Ningyi Xu

Recent 3D object detectors leverage multi-frame data, including past and future data, to enhance performance. However, the method of temporal data fusion they employ has not fully tapped into its potential for improving performance. Existing works make use of multi-frame data which only fuse specific features according to ego-motion and cannot be directly applied to long sequences due to the huge computation and memory cost. We find that the present methods do not efficiently exploit history information including history predictions and object-motion. Building on our investigations, we present a novel hybrid query formulation comprised of the history queries and original queries. The history queries consist of inferred position and content queries obtained from the historical predictions and features, which take into account the motion of all objects in the current scene. What’s more, our method can be simply applied into other DETR-like models to boost performance without introducing huge computation and memory cost. As a result, our History-DETR results in a remarkable improvement(+1.1% NDS) under negligible inference time increase.

A Multi-scale and Dense Object Detector for Tibetan Thangka Images

  • Gaohuan Dong
  • Qing Xie
  • Jiachen Li
  • Yanchun Ma
  • Yuhan Liu
  • Yongjian Liu

Thangka cultural elements detection aims to locate and identify instances in Thangka. However, as a unique form of pictorial art, Thangka exhibits distinct spatial structures that deviate significantly from general images in scale and density. Therefore, it is challenging for most state-of-the-art detectors designed for natural scenes to handle Thangka cultural elements detection effectively. To overcome this issue, we propose a multi-scale and dense object detector referred as MDDet. It embeds a multi-scale receptive field fusion module (MRF) that enlarges the receptive field while capturing the spatial and channel relationships at different scales, which significantly enriches the multi-scale features extracted from the backbone. In addition, we introduce a threshold-slicing aided hyper inference (T-SAHI) scheme, which adaptively slices images in dense scenarios to aid with dense object detection in the test time. We thoroughly evaluate our method, and MDDet outperforms the prior art by a clear margin on the Thangka dataset, achieving an absolute improvement of 1.9% in average precision (AP). For the challenging medium and small objects in Thangka, MDDet obtains wide margins of 12% and 3.7% in accuracy improvement, respectively. It also shows strong generalization ability when evaluated on general scenarios, e.g., Pascal VOC 2007 and MS COCO, validating the role of MDDet in object detection.

A Cross-modal and Redundancy-reduced Network for Weakly-Supervised Audio-Visual Violence Detection

  • Yidan Fan
  • Yongxin Yu
  • Wenhuan Lu
  • Yahong Han

Multimodal learning using audio and visual information has improved Violence Detection tasks. However, previous studies overlook the gap between pre-trained networks and the final violence detection task, as well as the semantic inconsistency between audio and visual features. We consider task-irrelevant information caused by the former situation and semantic noise due to the latter as redundancy, negatively affecting overall detection performance. Besides, the prevailing visual modality-centric approach with audio features as guidance may be biased. We contend that both modalities are crucial in violence detection. To address these issues, we propose a Cross-modal and Redundancy-reduced Network for Weakly-Supervised Audio-Visual Violence Detection. Our framework integrates a relation-ware module with a bi-directional cross-modal attention mechanism to explore interactions between modalities. Then, we introduce a feature filter gate to reduce redundancy. Finally, a multi-branch classification module is proposed for better utilization of both modalities. Extensive experiments demonstrate the effectiveness of our approach, surpassing previous methods with state-of-the-art performance in violence detection.

From Pixels to Explanations: Uncovering the Reasoning Process in Visual Question Answering

  • Siqi Zhang
  • Jing Liu
  • Zhihua Wei

Visual reasoning requires models to construct a reasoning process towards the final decision. Previous studies have used attention maps or textual explanations to illustrate the reasoning process, but both have their limitations. Attention maps can be difficult to read, while textual explanations cannot fully describe the process of reasoning, and both are hard to evaluate quantitatively. This paper proposes a novel pixel-to-explanation reasoning model that employs a user-friendly multimodal rationale to depict the reasoning process. The model dissects the question into subquestions, and constructs reasoning cells to retrieve knowledge from the image and question based on these subquestions. The intermediate outcomes from the reasoning cells are translated into object bounding boxes and classes, with the final output beging classified as a standard VQA answer and translated into a complete answer to summarize the entire reasoning process. All the generated results can be combined to produce a human-readable and informative explanation that can be evaluated quantitatively. Besides the interpretability, we achieved a 4.4% improvement over our baseline model on the GQA dataset and attained new state-of-the-art results on the challenging GQA-OOD dataset.

Global-Local GraphFormer: Towards Better Understanding of User Intentions in Sequential Recommendation

  • Hong Chen
  • Bin Huang
  • Xin Wang
  • Yuwei Zhou
  • Wenwu Zhu

Transformer-based model has gained great success in the multimedia sequential recommendation task due to its strong ability to handle sequential data. However, existing Transformer-based models regard the items in the sequential data as a user-specific fully-connected graph (local graph) and only explicitly consider the temporal information in the local graph to capture the users’ intentions, ignoring the fact that the user-item bipartite graph (global graph) may carry important relation patterns to the sequential items. Additionally, it is still unclear whether (and how) the information hidden in the global graphs can help the Transformer-based models better understand the users’ sequential behavior according to the current literature. To investigate this important problem, we propose to utilize the global graph information to help the Transformer-based sequential recommendation, where the information from different modalities, i.e., user-item interactions in the global graph and the temporal patterns in the historical sequences, are taken into account jointly. In concrete, we propose two Global-Local (GL) GraphFormer models for utilizing both the global graph and local temporal information. One GL-GraphFormer is able to gift the Transformer-based model with both first- and second-order graph information through two specifically designed encodings. The other GL-GraphFormer transfers higher-order graph information into the local Transformer with pretrained Graph Neural Networks (GNNs). Extensive experiments on several real-world datasets demonstrate that i) our proposed GL-GraphFormers can bring substantial improvement over baseline methods, and ii) the benefits of different orders of global graph information vary with the dataset sparsity.

Guided Spatio-Temporal Learning Method for 4K Video Super-Resolution

  • Qin Jiang
  • Qinglin Wang
  • Jie Liu

4K Video Super-Resolution (VSR) presents a challenging task in video processing, as most existing VSR models have high computational complexity, limiting their application to high-resolution videos, particularly for 4K resolution videos. To address this issue, we propose a novel Guided Spatio-Temporal Video Super-Resolution network (GST-VSR) designed to perform 4K VSR on a single GPU. The proposed method comprises two key components: the Spatio-Temporal Alignment Network (STAN) and the Super-resolution Reconstruction Network (SRN), which work together to enhance the quality of the output frames. The STAN is responsible for extracting highly relevant features in frames and aligning the reference frame with the neighboring frames at the feature level to maintain temporal consistency. The SRN fuses high-quality features into the final high-resolution frames. Unlike existing methods, our proposed approach does not require explicit optical flow estimation, making it more efficient and less computationally demanding. To facilitate the training and testing of the compared models, we have established a new dataset, Pixabay-Set, consisting of 145 videos suitable for the 4K VSR task. Experimental results on the test dataset show that the proposed method achieves competitive performance compared to state-of-the-art models. In summary, our proposed GST-VSR network provides an effective solution to the challenging task of 4K VSR.

NeRF-IS: Explicit Neural Radiance Fields in Semantic Space

  • Jiansong Sha
  • Haoyu Zhang
  • Yuchen Pan
  • Guang Kou
  • Xiaodong Yi

Implicit Neural Radiance Field (NeRF) techniques have been widely applied and shown promising results for scene decomposition learning and rendering. Existing methods typically require encoding spatial and semantic coordinates separately, followed by deep neural networks (MLP) to obtain representations of the entire scene and individual objects respectively. However, these implicit neural field methods mix scene data and differentiable rendering together, which results in issues with expensive computation, low interpretability and limited scalability. In this article, we propose NeRF-IS (Explicit Neural Radiance Fields in Semantic Space), a novel 4D neural radiance field model architecture, that integrates 3D space and semantic space modeling, which can perform both scene-level and object-level modeling. Specifically, we design a hybrid method of explicit spatial modeling and implicit feature representation, which enhances the model’s ability in scene semantic editing and realistic rendering. For efficient training of NeRF-IS, we apply low rank tensor decomposition to compress the model and speed up the training. We also introduce an importance sampling algorithm that uses a volume density prediction network to provide more accurate samples for the whole system with a coarse-to-fine strategy. Extensive experiments demonstrate that our system not only achieves competitive performance for scene-level representation and rendering of static scene, but also enables object-level rendering and editing.

NeRF-SDP: Efficient Generalizable Neural Radiance Field with Scene Depth Perception

  • Qiuwen Wang
  • Shuai Guo
  • Haoning Wu
  • Rong Xie
  • Li Song
  • Wenjun Zhang

In recent years, neural radiance fields have exhibited impressive performance in novel view synthesis. However, exploiting complex network structures to achieve generalizable NeRF usually results in inefficient rendering. Existing methods for accelerating rendering directly employ simpler inference networks or fewer sampling points, leading to unsatisfactory synthesis quality. To address the challenge of balancing rendering speed and quality in generalizable NeRF, we propose a novel framework, NeRF-SDP, which achieves both efficiency and high fidelity by introducing scene depth perception. We incorporate more scene information into the radiance field by using our proposed geometry feature extraction and depth-encoded ray transformer to improve the model’s inference capabilities with sparse points. With the aid of scene depth perception, NeRF-SDP can better understand the scene’s structure, thus better reconstructing the objects’ edges with significantly fewer artifacts. Experimental results demonstrate that NeRF-SDP achieves comparable synthesis quality to state-of-the-art methods while significantly improving rendering efficiency. Furthermore, ablation studies confirm that the depth-encoded ray transformer enhances the model’s robustness to varying numbers of sampling points.

Adaptive Fusion for Visual Question Answering: Integrating Multi-Label Classification and Similarity Matching

  • Zhengtao Yu
  • Jia Zhao
  • Huiling Wang
  • Chenliang Guo
  • Tong Zhou
  • Chongxiang Sun

Visual Question Answering (VQA) is an important multimodal task in which models are required to answer questions based on visual cues. However, most visual question-answering models suffer from the language prior problem, which is caused by data bias. Specifically, VQA models tend to output high-frequency answers to answer questions while ignoring the information contained in the images. Many approaches have emerged to solve the language prior problem. However, previous approaches could only improve the performance of easy classes, and there is no means to solve hard classes effectively. In this paper, we will utilize more semantic information to guide the model for learning and better handle hard questions. Specifically, in addition to the classification task, we map the image question pairs and the answers to the same dimensional space and construct a similarity metric between the two to get the answers’ similarity-matching output. Moreover, we learn a set of parameters to fuse the classification output with the answers’ similarity-matching output, and finally, we use the fused output for prediction. We use answer weighting for each output to mitigate the language priors in computing the loss function. Moreover, we use answer masks for the classification outputs. Experimental results demonstrate the effectiveness of our method, which achieves a state-of-the-art performance of 62.20% on VQA-CP v2.

Joint Coordinate Regression and Association For Multi-Person Pose Estimation, A Pure Neural Network Approach

  • Dongyang Yu
  • Yunshi Xie
  • Wangpeng An
  • Zhang Li
  • Yufeng Yao

We introduce a novel one-stage end-to-end multi-person 2D pose estimation algorithm, known as Joint Coordinate Regression and Association (JCRA), that produces human pose joints and associations without requiring any post-processing. The proposed algorithm is fast, accurate, effective, and simple. The one-stage end-to-end network architecture significantly improves the inference speed of JCRA. Meanwhile, we devised a symmetric network structure for both the encoder and decoder, which ensures high accuracy in identifying keypoints. It follows an architecture that directly outputs part positions via a transformer network, resulting in a significant improvement in performance. Extensive experiments on the MS COCO and CrowdPose benchmarks demonstrate that JCRA outperforms state-of-the-art approaches in both accuracy and efficiency. Moreover, JCRA demonstrates 69.2 mAP and is 78% faster at inference acceleration than previous state-of-the-art bottom-up algorithms.

Exploring Feature Fusion from A Contrastive Multi-Modality Learner for Liver Cancer Diagnosis

  • Yang Fan Chiang
  • Pei-Xuan Li
  • Ding-You Wu
  • Hsun-Ping Hsieh
  • Ching-Chung Ko

Self-supervised contrastive learning has achieved promising results in computer vision, and recently it also received attention in the medical domain. In practice, medical data is hard to collect and even harder to annotate, but leveraging multi-modality medical images to make up for small datasets has proved to be helpful. In this work, we focus on mining multi-modality Magnetic Resonance (MR) images to learn multi-modality contrastive representations. We first present multi-modality data augmentation (MDA) to adapt contrastive learning to multi-modality learning. Then, the proposed cross-modality group convolution (CGC) is used for multi-modality features in the downstream fine-tune task. Specifically, in the pre-training stage, considering different behaviors from each MRI modality with the same anatomic structure, yet without designing a handcrafted pretext task, we select two augmented MR images from a patient as a positive pair, and then directly maximize the similarity between positive pairs using Simple Siamese networks. To further exploit multi-modality representation, we combine 3D and 2D group convolution with a channel shuffle operation to efficiently incorporate different modalities of image features. We evaluate our proposed methods on liver MR images collected from a well-known hospital in Taiwan. Experiments show our framework has significantly improved from previous methods.

Learning Snippet-to-Motion Progression for Skeleton-based Human Motion Prediction

  • Xinshun Wang
  • Qiongjie Cui
  • Chen Chen
  • Shen Zhao
  • Mengyuan Liu

Existing Graph Convolutional Networks to achieve human motion prediction largely adopt a one-step scheme, which output the prediction straight from history input, failing to exploit human motion patterns. We observe that human motions have transitional patterns and can be split into snippets representative of each transition. Each snippet can be reconstructed from its starting and ending poses referred to as the transitional poses. We propose a snippet-to-motion multi-stage framework that breaks motion prediction into sub-tasks easier to accomplish. Each sub-task integrates three modules: transitional pose prediction, snippet reconstruction, and snippet-to-motion prediction. Specifically, we propose to first predict only the transitional poses. Then we use them to reconstruct the corresponding snippets, obtaining a close approximation to the true motion sequence. Finally we refine them to produce the final prediction output. To implement the network, we propose a novel unified graph modeling, which allows for direct and effective feature propagation compared to existing approaches which rely on separate space-time modeling. Extensive experiments on Human 3.6M, CMU Mocap and 3DPW datasets verify the effectiveness of our method which achieves state-of-the-art performance.

Semantic-Aware Dynamic Feature Selection and Fusion for Object Detection in UAV Videos

  • Jianping Zhong
  • Zhaobo Qi
  • Weigang Zhang
  • Qingming Huang

Keypoint-based detectors perform well in surveillance videos but face challenges in detecting objects in UAV videos due to missed corners and mismatches. To address this, we propose a semantic-aware module with a feature fusion sub-module and a feature selection sub-module. The feature fusion module adaptively combines low-level and high-level features, enhancing corner recall. The feature selection module determines spatial location importance, improving discriminative capabilities and reducing background interference, resulting in better precision. Experiments on the UAVDT benchmark show our method achieves competitive results. Notably, our method improves corner recall by 4.0% and reduces the mismatch rate by 2.9% compared to the baseline. Code is available at

Graph-Guided MLP-Mixer for Skeleton-Based Human Motion Prediction

  • Xinshun Wang
  • Qiongjie Cui
  • Chen Chen
  • Shen Zhao
  • Mengyuan Liu

In recent years, Graph Convolutional Networks (GCNs) have been widely used in human motion prediction, but their performance remains unsatisfactory. Recently, MLP-Mixer, initially developed for vision tasks, has been leveraged into human motion prediction as a promising alternative to GCNs, which achieves both better performance and better efficiency than GCNs.

Unlike GCNs, which can explicitly capture human skeleton’s bone-joint structure by representing it as a graph with edges and nodes, MLP-Mixer relies on fully connected layers and thus cannot explicitly model such graph-like structure of human’s.

To break this limitation of MLP-Mixer’s, we propose Graph-Guided Mixer, a novel approach that equips the original MLP-Mixer architecture with the capability to model graph structure.

By incorporating graph guidance, our Graph-Guided Mixer can effectively capture and utilize the specific connectivity patterns within human skeleton’s graph representation.

In this paper, first we uncover a theoretical connection between MLP-Mixer and GCN that is unexplored in existing research. Building on this theoretical connection, next we present our proposed Graph-Guided Mixer, explaining how the original MLP-Mixer architecture is reinvented to incorporate guidance from graph structure.

Then we conduct an extensive evaluation on the Human3.6M, AMASS, and 3DPW datasets, which shows that our method achieves state-of-the-art performance.

Self-supervised anomaly detection of medical images based on dual-module discrepancy

  • Yuqing Song
  • Jinyong Cheng

Medical images anomaly detection plays a very important role in modern health care, which helps to improve the quality and efficiency of medical services and promote the development of human health. Due to the high cost of annotation in anomaly images and the fact that most existing methods do not fully utilize information from unlabeled images. Therefore, we propose a new reconstruction network and loss function that can better utilize unlabelled and normal images for anomaly identification. The framework used in this paper consists of two modules, each consisting of three reconstruction networks with the same architecture but different inputs. One module is trained only on normal images and is called the normal module (NM). The other module is trained on both normal images and unlabeled images, and is called the unknown module (UM). Furthermore, the internal differences of the normal module and the differences between the two modules will be used as two powerful anomaly scores, and these two anomaly scores will be refined to indicate anomalies. Experiments on four medical datasets show the state-of-the-art performance by the proposed approach.

Cross-modal Image-Recipe Retrieval via Multimodal Fusion

  • Lijie Li
  • Caiyue Hu
  • Haitao Zhang
  • Akshita Maradapu Vera Venkata sai

Cross-modal image-recipe retrieval aims to capture the correlation between food images and recipes. While existing methods have demonstrated good performance on retrieval tasks, they often overlook two crucial aspects: (1) the capture of fine-grained recipe information and (2) the consideration of correlations between embeddings from different modalities. We introduce the Multimodal Fusion Retrieval Framework (MFRF) to address these issues. The proposed framework utilizes a deep learning-based encoder to process recipe and image data effectively, incorporates a fusion network to learn cross-modal semantic alignment, and ultimately achieves image-recipe retrieval. MFRF comprises three integral modules. The recipe preprocessing module utilizes various levels of Transformer to extract essential features such as the title and ingredients from the recipe. Additionally, it employs LSTM based on BERT to establish contextual relationships and dependencies among sentences in the recipe instructions. The multimodal fusion module incorporates visual-linguistic contrastive losses to align the representations of both images and recipes. Moreover, it leverages cross-modal attention mechanisms to facilitate effective interaction between the two modalities. Lastly, the cross-modal retrieval module employs a triple loss function to enable cross-modal retrieval of image-recipe pairs. Experimental evaluations conducted on the widely-used Recipe1M benchmark dataset demonstrate the effectiveness of the proposed MFRF, achieving substantial performance improvements on both the 1k and 10k test sets. Specifically, the results indicate an increase of +9.9% (64.8 R@1) and +8.4% (33.7 R@1) respectively.

Class-aware Convolution and Attentive Aggregation for Image Classification

  • Zitan Chen
  • Zhuang Qi
  • Xiangxian Li
  • Yuqing Wang
  • Lei Meng
  • Xiangxu Meng

Deep learning has been proven to be effective in image classification tasks. However, existing methods may face difficulties in distinguishing complex images due to the distraction caused by diverse image content. To overcome this challenge, we propose a class-aware convolution and attentive aggregation (CA-Net) framework that improves the effectiveness of representation learning and reduces the influence of irrelevant background. CA-Net includes three main modules: the discrete representation learning (DRL) module that uses a group learning method to learn discriminative representations, the class-aware score of discrete representation (CSDR) module that infers class-aware scores to generate weights for representation learners, and the class-aware representation fusion module(CRF) that aggregates class-aware representations using the class-aware scores as a guide. Our experimental results on three benchmarking datasets show that CA-Net improves the performance of state-of-the-art backbones and enhances feature extraction robustness.

Relevance and Irrelevance Considered Subspace Mapping Neural Networks for Remote Sensing Text-Image Retrieval

  • Xiu Li
  • Chengyu Zheng
  • Jie Nie
  • Ruoyu Zhang
  • Xinyue Liang
  • Zhiqiang Wei

Remote sensing cross-modal image-text retrieval has attracted increasing attention due to its important roles in multiple domains. Existing methods perform salient modeling for the feature that has high relevance between different modalities. However, most works consider the relevance between different modalities but ignore the irrelevance between different modalities, resulting in incomplete modeling of the relevance and irrelevance between different modalities. In this paper, we propose a Relevance and Irrelevance Considered Subspace Mapping Neural Networks (RIR-SMNNs) to simultaneously consider the relevance and irrelevance between different modalities. Specifically, we first utilize Multiscale Image Feature Extraction (MIFE) and Multiscale Text Feature Extraction (MTFE) to extract the multiscale feature of image and text. Then, we perform Local Space Building Module (LSB), which constructs local space that realizes scale alignment. Finally, we perform the Relevance and Irrelevance Local Space Mapping (RIRLSM) to consider the relevance and irrelevance of different modalities in multiple spaces. Experimental results on several remote sensing datasets demonstrate our model outperforms the state-of-the-art approaches.

NuclSeg: nuclei segmentation using semi-supervised stain deconvolution

  • Haixin Wang
  • Jian Yang
  • Ryohei Katayama
  • Michiya Matsusaki
  • Tomoyuki Miyao
  • Jinjia Zhou

Recently, deep learning-inferred stain deconvolution/separation-based nuclei segmentation works demonstrated significant results by translating low-cost and prevalent immunohistochemical (IHC) slides to more expensive-yet-informative multiplex immunofluorescence (mpIF) images. However, when the input stain style is changed to Hematoxylin and Eosin stain (H&E), which is one of the principal tissue stains used in histology, the stain deconvolution/separation based works can not achieve satisfactory performance because the features of the input stain image are greatly changed. To solve this problem, we integrate stain transfer (H&E->IHC) before stain deconvolution (IHC->mpIF) to revise the image style. Moreover, a new semi-supervised learning strategy collaborating supervised and unsupervised learning processes are employed to diversify training data content for stain deconvolution.Firstly, in the supervised learning process, generative adversarial network (GAN) based image-to-image mapping (stain deconvolution) and inverse mapping models are trained on the dataset of co-registered IHC staining and mpIF staining of the same slides to respectively convert IHC to mpIF (I2m), and mpIF to IHC (m2I). After that, in the unsupervised learning process, high-quality IHC images from m2I model are selected according to the Discriminator score on another unpaired dataset, and then, IHC images are paired with input mpIF as the unsupervised training data for further improving the I2m model. This semi-supervised scheme balances two supervised and unsupervised errors while optimizing to limit the effect of imperfect pseudo inputs but still enhance stain deconvolution. Furthermore, image enhancement is applied after the stain deconvolution model to obtain high-quality segmentation masks. We thoroughly evaluate our method on publicly available benchmark datasets. The results show the proposed model obtains significant improvement compared to the SOTAs.

Block based Adaptive Compressive Sensing with Sampling Rate Control

  • Kosuke Iwama
  • Ryugo Morita
  • Jinjia Zhou

Compressive sensing (CS), acquiring and reconstructing signals below the Nyquist rate, has great potential in image and video acquisition to exploit data redundancy and greatly reduce the amount of sampled data. To further reduce the sampled data while keeping the video quality, this paper explores the temporal redundancy in video CS and proposes a block based adaptive compressive sensing framework with a sampling rate (SR) control strategy. To avoid redundant compression of non-moving regions, we first incorporate moving block detection between consecutive frames, and only transmit the measurements of moving blocks. The non-moving regions are reconstructed from the previous frame. In addition, we propose a block storage system and a dynamic threshold to achieve adaptive SR allocation to each frame based on the area of moving regions and target SR for controlling the average SR within the target SR. Finally, to reduce blocking artifacts and improve reconstruction quality, we adopt a cooperative reconstruction of the moving and non-moving blocks by referring to the measurements of the non-moving blocks from the previous frame. Extensive experiments have demonstrated that this work is able to control SR and obtain better performance than existing works.

Adapting Hierarchical Transformer for Scene-Level Sketch-Based Image Retrieval

  • Jie Yang
  • Aihua Ke
  • Bo Cai

Sketch-based image retrieval (SBIR) is an essential application of sketches. Research on object-level SBIR is relatively mature, but the study of more complex scene-level SBIR is still in its early stages. In order to advance this research, we investigate previous works and identify two main shortcomings: (1) insufficient utilization of multi-scale features from sketches and images, and (2) lack of effective modules to eliminate the substantial domain gap between them. To address these issues, we propose SketchRetriever, a hierarchical Transformer-based scene-level SBIR model. In our model, the hierarchical Transformer and compressors are capable of efficiently capturing feature maps at various granularities and compressing them into corresponding feature vectors, and the modality-specific Adapters can project the feature embeddings of sketches and images into the same feature space, thereby closing the domain gap between them. We adopt the adapter-tuning strategy, which not only considerably reduces the number of tunable parameters but also effectively avoids overfitting. Extensive experiments demonstrate that SketchRetriever significantly outperforms state-of-the-art methods on two benchmark datasets with lower fine-tuning overhead.

Domain-Adaptive Mean Teacher for Category-Level Object Pose Estimation

  • I-Ju Hsieh
  • Yo-Chung Lau
  • Peng-Yuan Kao
  • Shih-Ping Hung
  • Yi-Ping Hung

Category-level object pose estimation aims at predicting 6-DoF object poses for previously unseen objects. Current methods mostly rely on ground-truth labels such as object poses and CAD models. However, annotating these labels manually is time-consuming and error-prone in the real-world scenario. Hence, we propose a novel method to solve unsupervised domain adaptation (UDA) for category-level object pose estimation. We adopt a teacher-student framework to utilize both labeled synthetic data and unlabeled real-world data. The student and the teacher are trained to make consistent predictions under different perturbations. Furthermore, we introduce domain adversarial training to bridge the domain gap between synthetic and real-world data. To prevent false feature alignment between domains, we adopt multiple discriminators instead of a single one and perform category-aware alignments. Extensive experiments show that our method achieves state-of-the-art performance on the REAL275 dataset. Through ablation studies, we also demonstrate that our method is not restricted to certain network architecture and can serve as a general UDA method for category-level object pose estimation.

Feature Adaptation with CLIP for Few-shot Classification

  • Guangxing Wu
  • Junxi Chen
  • Wentao Zhang
  • Ruixuan Wang

Large Vision-Language models such as CLIP have demonstrated impressive capabilities in zero-shot recognition. To apply CLIP to few-shot classification tasks, several methods have been proposed based on CLIP, achieving significant improvements. However, these methods either insufficiently leverage CLIP’s prior knowledge during training or neglect the impact of feature adaptation. In this paper, we propose FAR, a novel approach that balances distribution-altered Feature Adaptation with pRior knowledge of CLIP to further improve the performance of CLIP in few-shot classification tasks. Firstly, we introduce an adapter that enhances the effectiveness of CLIP adaptation by amplifying the differences between the fine-tuned CLIP features and the original CLIP features. Secondly, we leverage the prior knowledge of CLIP to mitigate the risk of overfitting. Through this framework, a good trade-off between feature adaptation and preserving prior knowledge is achieved, enabling effective utilization of both components to enhance performance on downstream tasks. We evaluate our method on over 10 datasets for classification, and our approach consistently outperforms existing methods, demonstrating its effectiveness and robustness.

Cross-modal Consistency Learning with Fine-grained Fusion Network for Multimodal Fake News Detection

  • Jun Li
  • Yi Bin
  • Jie Zou
  • Jiwei Wei
  • Guoqing Wang
  • Yang Yang

Previous studies on multimodal fake news detection have observed the mismatch between text and images in the fake news and attempted to explore the consistency of multimodal news based on global features of different modalities. However, they fail to investigate this relationship between fine-grained fragments in multimodal content. To gain public trust, fake news often includes relevant parts in the text and the image, making such multimodal content appear consistent. Using global features may suppress potential inconsistencies in irrelevant parts. Therefore, in this paper, we propose a novel Consistency-learning Fine-grained Fusion Network (CFFN) that separately explores the consistency and inconsistency from high-relevant and low-relevant word-region pairs. Specifically, for a multimodal post, we divide word-region pairs into high-relevant and low-relevant parts based on their relevance scores. For the high-relevant part, we follow the cross-modal attention mechanism to explore the consistency. For low-relevant part, we calculate inconsistency scores to capture inconsistent points. Finally, a selection module is used to choose the primary clue (consistency or inconsistency) for identifying the credibility of multimodal news. Extensive experiments on two public datasets demonstrate that our CFFN substantially outperforms all the baselines. Our code can be found at:

I2SRM: Intra- and Inter-Sample Relationship Modeling for Multimodal Information Extraction

  • Yusheng Huang
  • Zhouhan Lin

Multimodal information extraction is attracting research attention nowadays, which requires aggregating representations from different modalities. In this paper, we present the Intra- and Inter-Sample Relationship Modeling (I2SRM) method for this task, which contains two modules. Firstly, the intra-sample relationship modeling module operates on a single sample and aims to learn effective representations. Embeddings from textual and visual modalities are shifted to bridge the modality gap caused by distinct pre-trained language and image models. Secondly, the inter-sample relationship modeling module considers relationships among multiple samples and focuses on capturing the interactions. An AttnMixup strategy is proposed, which not only enables collaboration among samples but also augments data to improve generalization. We conduct extensive experiments on the multimodal named entity recognition datasets Twitter-2015 and Twitter-2017, and the multimodal relation extraction dataset MNRE. Our proposed method I2SRM achieves competitive results, 77.12% F1-score on Twitter-2015, 88.40% F1-score on Twitter-2017, and 84.12% F1-score on MNRE. 1

Occlusion­-Aware Manga Character Re­-identification with Self-Paced Contrastive Learning

  • Ci-Yin Zhang
  • Wei-Ta Chu

Existing methods for manga character re-identification primarily rely on facial information, overlooking the unique characteristics of characters’ bodies and failing to address common challenges like occlusion by speech balloons and incomplete body parts. To tackle these issues, we propose a method called Occlusion-Aware Manga Character Re-identification (OAM-ReID) with self-paced contrastive learning, which leverages annotated body data from the Manga109 dataset for training. By synthesizing data with occluded speech balloons and incomplete bodies, we empower the framework to be aware of occlusion, so that more effective feature representations are learnt. Experimental results show that this approach outperforms the state-of-the-art person ReID method.

DiffuseGAE: Controllable and High-fidelity Image Manipulation from Disentangled Representation

  • Yipeng Leng
  • Qiangjuan Huang
  • Zhiyuan Wang
  • Yangyang Liu
  • Haoyu Zhang

Diffusion probabilistic models (DPMs) have shown remarkable results on various image synthesis tasks such as text-to-image generation and image inpainting. However, compared to other generative methods like VAEs and GANs, DPMs lack a low-dimensional, interpretable, and well-decoupled latent code. Recently, diffusion autoencoders (Diff-AE) were proposed to explore the potential of DPMs for representation learning via autoencoding. Diff-AE provides an accessible latent space that exhibits remarkable interpretability, allowing us to manipulate image attributes based on latent codes from the space. However, previous works are not generic as they only operated on a few limited attributes. To further explore the latent space of Diff-AE and achieve a generic editing pipeline, we proposed a module called Group-supervised AutoEncoder(dubbed GAE) for Diff-AE to achieve better disentanglement on the latent code. Our proposed GAE has trained via an attribute-swap strategy to acquire the latent codes for multi-attribute image manipulation based on examples. We empirically demonstrate that our method enables multiple-attributes manipulation and achieves convincing sample quality and attribute alignments, while significantly reducing computational requirements compared to pixel-based approaches for representational decoupling.

A Lightweight and Efficient Model for Audio Anti-Spoofing

  • Qiaowei Ma
  • Jinghui Zhong
  • Yitao Yang
  • Weiheng Liu
  • Ying Gao
  • Wing Ng

With the rapid development of speech conversion and speech synthesis algorithms, automatic speaker verification (ASV) systems are vulnerable to spoofing attacks. In recent years, researchers had proposed anti-spoofing systems based on hand-crafted features. However, using hand-crafted features rather than raw waveform will lose implicit information for audio anti-spoofing. Inspired by the promising performance of ConvNeXt in classification tasks, we reference the network architecture design of ConvNeXt and propose a Lightweight and Efficient Model for Audio Anti-Spoofing (LEMAAS). With no preceding feature extraction process, we employ raw waveforms as direct inputs to our proposed model. By integrating with the channel attention module and using the focal loss function, the proposed model can focus on the most informative features representation of speech and the difficult samples that are hard to classify. Experimental results show that our proposed system could achieve an equal error rate of 0.64% and min-tDCF of 0.0187 for the ASVspoof 2019 LA evaluation dataset, which outperforms the state-of-the-art systems. Moreover, even when trained only on the ASVspoof 2019 LA dataset, the model still achieved equal error rates of 0.86% and 1.18% on the ASVspoof 2015 development dataset and evaluation dataset, respectively. This demonstrates that our model has achieved promising generalization performance during cross-dataset testing.

SOFTCUTMIX: Data Augmentation and Algorithmic Enhancements for Cross-Modality Person Re-Identification

  • Yuxiang Wan
  • Banghai Wang
  • Lunke Fei

One of the primary challenges in achieving Infrared-Visible Person Re-Identification (IV Re-ID) is the significant differences in modalities between visible (VIS) and infrared (IR) images.In addressing this challenge, we propose a new data augmentation method-SOFTCUTMIX and introduce a new algorithm called SOFTCUTMIX Auxiliary Modality(SCAM). SOFTCUTMIX augmentation strategy aims to randomly crop and blend portions of two images with random weights, and meanwhile blend their non-cropped portions with other random weights. SCAM algorithm generates mixed modality images by blending visible light and infrared images and serves as an auxiliary modality to reduce the inherent modality differences. We also design a Channel Random Selection (CRS) to adjust the channels of the three-channel visible light image to reduce differences with the single-channel infrared image. Furthermore, we propose a Weighted Regularization Center Triplet Loss (WRCT) and combine it with the Weighted Regularization Triplet Loss (WRT). This approach reduces intra-class variations and increases inter-class separability, thereby enhancing the discriminative power of the learned features. Experimental results on the SYSU-MM01 and RegDB datasets demonstrate that our algorithm significantly outperforms the state-of-the-art method.

Reimagining 3D Visual Grounding: Instance Segmentation and Transformers for Fragmented Point Cloud Scenarios

  • Zehan Tan
  • Weidong Yang
  • Zhiwei Wang

This work introduces a pioneering, engineerable approach to 3D visual localization(3DVG). Current challenges for 2D Visual Grounding (2DVG) and 3DVG are summarized: Absence of Depth Information in 2DVG, Memory and Computational Demands of Global Point Clouds, Limitations in Dynamic Scenarios, and Limited Understanding of Spatial Localization Reference Frames. Our solution proposes a Re_3DVG method for fragmented point cloud scenarios. Utilizing instance segmentation and transformer models, our approach offers a potent mechanism for establishing robust correspondences between text queries and object instances within the shared visible range. The introduction of a FragCloud3DRef dataset, grounded in ScanNet and supplemented with RGB data, object segmentation, and textual descriptions, fortifies the effectiveness of our proposed model. Experimental outcomes display that our model excels beyond conventional 3DVG and 2DVG models, establishing a formidable benchmark for future research within this discipline. The code source and dataset are open at

Speech Spoofing Detection Based on Graph Attention Networks with Spectral and Temporal Information

  • Peng Zhang
  • Yida Chen
  • Meijuan Li
  • Hui Zhao
  • Jianqiang Zhang
  • Fuqiang Wang
  • Xiaoming Wu

Automatic speaker verification (ASV) systems are vulnerable to synthetic speech attacks. Synthetic algorithms usually introduce artifacts in specific sub-bands or time segments. However, under unknown spoofing attacks, it is challenging to choose the right domain for effective detection. In this paper, we propose a speech spoofing detection method based on graph attention networks with spectral and temporal information. First, high-level features of raw audio are extracted using SENet channel attention to enhance the spatial correlation between speech frames. Then, spectral graph and temporal graph are constructed for the high-level features using graph attention networks. Finally, we design a new heterogeneous multi-domain co-graph attention module to process the information from different domains for effective speech spoofing detection. The proposed model was evaluated on the ASVspoof 2019 dataset and obtains a min t-DCF of 0.0264 and an EER of 0.94%, exhibiting competitive performance. Experiments also show its effectiveness when detecting unknown types of attacks.

Achieving Privacy-Preserving Multi-View Consistency with Advanced 3D-Aware Face De-identification

  • Jingyi Cao
  • Bo Liu
  • Yunqian Wen
  • Rong Xie
  • Li Song

The widespread application of face recognition technology has exacerbated privacy threats. Face de-identification is an effective means of protecting visual privacy by concealing identity information. While deep learning-based methods have greatly improved de-identification results, most existing algorithms rely on 2D generative models that struggle to produce identity-consistent results for multiple views. In this paper, we focus on identity disentanglement within the latest 3D-aware face generation model, and propose an advanced face de-identification framework that can be applied to various scenarios. Our proposed framework disentangles identity from other facial features, modifies only the former and generates the de-identified face using a 3D generator. This approach results in high-quality, identity-consistent de-identification that preserves other facial features. We demonstrate our approach on StyleNeRF, one of the most widely-used style-based neural radiation field models. Through extensive experiments, we demonstrate the effectiveness of our approach in achieving face de-identification both for a single image and group images with the same identity. Our work is a significant step forward in the field of face de-identification, opening up new possibilities for practical applications.

Moving Inside the Box: Interacting with Interpretation of Historical Artefacts Through Tangible Augmented Reality

  • Suzanne Kobeisse
  • Lars Erik Holmquist

We present ARcheoBox, a walk-up-and-use prototype for interacting with interpretation of historical artefacts using tangible augmented reality. ARcheoBox enables users to manipulate virtual representations and interact with interpretation of historical artefacts using cylinder-shaped generic proxies. We also leverage the user interactions with interpretation using three interaction techniques “Move”, “Rotate”, and “Flip” as output modalities in AR. The prototype consists of a wooden box, a tablet display, and generic proxies, which means ARcheoBox does not require any head-mounted displays (HMDs), handheld controllers, or haptic gloves. We conducted a user study with 25 participants in which the findings demonstrate the advantages of tangible AR over more conventional interaction modalities presented in museums such as touch screens. Finally, we present a set of design recommendations for designing tangible AR that enhances the user's interaction experience with historical artefacts.

A Decoupled Cross-layer Fusion Network with Bidirectional Guidance for Detecting Small Logos

  • Songhui Zhao
  • Sujuan Hou
  • Baisong Zhang

Logo detection involves the use of machine learning algorithms to recognize and locate logos in images and videos, which has applications in a wide range of industries, including e-commerce, advertising, and entertainment. However, detecting small logos is still a challenging task due to their limited coverage of pixels and unclear details resulting in insufficient feature information for detection. Therefore, they are often easily confused by complex backgrounds and have lower perturbation tolerance to the bounding box, making them more difficult to detect compared to medium and large-scale logos. To address this problem, we propose a Decoupled Cross-layer Fusion Network (DCFNet) that enhances the feature representation of small logo objects, resulting in excellent detection performance. Specifically, the proposed DCFNet first adopts a bidirectional cross-layer connection mechanism to capture complementary information between different layers. Next, a two-phase feature averaging and enhancement strategy is used to further enhance the features. In the detection phase, DCFNet decouples the classification and boundary box regression branches into two identical Fully Connected (FC) heads, improving the accuracy of small logo classification and localization by avoiding mutual interference between the branches. Extensive experiments conducted on three publicly available logo datasets demonstrate that DCFNet achieves state-of-the-art performance in detecting small logos.

Robust Tracking via Unifying Pretrain-Finetuning and Visual Prompt Tuning

  • Guangtong Zhang
  • Qihua Liang
  • Ning Li
  • Zhiyi Mo
  • Bineng Zhong

The finetuning paradigm has been a widely used methodology for the supervised training of top-performing trackers. However, the finetuning paradigm faces one key issue: it is unclear how best to perform the finetuning method to adapt a pretrained model to tracking tasks while alleviating the catastrophic forgetting problem. To address this problem, we propose a novel partial finetuning paradigm for visual tracking via unifying pretrain-finetuning and visual prompt tuning (named UPVPT), which can not only efficiently learn knowledge from the tracking task but also reuse the prior knowledge learned by the pre-trained model for effectively handling various challenges in tracking task. Firstly, to maintain the pre-trained prior knowledge, we design a Prompt-style method to freeze some parameters of the pretrained network. Then, to learn knowledge from the tracking task, we update the parameters of the prompt and MLP layers. As a result, we cannot only retain useful prior knowledge of the pre-trained model by freezing the backbone network but also effectively learn target domain knowledge by updating the Prompt and MLP layer. Furthermore, the proposed UPVPT can easily be embedded into existing Transformer trackers (e.g., OSTracker and SwinTracker) by adding only a small number of model parameters (less than 1% of a Backbone network). Extensive experiments on five tracking benchmarks (i.e., UAV123, GOT-10k, LaSOT, TNL2K, and TrackingNet) demonstrate that the proposed UPVPT can improve the robustness and effectiveness of the model, especially in complex scenarios.

Efficient Hand Gesture Recognition using Multi-Task Multi-Modal Learning and Self-Distillation

  • Jie-Ying Li
  • Herman Prawiro
  • Chia-Chen Chiang
  • Hsin-Yu Chang
  • Tse-Yu Pan
  • Chih-Tsun Huang
  • Min-Chun Hu

In this paper, we propose a lightweight model for hand gesture recognition using an RGB camera. The proposed model enables recognition of first-person hand gestures using a single camera and achieves near-real-time computational performance on both high-end and low-end computing devices. The proposed framework utilizes multi-task multi-modal learning and self-distillation to deal with the challenges in hand gesture recognition. We integrate additional modalities (depth) and a future prediction mechanism to enhance the model’s ability to learn spatio-temporal information. Furthermore, we employ self-distillation to compress the model, achieving a balance between accuracy and computational efficiency. We compared the proposed hand gesture recognition model with the state-of-the-art method, and our model outperforms the SOTA by 0.88% and 3.52% on the EgoGesture and NVGesture datasets, respectively. In terms of computational efficiency, our model takes only 161ms in average to recognize a gesture on a device with low-end GPUs (NVIDIA Jetson TX2), which is acceptable for interaction in XR applications.

Image Cropping under Design Constraints

  • Takumi Nishiyasu
  • Wataru Shimoda
  • Yoichi Sato

Image cropping is essential in image editing for obtaining a compositionally enhanced image. In display media, image cropping is a prospective technique for automatically creating media content. However, image cropping for media contents is often required to satisfy various constraints, such as an aspect ratio and blank regions for placing texts or objects. We call this problem image cropping under design constraints. To achieve image cropping under design constraints, we propose a score function-based approach, which computes scores for cropped results whether aesthetically plausible and satisfies design constraints. We explore two derived approaches, a proposal-based approach, and a heatmap-based approach, and we construct a dataset for evaluating the performance of the proposed approaches on image cropping under design constraints. In experiments, we demonstrate that the proposed approaches outperform a baseline, and we observe that the proposal-based approach is better than the heatmap-based approach under the same computation cost, but the heatmap-based approach leads to better scores by increasing computation cost. The experimental results indicate that balancing aesthetically plausible regions and satisfying design constraints is not a trivial problem and requires sensitive balance, and both proposed approaches are reasonable alternatives.

Learning a Contextualized Multimodal Embedding for Zero-shot Cooking Video Caption Generation

  • lin wang
  • Hongyi Zhang
  • xingfu wang
  • yan xiong

This paper proposes CookingCLIP, which introduces the latest CLIP (Contrastive Language-Image Pre-training) embedding from the general domain into the specific domain of cooking understanding, and makes two adaption upon the original CLIP embedding for better customization to the cooking understanding problems: 1) from the upstream perspective, we extend the static multi-modal CLIPembedding with a temporal dimension, to facilitate context-aware semantic understanding; 2) from the downstream perspective, we introduce the concept of zero-shot embedding to sequence-to-sequence dense prediction domains, facilitating CLIPbeing not only capable of telling “Which” (cross-modal recognition), but also capable of telling “When” (cross-context localization). Experiments conducted on two challenging cooking caption generation benchmarks, YouCook and CrossTask, demonstrate the effectiveness of the proposed embedding.

MA-Net: Multi-Attention Network for Skeleton-Based Action Recognition

  • Jingwen Cui
  • Qian Huang
  • Chang Li
  • Yunfei Zhang

Graph Convolution Networks (GCNs) have become the main-stream framework for skeleton-based action recognition tasks. Aiming at the problem of redundant spatial-temporal feature information and neighborhood constraints obtained in GCNs, we propose a novel method called Multi-Attention Network (MA-Net) to explore crucial skeleton information, including two main modules: Combined Attention Graph Convolution (CAGC) and Multi-layer Transposed Attention Encoding (MTAE). The CAGC utilizes multi-dimensional combination attention to capture more valuable information and enhance feature performance. The MTAE adopts self-attention to encode feature maps, effectively establishing long-range dependency and capturing global information. Centre on the attention mechanism, these two modules combine the complementary advantages of GCN (i.e., local topology and temporal dynamics) and Transformer (i.e., global context and dynamic attention). Extensive experiments on the challenging NTU-RGB+D 60 and Kinetics-Skeleton datasets demonstrate that our model performs excellently.

A Spatial-Spectral Decoupling Fusion Framework for Visible and Near-Infrared Images

  • Zhenglin Tang
  • Hai-Miao Hu

The visible and near-infrared image fusion aims to generate an image that integrates complementary information from images captured in different spectral bands. However, existing fusion methods either only focus on the fusion of spatial information, or fuse spatial and spectral information without decoupling, resulting in undesirable effects such as halo artifacts, information loss, and inferior visual quality. To address these issues, we propose a Spatial-Spectral Decoupling Fusion (SSDF) framework that can effectively fuse the spatial and spectral information of visible and near-infrared images. The SSDF framework decomposes the image pairs into two main branches: the Spatial Feature Enhancement (SFE) branch and the Spectral Characteristic Preservation (SCP) branch. The SFE branch enhances the salient details in the fused image by exploiting the contrast between spatial features and generating region-based fusion weights, while the SCP branch preserves the intrinsic spectral characteristics of the scene by fusing the reflectance characteristics of visible and near-infrared images. The final image is obtained by combining the spatial and spectral information. We conduct extensive experiments to show that our SSDF method can achieve superior fusion performance in subjective visual quality and objective metrics compared with state-of-the-art methods.

From Global to Local: An Adaptive Environmental Illumination Estimation for Non-uniform Scattering

  • Huaizhuo Liu
  • Hai-Miao Hu

The atmospheric scattering model is one of the most widely used model to describe the optical imaging processing of hazy images. However, the global atmospheric light used in the traditional atmospheric scattering model has limitations in describing images with varying local environmental illumination. In this paper, by extending the global atmospheric light to the local illumination, a non-uniform scattering model is proposed, which can better describe real scenes under non-uniform environmental illumination. Based on this model, an adaptive local illumination estimation for hazy image is proposed, which can adapt to the local differences of environment illumination. The experimental results demonstrate that the proposed algorithm can outperform the state-of-the-art algorithms in terms of not only the non-uniform scattering removal but also the adaptability.

Key Parts Spatio-Temporal Learning for Video Person Re-identification

  • Wei Guo
  • Hao Wang

Person re-identification (Re-ID) is a technology to identify specific pedestrians in different scenarios. In recent years, Re-ID has been widely used in surveillance, supermarket and smart city. However, there are still many challenges in this field, including complex background, pose changes, and occlusion dislocation. We propose a novel Key Parts Spatio-temporal Learning (KSTL) framework to alleviate the above problems. Specifically, we first use the mask method based on keypoint detection to locate and extract the key part features of the human body. Then, we introduce Spatio-temporal Learning (STL) block based on key parts to realize the mutual transfer and learning of key parts features of multiple frames. Finally, we fuse the learned key part features and global features as the final video representation. The method we propose can not only accurately learn the features of key parts, but also make full use of the timing information in the video, thus achieving good detection results. We conduct extensive experiments on three public benchmarks, and the results demonstrate the effectiveness and superiority of KSTL.

Mask-based Food Image Synthesis with Cross-Modal Recipe Embeddings

  • Chen Zhongtao
  • Yuma Honbu
  • Keiji Yanai

In this paper, we propose a Mask-based Recipe Embedding GAN (MRE-GAN), which enables us to generate a realistic food image based on a given mask image containing single or multiple food regions with cross-modal recipe embeddings for each food region. Thus, we can change meal shapes by modifying mask images, while by editing recipe text, we can change meal appearance. Our experimental findings confirmed that the proposed method could generate higher quality food images than the baselines, and we could change meal shapes and appearances by editing mask images and recipe texts as we liked.

AniCropify: Image Matting for Anime-Style Illustration

  • Yuki Matsuura
  • Takahiro Hayashi

Recently, deep learning-based image matting methods have emerged. However, the existing methods lack the capability to provide precise matting for anime-style illustrations because their network parameters are trained on primarily photo-realistic images. In this paper, we introduces a new anime image dataset, Chara-1M, designed for matting purposes. In addition, we propose AniCropify, a new matting method for character anime images. Focusing on the commonalities of representation between anime images and photo-realistic images, in AniCropify, an anime image is first converted into a photo-realistic image. From the converted image, a trimap is generated to identify the human regions in images. By using the trimap in the matting process, precise alpha masks of anime images can be obtained. From experiments, we confirmed that based on the quality evaluation of matting results, the proposed method received the highest rating compared to other state-of-the-art techniques.

Targeted Transferable Attack against Deep Hashing Retrieval

  • Fei Zhu
  • Wanqian Zhang
  • Dayan Wu
  • Lin Wang
  • Bo Li
  • Weiping Wang

With the extensive utilization of deep hashing, there exists a surging interest in studying adversarial attacks against it. Previous methods have demonstrated the superior white-box attack performance against deep hashing. However, the more challenging and realistic targeted black-box attack has not yet been explored sufficiently, which will result in an over-estimation on model robustness. In this paper, we focus on targeted black-box attack based on transferability, and propose a novel Targeted Transferable Attack method against deep hashing with Generative Adversarial Network (TTA-GAN). Specifically, we first propose a new Iterative Anchor code Optimization (IAO) method to generate anchor code with superior representative semantics of target label, which can improve both targeted white-box and black-box performances. Then, we propose a generation-based method to directly generate targeted transferable adversarial example by training a conditional generator and a discriminator. Moreover, to further promote the targeted transferability, we conduct multiple input transformations on the generated adversarial example to alleviate the overfitting phenomenon on source model. Finally, we extend our method to a novel model ensemble attack method TTA-GANens to preserve the representative semantics on multiple models, specialized for deep hashing. Extensive experiments demonstrate the superior targeted black-box attack performance than the state-of-the-art methods.

Prior Knowledge Guided Network for Video Anomaly Detection

  • Zhewen Deng
  • Dongyue Chen
  • Shizhuo Deng

Video Anomaly Detection (VAD) involves detecting anomalous events in videos, presenting a significant and intricate task within intelligent video surveillance. Existing studies often concentrate solely on features acquired from limited normal data, disregarding the latent prior knowledge present in extensive natural image datasets. To address this constraint, we propose a Prior Knowledge Guided Network(PKG-Net) for the VAD task. First, an auto-encoder network is incorporated into a teacher-student architecture to learn two designated proxy tasks: future frame prediction and teacher network imitation, which can provide better generalization ability on unknown samples. Second, knowledge distillation on proper feature blocks is also proposed to increase the multi-scale detection ability of the model. In addition, prediction error and teacher-student feature inconsistency are combined to evaluate anomaly scores of inference samples more comprehensively. Experimental results on three public benchmarks validate the effectiveness and accuracy of our method, which surpasses recent state-of-the-arts.

Feature Enhancement and Foreground-Background Separation for Weakly Supervised Temporal Action Localization

  • Peng Liu
  • Chuanxu Wang
  • Jianwei Qin
  • Guocheng Lin

Weakly-supervised Temporal Action Localization (W-TAL) is a significant task in video understanding, intending to recognize the category and pinpoint the temporal boundaries of action segments in untrimmed videos based on the video-level labels. Due to the lack of frame-level annotations, recognizing the spatiotemporal relationships among action snippets and precisely separating the foreground and background are two arduous challenges. In this paper, we propose a novel Feature Enhancement and Foreground-Background Separation (FE-FBS) method to address these problems. Specifically, we construct a spatiotemporal cooperation enhancement scheme utilizing residual graph convolutional to capture the spatial and temporal dependencies between the current snippet and others, generating snippet features with spatial and temporal correlations, thereby ensuring a complete feature representation for action localization. Furthermore, we propose a two-branch explicit foreground-background joint attention mechanism to guide foreground and background modeling, combined with an inverse enhancement strategy to enhance action weight for better foreground and background distinction, thereby improving the accuracy of action localization. We use THUMOS’14 and ActivityNet v1.3 datasets to achieve accuracy rates of 66.7% and 38.9%, respectively, and compare our method to other methods, gaining superiority.

Multi-view–enhanced modal fusion hashing for Unsupervised cross-modal retrieval

  • Longfei Ma
  • Honggang Zhao
  • Zheng Jiang
  • Mingyong Li

Cross-modal hashing is an important direction for multimodal data management and applications, which has recently received more and more attention. Unsupervised cross-modal retrieval does not rely on tag information and is more applicable to the real world. However it still faces some problems. Existing methods mainly encode for local features or global features. Due to the effect of negative samples, it is easy to cause noise interference. To solve these problems, we propose a Multi-view–enhanced modal fusion hashing for Unsupervised cross-modal retrieval (MUCH) to improve these problems. Firstly, we propose a multi-view network. Images inherently contain richer semantics, and we employ a multi-view network to observe the image from different perspectives and obtain the overall and local features of the image. Secondly, we introduce a noise cancellation module to approximate the cross-modal data feature alignment from both intra-modal and cross-modal perspectives before generating the hash code. Finally, we construct a distribution-based similarity weighting matrix to replace the graphical similarity matrix. And we performed multi-view enhancement experiments on JDSH and CIRH, with 1% to 2% enhancement over DAEH on all three datasets.

Hierarchical Multi-Scale Adaptive Conv-LSTM Network for Human Action Recognition Based on Wearable Sensors

  • Weiliang Xie
  • Qian Huang
  • Chang Li
  • Yanfang Wang
  • Yanwei Liu

Recently, human action recognition has been widely used in the fields of health monitoring, human-robot interaction, medical treatment, and sports. Due to the availability of various wearable devices on the market, we can easily access sensor data for human action recognition. However, it is still a challenge to capture minute action processes as well as extract spatio-temporal motion patterns from serial sensor data. Therefore, we propose a novel hierarchical multi-scale adaptive Conv-LSTM network structure called HMA Conv-LSTM. The finer-grained spatial information in the sensor signals is extracted by hierarchical multi-scale convolution. The multi-channel feature fusion through adaptive channel feature fusion retains important information and improves model efficiency. We capture temporal context information by dynamic channel selection-LSTM based on the attention mechanism. Extensive experiments on the Opportunity and PAMAP2 public datasets show that our proposed model achieves competitive performance compared to several state-of-the-art approaches.

Multi-Task Self-Blended Images for Face Forgery Detection

  • Po-Han Huang
  • Yue-Hua Han
  • Ernie Chu
  • Jun-Cheng Chen
  • Kai-Lung Hua

Deepfake detection has attracted extensive attention due to widespread forged images on social media. Recently, self-supervised learning (SSL) based Deepfake detection approaches have outperformed supervised methods in terms of model generalization. However, we notice that most SSL-based methods do not take the manipulation strength levels of synthesized forgery samples into consideration according to different synthesis parameters and result in suboptimal detection performances. To address this issue, we introduce several auxiliary losses to the state-of-the-art SSL-based method based on different synthesis sub-tasks during data generation by inferring their synthesis parameters where the ground-truth labels are obtained from the synthesis pipeline for free. With comprehensive evaluations on various benchmarks, our approach has achieved noticeable performance improvement. Specifically, for the cross-dataset evaluation, the proposed approach outperforms the state-of-the-art method in terms of AUC on various datasets with improvements of 3.4%, 1.47%, 1.56%, and 1.3% on the CDF, DFDC, DFDCP, and FFIW datasets and achieves competitive performance on the DFD dataset. This further demonstrates the effectiveness of the proposed approach in its generalization ability.

Power Efficient Mobile VTuber Live Streaming

  • Zichen Zhu
  • Stefano Petrangeli
  • Viswanathan Swaminathan
  • Sheng Wei

Virtual YouTuber (VTuber) live streaming, which renders and streams a virtual avatar of the real-person streamer on top of the live camera view, has gained significant popularity recently. Despite the engaging user experience, the intensive and power-consuming computations required by VTuber, such as facial feature extraction and avatar rendering, pose significant challenges to the constrained battery life of the mobile device. We develop a power efficient VTuber live streaming system by offloading the camera view and the computation-intensive operations from the mobile device to an edge server, which not only significantly reduces the power consumption of the mobile device but also enables larger-scale rendering of multiple avatars that are not feasible in the existing mobile VTuber systems. Furthermore, to reduce the bandwidth overhead caused by the camera view offloading, we develop an adaptive framerate control mechanism to dynamically adjust the framerate of the offloaded camera view based on the variations of inter-frame luminance. Our evaluations on the end-to-end VTuber live streaming system demonstrate significant power savings with limited bandwidth, latency, and quality overhead.

Toward Optimal Real-time Dynamic Point Cloud Streaming over Bandwidth-constrained Networks

  • Quang Long Nguyen
  • Duc Nguyen
  • Huong Thu Truong

Point cloud is the emerging format for representing real-world objects in VR/AR applications. However, real-time streaming of dynamic point clouds presents challenges due to high data rates and low latency requirements. This paper introduces a novel and bandwidth-efficient streaming approach for scenes consisting of multiple dynamic point clouds over networks with limited bandwidth. The proposed approach dynamically adjusts the Level of Detail (LoD) of individual point clouds based on network conditions and user preferences to optimize the user’s Quality of Experience (QoE). The LoD version selection problem is formulated as a QoE optimization problem, and two real-time solutions are presented for deciding the LoD version for each point cloud. Experimental results demonstrate that the proposed method outperforms the existing methods in terms of visual quality while achieving remarkably low processing time, about 0.01 ms. These findings have the potential to advance seamless user experience.

RecipeMeta: Metapath-enhanced Recipe Recommendation on Heterogeneous Recipe Network

  • Jialiang Shi
  • Takahiro Komamizu
  • Keisuke Doman
  • Haruya Kyutoku
  • Ichiro Ide

Recipe is a set of instructions that describes how to make food. It can help people from the preparation of ingredients, food cooking process, etc. to prepare the food, and increasingly in demand on the Web. To help users find the vast amount of recipes on the Web, we address the task of recipe recommendation. Due to multiple data types and relationships in a recipe, we can treat it as a heterogeneous network to describe its information more accurately. To effectively utilize the heterogeneous network, metapath was proposed to describe the higher-level semantic information between two entities by defining a compound path from peer entities. Therefore, we propose a metapath-enhanced recipe recommendation framework, RecipeMeta, that combines GNN (Graph Neural Network)-based representation learning and specific metapath-based information in a recipe to predict User-Recipe pairs for recommendation. Through extensive experiments, we demonstrate that the proposed model, RecipeMeta, outperforms state-of-the-art methods for recipe recommendation.

FTUnet: Feature Transferred U-Net For Single HDR Image Reconstruction

  • Shifeng Xie
  • Yi Liu
  • Wenjing Shuai

The development of the display technology supports the application of High Dynamic Range (HDR) enabling devices. In order to meet the surging demand for the HDR media content, we propose a feature-transferred U-shaped network (FTUnet) to convert existing Standard Dynamic Range (SDR) images into their HDR counterparts. The proposed FTUnet is a feature transformation network that converts the encoded SDR features to the HDR features. This transformation network extracts features rich of spatial information by a self-attention mechanism, in order to improve the reconstruction of the over-exposed regions and avoid unreasonable patches. Besides, we propose an Excitation-Restoration (ER) sub-network to involve the inter-channel attention mechanism. The ER network is used to remove redundant information between channels and reserve the key features. Therefore, the proposed FTUnet can efficiently merge feature channels and contribute to the advantage in color accuracy for the generated HDR images. Experimental results show that our proposed FTUnet achieves state-of-the-art performance in both quantitative comparison and visual quality for the single HDR image reconstruction. The ablation study is also performed to demonstrate the effectiveness of each module of the proposed FTUnet.

Research on Multi-Person Pose Estimation Based on YOLO and Decoupled Multi-Level Feature Layers Fusion

  • Bin Zheng
  • He Zhang
  • Lu Jin

Multi-person pose estimation is fundamental research in the fields of AIGC, multimedia understanding,virtual reality, human-computer interaction, etc. Existing algorithms have problems such as large computational complexity, low accuracy, and an inability to effectively predict occlusions, making them difficult to deploy on devices with limited computational resources. This paper introduces coordinate attention mechanisms within the YOLOv7tiny framework, improves the OKS loss function tailored for pose estimation tasks, and proposes the FDPose method that decouples bounding box detection and pose regression tasks by fusing different feature layers. Experimental results show that compared with the original algorithm, the standard version FDPose-n used in this paper reduces the number of parameters by 14.8% while improving AP by 9.7%. In comparison with some similarly sized coordinate regression based bottom-up algorithms, FDPose-n achieves higher AP than Associative Embedding, DeepPose, YOLOv5s-pose, YOLOv8n-pose, etc. We use the CrowdPose dataset as the training set and without any pre-trained weights.

Towards Representation Alignment and Uniformity in Long-tailed Classification

  • Yi Zheng
  • Zuqiang Meng

The long-tailed distribution is a commonly observed probability distribution in the real world, wherein a majority of classes possess a large number of samples while a minority of classes have only a few samples. This distribution pattern often leads to imbalanced learning, where the model’s performance becomes dominated by the majority classes and the discriminative ability for minority classes deteriorates. Ideal attributes of representation learning include uniformity and alignment, which entail similar samples being close to each other and the uniform distribution of samples in the feature space to preserve maximal information. While optimizing these attributes directly on balanced datasets yields promising results, no prior efforts have focused on achieving them on long-tailed datasets. Therefore, we propose a novel learning strategy, BalAUM, which addresses this gap by explicitly controlling the optimization of uniformity and alignment, thereby improving the quality of representations. Specifically, we design a balanced alignment and uniformity loss within an AU (Alignment and Uniformity) loss framework. This loss incorporates class weights and class centers to alleviate the bias towards head classes, thus enhancing the optimization of uniformity and alignment for tail classes. Furthermore, considering the scarcity of instances in tail classes, we combine mixup with re-sampling to generate additional samples carrying tail class information, utilizing label re-weighting. This augmentation technique enhances the diversity of tail class samples, thereby improving their uniformity. Experimental results on the CIFAR10-LT, CIFAR100-LT, and ImageNet-LT datasets demonstrate that the BalAUM method achieves competitive performance.

SFNet: Saliency fast Fourier convolutional Network for medical image segmentation

  • Shangwang Liu
  • Danyang Liu
  • Yinghai Lin
  • Ziqi Wei

Due to the limitation of local characteristics of convolution operations, the encoder of U-Net cannot effectively capture global context information; furthermore, the skip connections of U-Net fail to capture salient features in image segmentation tasks. Therefore, we put forward a Saliency fast Fourier convolutional Network (SFNet) for medical image segmentation. To begin with, we propose a SCAU attention module, which can highlight both spatial and channel attention for capturing not only global but local information, paying more attention to the whole target regions of an image, and creating strong information association among samples to extract the characteristics of the overall dataset. Subsequently, instead of employing the convolution to set the encoder and decoder, we introduce a Fourier convolution module, FFconv, which owns the non-local receptive field to fulfil cross-scale fusion in the convolution unit properly. Experimental results show that, on BUSI and Kvasir-SEG datasets, the mIOU and F1-score of our network reach 75.08% and 84.75%, respectively; our network greatly promote medical image segmentation performance.

Multi-head Siamese Prototype Learning against both Data and Label Corruption

  • Peng-Fei Zhang
  • Zi Helen Huang

The training of the Deep Neural Network (DNN) has been seriously challenged by insidious noise in the dataset, including noise in raw data and errors in annotations. Existing methods usually limit their efforts to the defense of one particular kind of noise, which would be powerless when facing the coexistence of various noise. To deal with it, we propose a novel Multi-head Siamese Prototype Learning (MSPL) method to promote discriminative features and representative prototypes by modeling invariance in samples and sieving out incorrectness in labels. More specifically, a multi-head Siamese network structure is constructed, where prototype learning with the multi-consistency constraint is performed to improve the resilience of the model to noise. Under this regime, adversarial contrastive learning is performed to train the model with the dynamically generated vicious adversarial examples, further enhancing the invariant predictive ability against data noise. At the same time, to deal with label noise, an effective multi-granularity sample selection strategy is designed to filter out noisy labels by measuring the error distribution in both global and local (i.e., class-specific) perspectives. Semi-supervised learning is accordingly conducted to train the model with the resulting labelled data (i.e., data with clean labels) and unlabelled data (i.e., data with noisy labels). Extensive experiments on benchmarks demonstrate the effectiveness of the proposed method in the extremely noisy learning environment.

Rethinking Parking Slot Detection with Rotated Bounding Box

  • Shengli Zhang
  • Shikui Wei
  • Shiyin Zhang
  • Sen Xu
  • Weiyan Xu
  • Yao Zhao

Parking slot detection is an essential yet challenging task in the field of self-driving perception. During parking, vehicles often block part of the parking slots which makes the corners occluded. In addition, due to the impact of the external environment, the corners of the parking slot may be blurred. Existing parking slot detection algorithms based on parking slot markings are sensitive to the corners of the parking slots, which makes it difficult to cope with the above scenario. To address this problem, we propose a parking slot entrance line detection algorithm called RPSED, which is the first to apply rotating object detection to the parking slot entrance line. RPSED takes a different route from traditional corner detection methods by focusing on the entrance lines of parking slots to grasp the intricate geometric details inherent to parking slots, which solves the problem that existing parking slot detection algorithms cannot detect parking slots with blurred corners. To further improve the precision and recall of the model and make the model more generalizable, we propose a model ensemble strategy to match and select the results of multiple models. Moreover, we propose two manually optimized parking slot dataset named RPS2.0 and RPSV, which adds more annotations with obstructed corners or obscured configurations to the datasets ps2.0 and psv, making the model evaluation more reasonable and realistic. Experimental results on the RPS2.0 and RPSV benchmarks demonstrate the superiority of our approach compared to existing state-of-the-art methods.

Generic Attention-model Explainability by Weighted Relevance Accumulation

  • Yiming Huang
  • Aozhe Jia
  • Xiaodan Zhang
  • Jiawei Zhang

Attention-based Transformer models have achieved remarkable progress in multi-modal tasks, such as visual question answering. The explainability of attention-based methods has recently attracted wide interest as it can explain the inner changes of attention tokens by accumulating relevancy across attention layers. Current methods simply update relevancy by equally accumulating the token relevancy before and after the attention processes. However, the importance of token values is usually different during relevance accumulation.In this paper, we propose a weighted relevancy strategy, which takes the importance of token values into consideration, to reduce distortion when equally accumulating relevance. To evaluate our method, we propose a unified CLIP-based two-stage model, named CLIPmapper, to process Vision-and-Language tasks through CLIP encoder and a following mapper. CLIPmapper consists of self-attention, cross-attention, single-modality, and cross-modality attention, thus it is more suitable for evaluating our generic explainability method. Extensive perturbation tests on visual question answering and image captioning tasks validate that our explainability method outperforms existing methods.

Geometric Style Transfer for Face Portraits

  • Miaomiao Dai
  • Hao Yin
  • Ran Yi
  • Lizhuang Ma

Geometric style transfer jointly stylizes the texture and geometry of a content image to better match a style image, which has attracted widespread attention due to its various applications. However, existing style transfer methods either primarily focus on texture and almost entirely ignore geometry, or have various drawbacks and are not suitable for Face Portraits. In the paper, We propose a new two-stage geometric style transfer method dedicated to face portraits, which simultaneously transfer both statistical and structural styles. Our network consists of Geometric deformation module (G) and Texture rendering module (T). G is trained with semantics image pairs, which has loose requirements on the training datasets. Besides, our flexible formulation also allows explicit user guidance and control of stylization tradeoffs. Experiments demonstrate that our method achieves state-of-the-art geometric style transfer for face portraits.

Optical Flow based Feature Prediction and Decomposed Context for Video Compression

  • Huashan Sun
  • Qian Huang
  • Yiming Wang
  • Xiaotong Guo
  • Ruoyu Hao

In recent years, there have been a growing interest in developing end-to-end neural video codecs. Previous works generally use a past decoded frame as reference directly, utilizing the motion information between it and the input frame to reduce temporal redundancy. However, this approach may lead to high bit rate consumption of the motion and fails to take advantage of the prior information in other reconstructed frames. In this work, We propose a learned video coding framework with optical flow based feature prediction module and decomposed context module. Specifically, we employ the previous optical flow to generate a warped frame, and along with other reconstructions, they are used for a more accurate reference forecasting, thereby reducing the bit rate required for motion compression. Moreover, based on the conditional coding framework, our decomposed context module explores conditional context in past decoded frames and further reduces additional spatiotemporal correlations. Experimental results demonstrate that our approach yields better performance than previous learned video compression methods and traditional standard codecs. For example, our neural codec achieves 28.94% coding gain over HEVC in PSNR metric and about 2.00% coding gain over VVC in MS-SSIM metric.

ADNet: An Asymmetric Dual-Stream Network for RGB-T Salient Object Detection

  • Yaqun Fang
  • Ruichao Hou
  • Jia Bei
  • Tongwei Ren
  • Gangshan Wu

RGB-Thermal salient object detection (RGB-T SOD) aims to locate salient objects in images that include both RGB and thermal information. Previous approaches often suggest designing a symmetric network structure to tackle the challenge of dealing with low-quality RGB or thermal images. However, we contend that RGB and thermal modalities possess different numbers of channels and disparities in information density. In this paper, we propose a novel asymmetric dual-stream network (ADNet). Specifically, we leverage an asymmetric backbone to extract four stages of RGB features and four stages of thermal features. To enable effective interaction among low-level features in the first two stages, we introduce the Channel-Spatial Interaction (CSI) module. In the last two stages, deep features are enhanced using the Self-Attention Enhancement (SAE) module. Experimental results on the VT5000, VT1000, and VT821 datasets attest to the superior performance of our proposed ADNet compared to state-of-the-art methods.

RGB-D Tracking via Hierarchical Modality Aggregation and Distribution Network

  • Boyue Xu
  • Yi Xu
  • Ruichao Hou
  • Jia Bei
  • Tongwei Ren
  • Gangshan Wu

The integration of dual-modal features has been pivotal in advancing RGB-Depth (RGB-D) tracking. However, current trackers are less efficient and focus solely on single-level features, resulting in weaker robustness in fusion and slower speeds that fail to meet the demands of real-world applications. In this paper, we introduce a novel network, denoted as HMAD (Hierarchical Modality Aggregation and Distribution), which addresses these challenges. HMAD leverages the distinct feature representation strengths of RGB and depth modalities, giving prominence to a hierarchical approach for feature distribution and fusion, thereby enhancing the robustness of RGB-D tracking. Experimental results on various RGB-D datasets demonstrate that HMAD achieves state-of-the-art performance. Moreover, real-world experiments further validate HMAD’s capacity to effectively handle a spectrum of tracking challenges in real-time scenarios.

Dual-domain Feature Learning and Cross Dimension Interaction Attention for Nighttime Image Dehazing

  • Yun Liang
  • Shijie Peng
  • Xinjie Xiao
  • Lianghui Li

Nighttime image dehazing is critical for many computer applications. Directly transferring daytime dehazing models to nighttime scenes often introduces haze residual, detail loss and color distortion for the uneven distribution by artificial lights. Therefore, we propose a nighttime dehazing method by defining the Dual-domain Feature Learning Module (DFLM) and the Feature Optimization Module (FOM). Firstly, we construct the DFLM in both frequency and spatial domains to accurately predict the image degradation caused by haze and remove most haze in nighttime hazy images. Secondly, to address the challenges of uneven illumination distribution and color interference of light sources in nighttime, we construct the FOM based on the proposed Cross Dimension Interaction Attention (CDIA), which captures the feature dependencies by crossing different dimensions including the channel-channel, height-channel and width-channel. By precisely representing illumination and color features, the FOM alleviates color distortion in nighttime dehazing. Extensive experiments on several synthetic and real-world datasets demonstrate that our method outperforms most state-of-the-art methods. Code will be available.

Improve Singing Quality Prediction Using Self-supervised Transfer Learning and Human Perception Feedback

  • Ping-Chen Chan
  • Po-Wei Chen
  • Von-Wun Soo

The scarcity of expansive datasets for singing quality assessment makes the utilization of complex deep learning methods a considerable challenge. This research presents a method to improve the singing quality prediction based on the feedback from subjective human perception opinion that is learned by the transfer learning methods of self-supervised learning (SSL) speech models. In combination with the CRNN_PH model as the baseline model, the SSL models are integrated into two distinct major architectures: one directly draws features from the pre-trained SSL model (CRNN_PH+SSL), and the other employs the weighted sum (WS) of the output features from different transformer blocks in the SSL model (CRNN_PH+SSL_WS). We conducted comparative experiments on pre-trained SSL models, five on wav2vec 2.0 (W2V2) and two on HuBERT, which were trained over various datasets. It turns out that CRNN_PH+W2V2_base_WS is improved the most on singing quality score prediction that is closely aligning with subjective human perceptions in terms of correlation coefficients and MSE with respect to the ground truth.

End-to-End Variable-Rate Image Compression with Bi-Resolution Spatial-Channel Context Aggregation

  • Xiaotong Guo
  • Qian Huang
  • Yiming Wang
  • Huashan Sun

Recently, neural network-based image compression techniques have demonstrated remarkable compression performance. The use of context-adaptive entropy models greatly enhances the rate-distortion (R-D) performance by effectively capturing spatial redundancy in latent representations. However, latent representations still contain some spatial correlations(e.g. same spatial structure), it needs to be eliminated by further processing. And many compression models are single-rate model, which is difficult to cover a big range of bitrate. In order to address this issue, we propose a novel variable-rate image compression algorithm that efficiently leverages bi-resolution spatial-channel information through learned mechanisms. In this paper, we first proposed a BRP network to divide our latent representations and side information into HR and LR components, eliminating the spatial redundancy in same location. Combining the spatial-channel context, we proposed a BSC context model, including a decreasing-granularity checkerboard pattern and channel grouping based on cosine slicing strategy. To cover a wide range of bitrate, we take a weight map as input to control bit allocation, achieving multiple compression rates. Our experimental results show that our method provides a better rate-distortion trade-off than BPG, JPEG and other recent image compression methods based on deep learning.

SASSM: Semantic Awareness and Self-Support Matching for Semi-Supervised Video Object Segmentation

  • Yun Liang
  • Ming Junhui
  • Jintu Zheng

Matching-based methods have becamed popular in semi-supervised video object segmentation (VOS), by maintaining a memory bank to predict object masks. However, these methods encounter challenges for fast motions and appearance changes, resulting in blurred predictions and missing boundaries. Then we introduce an innovative network that exploits the self-feature of the query frame to improve the masks prediction. We propose a semantic-aware branch (SAB) for precise semantic guidance during readout decoding and an enhanced feature memory matching module with a self-support matching (SSM) mechanism. Ablations demonstrate the strong collaboration between the semantic-aware branch and the self-support matching mechanism. Our approach achieves a favourable performance on popular datasets, demonstrating a acceptable accuracy and speed performance of 86.3 J&F and 26 FPS on DAVIS 2017 validation. Code will be available.

GTTrack: Gaussian Transformer Tracker for Visual Tracking

  • Yun Liang
  • Fumian Long
  • Qiaoqiao Li
  • Dong Wang

Recently, Transformer based visual object tracking methods have achieved impressive advancements and significantly improved tracking performance. Transformer includes two modules of self-attention and cross-attention for those methods. However, it brings up two problems: first, the self-attention only considers the relative relation between elements when establishing global association, which can not highlight the essential areas of the tracked target. Second, the cross-attention only relies on feature similarity to locate the target, where the interference of similar objects is challenging. In this paper, we propose a new transformer tracking method of GTTrack by defining Gaussian Attention (GA) and Adaptive Focusing Module (AFM). The GA leads into Gaussian prior to generate a semantic template with robust object features, in which Gaussian prior pays more attention to the central region of the tracked target. The AFM calculates the similarity between current frame and the template by combining the appearance features and position features. The position features are defined with an adaptive Gaussian prior according to the target area in the previous frame. The introduction of position features enhances the contrast between the tracked target and the similar objects. Extensive experiments also demonstrate that the GTTrack outperforms many state-of-the-art trackers and achieves leading performance. Code will be available.

MontageNet: Annotated Dataset of Furniture Components in Real-World Images

  • Iuan Kai Fang
  • Bo Hao Zhang
  • Te Lun Liu
  • Hao Tan
  • Wei Syun Chen
  • Che-Rung Lee

Indoor understanding is currently a topic that is widely studied in the field of machine learning. Furniture is the most common object in indoor scenes, just as various vehicles are most commonly seen in street scenes. Any object is made up of a combination of functional components. Functional component dismantling and reassembly is an important development for industrial manufacturing to improve efficiency and reduce costs. In the context of understanding indoor scenes, we focus on building a dataset that uses real-world furniture images as part labels. Building a large-scale furniture dataset is very challenging, first of all, the existing dataset has too few real images of furniture, mostly 3D model images, but the diversity of furniture in the real world far exceeds that of 3D models, and real images help improve the calculation speed of model training. Most of the published furniture data is poorly aligned with the annotation data, and even fewer materials perform component segmentation labeling using real furniture images. MontageNet has become a rich resource for part-level 3D shape analysis, semantic understanding, instance segmentation and 3D reconstruction, and other research. Our accompanying empirical studies provide an in-depth analysis of dataset characteristics and performance evaluation of several state-of-the-art methods against our benchmarks.

RanLayNet: A Dataset for Document Layout Detection used for Domain Adaptation and Generalization

  • Avinash Anand
  • Raj Jaiswal
  • Mohit Gupta
  • Siddhesh S Bangar
  • Pijush Bhuyan
  • Naman Lal
  • Rajeev Singh
  • Ritika Jha
  • Rajiv Ratn Shah
  • Shin'Ichi Satoh

Large ground-truth datasets and recent advances in deep learning techniques have been useful for layout detection. However, because of the restricted layout diversity of these datasets, training on them requires a sizable number of annotated instances, which is both expensive and time-consuming. As a result, differences between the source and target domains may significantly impact how well these models function. To solve this problem, domain adaptation approaches have been developed that use a small quantity of labeled data to adjust the model to the target domain. In this research, we introduced a synthetic document dataset called RanLayNet, enriched with automatically assigned labels denoting spatial positions, ranges, and types of layout elements. The primary aim of this endeavor is to develop a versatile dataset capable of training models with robustness and adaptability to diverse document formats. Through empirical experimentation, we demonstrate that a deep layout identification model trained on our dataset exhibits enhanced performance compared to a model trained solely on actual documents. Moreover, we conduct a comparative analysis by fine-tuning inference models using both PubLayNet and IIIT-AR-13K datasets on the Doclaynet dataset. Our findings emphasize that models enriched with our dataset are optimal for tasks such as achieving 0.398 and 0.588 mAP95 score in the scientific document domain for the TABLE class.

Multi-Scale Superpoint Network for 3D Point Cloud Semantic Segmentation

  • Ft Zheng
  • Le Hui
  • Jin Xie
  • Haofeng Zhang

3D point cloud semantic segmentation is a fundamental task for 3D scene understanding. However, most existing pipelines usually use k-NN or ball query operation to form hard neighborhoods, which may cross different semantic objects, resulting low-quality local features. To address this issue, we propose a multi-scale superpoint network that gradually generates multi-scale soft neighborhoods to extract geometric local features, thereby boosting the 3D semantic segmentation performance. Specifically, we present a simple yet efficient superpoint merging module that merge small-scale superpoints to obtain large-scale superpoint by considering the feature similarity of superpoints, so that we can obtain multi-scale geometric features of point clouds. We also develop a superpoint upsampling module that adopt inverse mapping function to propagate multi-scale features from low-resolution point cloud to high-resolution point cloud. By integrating our multi-scale superpoint network into a simple point based semantic segmentation network, our method can obtain SOTA results on S3DIS Area 5 and 6-fold, and competitive results on ScanNet v2.

Vision-Language Navigation for Quadcopters with Conditional Transformer and Prompt-based Text Rephraser

  • Zhe Chen
  • Jiyi Li
  • Fumiyo Fukumoto
  • Peng Liu
  • Yoshimi Suzuki

Controlling drones with natural language instructions is an important topic in Vision-and-Language Navigation (VLN). However, previous models can not effectively guide drones with the integration of multimodal features, as few of them exploit the correlations between instructions and the environmental contexts and consider the model’s capacity to understand natural languages. Therefore, we propose a novel language-enhanced cross-modal model that has a conditional Transformer to effectively integrate the multimodal features, i.e., the textual instructions and visual contexts. To enhance the ability of language representation, we also employ SentenceBERT. In addition, to address the issue that users could provide various textual instructions even for the same navigation task, we propose a prompt-based approach by introducing an LLM-based intermediary component (LLMIR) for rephrasing users’ instructions. We evaluate our approaches with a quadcopter simulator. Our model improves the absolute task completion rate by 1.39%. To evaluate LLMIR, we create a new test set by extracting the essential and minimal instructions from the original test set. By using the LLM, the task completion rate improves by 1.51%. And it narrows the performance gap between new and original test set by 34.83%.

Confidence-guided Boundary Adaption Network for Multimodal Fake News Detection

  • Lin Jiajie
  • Zhuopan Yang
  • Zhenguo Yang
  • Xiaoping Li
  • Fu Lee Wang
  • Wenyin Liu

Social media allows the public to access information conveniently, in which the false messages that are eye-catching may spread fast. In this paper, we propose a two-stage confidence-guided boundary adaption (CBA) network, consisting of a feature preprocessing (FP) module, a biased ambiguity learning (BA) module and a confidence-guided boundary adaptation (CG) module. In the first stage, the FP module obtains the textual and visual features, which are fused by conducting the visual-to-textual and textual-to-visual correlation coefficients with attention mechanism. Furthermore, BA evaluates the distribution distance between fused features and single modalities to determine the weights between modalities, capturing the semantics of key modality. In the second stage, CG leverages samples from the low-confidence interval to generate new instances using a mixup of augmentation techniques, aiming to occupy the decision space and optimize the decision boundary of the classifier. Extensive experiments on two public datasets show that our CBA model is 1.6% and 2.6% higher than the state-of-the-art methods.

Open-Vocabulary Segmentation Approach for Transformer-Based Food Nutrient Estimation

  • Satayu Parinayok
  • Yoko Yamakata
  • Kiyoharu Aizawa

Nutrition plays a vital role in overall health and well-being. With a highly accurate nutrient estimation model, we develop a tool that displays nutritional values from food images, thereby reducing the labor-intensiveness of dietary assessment. We propose a method that uses depth data with RGB images and incorporates an open-vocabulary segmentation process that separates food from non-food instances, coupled with two-stage self-attention Transformer decoder. Our model outperforms the current state-of-the-art method, with an average percent MAE of 17.2% on Nutrition5k, an RGB-D food image dataset with calories, mass, and three macronutrients annotated. Our study also focuses on the significance of the food and background regions for calorie, mass, and nutrient estimation. We analyze the impact of non-food regions on each estimation task, with results suggesting that background information is crucial for calorie, mass, and carbohydrate estimation but not as essential for protein and fat estimation. The qualitative results also show that the model attends to regions with a high corresponding nutritional value. Implementation codes and pre-trained models are provided at

Improving Class Representation for Zero-Shot Action Recognition

  • Lijuan Zhou
  • Jianing Mao

Zero-Shot Action Recognition (ZSAR) enables models to infer new action classes from previously seen data without any samples of those new classes. How an action class is represented in an understandable and processable format influences the performance in ZSAR. Semantic representations of action classes have been made in various forms, such as attributes, class labels, and text descriptions, while in video recognition, the action classes can also have visual representations in the form of images. This paper proposes a novel method by improving class representation for ZSAR. On the one hand, to improve the collection and quality of text descriptions, this paper uses ChatGPT to generate descriptions and designs conversation-based text prompts that can quickly obtain high-quality descriptions of many actions. On the other hand, to overcome the ambiguity of single-modal class representation, we propose the Image-based Description Refinement (IDR) method to obtain multimodal class representation. Specifically, action classes are represented by relevant images from the web and descriptions, and action videos are represented by spatio-temporal features and extracted objects. By training on the seen set to learn the mapping of multimodal representations for classes and videos, we can infer video classes on the unseen set from the similarity of the mapped representations. Experiments on two popular benchmarks and two elderly daily activity datasets show the effectiveness of our method. In particular, it has a significant improvement in the case of less available video samples.

Independent and Collaborative Demosaicking Neural Networks

  • Yan Niu
  • Lixue Zhang
  • Chenlai Li

Existing demosaicking neural models generally reconstruct the red, green and blue channels by one unified network. This has the advantage of allowing the RGB channels to share latent features. However, it is unnoticed that the training samples are severely unbalanced among the three channels. As a consequence, a unified model trained to fit the red or blue sample distributions is prone to under-fit the green samples. In this paper, we demonstrate the existence of such conflicts, and analyze the reason behind this phenomenon. To solve this disadvantage, we decouple the traditional three-in-one demosaicking to three independent but collaborative sub-tasks, respecting the distribution of samples in each spectrum. We show that, by our decoupling strategy, a very simple architecture can achieve high accuracy with high efficiency. For substantiation, we construct a model that has merely 1.3M parameters in total, and compares its accuracy to state of the art methods of similar or larger model size, across a wide variety of benchmark datasets. The proposed independent and collaborative architecture gains performance improvement over comparable models in both accuracy and inference time.

Learning a Robust Model with Pseudo Boundaries for Noisy Temporal Action Localization

  • Xinyi Yuan
  • Liansheng Zhuang

Temporal Action Localization (TAL) aims to locate starting and ending times of actions and recognize categories in untrimmed videos. Significant progress has been made in developing deep models for TAL. The success of previous methods relies on large-scale training data with precise boundary annotations. However, fully accurate annotations are unpractical to be obtained due to the ambiguities of the action boundaries and the crowd-sourcing labeling process, leading to a degradation in performance. In this work, we take the first step into learning with inaccurate boundaries in TAL tasks. Motivated by the fact that inaccurate boundary annotations harm localization precision more than classification accuracy, we propose to use classification as a guidance signal to improve localization precision. Specifically, we introduce a pseudo-boundary generation and refinement method (PbGaR). PbGaR first treats each action segment as a bag of instances to select the instances with more accurate boundaries for training. Then these boundaries are refined via two strategies for higher quality. The proposed method significantly alleviates the degraded performance of TAL models under inaccurate boundaries. Extensive experiments on two popular datasets demonstrate the effectiveness of our method.

Monocular 3D Pose Estimation of Very Small Airplane in the Air

  • Sung Kwon On
  • Songhyon Kim
  • Kwangjin Yang
  • Younggun Lee

In this paper, a novel pose estimation algorithm is proposed specifically for maneuvering airplanes in the air. The algorithm consists of two main stages. The first stage involves semantic segmentation of a monocular input image of a flying airplane, where the entire captured area serves as feature points for the airplane, which are typically small in the image. The second stage focuses on the 3D pose estimation of the segmented image using projective registration. Since airplanes have unique characteristics and there is a scarcity of airplane-specific datasets, a custom dataset is generated for the experiments. Unreal Engine 4, a 3D computer graphics game engine renowned for its realistic simulations, is employed for this purpose. Experimental results demonstrate the suitability of the algorithm for 3D pose estimation of airplanes, providing valuable information for studying autonomous control of airplanes.

Cross-Modal Retrieval for Motion and Text via DropTriple Loss

  • Sheng Yan
  • Yang Liu
  • Haoqiang Wang
  • Xin Du
  • Mengyuan Liu
  • Hong Liu

Cross-modal retrieval of image-text and video-text is a prominent research area in computer vision and natural language processing. However, there has been insufficient attention given to cross-modal retrieval between human motion and text, despite its wide-ranging applicability. To address this gap, we utilize a concise yet effective dual-unimodal transformer encoder for tackling this task. Recognizing that overlapping atomic actions in different human motion sequences can lead to semantic conflicts between samples, we explore a novel triplet loss function called DropTriple Loss. This loss function discards false negative samples from the negative sample set and focuses on mining remaining genuinely hard negative samples for triplet training, thereby reducing violations they cause. We evaluate our model and approach on the HumanML3D and KIT Motion-Language datasets. On the latest HumanML3D dataset, we achieve a recall of 62.9% for motion retrieval and 71.5% for text retrieval (both based on R@10). The source code for our approach is publicly available at

An Evaluation of Decentralized Group Formation Techniques for Flying Light Specks

  • Hamed Alimohammadzadeh
  • Heather Culbertson
  • Shahram Ghandeharizadeh

Group formation is fundamental for 3D displays that use Flying Light Specks, FLSs, to illuminate shapes and provide haptic interactions. An FLS is a drone with light sources that illuminates a shape. Groups of G FLSs may implement reliability techniques to tolerate FLS failures, provide kinesthetic haptic feedback in response to a user’s touch, and facilitate a divide and conquer approach to challenges such as localizing FLSs to render a shape. This paper evaluates four decentralized techniques to form groups. An FLS implements a technique autonomously using asynchronous communication and without a global clock. We evaluate these techniques using synthetic point clouds with known optimal solutions and real point clouds. Obtained results show a technique named Random Subset (RS) is superior when constructing small groups (G ≤ 5) while a different technique named Closest Available Neighbor First (CANF) is superior when constructing large groups (G ≥ 10).

SESSION: Short Papers

Learning Surface-awareness Network for X-Ray Prohibited Item Detection

  • Ying Shen
  • Wei Li
  • Zhaoquan Yuan
  • Xiao Wu

X-ray image security detection is a crucial method used to identify various types of prohibited items in luggage. However, the unique characteristics of X-ray imaging can result in the loss of intricate surface details, leading to subpar detection of prohibited items within X-ray images. In this paper, a Surface-aware Prohibited Item X-ray Detection Network (SPIXDet) is proposed to address this issue, which incorporates two key components: the Boundary Aggregation Module (BAM) and the Global Cross-Feature Downsampling layer (GCFD). The BAM module effectively mines image edge information while minimizing the number of parameters involved. Meanwhile, the GCFD module is introduced to mitigate chaotic interference caused by undifferentiated boundary boosting. The surface-aware capability of the model can be enhanced through the BAM and GCFD module. Furthermore, the Focal-SIoU loss function is introduced to increase positioning accuracy and optimize the model training process. To validate the effectiveness of our model, extensive experiments are conducted on the SIXray100 dataset, and the results demonstrate the advantages of SPIXDet compared to other X-ray prohibited item detection methods.

An Efficient CNN-based Prediction for Reversible Data Hiding

  • Mingjin Wu
  • Shijun Xiang

In the field of reversible data hiding (RDH), how to design an efficient image prediction method is an enduring research topic. In this paper, we propose a new CNN-based predictor consisting of an efficient image division strategy and a well-designed prediction network. The image division strategy optimizes the distribution of pixels belonging to different sets, which increases the amount of available adjacent pixels in the image prediction. In addition, with the utilization of the well-designed compensation module, the prediction network performs better and consumes less memory. The experiment results demonstrate that our approach achieves better prediction performance compared with the existing predictors. Furthermore, we have developed an RDH algorithm by combining the proposed CNN-based predictor with the location-based pixel value ordering (LPVO) embedding strategy. This RDH algorithm outperforms the state-of-the-art predictor-based RDH algorithm in embedding performance.

Developing a VR-based contextualized language learning system to Enhance Junior High School Students' Pragmatic Competence

  • Kuo-Yu Liu
  • Yuanshan Chen
  • Ming-Fang Lin
  • Li-Jung Daphne Huang
  • Cheah Ping Xiang

This paper addresses the significance of pragmatic competence development for language learners in a globalized world. To improve language learning experiences for Taiwanese junior high school students, a VR-based immersive English pragmatics learning game was developed. The game focuses on expressions of requests, apologies, and compliments, aligning with the junior high school English textbooks. Initial framework completion allowed for a SUS (System Usability Scale) survey with 15 urban and 15 rural students, evaluating the system's usability. The survey showed an average SUS score of 74, indicating user satisfaction while also revealing areas for improvement. Future work involves optimizing the system and conducting practical research to assess the game's effectiveness in enhancing students' English language learning outcomes.

Reducing Objective Difficulty Without Influencing Subjective Difficulty in a Video Game

  • Shunta Sakaue
  • Taiju Kimura
  • Hiroki NISHINO

While dynamic difficulty adjustment techniques can enhance the player engagement, overly apparent difficulty adjustments may have a detrimental effect on the overall gaming experience. Hence, it is desirable to implement game difficulty controls as unnoticeable as possible. We interpret this issue as a research question: Is it possible to reduce the objective difficulty of a game without influencing the subjective difficulty that players perceive?, and propose two techniques for dynamically adjusting game difficulty that is imperceptible to players: Collision Detection Area Adjustment and Time Elapse Manipulation. We integrated these two techniques into a simple shoot’em up game and evaluated their effectiveness through a user study. The results support the claim that these difficulty adjustment techniques can reduce the objective difficulty of a game without influencing the subjective difficulty that players perceive. This study could be beneficial, not just as a rare example in the research in difficulty adjustment that distinguishes objective difficulty and subjective difficulty, but also as a practical technique for difficulty adjustment that can avoid the negative impact on the gaming experience.

Multi-region CNN-Transformer for Micro-gesture Recognition in Face and Upper Body

  • Keita Suzuki
  • Satoshi Suzuki
  • Ryo Masumura
  • Atsushi Ando
  • Naoki Makishima

This paper presents a novel task that recognizes from a video unintentional micro-gestures (UMGs), which are movements made by people unconsciously and unintentionally. Recognizing UMGs is crucial because they reveal a person’s underlying psychological state. Since a UMG is composed of subtle sequential movements, the recognition model must be able to capture accurate information in both the spatial and temporal directions. Therefore, we utilize a convolutional neural network (CNN) to capture information in the spatial direction and a Transformer to merge the features extracted by the CNN in the temporal direction. However, this model often misrecognizes UMGs because it is not possible to capture slight differences in movements, such as in the face and mouth regions. To address this issue, we propose a novel model for UMG recognition, the Multi-Region CNN-Transformer model, that inputs cropped videos from multiple upper body regions simultaneously. The key advance of our method is to capture subtle changes in regions such as the upper body, face, head, and mouth for recognizing UMGs. We demonstrate the effectiveness of the proposed method through experiments using our newly created UMG dataset for this task.

VQ-VDM: Video Diffusion Models with 3D VQGAN

  • Ryota Kaji
  • Keiji Yanai

In recent years, deep generative models have achieved impressive performance such as realizing image generation that is indistinguishable from real images. Particularly, Latent Diffusion Models, one of the image generation models, have had a significant impact on society. Therefore, video generation is attracting attention as the next modality. However, video generation is more challenging than image generation due to the consideration of temporal consistency and the increase in computational complexity, since a video is a sequence of multiple frames. In this study, we propose a video generation model based on diffusion models employing 3D VQGAN, which is called VQ-VDM. The proposed model is about nine times faster than the Video Diffusion Models which directly generate videos, since our model generates a latent representation which is decoded into a video by a VQGAN decoder. Moreover, our model can generate higher quality video than prior video generation methods exclude state-of-the-art method.

Facial Parameter Splicing: A Novel Approach to Efficient Talking Face Generation

  • Xianhao Chen
  • Kuan Chen
  • Yuzhe Mao
  • Linna Zhou
  • Weike You

In recent years, generating talking faces has become a popular research area due to their applications in various fields. However, most current models require high computational demands, which limits their practicality. To address this issue, some researchers have developed phoneme-face indexes to generate talking videos quickly and efficiently. But when the training video is too short, it is not possible to create mappings for all phonemes. To overcome this limitation, we introduced a large-scale phoneme-face dictionary to complete the feature mapping, designed a novel method for fast phoneme-face indexes search and trained a generative adversarial network (GAN) to generate video from phoneme-face sequences. Our proposed method is capable of completing the phoneme-face mapping using less than 10 seconds training video of the target person based on the large-scale dictionary and fast search algorithm and reducing the preprocessing and training time for talking videos generation.

Few-Shot Learning for Word Recognition in Handwritten Seventeenth-Century Spanish American Notary Records

  • Nouf Alrasheed
  • Shraboni Sarker
  • Viviana Grieco
  • Praveen Rao

Historical records are invaluable sources of information that provide insights into multiple aspects of past events and societies. The analysis of historical records using deep learning poses critical challenges such as the lack of sufficient labeled data and at times the poor quality of scanned images. In this paper, we propose SpanishFSL, a few-shot learning (FSL) approach for word recognition in 17th-century handwritten Spanish American notary records. SpanishFSL draws inspiration from a zero-shot learning approach developed for image classification. It leverages an autoencoder to construct class-attribute signatures to effectively bridge the gap between seen and unseen classes. This enables SpanishFSL to generalize and accurately recognize words not present in the training set. Our labeled dataset was prepared by paleography experts using a subset of the notary records drafted by two notaries. Through experimental evaluation, we observed that SpanishFSL can outperform other FSL classifiers in terms of word recognition accuracy.

Reprogramming Self-supervised Learning-based Speech Representations for Speaker Anonymization

  • Xiaojiao Chen
  • Sheng Li
  • Jiyi Li
  • Hao Huang
  • Yang Cao
  • Liang He

Current speaker anonymization methods, especially with self-supervised learning (SSL) models, require massive computational resources when hiding speaker identity. This paper proposes an effective and parameter-efficient speaker anonymization method based on recent End-to-End model reprogramming technology. To improve the anonymization performance, we first extract speaker representation from large SSL models as the speaker identifies. To hide the speaker’s identity, we reprogram the speaker representation by adapting the speaker to a pseudo domain. Extensive experiments are carried out on the VoicePrivacy Challenge (VPC) 2022 datasets to demonstrate the effectiveness of our proposed parameter-efficient learning anonymization methods. Additionally, while achieving comparable performance with the VPC 2022 strong baseline 1.b, our approach also consumes less computational resources during anonymization.

GhostVec: A New Threat to Speaker Privacy of End-to-End Speech Recognition System

  • Xiaojiao Chen
  • Sheng Li
  • Jiyi Li
  • Yang Cao
  • Hao Huang
  • Liang He

Speaker adaptation systems face privacy concerns, for such systems are trained on private datasets and often overfitting. This paper demonstrates that an attacker can extract speaker information by querying speaker-adapted speech recognition (ASR) systems. We focus on the speaker information of a transformer-based ASR and propose GhostVec, a simple and efficient attack method to extract the speaker information from an encoder-decoder-based ASR system without any external speaker verification system or natural human voice as a reference. To make our results quantitative, we pre-process GhostVec using singular value decomposition (SVD) and synthesize it into waveform. Experiment results show that the synthesized audio of GhostVec reaches 10.83% EER and 0.47 minDCF with target speakers, which suggests the effectiveness of the proposed method. We hope the preliminary discovery in this study to catalyze future speech recognition research on privacy-preserving topics.

Towards Digital Twin of Crops for Growth Modelling using Virtual Reality

  • Karanvir Singh
  • Mukesh Saini

A major problem that a farmer faces, while adopting a new crop variety in the farm; is the uncertainty associated with its growth. Farmers working on real farms are not aware of the growth models, even for the existing crops. Hence, there is a need for more accessible and intuitive models. This work is a step towards the realization of another promising model, which is the digital twin of a crop. A primary requirement of the digital twin is the digital representation of the crop itself. Extending that notion, the work discusses the development of 3D assets of crops and their temporal alignment. It also describes the methodology involved in the development of a VR framework, which stores the ideal growth of a crop. This framework could be useful to farmers who want to confirm the growth of their crops. Furthermore, it also proposes a quantitative metric to evaluate the VR framework. The consistency of this proposed metric is further backed by a user study which is based on a qualitative method.

Exploring User-oriented Social Recommendation System through Granting Users Control over a Social Group

  • Jeonguk Hong
  • Gyewon Jeon
  • Sangwon Lee

The limitations of accuracy-focused recommendation systems in improving user experiences have become apparent since user preferences change over time. Despite efforts to solve this issue through the examination of social information (e.g., relational data pertaining to users), capturing temporal user preferences remains a challenge. This study proposes a novel interaction method for integrating temporal user preferences into a social recommendation system. Users can highlight their preferred users within their social interactions. Through this interaction, preferences-integrated social information is incorporated into recommendations. We conducted a user test to validate the value of our proposed interaction method and found that the proposed approach had a positive impact on users’ subjective responses. The implications showed practical and theoretical improvements in user-centered recommendations.

Music-Graph2Vec: An Efficient Method for Embedding Pitch Segment

  • Taiwei Wu
  • Jianhao Zhang
  • Lian Duan
  • Yuanzhe Cai

Learning low-dimensional continuous vector representation for short pitch segment extracted from songs is has been confirmed to contain tonal features of music, which is key to melody modeling that can be utilized in many music investigations, such as genre classification, emotion classification, and music retrieval, and so on. The skip-gram version of Word2Vec is ubiquitous, and widely used approach for music pitch segment embedding, but it poorly scales to large data sets due to its extremely long training time. In this paper, we propose a novel efficient graph-based embedding method, named Music-Graph2Vec, to tackle this concern. This approach converts music files into graphs, extracts the rhythmic sequence through random walking, and trains the rhythmic embedding model using skip-gram. Experimental results demonstrate that Music-Graph2Vec outperforms Word2Vec in training rhythmic embedding, with the advantage of being 55 times faster on the top-MAGD dataset (2,134.7s for Word2Vec and 38.9s for Music-Graph2Vec), with the same accuracy for Word2Vec in terms of music genre classification.

Adaptive Sampling for Computer Vision-Oriented Compressive Sensing

  • Luyang Liu
  • Hiroki Nishikawa
  • Jinjia Zhou
  • Ittetsu Taniguchi
  • Takao Onoye

Compressive sensing (CS) is renowned for its efficient signal data compression. However, due to its compressive nature, the accuracy of downstream computer vision (CV) tasks by reconstruction inevitably degrades as sampling rate decreases. This limitation significantly hinders the application of existing CS techniques. To overcome the drawback, this paper presents a novel CS technique that employs adaptive sampling rates based on saliency distribution. The goal of this work is to enhance the preservation of information necessary for classification while reducing the weight of non-essential information. Experimental results show the effectiveness of the proposed adaptive sampling technique, which outperforms existing sampling CS techniques on STL10 and Imagenette datasets. The average classification accuracy is maximally improved by 26.23% and 18.25%, respectively.

EmAGAN: Embedded Blocks Search and Mask Attention GAN for Makeup Transfer

  • Li Yan
  • Wang Shibin

Currently, the results of makeup transfer are generally satisfactory in most scenarios. However, the transfer results show that the transfer makeup details is not accurate, such as in blush and lip corners. To this end, we propose a variant model of generative adversarial networks (GAN) based on the embedded block search and the mask attention, called EmAGAN, in which we use a latent space module to supplement the detail information for makeup transfer. The latent space module is a randomly initialized Embedded Block. We quantify the transferred image matrix as a discrete probability value and fill the details of different regions by the closest image information found in the Embedded Block. Another, we propose a novel makeup alignment module, which divides the facial feature regions of the source and reference image into blocks by transformer and extracts the mask of makeup area by image segmentation approach. After that we can perform the makeup alignment by matching the similarity relationship learned from the mask of reference image through attention mechanism. Extensive experiments demonstrate that our proposed method can accurately transfer the makeup details. We use the Frechet Inception Distance (FID) and Peak Signal-to-noise Ratio (PSNR) to qualitatively evaluate the quality and similarity of the generated image. Compared with other related schemes, our method achieves better scores, which are 60.12 and 23.226 respectively.

Contextual Associated Triplet Queries for Panoptic Scene Graph Generation

  • Jingbin Xu
  • Junwen Chen
  • Keiji Yanai

The Panoptic Scene Graph generation (PSG) task aims to extract the triplets composed of subject, object, and relation based on panoptic segmentation. For one-stage methods, PSGTR predicts the subject, object, and relation by one query. However, the integrated query is too implicit to simultaneously ascertain pairs of instances and relations. In PSGFormer, it learns instances and relation queries separately and establishes matches between subject-relation and object-relation pairs by employing the relation as an index. Nevertheless, this method could potentially impede the accurate determination of the optimal match. To address the aforementioned issues, we propose a new one-stage method, Contextual Associated Triplet Queries (CATQ), which employs three branches to decode subject, object, and relation features separately. Additionally, we leverage instance information to guide the relation decoding process. Furthermore, we introduce the triplet context fusion block to enable the extraction of more comprehensive instance pairs and triplet relations. Our proposed method achieves 34.8 Recall@20 and 20.9 mRecall@20 respectively and surpasses the state-of-the-art baseline method by 22.5% and 26.0% with half of the training session.

Automatic Dataset Creation from User-generated Recipes for Ingredient-centric Food Image Analysis

  • Liangyu Wang
  • Yoko Yamakata
  • Kiyoharu Aizawa

We aim to develop an application that automatically creates a nutrition facts label from food images for precise dietary control. Firstly, we constructed a new dataset with food category labels and a list of ingredients in a nutritionally calculable format using an image classification model and BERT for 1.6 million recipes accompanied by images. The nutritional value of the recipe can be calculated using a conversion table consisting of the food item number and unit class. Next, using deep learning techniques, we built models that estimate the list of food item numbers from food images. While the multi-task model that identifies the food category label and the ingredient list simultaneously is only effective within a limited number of recipes, the single-task model that only identified the ingredient list achieved a Micro-F1 of 53.32% in total.

SESSION: Demo Papers

OmniScorer: Real-Time Shot Spot Analysis for Court View Basketball Videos

  • Yen-Pin Cheng
  • Tsung-Hsun Tsai
  • Tai-Chen Tsai
  • Yi-Hsuan Chiu
  • Hung-Kuo Chu
  • Min-Chun Hu

We propose a real-time shot spot analysis system specifically designed for basketball videos captured from a court view perspective, even in the presence of camera movements such as panning, zooming-in, and zooming-out. Our method consists of two stages: the first stage focuses on identifying the precise frame of the shot, while the second stage predicts the shot event category (i.e., 3-point shot, 2-point shot, or free-throw) and localizes the shot spot from a top-view perspective. Compared to existing end-to-end methods for shot event prediction, our method offers significant advantages. It effectively mitigates the overfitting problem and demonstrates superior performance in predicting 3-point shot and free-throw events. To the best of our knowledge, this work is the first real-time system capable of accurately localizing shot spots in basketball games captured by a moving camera with a court view.

TelEmoScatter: Enabling Remote Interaction and Emotional Connections in Virtual and Physical Music Performance

  • Chen-Wei Fu
  • Wei-Lun Huang
  • Pin-Xuan Liu
  • Yu-Hsuan Chen
  • Ming-Cong Su
  • Andrew Chen
  • Ping-Hsuan Han
  • Tse-Yu Pan

To enrich the emotional experiences of virtual reality (VR) online audiences in music performances, we developed TelEmoScatter, a system that facilitates remote interaction between music performers and onsite audiences. Our system also fosters emotional connections for online audiences through sound-visualization conversion, which is influenced by the state of the onsite audiences using computer vision techniques. In this work, we generate a 3D space using real-time sound-visualization techniques by converting MIDI signals from musical instruments into dynamic animations. Additionally, we employ video analysis to predict the emotions of the onsite audience, allowing seamless integration of emotional visual cues into the virtual scene. With our system, users can effortlessly immerse themselves in the emotional expressions of performers through music and experience the unique atmosphere of a live performance venue simply by wearing a VR headset.

FinGuard: A Multimodal AIGC Guardrail in Financial Scenarios

  • Wenlong Du
  • Qingquan Li
  • Jian Zhou
  • Xu Ding
  • Xuewei Wang
  • Zhongjun Zhou
  • Jin Liu

Recently, the development of foundation models has led to significant advances in the ability of artificial intelligence (AI) to generate multimodal content such as text and images. However, specialized industrial scenarios such as finance, which require high levels of security and compliance, pose challenges for the application of generative AI due to its uncontrollability. To address this issue, we propose FinGuard, a multimodal AI-generated content (AIGC) guardrail specifically designed for financial scenarios. We provide detailed definitions of the general quality, financial compliance, and security dimensions of AIGC, and implement the evaluation and inspection of multimodal AIGC including text and images. Our proposed FinGuard has been applied to a financial marketing application serving hundreds of millions of users.

Easy Travelogue: A Travelogue Editor with Automatic Image Recommendation and Insertion

  • Fan Yu
  • Huanyu Xing
  • Jia Bei
  • Tongwei Ren

Travelogues are a common media form that incorporates both text and images. Typically, they are composed after the completion of a travel period. Creating a travelogue demands substantial time and effort, particularly in the curation of suitable images from the extensive collection of photos taken during the journey to complement the text. Consequently, we have developed and implemented Easy Travelogue, a travelogue editor that utilizes visual and language models. It offers real-time image suggestions while writing the text and can automatically insert fitting images into the finished content. The editor is versatile and can be readily utilized for personal travelogues, travel blogs, and various social media platforms, facilitating users in effortlessly sharing and showcasing their travel experiences.

Directional Sound Source Representation Using Paired Microphone Array with Different Characteristics Suitable for Volumetric Video Capture

  • Shota Okubo
  • Tomoaki Konno
  • Toshiharu Horiuchi
  • Tatsuya Kobayashi

In this research, we propose a directional sound source representation technique for 3D contents such as volumetric video in metaverse and digital twin. Our proposed technique enables us to have a novel 3D audio-visual experience which is derived from immersive audio presentation expressing the radiation characteristics of sound source. To realize such an experience, we configure the spaced placement of paired microphone array and capture sound source signals completely without obstacles for volumetric video capture. Then, we synthesize the directional sound source signal using our technique which conducts signal processing to capture sound signals based on the positional and directional information of an object relative to a user. We developed and demonstrated a VR application using this technique to evaluate the change of sound with the object or user's movement in accordance with visual rendering. In our user study, we received lots of positive feedback for a novel audio-visual experience.

A Trajectory-based Statistics and Tactics Analysis System for Table Tennis

  • Guan-Yu Wu
  • Chun-Ho Hung
  • Hsuan-Wei Chen
  • Wei-Ta Chu

For table tennis videos, we develop a system to analyze and generate statistics based on ball trajectories. By a real-time ball detector, the ball trajectory is constructed based on the tracking by detection scheme. Landing points on the table are estimated. Based on moving direction and the sequence of landing points, three-stage analysis can be achieved. We also analyze how a point starts (serving type classification) and how a point ends (point loss classification).

A consulting system for guiding various image recognitions

  • Ryo Kawai
  • Noboru Yoshida
  • Jianquan Liu

In recent years, various image recognition tasks have been used in many real-world applications thanks to the development and open sources of computer vision technologies. However, the expertise of users is often required for selecting appropriate recognition engines for the analysis of given images. This limits the use of beginners who wanted to apply image recognitions for their real-world demands. To make such a selection process easier, we propose a consulting system in this paper that can automatically suggest appropriate recognition engines for given images or videos. In addition, the system can provide alternative editing operations, such as enlarging or shrinking, when the size or quality of an image is inappropriate for any recognitions. The effectiveness, easy-useness, and user-friendliness is demonstrated by the proposed consulting system.

VLM-BCD: Unsupervised Building Change Detection

  • Yiyun Zhang
  • Zijian Wang

Building Change Detection (BCD) is one of the most important parts of remote sensing analysis. However, most of the existing BCD approaches require a large amount of pixel-level annotation, which limits their applicability due to intensive labour costs. To alleviate this issue, we propose a vision-language model-based framework, VLM-BCD, which performs BCD tasks without requiring any labels. Specifically, the proposed framework consists of two stages: 1) Bi-temporal building localisation by leveraging open-vocabulary DETR. 2) Unchanged mask suppressing by the Change Resolver module to detect the building change in bi-temporal satellite images. An application with an interactive dashboard is implemented to maximise the usability of the developed framework.

SESSION: Demo Papers

One-Epoch Training for Object Detection in Fisheye Images

  • Yu-Hsi Chen

This challenge is divided into two stages: qualification and final competition. We will acquire regular image data and need to perform detection on images with a fisheye effect. The approach described in this context begins by taking the original images and transforming them to mimic fisheye effect images for training. Furthermore, this challenge imposes limitations on computational resources, so striking a balance between accuracy and speed is a crucial aspect. In this paper, we asserted that our approach for this competition can achieve high performance with just one epoch of training. In summary, we achieved the top position among 24 participating teams in the qualification competition and secured the fourth position among the 11 successful submitted teams in the final competition. The corresponding source code will be available at: One-Epoch Training for Object Detection in Fisheye Images.

Adapting Object Detection to Fisheye Cameras: A Knowledge Distillation with Semi-Pseudo-Label Approach

  • Chih-Chung Hsu
  • Wen-Hai Tseng
  • Ming-Hsuan Wu
  • Chia-Ming Lee
  • Wei-Hao Huang

In this paper, we introduce a lightweight object detection system, custom-designed for fisheye cameras and optimized for quick deployment on embedded systems. Given the constraints of training solely on standard images, our methodology centers on the effective knowledge transfer to accentuate object detection in fisheye scenarios. The integration of the Parallel Residual Bi-Fusion (PRB) Feature Pyramid Network (FPN) into the state-of-the-art YOLOv7 backbone specifically addresses the challenges of detecting tiny objects often present in fisheye images.

Our unique two-phase training strategy operates as follows: Firstly, a comprehensive Teacher Model is trained on standard images, setting the stage for knowledge acquisition. Subsequently, in the second phase, this knowledge is distilled to a more compact Student Model. The twist is in using fisheye images as pseudo-information, ensuring the model’s adaptability to fisheye-centric environments. Combining knowledge distillation with semi-pseudo-label semi-supervised learning, this strategy guarantees optimal performance and embraces a lightweight design perfect for real-time applications on constrained devices. In essence, our contributions span the crafting of a specialized object detection framework for fisheye cameras, the proposition of a novel two-tiered training strategy, and the synergetic use of PRB with YOLOv7. Empirical results reinforce the efficacy of our approach, illustrating that while our model retains a compact footprint, it doesn’t compromise on performance, excelling in tasks with a comparable nature and offering swift inference.

Object Detection via Fisheye Camera

  • Yi-Zeng Hsieh
  • Hau-Ching Chen
  • Yi-Hung Yeh

During the competition, several factors that could decrease the effectiveness of the training result was quickly identified, such as the lack of distortion of provided training data, the high similarity between multiple images, and the extreme imbalance of quantity between classes. Due to the short duration of the competition, we proposed six simple-to-conduct yet proven very effective methods: 1) data filtering by skipping neighboring images, 2) data filtering to reduce high quantity classes, 3) use other datasets to replenish total quantity and abundancy, 4) develop an algorithm to distort a plain image accurately, 5) Pretrain on selected MS-COCO. Ablation studies were conducted to find out the effect of each part of the data. And found that using a selected set of MS-COCO to pre-train could increase its effect. Our work is straightforward in terms of methods used, but the result of our evaluation shows that it is reasonably solid.

Summary of the 2023 PAIR-LITEON Competition: Embedded AI Object Detection Model Design Contest on Fish-eye Around-view Cameras

  • Yu-Shu Ni
  • Chia-Chi Tsai
  • Jyun-Syu Lin
  • Hsien-Po Meng
  • Po-Chi Hu
  • Jiun-Shiung Chen
  • Kun-Hung Lin
  • Chih-Yuan Chuang
  • Jiun-In Guo

This competition is dedicated to achieving fisheye object detection in Asia, particularly in countries like Taiwan, while emphasizing low power consumption and simultaneously achieving a high mean average precision (mAP). This task is notably challenging as it must be accomplished in adverse driving conditions. The objects targeted for detection include cars, pedestrians, motorcycles, and bicycles. To train their models, participants utilized 89,002 annotated training images from the iVS-Dataset [1] and conducted testing on the MemryX platform [2]. To excel in this competition, participants had to master the art of transforming standard images into fisheye images. The judging process involved 6,500 test images, with 1,500 used in the preliminary competition stage, and the rest reserved for the final competition stage. A total of 129 teams registered for this competition, and those with mAP scores exceeding 20% advanced to the final competition stage, where 16 teams are qualified. Out of these, 11 teams submitted their works based on the final competition accuracy, which could not be lower than 5% of the preliminary competition accuracy. Ultimately, five teams attained their final scores and competed for rankings based on paper reviews. Champion is chici_lab, securing the top position in this demanding competition. NCKU_ACVLab, the 1st Runner-up, demonstrated outstanding skills. The 2nd Runner-up, yuhsi44165, also showcased commendable performance. Special Awards recognized excellence in specific categories, with chici_lab sweeping all three accolades. They were bestowed the best pedestrian detection award, the best bicycle detection award, and the best motorbike detection award for their remarkable achievements.