MM '23: Proceedings of the 31st ACM International Conference on Multimedia

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Full Citation in the ACM Digital Library

SESSION: Keynote Talks

Internet of Video Things: Technical Challenges and Emerging Applications

  • Chang-Wen Chen Chen

The worldwide flourishing of the Internet of Things (IoT) in the past decade has enabled numerous new applications through the internetworking of a wide variety of devices and sensors. In recent years, visual sensors have seen a considerable boom in IoT systems because they are capable of providing richer and more versatile information. Internetworking of large-scale visual sensors has been named the Internet of Video Things (IoVT). IoVT has a new array of unique characteristics in terms of sensing, transmission, storage, and analysis, all are fundamentally different from the conventional IoT. These new characteristics of IoVT are expected to impose significant challenges on existing technical infrastructures. In this keynote talk, an overview of recent advances in various fronts of IoVT will be introduced and a broad range of technological and systematic challenges will be addressed. Several emerging IoVT applications will be discussed to illustrate the great potential of IoVT in a broad range of practical scenarios.

Multimodal AI & LLMs for Peacekeeping and Emergency Response

  • Alejandro Jaimes

When an emergency event, or an incident relevant for peacekeeping first occurs, getting the right information as quickly as possible is critical in saving lives. When an event is ongoing, information on what is happening can be critical in making decisions to keep people safe and take control of the particular situation unfolding. In both cases, first responders and peacekeepers have to quickly make decisions that include what resources to deploy and where. Fortunately, in most emergencies, people use social media to publicly share information. At the same time, sensor data is increasingly becoming available. But a platform to detect emergency situations and deliver the right information has to deal with ingesting thousands of noisy data points per second: sifting through and identifying relevant information, from different sources, in different formats, with varying levels of detail, in real time, so that relevant individuals and teams can be alerted at the right level and at the right time. In this talk I will describe the technical challenges in processing vast amounts of heterogeneous, noisy data in real time, highlighting the importance of interdisciplinary research and a human-centered approach to address problems in peacekeeping and emergency response. I will give specific examples specifically discussing how LLMs can be deployed at scale, including relevant future research directions in Multimedia.

Transition and Adaptability: The Cornerstone of Resilience in Future Networked Multimedia Systems and Beyond

  • Ralf Steinmetz

Let us define transition as the "exchange" between two mechanisms with comparable functionality, but with different algorithms and implementation concepts, which are optimal depending on the respective conditions of the respective context. It is much more that adaptability; it does not cover just the smooth automatic control of e.g., a MAPE loop or a control loop which is in charge to maximize the quality of service of streamed media data while errors occur.

Resilience describes the ability of a system to either absorb large changes (and crises) and recover from them in the short term or to overcome them by acquiring comparable or new basic functionality through overall system adjustments. In doing so, the system's readiness increases continuously and sustainably by learning from past changes of the context (and crises).

Just one extreme example: In the situation of a severe danger due to a nature disaster, a person -located in the affected area of the disaster- transmits to the rescue team an on-the-fly generated 360-degree panoramic point cloud of the situation. He still has sufficient energy supply and for whatever reason the communication facilities are still available. Due to energy shortage, a lot of other traffic and some damages of the infrastructure, multimedia communication must be adjusted continuously to the environment and requirements. In an extreme situation data is send over high latency low bandwidth satellite channels. Media might become a short textual description of the actual surrounding. Assume it happens without any manual interaction of the person sending this data. Multimedia and Communications Mechanisms must be exchanged; media must be "customized". Transitions happen to support the user in e.g., such an extreme stress situation.

In the collaborative research project MAKI as well as our center researching resilient infrastructures of digital cities that can withstand crises and disasters emergenCITY, we address some of these issues. However, in the next years beyond multimedia networks, many multimedia systems, interfaces, applications, etc. will be affected.

SESSION: Oral Session I: Understanding Multimedia Content -- Media Interpretation

Mutual Information-driven Triple Interaction Network for Efficient Image Dehazing

  • Hao Shen
  • Zhong-Qiu Zhao
  • Yulun Zhang
  • Zhao Zhang

Multi-stage architectures have exhibited efficacy in image dehazing, which usually decomposes a challenging task into multiple more tractable sub-tasks and progressively estimates latent hazy-free images. Despite the remarkable progress, existing methods still suffer from the following shortcomings: (1) limited exploration of frequency domain information; (2) insufficient information interaction; (3) severe feature redundancy. To remedy these issues, we propose a novel Mutual Information-driven Triple interaction Network (MITNet) based on spatial-frequency dual domain information and two-stage architecture. To be specific, the first stage, named amplitude-guided haze removal, aims to recover the amplitude spectrum of the hazy images for haze removal. And the second stage, named phase-guided structure refined, devotes to learning the transformation and refinement of the phase spectrum. To facilitate the information exchange between two stages, an Adaptive Triple Interaction Module (ATIM) is developed to simultaneously aggregate cross-domain, cross-scale, and cross-stage features, where the fused features are further used to generate content-adaptive dynamic filters so that applying them to enhance global context representation. In addition, we impose the mutual information minimization constraint on paired scale encoder and decoder features from both stages. Such an operation can effectively reduce information redundancy and enhance cross-stage feature complementarity. Extensive experiments on multiple public datasets exhibit that our MITNet performs superior performance with lower model complexity. The code and models are available at

Suspected Objects Matter: Rethinking Model's Prediction for One-stage Visual Grounding

  • Yang Jiao
  • Zequn Jie
  • Jingjing Chen
  • Lin Ma
  • Yu-Gang Jiang

Recently, one-stage visual grounders attract high attention due to their comparable accuracy but significantly higher efficiency than two-stage grounders. However, inter-object relation modeling has not been well studied for one-stage grounders. Inter-object relationship modeling, though important, is not necessarily performed among all objects, as only part of them are related to the text query and may confuse the model. We call these objects "suspected objects". However, exploring their relationships in the one-stage paradigm is non-trivial because: (1) no object proposals are available as the basis on which to select suspected objects and perform relationship modeling; (2) suspected objects are more confusing than others, as they may share similar semantics, be entangled with certain relationships, etc, and thereby more easily mislead the model's prediction. Toward this end, we propose a Suspected Object Transformation mechanism (SOT), which can be seamlessly integrated into existing CNN and Transformer-based one-stage visual grounders to encourage the target object selection among the suspected ones. Suspected objects are dynamically discovered from a learned activation map adapted to the model's current discrimination ability during training. Afterward, on top of suspected objects, a Keyword-Aware Discrimination module (KAD) and an Exploration by Random Connection strategy (ERC) are concurrently proposed to help the model rethink its initial prediction. On the one hand, KAD leverages keywords contributing high to suspected object discrimination. On the other hand, ERC allows the model to seek the correct object instead of being trapped in a situation that always exploits the current false prediction. Extensive experiments demonstrate the effectiveness of our proposed method.

Self-Relational Graph Convolution Network for Skeleton-Based Action Recognition

  • Sophyani Banaamwini Yussif
  • Ning Xie
  • Yang Yang
  • Heng Tao Shen

Using a Graph convolution network (GCN) for constructing and aggregating node features has been helpful for skeleton-based action recognition. The strength of the nodes' relation of an action sequence distinguishes it from other actions. This work proposes a novel spatial module called Multi-scale self-relational graph convolution (MS-SRGC) for dynamically modeling joint relations of action instances. Modeling the joints' relations is crucial in determining the spatial distinctiveness between skeleton sequences; hence MS-SRGC shows effectiveness for activity recognition. We also propose a Hybrid multi-scale temporal convolution network (HMS-TCN) that captures different ranges of time steps along the temporal dimension of the skeleton sequence. In addition, we propose a Spatio-temporal blackout (STB) module that randomly zeroes some continue frames for selected strategic joint groups. We sequentially stack our spatial (MS-SRGC) and temporal (HMS-TCN) modules to form a Self-relational graph convolution network (SR-GCN) block, which we use to construct our SR-GCN model. We append our STB on the SR-GCN model top for the randomized operation. With the effectiveness of ensemble networks, we perform extensive experiments on single and multiple ensembles. Our results beat the state-of-the-art methods on the NTU RGB-D, NTU RGB-D 120, and Northwestern-UCLA datasets.

Exploring Correlations in Degraded Spatial Identity Features for Blind Face Restoration

  • Qian Ning
  • Fangfang Wu
  • Weisheng Dong
  • Xin Li
  • Guangming Shi

Blind face restoration aims to recover high-quality face images from low-quality ones with complex and unknown degradation. Existing approaches have achieved promising performance by leveraging pre-trained dictionaries or generative priors. However, these methods may fail to exploit the full potential of degraded inputs and facial identity features due to complex degradation. To address this issue, we propose a novel method that explores the correlation of degraded spatial identity features by learning a general representation using memory network. Specifically, our approach enhances degraded features with more identity by leveraging similar facial features retrieved from memory network. We also propose a fusion approach that fuses memorized spatial features with GAN prior features via affine transformation and blending fusion to improve fidelity and realism. Additionally, the memory network is updated online in an unsupervised manner along with other modules, which obviates the requirement for pre-training. Experimental results on synthetic and popular real-world datasets demonstrate the effectiveness of our proposed method, which achieves at least comparable and often better performance than other state-of-the-art approaches.

Video-based Visible-Infrared Person Re-Identification via Style Disturbance Defense and Dual Interaction

  • Chuhao Zhou
  • Jinxing Li
  • Huafeng Li
  • Guangming Lu
  • Yong Xu
  • Min Zhang

Video-based visible-infrared person re-identification (VVI-ReID) aims to retrieve video sequences of the same pedestrian from different modalities. The key of VVI-ReID is to learn discriminative sequence-level representations that are invariant to both intra- and inter-modal discrepancies. However, most works only focus on the elimination of modality-gap while ignore the distractors within the modality. Moreover, existing sequence-level representation learning approaches are limited to a single video, failing to mine the correlations among multiple videos of the same pedestrian. In this paper, we propose a Style Augmentation, Attack and Defense network with Graph-based dual interaction (SAADG) to guarantee the semantic consistency against both intra-modal discrepancies and inter-modal gap. Specifically, we first generate diverse styles for video frames by random style variation in image spaces. Followed by the style attack and defense, the intra- and inter-modal discrepancies are modeled as different types of style disturbance (attack), and our model achieves to keep the id-related content invariant under such attack. Besides, a graph-based dual interaction module is further introduced to fully explore the cross-view and cross-modal correlations among various videos of the same identity, which are then transferred to the sequence-level representations. Extensive experiments on the public SYSU-MM01 and HITSZ-VCM datasets show that our approach achieves the remarkable performance compared with state-of-the-arts. The code is available at

PetalView: Fine-grained Location and Orientation Extraction of Street-view Images via Cross-view Local Search

  • Wenmiao Hu
  • Yichen Zhang
  • Yuxuan Liang
  • Xianjing Han
  • Yifang Yin
  • Hannes Kruppa
  • See-Kiong Ng
  • Roger Zimmermann

Satellite-based street-view information extraction by cross-view matching refers to a task that extracts the location and orientation information of a given street-view image query by using one or multiple geo-referenced satellite images. Recent work has initiated a new research direction to find accurate information within a local area covered by one satellite image centered at a location prior (e.g., from GPS). It can be used as a standalone solution or complementary step following a large-scale search with multiple satellite candidates. However, these existing works require an accurate initial orientation (angle) prior (e.g., from IMU) and/or do not efficiently search through all possible poses. To allow efficient search and to give accurate prediction regardless of the existence or the accuracy of the angle prior, we present PetalView extractors with multi-scale search. The PetalView extractors give semantically meaningful features that are equivalent across two drastically different views, and the multi-scale search strategy efficiently inspects the satellite image from coarse to fine granularity to provide sub-meter and sub-degree precision extraction. Moreover, when an angle prior is given, we propose a learnable prior angle mixer to utilize this information. Our method obtains the best performance on the VIGOR dataset and successfully improves the performance on KITTI dataset test~1 set with the recall within 1 meter (r@1m) for location estimation to 68.88% and recall within 1 degree (r@1d) 21.10% when no angle prior is available, and with angle prior achieves stable estimations at r@1m and r@1d above 70% and 21%, up to a 40-degree noise level.

Shifted GCN-GAT and Cumulative-Transformer based Social Relation Recognition for Long Videos

  • Haorui Wang
  • Yibo Hu
  • Yangfu Zhu
  • Jinsheng Qi
  • Bin Wu

Social Relation Recognition is an important part of Video Understanding, providing insights into the information that videos convey. Most previous works mainly focused on graph generation for characters, instead of edges which are more suitable for relation modelling. Furthermore, previous methods tend to recognize social relations for single frames or short video clips within their receptive fields, neglecting the importance of continuous reasoning throughout the entire video. To tackle these challenges, we propose a novel Shifted GCN-GAT and Cumulative-Transformer framework, named SGCAT-CT. The overall architecture consists of an SGCAT module for shifted graph operations on novel relation graphs and a CT module for temporal processing with memory. SGCAT-CT conducts continuous recognition of social relations and memorizes information from as early as the beginning of a long video. Experiments conducted on several video datasets demonstrate encouraging performance on long videos. Our code will be released at

Causal Intervention for Sparse-View Gait Recognition

  • Jilong Wang
  • Saihui Hou
  • Yan Huang
  • Chunshui Cao
  • Xu Liu
  • Yongzhen Huang
  • Liang Wang

Gait recognition aims at identifying individuals by unique walking patterns at a long distance. However, prevailing methods suffer from a large degradation when applied to large-scale surveillance systems. We find a significant cause of this issue is that previous methods heavily rely on full-view person annotations to reduce view differences by pulling closer the anchor to positive samples from different viewpoints. But, subjects under in-the-wild scenarios usually have only a limited number of sequences from different viewpoints. As a result, the available viewpoints of each subject are sparse compared to the whole dataset, and simply minimizing intra-identity differences cannot well reducing the view differences in the whole dataset. In this work, we formulate this overlooked problem as Sparse-View Gait Recognition and provide a comprehensive analysis of it by a Structural Causal Model for causalities among latent features, view distribution, and labels. Based on our analysis, we propose a simple yet effective method that enables networks to learn a more robust representation among different views. Specifically, our method consists of two parts: 1) an effective metric learning algorithmic implementation based on the backdoor adjustment, which improves the consistency of representations among different views; 2) an unsupervised view cluster algorithm to discover and identify the most influential view contexts. We evaluate the effectiveness of our method on popular GREW, Gait3D, CASIA-B, and OU-MVLP, showing that our method consistently outperforms baselines and achieves state-of-the-art performance. The code will be available at

MM-AU:Towards Multimodal Understanding of Advertisement Videos

  • Digbalay Bose
  • Rajat Hebbar
  • Tiantian Feng
  • Krishna Somandepalli
  • Anfeng Xu
  • Shrikanth Narayanan

Advertisement videos (ads) play an integral part in the domain of Internet e-commerce, as they amplify the reach of particular products to a broad audience or can serve as a medium to raise awareness about specific issues through concise narrative structures. The narrative structures of advertisements involve several elements like reasoning about the broad content (topic and the underlying message) and examining fine-grained details involving the transition of perceived tone due to the sequence of events and interaction among characters. In this work, to facilitate the understanding of advertisements along the three dimensions of topic categorization, perceived tone transition, and social message detection, we introduce a multimodal multilingual benchmark called MM-AU comprised of 8.4 K videos (147hrs) curated from multiple web-based sources. We explore multiple zero-shot reasoning baselines through the application of large language models on the ads transcripts. Further, we demonstrate that leveraging signals from multiple modalities, including audio, video, and text, in multimodal transformer-based supervised models leads to improved performance compared to unimodal approaches.

UER: A Heuristic Bias Addressing Approach for Online Continual Learning

  • Huiwei Lin
  • Shanshan Feng
  • Baoquan Zhang
  • Hongliang Qiao
  • Xutao Li
  • Yunming Ye

Online continual learning aims to continuously train neural networks from a continuous data stream with a single pass-through data. As the most effective approach, the rehearsal-based methods replay part of previous data. Commonly used predictors in existing methods tend to generate biased dot-product logits that prefer to the classes of current data, which is known as a bias issue and a phenomenon of forgetting. Many approaches have been proposed to overcome the forgetting problem by correcting the bias; however, they still need to be improved in online fashion. In this paper, we try to address the bias issue by a more straightforward and more efficient method. By decomposing the dot-product logits into an angle factor and a norm factor, we empirically find that the bias problem mainly occurs in the angle factor, which can be used to learn novel knowledge as cosine logits. On the contrary, the norm factor abandoned by existing methods helps remember historical knowledge. Based on this observation, we intuitively propose to leverage the norm factor to balance the new and old knowledge for addressing the bias. To this end, we develop a heuristic approach called unbias experience replay (UER). UER learns current samples only by the angle factor and further replays previous samples by both the norm and angle factors. Extensive experiments on three datasets show that UER achieves superior performance over various state-of-the-art methods. The code is in

Clip Fusion with Bi-level Optimization for Human Mesh Reconstruction from Monocular Videos

  • Peng Wu
  • Xiankai Lu
  • Jianbing Shen
  • Yilong Yin

Human mesh reconstruction (HMR) from monocular video is the key step to many mixed reality and robotic applications. Although existing methods show promising results by capturing frames' temporal information, these methods predict human mesh with the design of implicit temporal learning modules in a sequence to frame manner. To mine more temporal information from the video, we present a bi-level clip inference network for HMR, which leverages both local motion and global context explicitly for dense 3D reconstruction. Specifically, we propose a novel bi-level temporal fusion strategy that takes both neighboring and long-range relations into consideration. In addition, different from traditional frame-wise operation, we investigate an alternative perspective by treating video-based HMR as clip-wise inference. We evaluate the proposed method on multiple datasets (3DPW, Human3.6M, and MPI-INF-3DHP) quantitatively and qualitatively, demonstrating a significant improvement over existing methods (in terms of PA-MPJPE, ACC-Error etc). Furthermore, we extend the proposed method on more challenging Multiple Shots HMR task to demonstrate its generalizability. Some visual demos can be seen

Parsing is All You Need for Accurate Gait Recognition in the Wild

  • Jinkai Zheng
  • Xinchen Liu
  • Shuai Wang
  • Lihao Wang
  • Chenggang Yan
  • Wu Liu

Binary silhouettes and keypoint-based skeletons have dominated human gait recognition studies for decades since they are easy to extract from video frames. Despite their success in gait recognition for in-the-lab environments, they usually fail in real-world scenarios due to their low information entropy for gait representations. To achieve accurate gait recognition in the wild, this paper presents a novel gait representation, named Gait Parsing Sequence (GPS). GPSs are sequences of fine-grained human segmentation, i.e., human parsing, extracted from video frames, so they have much higher information entropy to encode the shapes and dynamics of fine-grained human parts during walking. Moreover, to effectively explore the capability of the GPS representation, we propose a novel human parsing-based gait recognition framework, named ParsingGait. ParsingGait contains a Convolutional Neural Network (CNN)-based backbone and two light-weighted heads. The first head extracts global semantic features from GPSs, while the other one learns mutual information of part-level features through Graph Convolutional Networks to model the detailed dynamics of human walking. Furthermore, due to the lack of suitable datasets, we build the first parsing-based dataset for gait recognition in the wild, named Gait3D-Parsing, by extending the large-scale and challenging Gait3D dataset. Based on Gait3D-Parsing, we comprehensively evaluate our method and existing gait recognition methods. Specifically, ParsingGait achieves a 17.5% Rank-1 increase compared with the state-of-the-art silhouette-based method. In addition, by replacing silhouettes with GPSs, current gait recognition methods achieve about 12.5% ~ 19.2% improvements in Rank-1 accuracy. The experimental results show a significant improvement in accuracy brought by the GPS representation and the superiority of ParsingGait.

Multi-Scale Similarity Aggregation for Dynamic Metric Learning

  • Dingyi Zhang
  • Yingming Li
  • Zhongfei Zhang

In this paper, we propose a new multi-scale similarity aggregation method (MSA) for dynamic metric learning (DyML), which adopts a pretraining-finetuning scheme and efficiently learns the similarity relationship for each semantic level. In particular, building upon the framework of self-supervised pretraining, the output embedding layer is divided into three learners to learn the similarity relations in each level individually. Then for training these learners, the hierarchical prior information is fully considered. Specifically, in light of the class hierarchy that each class in a coarse level corresponds to a set of subclasses in a finer level, multi-proxy learning is employed to facilitate the single-level similarity learning of each learner. On the other hand, following the hierarchical consistency property, a cross-level similarity constraint is further presented to encourage the estimated similarities of the three learners to be hierarchically consistent. Extensive experiments on three DyML datasets show that MSA significantly outperforms the existing state-of-the-art methods and allows for a better generalization for different semantic scales.

RefineTAD: Learning Proposal-free Refinement for Temporal Action Detection

  • Yue Feng
  • Zhengye Zhang
  • Rong Quan
  • Limin Wang
  • Jie Qin

Temporal action detection (TAD) aims to localize the start and end frames of actions in untrimmed videos, which is a challenging task due to the similarity of adjacent frames and the ambiguity of action boundaries. Previous methods often generate coarse proposals first and then perform proposal-based refinement, which is coupled with prior action detectors and leads to proposal-oriented offsets. However, this paradigm increases the training difficulty of the TAD model and is heavily influenced by the quantity and quality of the proposals. To address the above issues, we decouple the refinement process from conventional TAD methods and propose a learnable, proposal-free refinement method for fine boundary localization, named RefineTAD. We first propose a multi-level refinement module to generate multi-scale boundary offsets, score offsets and boundary-aware probability at each time point based on the feature pyramid. Then, we propose an offset focusing strategy to progressively refine the predicted results of TAD models in a coarse-to-fine manner with our multi-scale offsets. We perform extensive experiments on three challenging datasets and demonstrate that our RefineTAD significantly improves the state-of-the-art TAD methods with minimal computational overhead.

Video Infringement Detection via Feature Disentanglement and Mutual Information Maximization

  • Zhenguang Liu
  • Xinyang Yu
  • Ruili Wang
  • Shuai Ye
  • Zhe Ma
  • Jianfeng Dong
  • Sifeng He
  • Feng Qian
  • Xiaobo Zhang
  • Roger Zimmermann
  • Lei Yang

The self-media era provides us tremendous high quality videos. Unfortunately, frequent video copyright infringements are now seriously damaging the interests and enthusiasm of video creators. Identifying infringing videos is therefore a compelling task. Current state-of-the-art methods tend to simply feed high-dimensional mixed video features into deep neural networks and count on the networks to extract useful representations. Despite its simplicity, this paradigm heavily relies on the original entangled features and lacks constraints guaranteeing that useful task-relevant semantics are extracted from the features.

In this paper, we seek to tackle the above challenges from two aspects: (1) We propose to disentangle an original high-dimensional feature into multiple sub-features, explicitly disentangling the feature into exclusive lower-dimensional components. We expect the sub-features to encode non-overlapping semantics of the original feature and remove redundant information. (2) On top of the disentangled sub-features, we further learn an auxiliary feature to enhance the sub-features. We theoretically analyzed the mutual information between the label and the disentangled features, arriving at a loss that maximizes the extraction of task-relevant information from the original feature.

Extensive experiments on two large-scale benchmark datasets (i.e., SVD and VCSL) demonstrate that our method achieves 90.1% TOP-100 mAP on the large-scale SVD dataset and also sets the new state-of-the-art on the VCSL benchmark dataset. Our code and model have been released at, hoping to contribute to the community.

Pseudo Object Replay and Mining for Incremental Object Detection

  • Dongbao Yang
  • Yu Zhou
  • Xiaopeng Hong
  • Aoting Zhang
  • Xin Wei
  • Linchengxi Zeng
  • Zhi Qiao
  • Weipinng Wang

Incremental object detection (IOD) aims to mitigate catastrophic forgetting for object detectors when incrementally learning to detect new emerging object classes without using original training data. Most existing IOD methods benefit from the assumption that unlabeled old-class objects may co-occur with labeled new-class objects in the new training data. However, in practical scenarios, old-class objects may be absent, which is called non co-occurrence IOD. In this paper, we propose a pseudo object replay and mining method (PseudoRM) to handle the co-occurrence dependent problem, reducing the performance degradation caused by the absence of old-class objects. The new training data can be augmented by co-occurring fake (old-class) and real (new-class) objects with a patch-level data-free generation method in the pseudo object replay stage. To fully use existing training data, we propose pseudo object mining to explore false positives for transferring useful instance-level knowledge. In the incremental learning procedure, a generative distillation is introduced to distill image-level knowledge for balancing stability and plasticity. Experimental results on PASCAL VOC and COCO demonstrate that PseudoRM can effectively boost the performance on both co-occurrence and non co-occurrence scenarios without using old samples or extra wild data.

Informative Classes Matter: Towards Unsupervised Domain Adaptive Nighttime Semantic Segmentation

  • Shiqin Wang
  • Xin Xu
  • Xianzheng Ma
  • Kui Jiang
  • Zheng Wang

Unsupervised Domain Adaptive Nighttime Semantic Segmentation (UDA-NSS) aims to adapt a robust model from a labeled daytime domain to an unlabeled nighttime domain. However, current advanced segmentation methods ignore the illumination effect and class discrepancies of different semantic classes during domain adaptation, showing an uneven prediction phenomenon. It is the completely ignored and underexplored issues of ''hard-to-adapt'' classes that some classes have a large performance gap between existing UDA-NSS methods and supervised learning counterparts while others have a very low performance gap. To realize ''hard-to-adapt'' classes' more sufficient learning and facilitate the UDA-NSS task, we present an Online Informative Class Sampling (OICS) strategy to adaptively mine informative classes from the target nighttime domain according to the corresponding spectrogram mean and the class frequency via our Informative Mixture of Experts. Furthermore, an Informativeness-based cross-domain Mixed Sampling (InforMS) framework is designed to focus on informative classes from the target nighttime domain by vesting their higher sampling probabilities when cross-domain mixing sampling and achieves better performance in UDA-NSS tasks. Consequently, our method outperforms state-of-the-art UDA-NSS methods by large margins on three widely-used benchmarks (e.g., ACDC, Dark Zurich, and Nighttime Driving). Notably, our method achieves state-of-the-art performance with 65.1% mIoU on ACDC-night-test and 55.4% mIoU on ACDC-night-val.

View while Moving: Efficient Video Recognition in Long-untrimmed Videos

  • Ye Tian
  • Mengyu Yang
  • Lanshan Zhang
  • Zhizhen Zhang
  • Yang Liu
  • Xiaohui Xie
  • Xirong Que
  • Wendong Wang

Recent adaptive methods for efficient video recognition mostly follow the two-stage paradigm of "preview-then-recognition" and have achieved great success on multiple video benchmarks. However, this two-stage paradigm involves two visits of raw frames from coarse-grained to fine-grained during inference (cannot be parallelized), and the captured spatiotemporal features cannot be reused in the second stage (due to varying granularity), being not friendly to efficiency and computation optimization.To this end, inspired by human cognition, we propose a novel recognition paradigm of "View while Moving" for efficient long-untrimmed video recognition.In contrast to the two-stage paradigm, our paradigm only needs to access the raw frame once.The two phases of coarse-grained sampling and fine-grained recognition are combined into unified spatiotemporal modeling, showing great performance.Moreover, we investigate the properties of semantic units in video and propose a hierarchical mechanism to efficiently capture and reason about the unit-level and video-level temporal semantics in long-untrimmed videos respectively.Extensive experiments on both long-untrimmed and short-trimmed videos demonstrate that our approach outperforms state-of-the-art methods in terms of accuracy as well as efficiency, yielding new efficiency and accuracy trade-offs for video spatiotemporal modeling.

PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion

  • Yimin Deng
  • Huaizhen Tang
  • Xulong Zhang
  • Jianzong Wang
  • Ning Cheng
  • Jing Xiao

Voice conversion as the style transfer task applied to speech, refers to converting one person's speech into a new speech that sounds like another person's. Up to now, there has been a lot of research devoted to better implementation of VC tasks. However, a good voice conversion model should not only match the timbre information of the target speaker, but also expressive information such as prosody, pace, pause, etc. In this context, prosody modeling is crucial for achieving expressive voice conversion that sounds natural and convincing. Unfortunately, prosody modeling is important but challenging, especially without text transcriptions. In this paper, we firstly propose a novel voice conversion framework named 'PMVC', which effectively separates and models the content, timbre, and prosodic information from the speech without text transcriptions. Specially, we introduce a new speech augmentation algorithm for robust prosody extraction. And building upon this, mask and predict mechanism is applied in the disentanglement of prosody and content information. The experimental results on the AIShell-3 corpus supports our improvement of naturalness and similarity of converted speech.

Alleviating Spatial Misalignment and Motion Interference for UAV-based Video Recognition

  • Gege Shi
  • Xueyang Fu
  • Chengzhi Cao
  • Zheng-Jun Zha

Recognizing activities with Unmanned Aerial Vehicles (UAVs) is essential for many applications, while existing video recognition methods are mainly designed for ground cameras and do not account for UAV changing attitudes and fast motion. This creates spatial misalignment of small objects between frames, leading to inaccurate visual movement in drone videos. Additionally, camera motion relative to objects in the video causes relative movements that visually affect object motion and can result in misunderstandings of video content. To address these issues, we present a novel framework named Attentional Spatial and Adaptive Temporal Relations Modeling. First, to mitigate the spatial misalignment of small objects between frames, we design an Attentional Patch-level Spatial Enrichment (APSE) module that models dependencies among patches and enhances patch-level features. Then, we propose a Multi-scale Temporal and Spatial Mixer (MTSM) module that is capable of adapting to disturbances caused by the UAV flight and modeling various temporal clues. By integrating APSE and MTSM into a single model, our network can effectively and accurately capture spatiotemporal relations for UAV videos. Extensive experiments on several benchmarks demonstrate the superiority of our method over state-of-the-art approaches. For instance, our network achieves a classification accuracy of 68.1% with an absolute gain of 1.3% compared to FuTH-Net on the ERA dataset.

Learning Causality-inspired Representation Consistency for Video Anomaly Detection

  • Yang Liu
  • Zhaoyang Xia
  • Mengyang Zhao
  • Donglai Wei
  • Yuzheng Wang
  • Siao Liu
  • Bobo Ju
  • Gaoyun Fang
  • Jing Liu
  • Liang Song

Video anomaly detection is an essential yet challenging task in the multimedia community, with promising applications in smart cities and secure communities. Existing methods attempt to learn abstract representations of regular events with statistical dependence to model the endogenous normality, which discriminates anomalies by measuring the deviations to the learned distribution. However, conventional representation learning is only a crude description of video normality and lacks an exploration of its underlying causality. The learned statistical dependence is unreliable for diverse regular events in the real world and may cause high false alarms due to over generalization. Inspired by causal representation learning, we think that there exists a causal variable capable of adequately representing the general patterns of regular events in which anomalies will present significant variations. Therefore, we design a causality-inspired representation consistency (CRC) framework to implicitly learn the unobservable causal variables of normality directly from available normal videos and detect abnormal events with the learned representation consistency. Extensive experiments show that the causality-inspired normality is robust to regular events with label-independent shifts, and the proposed CRC framework can quickly and accurately detect various complicated anomalies from real-world surveillance videos.

M2ATS: A Real-world Multimodal Air Traffic Situation Benchmark Dataset and Beyond

  • Dongyue Guo
  • Yi Lin
  • Xuehang You
  • Zhongping Yang
  • Jizhe Zhou
  • Bo Yang
  • Jianwei Zhang
  • Han Shi
  • Shasha Hu
  • Zheng Zhang

Air Traffic Control (ATC) is a complicated, time-evolving, and real-time procedure to direct flight operations in a safer and ordered manner. Although enormous data storages are available during air traffic operations for over 40 years, data-driven intelligent application in aviation is still an emerging task due to the safety-critical issue. With the prevalence of the Next Generation ATC system, artificial intelligence (AI) -empowered research topics are attracting increasing attention from both industrial and academic domains and a high-quality dataset naturally becomes the prerequisite for such practices. However, almost all ATC-related datasets are only unimodal for certain tasks, which fails to comprehensively illustrate the traffic situation to further support real-world studies. To address this gap, a multimodal air traffic situation (M2ATS) dataset is constructed to advance AI-related research in the ATC domain, including airspace information, flight plan, trajectory, and speech. M2ATS covers 10362 flights ATC situation data, involving 110000+ utterances (104 hours) with diversity golden text annotations, 16 intents, and 51 slots. Considering the real-world ATC requirements, a total of 10 multimedia-related tasks (24 baselines) are designed to validate the proposed dataset, covering automatic speech recognition, natural language processing, and spatial-temporal data processing. New ATC-related metrics corresponding to ATC applications are proposed in addition to the common metrics to evaluate task performance. Extensive experiment results demonstrate that the selective baselines can achieve designed tasks on this new dataset, and further investigations are also required to address task and data specificities. It is believed that the proposed new dataset is a new practice to advance AI applications to an industrial scene, which not only promotes ATC-related applications but also provides diverse research topics in the common multimedia community.

Federated Learning with Label-Masking Distillation

  • Jianghu Lu
  • Shikun Li
  • Kexin Bao
  • Pengju Wang
  • Zhenxing Qian
  • Shiming Ge

Federated learning provides a privacy-preserving manner to collaboratively train models on data distributed over multiple local clients via the coordination of a global server. In this paper, we focus on label distribution skew in federated learning, where due to the different user behavior of the client, label distributions between different clients are significantly different. When faced with such cases, most existing methods will lead to a suboptimal optimization due to the inadequate utilization of label distribution information in clients. Inspired by this, we propose a label-masking distillation approach termed FedLMD to facilitate federated learning via perceiving the various label distributions of each client. We classify the labels into majority and minority labels based on the number of examples per class during training. The client model learns the knowledge of majority labels from local data. The process of distillation masks out the predictions of majority labels from the global model, so that it can focus more on preserving the minority label knowledge of the client. A series of experiments show that the proposed approach can achieve state-of-the-art performance in various cases. Moreover, considering the limited resources of the clients, we propose a variant FedLMD-Tf that does not require an additional teacher, which outperforms previous lightweight approaches without increasing computational costs. Our code is available at

Painterly Image Harmonization using Diffusion Model

  • Lingxiao Lu
  • Jiangtong Li
  • Junyan Cao
  • Li Niu
  • Liqing Zhang

Painterly image harmonization aims to insert photographic objects into paintings and obtain artistically coherent composite images. Previous methods for this task mainly rely on inference optimization or generative adversarial network, but they are either very time-consuming or struggling at fine control of the foreground objects (e.g., texture and content details). To address these issues, we propose a novel Painterly Harmonization stable Diffusion model (PHDiffusion), which includes a lightweight adaptive encoder and a Dual Encoder Fusion (DEF) module. Specifically, the adaptive encoder and the DEF module first stylize foreground features within each encoder. Then, the stylized foreground features from both encoders are combined to guide the harmonization process. During training, besides the noise loss in diffusion model, we additionally employ content loss and two style losses, i.e., AdaIN style loss and contrastive style loss, aiming to balance the trade-off between style migration and content preservation. Compared with the state-of-the-art models from related fields, our PHDiffusion can stylize the foreground more sufficiently and simultaneously retain finer content. Our code and model are available at

Exploring Hyperspectral Histopathology Image Segmentation from a Deformable Perspective

  • Xingran Xie
  • Ting Jin
  • Boxiang Yun
  • Qingli Li
  • Yan Wang

Hyperspectral images (HSIs) offer great potential for computational pathology. However, limited by the spectral redundancy and the lack of spectral prior in popular 2D networks, previous HSI based techniques do not perform well. To address these problems, we propose to segment HSIs from a deformable perspective, which processes different spectral bands independently and fuses spatiospectral features of interest via deformable attention mechanisms. In addition, we propose Deformable Self-Supervised Spectral Regression (DF-S3R), which introduces two self-supervised pre-text tasks based on the low rank prior of HSIs enabling the network learning with spectrum-related features. During pre-training, DF-S3R learns both spectral structures and spatial morphology, and the jointly pre-trained architectures help alleviate the transfer risk to downstream fine-tuning. Compared to previous works, experiments show that our deformable architecture and pre-training method perform much better than other competitive methods on pathological semantic segmentation tasks, and the visualizations indicate that our method can trace the critical spectral characteristics from subtle spectral disparities. Code will be released at

Uncertainty-Aware Variate Decomposition for Self-supervised Blind Image Deblurring

  • Runhua Jiang
  • Yahong Han

Blind image deblurring remains challenging due to the ill-posed nature of the traditional blurring function. Although previous supervised methods have achieved great breakthrough with synthetic blurry-sharp image pairs, their generalization ability to real-world blurs is limited by the discrepancy between synthetic and real blurs. To overcome this limitation, unsupervised deblurring methods have been proposed by using natural priors or generative adversarial networks. However, natural priors are vulnerable to random blur artifacts, while generators of generative adversarial networks always produce inaccurate details and unrealistic colors. Consequently, previous methods easily suffer from slow convergence and poor performance. In this work, we propose to formulate the traditional blurring function as the composition of multiple variates, thus allowing us explicitly define characteristics of residual images between blurry and sharp images. We also propose a multi-step self-supervised deblurring framework to address the slow convergence issue. Our framework continuously decomposes and composes input images, thus utilizing the uncertainty of blur artifacts to obtain diverse pseudo blurry-sharp image pairs for self-supervised learning. This framework is more efficient than previous methods, as it does not rely on natural priors or GANs. Extensive comparisons demonstrate that the proposed framework outperforms state-of-the-art unsupervised methods on both dynamic scene, human-aware centric motion, real-world and out-of-focus deblurring datasets. The codes are available at

SESSION: Oral Session II: Understanding Multimedia Content -- Multimodal Fusion and Embedding

SCLAV: Supervised Cross-modal Contrastive Learning for Audio-Visual Coding

  • Chao Sun
  • Min Chen
  • Jialiang Cheng
  • Han Liang
  • Chuanbo Zhu
  • Jincai Chen

Audio and vision are important senses for high-level cognition, and their special strong correlation makes audio-visual coding a crucial factor in many multimodal tasks. However, there are two challenges in audio-visual coding. First, the heterogeneity of multimodal data often leads to misalignment of cross-modal features under the same sample, which reduces their representation quality. Second, most self-supervised learning frameworks are constructed based on instance semantics, and the generated pseudo labels introduce additional classification noise. To address these challenges, we propose a Supervised Cross-modal Contrastive Learning Framework for Audio-Visual Coding (SCLAV). Our framework includes an audio-visual coding network composed of an inter-modal attention interaction module and an intra-modal self-integration module, which leverage multimodal complementary and hidden information for better representation. Additionally, we introduce a supervised cross-modal contrastive loss to minimize the distance between audio and vision features of the same instance, and use weak labels of multimodal data to eliminate the feature-oriented classification noise. Extensive experiments on the AVE and XD-Violence datasets demonstrate that SCLAV outperforms the state-of-the-art results, even with limited computational resources.

Cross-Modal and Multi-Attribute Face Recognition: A Benchmark

  • Feng Lin
  • Kaiqiang fu
  • Hao Luo
  • Ziyue Zhan
  • Zhibo Wang
  • Zhenguang Liu
  • Lorenzo Cavallaro
  • Kui Ren

Face recognition has made significant advances with the development of deep learning and has begun to be deployed in some unrestricted scenarios. Many smartphones, for example, have infrared sensors that allow them to capture clear images even in low-light conditions. Face authentication under complex environmental conditions can thus be accomplished by matching NIR-VIS face images across modalities. However, existing NIR-VIS datasets lack enough variation in face attributes and are insufficient for real-world scenarios. To address the aforementioned issues, we first propose a 300-person NIR-VIS cross-modality face dataset with a variety of attributes. Based on modal information removal, we proposed a NIR-VIS cross-modal face recognition model. We can effectively extract modal information by constraining the similarity distribution of modalities and then using the orthogonal loss to remove modal information from identity features. The method achieves excellent results on our dataset and CASIA NIR-VIS 2.0 dataset.

A Closer Look at Classifier in Adversarial Domain Generalization

  • Ye Wang
  • Junyang Chen
  • Mengzhu Wang
  • Hao Li
  • Wei Wang
  • Houcheng Su
  • Zhihui Lai
  • Wei Wang
  • Zhenghan Chen

The task of domain generalization is to learn a classification model from multiple source domains and generalize it to unknown target domains. The key to domain generalization is learning discriminative domain-invariant features. Invariant representations are achieved using adversarial domain generalization as one of the primary techniques. For example, generative adversarial networks have been widely used, but suffer from the problem of low intra-class diversity, which can lead to poor generalization ability. To address this issue, we propose a new method called auxiliary classifier in adversarial domain generalization (CloCls). CloCls improve the diversity of the source domain by introducing auxiliary classifier. Combining typical task-related losses, e.g., cross-entropy loss for classification and adversarial loss for domain discrimination, our overall goal is to guarantee the learning of condition-invariant features for all source domains while increasing the diversity of source domains. Further, inspired by smoothing optima have improved generalization for supervised learning tasks like classification. We leverage that converging to a smooth minima with respect task loss stabilizes the adversarial training leading to better performance on unseen target domain which can effectively enhances the performance of domain adversarial methods. We have conducted extensive image classification experiments on benchmark datasets in domain generalization, and our model exhibits sufficient generalization ability and outperforms state-of-the-art DG methods.

Mixture-of-Experts Learner for Single Long-Tailed Domain Generalization

  • Mengzhu Wang
  • Jianlong Yuan
  • Zhibin Wang

Domain generalization (DG) refers to the task of training a model on multiple source domains and test it on a different target domain with different distribution. In this paper, we address a more challenging and realistic scenario known as Single Long-Tailed Domain Generalization, where only one source domain is available and the minority class in this domain has an abundance of instances in other domains. To tackle this task, we propose a novel approach called Mixture-of-Experts Learner for Single Long-Tailed Domain Generalization (MoEL), which comprises two key strategies. The first strategy is a simple yet effective data augmentation technique that leverages saliency maps to identify important regions on the original images and preserves these regions during augmentation. The second strategy is a new skill-diverse expert learning approach that trains multiple experts from a single long-tailed source domain and leverages mutual learning to aggregate their learned knowledge for the unknown target domain. We evaluate our method on various benchmark datasets, including Digits-DG, CIFAR-10-C, PACS, and DomainNet, and demonstrate its superior performance compared to previous single domain generalization methods. Additionally, the ablation study is also conducted to illustrate the inner workings of our approach.

Robust Spectral Embedding Completion Based Incomplete Multi-view Clustering

  • Chao Zhang
  • Jingwen Wei
  • Bo Wang
  • Zechao Li
  • Chunlin Chen
  • Huaxiong Li

Graph based methods have been widely used in incomplete multi-view clustering (IMVC). Most recent methods try to fill the original missing samples or incomplete affinity matrices to obtain a complete similarity graph for the subsequent spectral clustering. However, recovering the original high-dimensional data or complete n X n similarity matrix is usually time-consuming and noise-sensitive. Besides, they generally separate the cluster indicator learning into an individual step, which may result in sub-optimal graphs or spectral embeddings for clustering. To address these problems, this paper proposes a robust Spectral Embedding Completion based IMVC (SEC-IMVC) method, which incorporates spectral embedding completion and discrete cluster indicator learning into a unified framework. SEC-IMVC performs completion on spectral embeddings, and the embedding noise is eliminated to reduce the negative influence of original data noise. The discrete cluster indicator matrix is seamlessly learned by using spectral rotation, and it can explore the first-order feature consistency among different views. To further improve the completion robustness, the second-order correlation consistency is also captured by pairwise relations alignment. We compare our method with some state-of-the-art approaches on several datasets, and the experimental results show the effectiveness and advantages of our method.

SA-GDA: Spectral Augmentation for Graph Domain Adaptation

  • Jinhui Pang
  • Zixuan Wang
  • Jiliang Tang
  • Mingyan Xiao
  • Nan Yin

Graph neural networks (GNNs) have achieved impressive impressions for graph-related tasks. However, most GNNs are primarily studied under the cases of signal domain with supervised training, which requires abundant task-specific labels and is difficult to transfer to other domains. There are few works focused on domain adaptation for graph node classification. They mainly focused on aligning the feature space of the source and target domains, without considering the feature alignment between different categories, which may lead to confusion of classification in the target domain. However, due to the scarcity of labels of the target domain, we cannot directly perform effective alignment of categories from different domains, which makes the problem more challenging. In this paper, we present the Spectral Augmentation for Graph Domain Adaptation (SA-GDA) for graph node classification. First, we observe that nodes with the same category in different domains exhibit similar characteristics in the spectral domain, while different classes are quite different. Following the observation, we align the category feature space of different domains in the spectral domain instead of aligning the whole features space, and we theoretical proof the stability of proposed SA-GDA. Then, we develop a dual graph convolutional network to jointly exploits local and global consistency for feature aggregation. Last, we utilize a domain classifier with an adversarial learning submodule to facilitate knowledge transfer between different domain graphs. Experimental results on a variety of publicly available datasets reveal the effectiveness of our SA-GDA.

CONVERT: Contrastive Graph Clustering with Reliable Augmentation

  • Xihong Yang
  • Cheng Tan
  • Yue Liu
  • Ke Liang
  • Siwei Wang
  • Sihang Zhou
  • Jun Xia
  • Stan Z. Li
  • Xinwang Liu
  • En Zhu

Contrastive graph node clustering via learnable data augmentation is a hot research spot in the field of unsupervised graph learning. The existing methods learn the sampling distribution of a pre-defined augmentation to generate data-driven augmentations automatically. Although promising clustering performance has been achieved, we observe that these strategies still rely on pre-defined augmentations, the semantics of the augmented graph can easily drift. The reliability of the augmented view semantics for contrastive learning can not be guaranteed, thus limiting the model performance. To address these problems, we propose a novel CONtrastiVe Graph ClustEring network with Reliable AugmenTation (COVERT). Specifically, in our method, the data augmentations are processed by the proposed reversible perturb-recover network. It distills reliable semantic information by recovering the perturbed latent embeddings. Moreover, to further guarantee the reliability of semantics, a novel semantic loss is presented to constrain the network via quantifying the perturbation and recovery. Lastly, a label-matching mechanism is designed to guide the model by clustering information through aligning the semantic labels and the selected high-confidence clustering pseudo labels. Extensive experimental results on seven datasets demonstrate the effectiveness of the proposed method. We release the code and appendix of CONVERT at on GitHub.

High-order Complementarity Induced Fast Multi-View Clustering with Enhanced Tensor Rank Minimization

  • Jintian Ji
  • Songhe Feng

Recently, tensor-based multi-view clustering methods have achieved promising results, primarily benefited from their superior ability in exploring high-order consistent information among views. Despite significant progress, these methods inevitably suffer from several drawbacks: 1) Extremely high computational complexity restricts their feasibility for large-scale data sets. 2) Prevalently adopted tensor rank approximations (e.g., Tensor Nuclear Norm (TNN)) tend to under-penalize small singular values, resulting in noise residuals. 3) Tensor structure is rarely utilized for high-order complementarity investigation. In light of this, we propose High-order Complementarity Induced Fast Multi-View Clustering with Enhanced Tensor Rank Minimization (CFMVC-ETR). Specifically, two sets of representation matrices are learned from original multi-view data via the matrix factorization mechanism with a group of base matrices, which are further reconstructed into the consistent tensor and the complementary tensor, respectively. Subsequently, a novel Enhanced Tensor Rank is imposed on the consistent tensor, which is a tighter approximation of the tensor rank and is more noisy-robust to explore the high-order consistency. Meanwhile, a tensor-level constraint termed Tensorial Exclusive Regularization is proposed on the complementary tensor to enhance the view-specific feature and well capture the high-order complementarity. Moreover, we adopt a concatenation-fusion approach to integrate these two parts, deriving a discriminative unified embedding for the clustering task. We solve CFMVC-ETR by an efficient algorithm with good convergence. Extensive experiments on nine challenging data sets demonstrate the superiority of the proposed method.

DealMVC: Dual Contrastive Calibration for Multi-view Clustering

  • Xihong Yang
  • Jin Jiaqi
  • Siwei Wang
  • Ke Liang
  • Yue Liu
  • Yi Wen
  • Suyuan Liu
  • Sihang Zhou
  • Xinwang Liu
  • En Zhu

Benefiting from the strong view-consistent information mining capacity, multi-view contrastive clustering has attracted plenty of attention in recent years. However, we observe the following drawback, which limits the clustering performance from further improvement. The existing multi-view models mainly focus on the consistency of the same samples in different views while ignoring the circumstance of similar but different samples in cross-view scenarios. To solve this problem, we propose a novel Dual contrastive calibration network for Multi-View Clustering (DealMVC). Specifically, we first design a fusion mechanism to obtain a global cross-view feature. Then, a global contrastive calibration loss is proposed by aligning the view feature similarity graph and the high-confidence pseudo-label graph. Moreover, to utilize the diversity of multi-view information, we propose a local contrastive calibration loss to constrain the consistency of pair-wise view features. The feature structure is regularized by reliable class information, thus guaranteeing similar samples have similar features in different views. During the training procedure, the interacted cross-view feature is jointly optimized at both local and global levels. In comparison with other state-of-the-art approaches, the comprehensive experimental results obtained from eight benchmark datasets provide substantial validation of the effectiveness and superiority of our algorithm. We release the code of DealMVC at on GitHub.

Bidomain Modeling Paradigm for Pansharpening

  • Junming Hou
  • Qi Cao
  • Ran Ran
  • Che Liu
  • Junling Li
  • Liang-jian Deng

Pansharpening is a challenging low-level vision task whose aim is to learn the complementary representation between spectral information and spatial detail. Despite the remarkable progress, existing deep neural network (DNN) based pansharpening algorithms are still confronted with common limitations. 1) These methods rarely consider the local specificity of different spectral bands; 2) They often extract the global detail in the spatial domain, which ignore the task-related degradation, e.g., the down-sampling process of MS image, and also suffer from limited receptive field. In this work, we propose a novel bidomain modeling paradigm for pansharpening problem (dubbed as BiMPan), which takes into both local spectral specificity and global spatial detail. More specifically, we first customize the specialized source-discriminative adaptive convolution (SDAConv) for every spectral band instead of sharing the identical kernels across all bands like prior works. Then, we devise a novel Fourier global modeling module (FGMM), which is capable of embracing global information while benefiting the disentanglement of image degradation. By integrating the band-aware local feature and Fourier global detail from these two functional designs, we can fuse a texture-rich while visually pleasing high-resolution MS image. Extensive experiments demonstrate that the proposed framework achieves favorable performance against current state-of-the-art pansharpening methods. The code is available at

Learning High-frequency Feature Enhancement and Alignment for Pan-sharpening

  • Yingying Wang
  • Yunlong Lin
  • Ge Meng
  • Zhenqi Fu
  • Yuhang Dong
  • Linyu Fan
  • Hedeng Yu
  • Xinghao Ding
  • Yue Huang

Pan-sharpening aims to utilize the high-resolution panchromatic (PAN) image as a guidance to super-resolve the spatial resolution of the low-resolution multispectral (MS) image. The key challenge in pan-sharpening is how to effectively and precisely inject high-frequency edges and textures from the PAN image into the low-resolution MS image. To address this issue, we propose a High-frequency Feature Enhancement and Alignment Network (HFEAN) for effectively encouraging the high-frequency learning. To implement it, three core designs are customized: a Fourier convolution based efficient feature enhancement module (FEM), an implicit neural alignment module (INA), and a preliminary alignment module (Pre-align). To be specific, FEM employs the fast Fourier convolution with attention mechanism to achieve the mixed global-local receptive field on each scale of the high-frequency domain, thus yielding the informative latent codes. INA leverages implicit neural function to precisely align the latent codes from different scales in the continuous domain. In this way, the high frequency signals at different scales are represented as functions of continuous coordinates, enabling a precise feature alignment in a resolution-free manner. Pre-align is developed to further address the inherent misalignment between PAN and MS pairs. Extensive experiments over multiple satellite datasets validate the effectiveness of the proposed network and demonstrate its favorable performance against the existing state-of-the-art methods both visually and quantitatively. Code is available at:

Distribution Consistency based Fast Anchor Imputation for Incomplete Multi-view Clustering

  • Xingfeng Li
  • Yinghui Sun
  • Quansen Sun
  • Jia Dai
  • Zhenwen Ren

In practical scenarios, partial missing of multi-view data is very common, such as register information missing from social network analysis, which results in incomplete multi-view clustering (IMVC). How to fill missing data fast and efficiently plays a vital role in improving IMVC, carrying a significant challenge. Existing IMVC methods always use all observed data to fill in missing data, resulting in high complexity and poor imputation quality due to a lack of guidance from consistent distribution. To break the existing limitations, we propose a novel Distribution Consistency based Fast Anchor Imputation for Incomplete Multi-view Clustering (DCFAI-IMVC) method. Specifically, to eliminate the interference of redundant and fraudulent features in the original space, incomplete data are first projected into a consensus latent space, where we dynamically learn a small number of anchors to achieve fast and good imputation. Then, we employ global distribution information of the observed embedding representations to further ensure the consistent distribution between the learned anchors and the observed embedding representations. Ultimately, a tensor low-rank constraint is imposed on bipartite graphs to investigate the high-order correlations hidden in data. DCFAI-IMVC enjoys linear complexity in terms of sample number, which gives it great potential to handle large-scale IMVC tasks. By performing extensive experiments, our effectiveness, superiority, and efficiency are all validated on multiple public datasets with recent advances.

Visual Causal Scene Refinement for Video Question Answering

  • Yushen Wei
  • Yang Liu
  • Hong Yan
  • Guanbin Li
  • Liang Lin

Existing methods for video question answering (VideoQA) often suffer from spurious correlations between different modalities, leading to a failure in identifying the dominant visual evidence and the intended question. Moreover, these methods function as black boxes, making it difficult to interpret the visual scene during the QA process. In this paper, to discover critical video segments and frames that serve as the visual causal scene for generating reliable answers, we present a causal analysis of VideoQA and propose a framework for cross-modal causal relational reasoning, named Visual Causal Scene Refinement (VCSR). Particularly, a set of causal front-door intervention operations is introduced to explicitly find the visual causal scenes at both segment and frame levels. Our VCSR involves two essential modules: i) the Question-Guided Refiner (QGR) module, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention; ii) the Causal Scene Separator (CSS) module, which discovers a collection of visual causal and non-causal scenes based on the visual-linguistic causal relevance and estimates the causal effect of the scene-separating intervention in a contrastive learning manner. Extensive experiments on the NExT-QA, Causal-VidQA, and MSRVTT-QA datasets demonstrate the superiority of our VCSR in discovering visual causal scene and achieving robust video question answering.

Parameter-Efficient Transfer Learning for Audio-Visual-Language Tasks

  • Hongye Liu
  • Xianhai Xie
  • Yang Gao
  • Zhou Yu

The pretrain-then-finetune paradigm has been widely used in various unimodal and multimodal tasks. However, finetuning all the parameters of a pre-trained model becomes prohibitive as the model size grows exponentially. To address this issue, the adapter mechanism that freezes the pre-trained model and only finetunes a few extra parameters is introduced and delivers promising results. Most studies on adapter architectures are dedicated to unimodal or bimodal tasks, while the adapter architectures for trimodal tasks have not been investigated yet. This paper introduces a novel Long Short-Term Trimodal Adapter (LSTTA) approach for video understanding tasks involving audio, visual, and language modalities. Based on the pre-trained from the three modalities, the designed adapter module is inserted between the sequential blocks to model the dense interactions across the three modalities. Specifically, LSTTA consists of two types of complementary adapter modules, namely the long-term semantic filtering module and the short-term semantic interaction module. The long-term semantic filtering aims to characterize the temporal importance of the video frames and the short-term semantic interaction module models local interactions within short periods. Compared to previous state-of-the-art trimodal learning methods pre-trained on a large-scale trimodal corpus, LSTTA is more flexible and can inherit any powerful unimodal or bimodal models. Experimental results on four typical trimodal learning tasks show the effectiveness of LSTTA over existing state-of-the-art methods.

ReCo: A Dataset for Residential Community Layout Planning

  • Xi Chen
  • Yun Xiong
  • Siqi Wang
  • Haofen Wang
  • Tao Sheng
  • Yao Zhang
  • Yu Ye

Layout planning is centrally important in the field of architecture and urban design. Among the various basic units carrying urban functions, residential community plays a vital part for supporting human life. Therefore, the layout planning of residential community has always been of concern, and has attracted particular attention since the advent of deep learning that facilitates the automated layout generation and spatial pattern recognition. However, the research circles generally suffer from the insufficiency of residential community layout benchmark or high-quality datasets, which hampers the future exploration of data-driven methods for residential community layout planning. The lack of datasets is largely due to the difficulties of large-scale real-world residential data acquisition and long-term expert screening. In order to address the issues and advance a benchmark dataset for various intelligent spatial design and analysis applications in the development of smart city, we introduce Residential Community Layout Planning (ReCo) Dataset, which is the first and largest open-source vector dataset related to real-world community to date. ReCo Dataset is presented in multiple data formats with 37,646 residential community layout plans, covering 598,728 residential buildings with height information. ReCo can be conveniently adapted for residential community layout related urban design tasks, e.g., generative layout design, morphological pattern recognition and spatial evaluation. To validate the utility of ReCo in automated residential community layout planning, two Generative Adversarial Network (GAN) based generative models are further applied to the dataset. We expect ReCo Dataset to inspire more creative and practical work in intelligent design and beyond. The ReCo Dataset is published at: and related code can be found at: \url

Point-aware Interaction and CNN-induced Refinement Network for RGB-D Salient Object Detection

  • Runmin Cong
  • Hongyu Liu
  • Chen Zhang
  • Wei Zhang
  • Feng Zheng
  • Ran Song
  • Sam Kwong

By integrating complementary information from RGB image and depth map, the ability of salient object detection (SOD) for complex and challenging scenes can be improved. In recent years, the important role of Convolutional Neural Networks (CNNs) in feature extraction and cross-modality interaction has been fully explored, but it is still insufficient in modeling global long-range dependencies of self-modality and cross-modality. To this end, we introduce CNNs-assisted Transformer architecture and propose a novel RGB-D SOD network with Point-aware Interaction and CNN-induced Refinement (PICR-Net). On the one hand, considering the prior correlation between RGB modality and depth modality, an attention-triggered cross-modality point-aware interaction (CmPI) module is designed to explore the feature interaction of different modalities with positional constraints. On the other hand, in order to alleviate the block effect and detail destruction problems brought by the Transformer naturally, we design a CNN-induced refinement (CNNR) unit for content refinement and supplementation. Extensive experiments on five RGB-D SOD datasets show that the proposed network achieves competitive results in both quantitative and qualitative comparisons. Our code is publicly available at:

Multi-view Self-Expressive Subspace Clustering Network

  • Jinrong Cui
  • Yuting Li
  • Yulu Fu
  • Jie Wen

Advanced deep multi-view subspace clustering methods are based on the self-expressive model, which has achieved impressive performance. However, most existing works have several limitations: 1) They endure high computational complexity when learning a consistent affinity matrix, impeding their capacity to handle large-scale multi-view data; 2) The global and local structure information of multi-view data remains under-explored. To tackle these challenges, we propose a simplistic but comprehensive framework called Multi-view Self-Expressive Subspace Clustering (MSESC) network. Specifically, we design a deep metric network to replace the conventional self-expressive model, which can directly and efficiently produce the intrinsic similarity values of any instance-pairs of all views. Moreover, our method explores global and local structure information from the connectivity of instance-pairs across views and the nearest neighbors of instance-pairs within the view, respectively. By integrating global and local structure information within a unified framework, MSESC can learn a high-quality shared affinity matrix for better clustering performance. Extensive experimental results indicate the superiority of MSESC compared to several state-of-the-art methods.

Cross-modality Representation Interactive Learning for Multimodal Sentiment Analysis

  • Jian Huang
  • Yanli Ji
  • Yang Yang
  • Heng Tao Shen

Effective alignment and fusion of multimodal features remain a significant challenge for multimodal sentiment analysis. In various multimodal applications, the text modal exhibits a significant advantage of compact yet expressive representation ability. In this paper, we propose a Cross-modality Representation Interactive Learning (CRIL) approach, which adopts the text modality to guide other modalities for learning representative feature tokens, contributing to effective multimodal fusion in multimodal sentiment analysis. We propose a semantic representation interactive learning module to learn concise semantic representation tokens for audio and video modalities under the guidance of the text modality, ensuring semantic alignment of representations among multiple modalities. Furthermore, we design a semantic relationship interactive learning module, which calculates a self-attention matrix for each modality and controls their consistency to enable the semantic relationship alignment for multiple modalities. Finally, we present a two-stage interactive fusion solution to bridge the modality gap for multimodal fusion and sentiment analysis. Extensive experiments are performed on the CMU-MOSEI, CMU-MOSI, and UR-FUNNY datasets, and experiment results demonstrate the effectiveness of our proposed approach.

Entropy Neural Estimation for Graph Contrastive Learning

  • Yixuan Ma
  • Xiaolin Zhang
  • Peng Zhang
  • Kun Zhan

Contrastive learning on graphs aims at extracting distinguishable high-level representations of nodes. We theoretically illustrate that the entropy of a dataset is approximated by maximizing the lower bound of the mutual information across different views of a graph, i.e., entropy is estimated by a neural network. Based on this finding, we propose a simple yet effective subset sampling strategy to contrast pairwise representations between views of a dataset. In particular, we randomly sample nodes and edges from a given graph to build the input subset for a view. Two views are fed into a parameter-shared Siamese network to extract the high-dimensional embeddings and estimate the information entropy of the entire graph. For the learning process, we propose to optimize the network using two objectives, simultaneously. Concretely, the input of the contrastive loss consists of positive and negative pairs. Our selection strategy of pairs is different from previous works and we present a novel strategy to enhance the representation ability by selecting nodes based on cross-view similarities. We enrich the diversity of the positive and negative pairs by selecting highly similar samples and totally different data with the guidance of cross-view similarity scores, respectively. We also introduce a cross-view consistency constraint on the representations generated from the different views. We conduct experiments on seven graph benchmarks, and the proposed approach achieves competitive performance compared to the current state-of-the-art methods. The source code is available at

Cross-modal and Cross-medium Adversarial Attack for Audio

  • Liguo Zhang
  • Zilin Tian
  • Yunfei Long
  • Sizhao Li
  • Guisheng Yin

Acoustic waves are forms of energy that propagate through various mediums. They can be represented by different modalities, such as auditory signals and visual patterns. The two modalities are often described as one-dimensional waveform in the time domain and two-dimensional spectrogram in the frequency domain. Most acoustic signal processing methods use single modal data for input and training models. This poses a challenge for black-box adversarial attacks on audio signals because the input modality is also unknown to the attacker. In fact, there currently exist no methods that explore the cross-modal transferability of adversarial perturbation. This paper investigates the cross-modal transferability from waveform to spectrogram. We argue that the data distributions in the sample space with the different modalities have mapping relations and propose a novel decision-based cross-modal and cross-medium adversarial attack method. Specifically, it generates an initial example with cross-modal attack capability by combining random natural noise, then iteratively reduces the perturbation to enhance its invisibility. It incorporates the constraints of the spectrogram sample space while iteratively optimizing adversarial perturbations for black-box audio classification models. The perturbation is imperceptible to humans, both visually and aurally. Extensive experiments demonstrate that our approach can launch attacks on classification models for sound waves and spectrograms that share the same audio signal. Furthermore, we explore the cross-medium capability of our proposed adversarial attack strategy that can target processing models for acoustic signals propagating in air and seawater. The proposed method has preeminent invisibility and generalization compared to other methods.

Unsupervised Multiplex Graph learning with Complementary and Consistent Information

  • Liang Peng
  • Xin Wang
  • Xiaofeng Zhu

Unsupervised multiplex graph learning (UMGL) has been shown to achieve significant effectiveness for different downstream tasks by exploring both complementary information and consistent information among multiple graphs. However, previous methods usually overlook the issues in practical applications, i.e., the out-of-sample issue and the noise issue. To address the above issues, in this paper, we propose an effective and efficient UMGL method to explore both complementary and consistent information. To do this, our method employs multiple MLP encoders rather than graph convolutional network (GCN) to conduct representation learning with two constraints, i.e., preserving the local graph structure among nodes to handle the out-of-sample issue, and maximizing the correlation of multiple node representations to handle the noise issue. Comprehensive experiments demonstrate that our proposed method achieves superior effectiveness and efficiency over the comparison methods and effectively tackles those two issues. Code is available at

GCL: Gradient-Guided Contrastive Learning for Medical Image Segmentation with Multi-Perspective Meta Labels

  • Yixuan Wu
  • Jintai Chen
  • Jiahuan Yan
  • Yiheng Zhu
  • Danny Z. Chen
  • Jian Wu

Since annotating medical images for segmentation tasks commonly incurs expensive costs, it is highly desirable to design an annotation-efficient method to alleviate the annotation burden. Recently, contrastive learning has exhibited a great potential in learning robust representations to boost downstream tasks with limited labels. In medical imaging scenarios, ready-made meta labels (i.e., specific attribute information of medical images) inherently reveal semantic relationships among images, which have been used to define positive pairs in previous work. However, the multi-perspective semantics revealed by various meta labels are usually incompatible and can incur intractable "semantic contradiction" when combining different meta labels. In this paper, we tackle the issue of "semantic contradiction" in a gradient-guided manner using our proposed Gradient Mitigator method, which systematically unifies multi-perspective meta labels to enable a pre-trained model to attain a better high-level semantic recognition ability. Moreover, we emphasize that the fine-grained discrimination ability is vital for segmentation-oriented pre-training, and develop a novel method called Gradient Filter to dynamically screen pixel pairs with the most discriminating power based on the magnitude of gradients. Comprehensive experiments on four medical image segmentation datasets verify that our new method GCL: (1) learns informative image representations and considerably boosts segmentation performance with limited labels, and (2) shows promising generalizability on out-of-distribution datasets.

Multi-Spectral Image Stitching via Spatial Graph Reasoning

  • Zhiying Jiang
  • Zengxi Zhang
  • Jinyuan Liu
  • Xin Fan
  • Risheng Liu

Multi-spectral image stitching leverages the complementarity between infrared and visible images to generate a robust and reliable wide field-of-view~(FOV) scene. The primary challenge of this task is to explore the relations between multi-spectral images for aligning and integrating multi-view scenes. Capitalizing on the strengths of Graph Convolutional Networks (GCNs) in modeling feature relationships, we propose a spatial graph reasoning based multi-spectral image stitching method that effectively distills the deformation and integration of multi-spectral images across different viewpoints. To accomplish this, we embed multi-scale complementary features from the same view position into a set of nodes. The correspondence across different views is learned through powerful dense feature embeddings, where both inter- and intra-correlations are developed to exploit cross-view matching and enhance inner feature disparity. By introducing long-range coherence along spatial and channel dimensions, the complementarity of pixel relations and channel interdependencies aids in the reconstruction of aligned multi-view features, generating informative and reliable wide FOV scenes. Moreover, we release a challenging dataset named ChaMS, comprising both real-world and synthetic sets with significant parallax, providing a new option for comprehensive evaluation. Extensive experiments demonstrate that our method surpasses the state-of-the-arts.

Propagation is All You Need: A New Framework for Representation Learning and Classifier Training on Graphs

  • Jiaming Zhuo
  • Can Cui
  • Kun Fu
  • Bingxin Niu
  • Dongxiao He
  • Yuanfang Guo
  • Zhen Wang
  • Chuan Wang
  • Xiaochun Cao
  • Liang Yang

Graph Neural Networks (GNNs) have been the standard toolkit for processing non-euclidean spatial data since their powerful capability in graph representation learning. Unfortunately, their training strategy for network parameters is inefficient since it is directly inherited from classic Neural Networks (NNs), ignoring the characteristic of GNNs. To alleviate this issue, experimental analyses are performed to investigate the knowledge captured in classifier parameters during network training. We conclude that the parameter features, i.e., the column vectors of the classifier parameter matrix, are cluster representations with high discriminability. And after a theoretical analysis, we conclude that the discriminability of these features is obtained from the feature propagation from nodes to parameters. Furthermore, an experiment verifies that compared with cluster centroids, the parameter features are more potential for augmenting the feature propagation between nodes. Accordingly, a novel GNN-specific training framework is proposed by simultaneously updating node representations and classifier parameters via a unified feature propagation scheme. Moreover, two augmentation schemes are implemented for the framework, named Full Propagation Augmentation (FPA) and Simplified Full Propagation Augmentation (SFPA). Specifically, FPA augmentates the feature propagation of each node with the updated classifier parameters. SFPA only augments nodes with the classifier parameters corresponding to their clusters. Theoretically, FPA is equivalent to optimizing a novel graph learning objective, which demonstrates the universality of the proposed framework to existing GNNs. Extensive experiments demonstrate the superior performance and the universality of the proposed framework.

Cross-modal Unsupervised Domain Adaptation for 3D Semantic Segmentation via Bidirectional Fusion-then-Distillation

  • Yao Wu
  • Mingwei Xing
  • Yachao Zhang
  • Yuan Xie
  • Jianping Fan
  • Zhongchao Shi
  • Yanyun Qu

Cross-modal Unsupervised Domain Adaptation (UDA) becomes a research hotspot because it reduces the laborious annotation of target domain samples. Existing methods only mutually mimic the outputs of cross-modality in each domain, which enforces the class probability distribution agreeable in different domains. However, these methods ignore the complementarity brought by the modality fusion representation in cross-modal learning. In this paper, we propose a cross-modal UDA method for 3D semantic segmentation via Bidirectional Fusion-then-Distillation, named BFtD-xMUDA, which explores cross-modal fusion in UDA and realizes distribution consistency between outputs of two domains not only for 2D image and 3D point cloud but also for 2D/3D and fusion. Our method contains three significant components: Model-agnostic Feature Fusion Module (MFFM), Bidirectional Distillation (B-Distill), and Cross-modal Debiased Pseudo-Labeling (xDPL). MFFM is employed to generate cross-modal fusion features for establishing a latent space, which enforces maximum correlation and complementarity between two heterogeneous modalities. B-Distill is introduced to exploit bidirectional knowledge distillation which includes cross-modality and cross-domain fusion distillation, and well-achieving domain-modality alignment. xDPL is designed to model the uncertainty of pseudo-labels by self-training scheme. Extensive experimental results demonstrate that our method outperforms state-of-the-art competitors in several adaptation scenarios.

SESSION: Oral Session III: Understanding Multimedia Content -- Vision and Language

Distortion-aware Transformer in 360° Salient Object Detection

  • Yinjie Zhao
  • Lichen Zhao
  • Qian Yu
  • Lu Sheng
  • Jing Zhang
  • Dong Xu

With the emergence of VR and AR, 360° data attracts increasing attention from the computer vision and multimedia communities. Typically, 360° data is projected into 2D ERP (equirectangular projection) images for feature extraction. However, existing methods cannot handle the distortions that result from the projection, hindering the development of 360-data-based tasks. Therefore, in this paper, we propose a Transformer-based model called DATFormer to address the distortion problem. We tackle this issue from two perspectives. Firstly, we introduce two distortion-adaptive modules. The first is a Distortion Mapping Module, which guides the model to pre-adapt to distorted features globally. The second module is a Distortion-Adaptive Attention Block that reduces local distortions on multi-scale features. Secondly, to exploit the unique characteristics of 360° data, we present a learnable relation matrix and use it as part of the positional embedding to further improve performance. Extensive experiments are conducted on three public datasets, and the results show that our model outperforms existing 2D SOD (salient object detection) and 360 SOD methods. The source code is available at

Symmetrical Linguistic Feature Distillation with CLIP for Scene Text Recognition

  • Zixiao Wang
  • Hongtao Xie
  • Yuxin Wang
  • Jianjun Xu
  • Boqiang Zhang
  • Yongdong Zhang

In this paper, we explore the potential of the Contrastive Language-Image Pretraining (CLIP) model in scene text recognition (STR), and establish a novel Symmetrical Linguistic Feature Distillation framework (named CLIP-OCR) to leverage both visual and linguistic knowledge in CLIP. Different from previous CLIP-based methods mainly considering feature generalization on visual encoding, we propose a symmetrical distillation strategy (SDS) that further captures the linguistic knowledge in the CLIP text encoder. By cascading the CLIP image encoder with the reversed CLIP text encoder, a symmetrical structure is built with an image-to-text feature flow that covers not only visual but also linguistic information for distillation. Benefiting from the natural alignment in CLIP, such guidance flow provides a progressive optimization objective from vision to language, which can supervise the STR feature forwarding process layer-by-layer. Besides, a new Linguistic Consistency Loss (LCL) is proposed to enhance the linguistic capability by considering second-order statistics during the optimization. Overall, CLIP-OCR is the first to design a smooth transition between image and text for the STR task. Extensive experiments demonstrate the effectiveness of CLIP-OCR with 93.8% average accuracy on six popular STR benchmarks. Code will be available at

SpaceCLIP: A Vision-Language Pretraining Framework With Spatial Reconstruction On Text

  • Bo Zou
  • Chao Yang
  • Chengbin Quan
  • Youjian Zhao

The tremendous progress of vision-to-language retrieval over these years is fueled by contrastive vision-language pretraining (VLP), such as CLIP. Although, contrastive methods do not exhibit the same level of performance on other downstream tasks (e.g., video question answering and natural language grounding). One possible reason is they ignore the misalignment between vision and language, especially the absence of spatial information in language. To mitigate this issue, We start from a new perspective and propose a contrastive VLP framework with spatial reconstruction on text (SpaceCLIP). Specifically, we introduce a unique reconstruction method to assign text representations into the same spatial structure with images or videos and a pretraining objective, SpatialNCE, to reduce the computational overhead and ensure performance on downstream tasks. Empirically, we show SpaceCLIP outperforms other methods with performance gains ranging from 2.1% up to 9.0% on MSRVTT and EgoCLIP multiple-choice questions answering, 2.5% up to 11.0% on EPIC-KITCHENS-100 and MSRVTT multi-instance retrieval, and 0.31% up to 7.2% on Ego4D natural language query benchmark.

Improving Cross-Modal Recipe Retrieval with Component-Aware Prompted CLIP Embedding

  • Xu Huang
  • Jin Liu
  • Zhizhong Zhang
  • Yuan Xie

Cross-modal recipe retrieval is an emerging visual-textual retrieval task, which aims at matching food images with the corresponding recipes. Although large-scale Vision-Language Pre-training (VLP) models have achieved impressive performance on a wide range of downstream tasks, they still perform unsatisfactorily on this cross-modal retrieval task due to the following two problems: (1) Features from food images and recipes need to be aligned, simply fine-tuning the pre-trained VLP model's image encoder does not explicitly help with this goal. (2) The text content in the recipe is more structured than the text caption in the VLP model's pre-training corpus, which prevents the VLP model from adapting to the recipe retrieval task. In this paper, we propose a Component-aware Instance-specific Prompt learning (CIP) model that fully exploits the ability of large-scale VLP models. CIP enables us to learn the structured recipe information and therefore allows for aligning visual-textual representations without fine-tuning. Furthermore, we construct a recipe encoder termed Adaptive Recipe Merger (ARM) based on hierarchical Transformers, encouraging the model to learn more effective recipe representations. Extensive experiments on the public Recipe1M dataset demonstrate the superiority of our proposed method by outperforming the state-of-the-art methods on cross-modal recipe retrieval task.

Dynamic Contrastive Learning with Pseudo-samples Intervention for Weakly Supervised Joint Video MR and HD

  • Shuhan Kong
  • Liang Li
  • Beichen Zhang
  • Wenyu Wang
  • Bin Jiang
  • Chenggang Yan
  • Changhao Xu

Joint video moment retrieval (MR) and highlight detection (HD) aims to find relevant video moments according to the query text. Existing methods are fully supervised based on manual annotation, and their coarse multi-modal information interactions easily lose details about video and text. In addition, some tasks introduce weakly supervised learning with random masks, while the single masking forces the model to focus on masked words and ignore multi-modal contextual information. In view of this, we attempt weakly supervised joint tasks (MR+HD) and propose Dynamic Contrastive Learning with Pseudo-Sample Intervention (CPI) for better multi-modal video comprehension. First, we design pseudo-samples over random masks for a more efficient contrastive learning manner. We introduce a proportional sampling strategy for pseudo-samples to ensure the semantic difference between the pseudo-samples and the query text. This balances the over-reliance from single random mask to global text semantics and makes the model learn multimodal context from each word fairly. Second, we design dynamic intervention contrastive loss to enhance the core feature-matching ability of the model dynamically. We add pseudo-sample intervention when negative proposals are close to positive proposals. This can help the model overcome the vision confusion phenomenon and achieve semantic similarity instead of word similarity. Extensive experiments demonstrate the effectiveness of CPI and the potential of weakly supervised joint tasks.

RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training

  • Zheng Yuan
  • Qiao Jin
  • Chuanqi Tan
  • Zhengyun Zhao
  • Hongyi Yuan
  • Fei Huang
  • Songfang Huang

Vision-and-language multi-modal pretraining and fine-tuning have shown great success in visual question answering (VQA). Compared to general domain VQA, the performance of biomedical VQA suffers from limited data. In this paper, we propose a retrieval-augmented pretrain-and-finetune paradigm named RAMM for biomedical VQA to overcome the data limitation issue. Specifically, we collect a new biomedical dataset named PMCPM which offers patient-based image-text pairs containing diverse patient situations from PubMed. Then, we pretrain the biomedical multi-modal model to learn visual and textual representation for image-text pairs and align these representations with image-text contrastive objective (ITC). Finally, we propose a retrieval-augmented method to better use the limited data. We propose to retrieve similar image-text pairs based on ITC from pretraining datasets and introduce a novel retrieval-attention module to fuse the representation of the image and the question with the retrieved images and texts. Experiments demonstrate that our retrieval-augmented pretrain-and-finetune paradigm obtains state-of-the-art performance on Med-VQA2019, Med-VQA2021, VQARAD, and SLAKE datasets. Further analysis shows that the proposed RAMM and PMCPM can enhance biomedical VQA performance compared with previous resources and methods. The pre-trained models and codes are published at

RTQ: Rethinking Video-language Understanding Based on Image-text Model

  • Xiao Wang
  • Yaoyu Li
  • Tian Gan
  • Zheng Zhang
  • Jingjing Lv
  • Liqiang Nie

Recent advancements in video-language understanding have been established on the foundation of image-text models, resulting in promising outcomes due to the shared knowledge between images and videos. However, video-language understanding presents unique challenges due to the inclusion of highly complex semantic details, which result in information redundancy, temporal dependency, and scene complexity. Current techniques have only partially tackled these issues, and our quantitative analysis indicates that some of these methods are complementary. In light of this, we propose a novel framework called RTQ (Refine, Temporal model, and Query), which addresses these challenges simultaneously. The approach involves refining redundant information within frames, modeling temporal relations among frames, and querying task-specific information from the videos. Remarkably, our model demonstrates outstanding performance even in the absence of video-language pre-training, and the results are comparable with or superior to those achieved by state-of-the-art pre-training methods.

SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models

  • Shanshan Zhong
  • Zhongzhan Huang
  • Weushao Wen
  • Jinghui Qin
  • Liang Lin

Diffusion models, which have emerged to become popular text-to-image generation models, can produce high-quality and content-rich images guided by textual prompts. However, there are limitations to semantic understanding and commonsense reasoning in existing models when the input prompts are concise narrative, resulting in low-quality image generation. To improve the capacities for narrative prompts, we propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models. To reach this goal, we first collect and annotate a new dataset SURD which consists of more than 57,000 semantically corrected multi-modal samples. Each sample contains a simple narrative prompt, a complex keyword-based prompt, and a high-quality image. Then, we align the semantic representation of narrative prompts to the complex prompts and transfer knowledge of large language models (LLMs) to our SUR-adapter via knowledge distillation so that it can acquire the powerful semantic understanding and reasoning capabilities to build a high-quality textual semantic representation for text-to-image generation. We conduct experiments by integrating multiple LLMs and popular pre-trained diffusion models to show the effectiveness of our approach in enabling diffusion models to understand and reason concise natural language without image quality degradation. Our approach can make text-to-image diffusion models easier to use with better user experience, which demonstrates our approach has the potential for further advancing the development of user-friendly text-to-image generation models by bridging the semantic gap between simple narrative prompts and complex keyword-based prompts. The code is released at

Face Encryption via Frequency-Restricted Identity-Agnostic Attacks

  • Xin Dong
  • Rui Wang
  • Siyuan Liang
  • Aishan Liu
  • Lihua Jing

Billions of people are sharing their daily live images on social media everyday. However, malicious collectors use deep face recognition systems to easily steal their biometric information (e.g., faces) from these images. Some studies are being conducted to generate encrypted face photos using adversarial attacks by introducing imperceptible perturbations to reduce face information leakage. However, existing studies need stronger black-box scenario feasibility and more natural visual appearances, which challenge the feasibility of privacy protection. To address these problems, we propose a frequency-restricted identity-agnostic (FRIA) framework to encrypt face images from unauthorized face recognition without access to personal information. As for the weak black-box scenario feasibility, we obverse that representations of the average feature in multiple face recognition models are similar, thus we propose to utilize the average feature via the crawled dataset from the Internet as the target to guide the generation, which is also agnostic to identities of unknown face recognition systems; in nature, the low-frequency perturbations are more visually perceptible by the human vision system. Inspired by this, we restrict the perturbation in the low-frequency facial regions by discrete cosine transform to achieve the visual naturalness guarantee. Extensive experiments on several face recognition models demonstrate that our FRIA outperforms other state-of-the-art methods in generating more natural encrypted faces while attaining high black-box attack success rates of 96%. In addition, we validate the efficacy of FRIA using real-world black-box commercial API, which reveals the potential of FRIA in practice. Our codes can be found in

Emotion-Prior Awareness Network for Emotional Video Captioning

  • Peipei Song
  • Dan Guo
  • Xun Yang
  • Shengeng Tang
  • Erkun Yang
  • Meng Wang

Emotional video captioning (EVC) is an emerging task to describe the factual content with the inherent emotion expressed in a video. It is crucial for the EVC task to effectively perceive subtle and ambiguous visual emotion cues in the stage of caption generation. However, existing captioning methods usually overlooked the learning of emotions in user-generated videos, thus making the generated sentence a bit boring and soulless.

To address this issue, this paper proposes a new emotional captioning perspective in a human-like perception-priority manner, i.e., first perceiving the inherent emotion and then leveraging the perceived emotion cue to support caption generation. Specifically, we devise an Emotion-Prior Awareness Network (EPAN). It mainly benefits from a novel tree-structured emotion learning module involving both catalog-level psychological categories and lexical-level usual words to achieve the goal of explicit and fine-grained emotion perception. Besides, we develop a novel subordinate emotion masking mechanism between the catalog level and lexical level that facilitates coarse-to-fine emotion learning. Afterward, with the emotion prior, we can effectively decode the emotional caption by exploiting the complementation of visual, textual, and emotional semantics. In addition, we also introduce three simple yet effective optimization objectives, which can significantly boost the emotion learning from the perspectives of emotional captioning, hierarchical emotion classification, and emotional contrastive learning. Sufficient experimental results on three benchmark datasets clearly demonstrate the advantages of our proposed EPAN over existing SOTA methods in both semantic and emotional metrics. The extensive ablation study and visualization analysis further reveal the good interpretability of our emotional video captioning method. Code will be made available at

TE-KWS: Text-Informed Speech Enhancement for Noise-Robust Keyword Spotting

  • Dong Liu
  • Qirong Mao
  • Lijian Gao
  • Qinghua Ren
  • Zhenghan Chen
  • Ming Dong

Keyword spotting (KWS) presents a formidable challenge, particularly in high-noise environments. Traditional denoising algorithms that rely solely on speech have difficulty recovering speech that has been severely corrupted by noise. In this investigation, we develop an adaptive text-informed denoising model to bolster reliable keyword identification in the presence of considerable noise degradation. The whole proposed TE-KWS incorporates a tripartite branch structure, where the speech branch (SB) takes noisy speech as input which provides the raw speech information, the alignment branch (AB) accommodates aligned text input which facilitates accurate restoration of the corresponding speech when text with alignment is preserved, and the text branch (TB) handles unaligned text which prompts the model to autonomously learn the alignment between speech and text. To make the proposed denoising model more beneficial for KWS, following the training of the whole model,the alignment branch (AB) is frozen, and the model is fine-tuned by leveraging its speech restoration and forced alignment capabilities. Subsequently, the input for the text branch (TB) is supplanted with designated keywords, and a heavier denoising penalty is applied on the keywords period, thereby explicitly intensifying the speech restoration ability of the model for keywords. Finally, the Combined Adversarial Domain Adaptation (CADA) is implemented to enhance the robustness of KWS with regard to data pre-and post-speech enhancement (SE). Experimental results indicate that our approach not only markedly ameliorates highly corrupted speech, achieving SOTA performance for marginally corrupted speech, but also bolsters the efficacy and generalizability of prevailing mainstream KWS models.

A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval

  • Jiancheng Pan
  • Qing Ma
  • Cong Bai

This paper presents a prior instruction representation framework (PIR) for remote sensing image-text retrieval, aimed at remote sensing vision-language understanding tasks to solve the semantic noise problem. Our highlight is the proposal of a paradigm that draws on prior knowledge to instruct adaptive learning of vision and text representations. Concretely, two progressive attention encoder (PAE) structures, Spatial-PAE and Temporal-PAE, are proposed to perform long-range dependency modeling to enhance key feature representation. In vision representation, Vision Instruction Representation (VIR) based on Spatial-PAE exploits the prior-guided knowledge of the remote sensing scene recognition by building a belief matrix to select key features for reducing the impact of semantic noise. In text representation, Language Cycle Attention (LCA) based on Temporal-PAE uses the previous time step to cyclically activate the current time step to enhance text representation capability. A cluster-wise affiliation loss is proposed to constrain the inter-classes and to reduce the semantic confusion zones in the common subspace. Comprehensive experiments demonstrate that using prior knowledge instruction could enhance vision and text representations and could outperform the state-of-the-art methods on two benchmark datasets, RSICD and RSITMD. Codes are available at

PromptMTopic: Unsupervised Multimodal Topic Modeling of Memes using Large Language Models

  • Nirmalendu Prakash
  • Han Wang
  • Nguyen Khoi Hoang
  • Ming Shan Hee
  • Roy Ka-Wei Lee

The proliferation of social media has given rise to a new form of communication: memes. Memes are multimodal and often contain a combination of text and visual elements that convey meaning, humor, and cultural significance. While meme analysis has been an active area of research, little work has been done on unsupervised multimodal topic modeling of memes, which is important for content moderation, social media analysis, and cultural studies. We propose PromptMTopic, a novel multimodal prompt-based model designed to learn topics from both text and visual modalities by leveraging the language modeling capabilities of large language models. Our model effectively extracts and clusters topics learned from memes, considering the semantic interaction between the text and visual modalities. We evaluate our proposed model through extensive experiments on three real-world meme datasets, which demonstrate its superiority over state-of-the-art topic modeling baselines in learning descriptive topics in memes. Additionally, our qualitative analysis shows that PromptMTopic can identify meaningful and culturally relevant topics from memes. Our work contributes to the understanding of the topics and themes of memes, a crucial form of communication in today's society. Disclaimer: This paper contains sensitive content that may be disturbing to some readers.

Dynamic Low-Rank Instance Adaptation for Universal Neural Image Compression

  • Yue Lv
  • Jinxi Xiang
  • Jun Zhang
  • Wenming Yang
  • Xiao Han
  • Wei Yang

The latest advancements in neural image compression show great potential in surpassing the rate-distortion performance of conventional standard codecs. Nevertheless, there exists an indelible domain gap between the datasets utilized for training (i.e., natural images) and those utilized for inference (e.g., artistic images). Our proposal involves a low-rank adaptation approach aimed at addressing the rate-distortion drop observed in out-of-domain datasets. Specifically, we perform low-rank matrix decomposition to update certain adaptation parameters of the client's decoder. These updated parameters, along with image latents, are encoded into a bitstream and transmitted to the decoder in practical scenarios. Due to the low-rank constraint imposed on the adaptation parameters, the resulting bit rate overhead is small. Furthermore, the bit rate allocation of low-rank adaptation is non-trivial, considering the diverse inputs require varying adaptation bitstreams. We thus introduce a dynamic gating network on top of the low-rank adaptation method, in order to decide which decoder layer should employ adaptation. The dynamic adaptation network is optimized end-to-end using rate-distortion loss. Our proposed method exhibits universality across diverse image datasets. Extensive results demonstrate that this paradigm significantly mitigates the domain gap, surpassing non-adaptive methods with an average BD-rate improvement of approximately 19% across out-of-domain images. Furthermore, it outperforms the most advanced instance adaptive methods by roughly 5% BD-rate. Ablation studies confirm our method's ability to universally enhance various image compression architectures. Our project is available at

LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation

  • Leigang Qu
  • Shengqiong Wu
  • Hao Fei
  • Liqiang Nie
  • Tat-Seng Chua

In the text-to-image generation field, recent remarkable progress in Stable Diffusion makes it possible to generate rich kinds of novel photorealistic images. However, current models still face misalignment issues (e.g., problematic spatial relation understanding and numeration failure) in complex natural scenes, which impedes the high-faithfulness text-to-image generation. Although recent efforts have been made to improve controllability by giving fine-grained guidance (e.g., sketch and scribbles), this issue has not been fundamentally tackled since users have to provide such guidance information manually. In this work, we strive to synthesize high-fidelity images that are semantically aligned with a given textual prompt without any guidance. Toward this end, we propose a coarse-to-fine paradigm to achieve layout planning and image generation. Concretely, we first generate the coarse-grained layout conditioned on a given textual prompt via in-context learning based on Large Language Models. Afterward, we propose a fine-grained object-interaction diffusion method to synthesize high-faithfulness images conditioned on the prompt and the automatically generated layout. Extensive experiments demonstrate that our proposed method outperforms the state-of-the-art models in terms of layout and image generation. Our code and settings are available at

POAR: Towards Open Vocabulary Pedestrian Attribute Recognition

  • Yue Zhang
  • Suchen Wang
  • Shichao Kan
  • Zhenyu Weng
  • Yigang Cen
  • Yap-peng Tan

Pedestrian attribute recognition (PAR) aims to predict the attributes of a target pedestrian. Recent methods often address the PAR problem by training a multi-label classifier with predefined attribute classes, but they can hardly exhaust all possible pedestrian attributes in the real world. To tackle this problem, we propose a novel Pedestrian Open-Attribute Recognition (POAR) approach by formulating the problem as a task of image-text search. Our approach employs a Transformer-based Encoder with a Masking Strategy (TEMS) to focus on the attributes of specific pedestrian parts (e.g., head, upper body, lower body, feet, etc.), and introduces a set of attribute tokens to encode the corresponding attributes into visual embeddings. Each attribute category is described as a natural language sentence and encoded by the text encoder. Then, we compute the similarity between the visual and text embeddings to find the best attribute descriptions for the input images. To handle multiple attributes of a single pedestrian, we propose a Many-To-Many Contrastive (MTMC) loss with masked tokens. In addition, we propose a Grouped Knowledge Distillation (GKD) method to minimize the disparity between visual embeddings and unseen attribute text embeddings. We evaluate our proposed method on three PAR datasets with an open-attribute setting. The results demonstrate the effectiveness of our method as a strong baseline for the POAR task. Our code is available at

PointCRT: Detecting Backdoor in 3D Point Cloud via Corruption Robustness

  • Shengshan Hu
  • Wei Liu
  • Minghui Li
  • Yechao Zhang
  • Xiaogeng Liu
  • Xianlong Wang
  • Leo Yu Zhang
  • Junhui Hou

Backdoor attacks for point clouds have elicited mounting interest with the proliferation of deep learning. The point cloud classifiers can be vulnerable to malicious actors who seek to manipulate or fool the model with specific backdoor triggers. Detecting and rejecting backdoor samples during the inference stage can effectively alleviate backdoor attacks. Recently, some black-box test-time backdoor sample detection methods have been proposed in the 2D image domain, without any underlying assumptions about the backdoor triggers. However, upon examination, we have found that these detection techniques are not effective for 3D point clouds. As a result, there is a pressing need to bridge the gap for the development of a universal approach that is specifically designed for 3D point clouds.

In this paper, we propose the first test-time backdoor sample detection method in 3D point cloud without assumption to the backdoor triggers, called Point Clouds Corruption Robustness Test (PointCRT). Based on the fact that the corruption robustness of clean samples remains relatively stable across various backdoor models, we propose the corruption robustness score to map the features into high-dimensional space. The corruption robustness score is a vector evaluated by label consistency, whose element is the minimum severity level of corruption that changes the label prediction of the victim model. Then, the trigger is identified by detecting the abnormal corruption robustness score through a nonlinear classification. The comprehensive experiments demonstrate PointCRT deals with all cases with the average AUC over 0.934 and F1 score over 0.864, with the enhancement of 18%-28% on ModelNet40. Our codes are available at:

Blind Image Super-resolution with Rich Texture-Aware Codebook

  • Rui Qin
  • Ming Sun
  • Fangyuan Zhang
  • Xing Wen
  • Bin Wang

Blind super-resolution (BSR) methods based on high-resolution (HR) reconstruction codebooks have achieved promising results in recent years. However, we find that a codebook based on HR reconstruction may not effectively capture the complex correlations between low-resolution (LR) and HR images. In detail, multiple HR images may produce similar LR versions due to complex blind degradations, causing the HR-dependent only codebooks having limited texture diversity when faced with confusing LR inputs. To alleviate this problem, we propose the Rich Texture-aware Codebook-based Network (RTCNet), which consists of the Degradation-robust Texture Prior Module (DTPM) and the Patch-aware Texture Prior Module (PTPM). DTPM effectively mines the cross-resolution correlation of textures between LR and HR images by exploiting the cross-resolution correspondence of textures. PTPM uses patch-wise semantic pre-training to correct the misperception of texture similarity in the high-level semantic regularization. By taking advantage of this, RTCNet effectively gets rid of the misalignment of confusing textures between HR and LR in the BSR scenarios. Experiments show that RTCNet outperforms state-of-the-art methods on various benchmarks by up to 0.16 ~ 0.46dB.

V2Depth: Monocular Depth Estimation via Feature-Level Virtual-View Simulation and Refinement

  • Zizhang Wu
  • Zhuozheng Li
  • Zhi-Gang Fan
  • Yunzhe Wu
  • Jian Pu
  • Xianzhi Li

Due to the lack of spatial cues giving merely a single image, many monocular depth estimation methods have been developed to leverage stereo or multi-view images to learn the spatial information of a scene in a self-supervised manner. However, these methods have limited performance gain since they are not able to exploit sufficient 3D geometry cues during inference, where only monocular images are available. In this work, we present V2Depth, a novel coarse-to-fine framework with Virtual View feature simulation for supervised monocular Depth estimation. Specifically, we first design a virtual-view feature simulator by leveraging the technique of novel view synthesis and contrastive learning to generate virtual view feature maps. In this way, we explicitly provide representative spatial geometry for subsequent depth estimation in both the training and inference stages. Then we introduce a 3DVA-Refiner to iteratively optimize the predicted depth map. During the optimization process, 3D-aware virtual attention is developed to capture the global spatial-context correlations to maintain the feature consistency of different views and estimation integrity of the 3D scene such as objects with occlusion relationships. Decisive improvements over state-of-the-art approaches on three benchmark datasets across all metrics demonstrate the superiority of our method.

GCMA: Generative Cross-Modal Transferable Adversarial Attacks from Images to Videos

  • Kai Chen
  • Zhipeng Wei
  • Jingjing Chen
  • Zuxuan Wu
  • Yu-Gang Jiang

Existing cross-domain transferable attacks mostly focus on exploring the adversarial transferability across homomodal domains, while the adversarial transferability across heteromodal domains, e.g., image domains to video domains, has received less attention. This paper investigates cross-modal transferable attacks from image domains to video domains with the generator-oriented approach, i.e., crafting adversarial perturbations for each frame of video clips with the perturbation generator trained in the ImageNet domain to attack target video models. To this end, we propose an effective Generative Cross-Modal Attacks (GCMA) framework to enhance adversarial transferability from image domains to video domains. To narrow the domain gap between image and video data, we first propose a random motion module that warps images with synthetic random optical flows. We then integrate the random motion module into the feature disruption loss to incorporate additional temporal cues in the training phase. Specifically, feature disruption loss minimizes the cosine similarity between intermediate features of warped benign and adversarial images. Furthermore, motivated by the positive correlation between transferability and temporal consistency of adversarial video clips, we also introduce a temporal consistency loss that maximizes the cosine similarity between intermediate features of warped adversarial images and adversarial counterparts of warped benign images. Finally, GCMA trains the perturbation generator by simultaneously optimizing feature disruption loss and temporal consistency loss. Extensive experiments demonstrate the effectiveness of our proposed method, achieving state-of-the-art performance on Kinetics-400 and UCF-101. Our code is available at

AdaBrowse: Adaptive Video Browser for Efficient Continuous Sign Language Recognition

  • Lianyu Hu
  • Liqing Gao
  • Zekang Liu
  • Chi-Man Pun
  • Wei Feng

Raw videos have been proven to own considerable feature redundancy where in many cases only a portion of frames can already meet the requirements for accurate recognition. In this paper, we are interested in whether such redundancy can be effectively leveraged to facilitate efficient inference in continuous sign language recognition (CSLR). We propose a novel adaptive model (AdaBrowse) to dynamically select a most informative subsequence from input video sequences by modelling this problem as a sequential decision task. In specific, we first utilize a lightweight network to quickly scan input videos to extract coarse features. Then these features are fed into a policy network to intelligently select a subsequence to process. The corresponding subsequence is finally inferred by a normal CSLR model for sentence prediction. As only a portion of frames are processed in this procedure, the total computations can be considerably saved. Besides temporal redundancy, we are also interested in whether the inherent spatial redundancy can be seamlessly integrated together to achieve further efficiency, i.e., dynamically selecting a lowest input resolution for each sample, whose model is referred to as AdaBrowse+. Extensive experimental results on four large-scale CSLR datasets, i.e., PHOENIX14, PHOENIX14-T, CSL-Daily and CSL, demonstrate the effectiveness of AdaBrowse and AdaBrowse+ by achieving comparable accuracy with state-of-the-art methods with 1.44X throughput and 2.12X fewer FLOPs. Comparisons with other commonly-used 2D CNNs and adaptive efficient methods verify the effectiveness of AdaBrowse. Code is available at

Dynamic Triple Reweighting Network for Automatic Femoral Head Necrosis Diagnosis from Computed Tomography

  • Lingfeng Li
  • Gangming Zhao
  • Yizhou Yu
  • Jinpeng Li

Avascular necrosis of the femoral head (AVNFH) is a common orthopedic disease that seriously affects the life quality of middle-aged and elderly people. Early AVNFH is difficult to diagnose due to its complex symptoms. In recent years, some works have applied deep learning algorithms to find traces of early AVNFH in X-rays or magnetic resonance imaging (MRI). However, X-rays are difficult to reflect hidden features due to the tissue overlap; MRI is sensitive but requires more time for imaging and is expensive. This study aims to develop a computer-aided diagnosis system for early AVNFH based on computed tomography (CT), which provides layer-wise features and is less costly. To achieve this, a large-scale dataset for AVNFH was collected and annotated by experienced doctors. We propose the Dynamic Triple Reweighting Network (DTRNet) that integrates the AVNFH classification and weakly-supervised localization. DTRNet incorporates nested multi-instance learning as the first and second reweighting, and structure regularization as the third reweighting to identify diseases and localize the lesion region. Since nested multi-instance learning is inapplicable in situations with few positive samples in the patch set, we propose a dynamic pseudo-package module to compensate for this limitation. Experimental results show that DTRNet is superior to the baselines in AVNFH classification. In addition, it can locate lesions to provide more information for assisting clinical decisions. The desensitized data and codes has been made available at:

Category-Level Articulated Object 9D Pose Estimation via Reinforcement Learning

  • Liu Liu
  • Jianming Du
  • Hao Wu
  • Xun Yang
  • Zhenguang Liu
  • Richang Hong
  • Meng Wang

Human life is populated with articulated objects. Current category-level articulated object 9D pose estimation (Articulated Object 9D Pose Estimation, ArtOPE) methods usually meet the challenges of shared object representation requirement, kinematics-agnostic pose modeling and self-occlusions. In this paper, we propose a novel framework called Articulated object 9D Pose Estimation via Reinforcement Learning (ArtPERL), which formulates the category-level ArtOPE as a reinforcement learning problem. Given a point cloud or RGB-D image input, ArtPERL firstly retrieves the part-sensitive articulated object as reference point cloud, and then introduces a joint-centric pose modeling strategy that estimates 9D pose by fitting joint states via reinforced agent training. Finally, we further propose a pose optimization that refine the predicted 9D pose considering kinematic constraints. We evaluate our ArtPERL on various datasets ranging from synthetic point cloud to real-world multi-hinged object. Experiments demonstrate the superior performance and robustness of our ArtPERL. Our work provides a new perspective on category-level articulated object 9D pose estimation and has the potential to be applied in many fields, including robotics, augmented reality, and autonomous driving.

RetouchingFFHQ: A Large-scale Dataset for Fine-grained Face Retouching Detection

  • Qichao Ying
  • Jiaxin Liu
  • Sheng Li
  • Haisheng Xu
  • Zhenxing Qian
  • Xinpeng Zhang

The widespread use of face retouching filters on short-video platforms has raised concerns about the authenticity of digital appearances and the impact of deceptive advertising. To address these issues, there is a pressing need to develop advanced face retouching techniques. However, the lack of large-scale and fine-grained face retouching datasets has been a major obstacle to progress in this field. In this paper, we introduce RetouchingFFHQ, a large-scale and fine-grained face retouching dataset that contains over half a million conditionally-retouched images. RetouchingFFHQ stands out from previous datasets due to its large scale, high quality, fine-grainedness, and customization. By including four typical types of face retouching operations and different retouching levels, we extend the binary face retouching detection into a fine-grained, multi-retouching type, and multi-retouching level estimation problem. Additionally, we propose a Multi-granularity Attention Module (MAM) as a plugin for CNN backbones for enhanced cross-scale representation learning. Extensive experiments using different baselines as well as our proposed method on RetouchingFFHQ show decent performance on face retouching detection.

Slow-Fast Time Parameter Aggregation Network for Class-Incremental Lip Reading

  • Xueyi Zhang
  • Chengwei Zhang
  • Tao Wang
  • Jun Tang
  • Songyang Lao
  • Haizhou Li

Class incremental learning has yet to be explored in the field of lip-reading, which can circumvent data privacy issues and avoid the high training costs associated with joint training. In this paper, we introduce a benchmark for Class-Incremental Lip-Reading (CILR). To simultaneously improve the plasticity for new classes and stability for old classes in incremental learning, we propose a Slow-Fast Time Parameter Aggregation Network (TPAN) that decouples representation learning of new and old knowledge, taking into account the task characteristics of lip-reading. The TPAN comprises two dynamically evolving branches: one that uses fast gradient descent and the other employs slow momentum updates to retain old knowledge while adapting to new knowledge. Additionally, to achieve efficient knowledge transfer of the incremental model, we design a Hybrid Sequence-Distribution Distillation (HSDD) strategy to transfer knowledge in temporal feature view and classification probability view. We present a comprehensive comparison of the proposed method and previous state-of-the-art class incremental learning methods on the most commonly used lip-reading datasets LRW and LRW1000. The experimental result show that the proposed method can reduce the effect of catastrophic forgetting and improve the incremental accuracy.

Text-based Person Search without Parallel Image-Text Data

  • Yang Bai
  • Jingyao Wang
  • Min Cao
  • Chen Chen
  • Ziqiang Cao
  • Liqiang Nie
  • Min Zhang

Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description. Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect. In this paper, we make the first attempt to explore TBPS without parallel image-text data (μ-TBPS), in which only non-parallel images and texts, or even image-only data, can be adopted. Towards this end, we propose a two-stage framework, generation-then-retrieval (GTR), to first generate the corresponding pseudo text for each image and then perform the retrieval in a supervised manner. In the generation stage, we propose a fine-grained image captioning strategy to obtain an enriched description of the person image, which firstly utilizes a set of instruction prompts to activate the off-the-shelf pretrained vision-language model to capture and generate fine-grained person attributes, and then converts the extracted attributes into a textual description via the finetuned large language model or the hand-crafted template. In the retrieval stage, considering the noise interference of the generated texts for training model, we develop a confidence score-based training scheme by enabling more reliable texts to contribute more during the training. Experimental results on multiple TBPS benchmarks (i.e., CUHK-PEDES, ICFG-PEDES and RSTPReid) show that the proposed GTR can achieve a promising performance without relying on parallel image-text data.

Exploring Inconsistent Knowledge Distillation for Object Detection with Data Augmentation

  • Jiawei Liang
  • Siyuan Liang
  • Aishan Liu
  • Ke Ma
  • Jingzhi Li
  • Xiaochun Cao

Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model. Since the teacher model perceives data in a way different from humans, existing KD methods only distill knowledge that is consistent with labels annotated by human expert while neglecting knowledge that is not consistent with human perception, which results in insufficient distillation and sub-optimal performance. In this paper, we propose inconsistent knowledge distillation (IKD), which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions. We start by considering the teacher model's counter-intuitive perceptions of frequency and non-robust features. Unlike previous works that exploit fine-grained features or introduce additional regularizations, we extract inconsistent knowledge by providing diverse input using data augmentation. Specifically, we propose a sample-specific data augmentation to transfer the teacher model's ability in capturing distinct frequency components and suggest an adversarial feature augmentation to extract the teacher model's perceptions of non-robust features in the data. Extensive experiments demonstrate the effectiveness of our method which outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors (at most +1.0 mAP). Our codes will be made available at

CARIS: Context-Aware Referring Image Segmentation

  • Sun-Ao Liu
  • Yiheng Zhang
  • Zhaofan Qiu
  • Hongtao Xie
  • Yongdong Zhang
  • Ting Yao

Referring image segmentation aims to segment the target object described by a natural-language utterance. Recent approaches typically distinguish pixels by aligning pixel-wise visual features with linguistic features extracted from the referring description. Nevertheless, such a free-form description only specifies certain discriminative attributes of the target object or its relations to a limited number of objects, which fails to represent the rich visual context adequately. The stand-alone linguistic features are therefore unable to align with all visual concepts, resulting in inaccurate segmentation. In this paper, we propose to address this issue by incorporating rich visual context into linguistic features for sufficient vision-language alignment. Specifically, we present Context-Aware Referring Image Segmentation (CARIS), a novel architecture that enhances the contextual awareness of linguistic features via sequential vision-language attention and learnable prompts. Technically, CARIS develops a context-aware mask decoder with sequential bidirectional cross-modal attention to integrate the linguistic features with visual context, which are then aligned with pixel-wise visual features. Furthermore, two groups of learnable prompts are employed to delve into additional contextual information from the input image and facilitate the alignment with non-target pixels, respectively. Extensive experiments demonstrate that CARIS achieves new state-of-the-art performances on three public benchmarks. Code is available at

Ground-to-Aerial Person Search: Benchmark Dataset and Approach

  • Shizhou Zhang
  • Qingchun Yang
  • De Cheng
  • Yinghui Xing
  • Guoqiang Liang
  • Peng Wang
  • Yanning Zhang

In this work, we construct a large-scale dataset for Ground-to-Aerial Person Search, named G2APS, which contains 31,770 images of 260,559 annotated bounding boxes for 2,644 identities appearing in both of the UAVs and ground surveillance cameras. To our knowledge, this is the first dataset for cross-platform intelligent surveillance applications, where the UAVs could work as a powerful complement for the ground surveillance cameras. To more realistically simulate the actual cross-platform Ground-to-Aerial surveillance scenarios, the surveillance cameras are fixed about 2 meters above the ground, while the UAVs capture videos of persons at different location, with a variety of view-angles, flight attitudes and flight modes. Therefore, the dataset has the following unique characteristics: 1) drastic view-angle changes between query and gallery person images from cross-platform cameras; 2) diverse resolutions, poses and views of the person images under 9 rich real-world scenarios. On basis of the G2APS benchmark dataset, we demonstrate detailed analysis about current two-step and end-to-end person search methods, and further propose a simple yet effective knowledge distillation scheme on the head of the ReID network, which achieves state-of-the-art performances on both of the G2APS and the previous two public person search datasets, i.e., PRW and CUHK-SYSU. The dataset and source code available on

Sparse Sharing Relation Network for Panoptic Driving Perception

  • Fan Jiang
  • Zilei Wang

Efficient and accurate perception system is critical for autonomous driving, including traffic object detection, drivable area segmentation, and lane detection. Most previous works do not consider the spatial and semantic cues in traffic scenes. In this paper, we propose a novel multi-task learning network to exploit these priors. Specifically, to model the co-occurrence and spatial relationships of traffic objects, we propose to use a Graph Convolutional Network (GCN) block operating on the patches of feature maps. It enables adaptive discovery and incorporation of semantic and spatial relationships in the feature space. Furthermore, we propose a sub-feature sharing method to mitigate negative transfer in multi-task learning. On the basis of a fully shared base network, we split the feature space of different tasks along the channel dimension, resulting in the shared and private features for each task. It allows the network parameters to be selectively updated by different tasks during training. Experimental results on the challenging BDD100K dataset demonstrate that our proposed approach gets consistent improvement with fewer parameters, and achieves new state-of-the-art performance in terms of accuracy and speed.

SESSION: Oral Session IV: Engaging Users with Multimedia -- Emotional and Social Signals

AcFormer: An Aligned and Compact Transformer for Multimodal Sentiment Analysis

  • Daoming Zong
  • Chaoyue Ding
  • Baoxiang Li
  • Jiakui Li
  • Ken Zheng
  • Qunyan Zhou

Multimodal Sentiment Analysis (MSA) is a popular research topic aimed at utilizing multimodal signals for understanding human emotions. The primary approach to solving this task is to develop complex fusion techniques. However, the heterogeneity and unaligned nature between modalities pose significant challenges to fusion. Additionally, existing methods lack consideration for the efficiency of modal fusion. To tackle these issues, we propose AcFormer, which contains two core ingredients: i) contrastive learning within and across modalities to explicitly align different modality streams before fusion; and ii) pivot attention for multimodal interaction/fusion. The former encourages positive triplets of image-audio-text to have similar representations in contrast to negative ones. The latter introduces attention pivots that can serve as cross-modal information bridges and limit cross-modal attention to a certain number of fusion pivot tokens. We evaluate AcFormer on multiple MSA tasks, including multimodal emotion recognition, humor detection, and sarcasm detection. Empirical evidence shows that AcFormer achieves the optimal performance with minimal computation cost compared to previous state-of-the-art methods. Our code is publicly available at

Freq-HD: An Interpretable Frequency-based High-Dynamics Affective Clip Selection Method for in-the-Wild Facial Expression Recognition in Videos

  • Zeng Tao
  • Yan Wang
  • Zhaoyu Chen
  • Boyang Wang
  • Shaoqi Yan
  • Kaixun Jiang
  • Shuyong Gao
  • Wenqiang Zhang

The in-the-wild dynamic facial expression recognition (DFER) has been challenging due to several high-dynamics factors such as limited dynamic expression-related frames and variable non-expression noise in facial expression sequences. To provide more expression-related clips for DFER models, we propose a novel and interpretable frequency-based method (Freq-HD) for high-dynamics affective clip selection. It can select clips containing pure expression changes from sequences and aid different DFER network structures in recognizing in-the-wild dynamic facial expressions more accurately and efficiently. We first design a novel spatial-temporal frequency analysis (STFA) module to compute the dynamics values of each clip by using sliding windows and spatial-temporal frequency analysis. Moreover, we propose a multi-band complementary selection (MBC) module to amend the inappropriate reaction of the dynamics values of different spatial frequency bands in STFA when expression-irrelevant noise occurs. Specifically, the MBC uses an ingenious mapping method to generate the inhibitory factors to complement and separate the dynamics of expressions and non-expressions in different frequency bands. The Freq-HD can select the most expression-correlated clips and the consisting frames, which could be incorporated into any existing DFER models. We extensively evaluate the Freq-HD on two in-the-wild datasets and four DFER baselines, showing that our method significantly improves the subsequent network performance while using fewer input frames and reducing computation cost. More ablation studies and visualization analysis provide further empirical evidence of the effectiveness of our method.

StyleEDL: Style-Guided High-order Attention Network for Image Emotion Distribution Learning

  • Peiguang Jing
  • Xianyi Liu
  • Ji Wang
  • Yinwei Wei
  • Liqiang Nie
  • Yuting Su

Emotion distribution learning has gained increasing attention with the tendency to express emotions through images. As for emotion ambiguity arising from humans' subjectivity, substantial previous methods generally focused on learning appropriate representations from the holistic or significant part of images. However, they rarely consider establishing connections with the stylistic information although it can lead to a better understanding of images. In this paper, we propose a style-guided high-order attention network for image emotion distribution learning termed StyleEDL, which interactively learns stylistic-aware representations of images by exploring the hierarchical stylistic information of visual contents. Specifically, we consider exploring the intra- and inter-layer correlations among GRAM-based stylistic representations, and meanwhile exploit an adversary-constrained high-order attention mechanism to capture potential interactions between subtle visual parts. In addition, we introduce a stylistic graph convolutional network to dynamically generate the content-dependent emotion representations to benefit the final emotion distribution learning. Extensive experiments conducted on several benchmark datasets demonstrate the effectiveness of our proposed StyleEDL compared to state-of-the-art methods. The implementation is released at:

Variance-Aware Bi-Attention Expression Transformer for Open-Set Facial Expression Recognition in the Wild

  • Junjie Zhu
  • Bingjun Luo
  • Ao Sun
  • Jinghang Tan
  • Xibin Zhao
  • Yue Gao

Despite the great accomplishments of facial expression recognition (FER) models in closed-set scenarios, they still lack open-world robustness when it comes to handling unknown samples. To address the demands of operating in an open environment, open-set FER models should improve their performance in rejecting unknown samples while maintaining their efficiency in recognizing known expressions. With this goal in mind, we propose an open-set FER framework named Variance-Aware Bi-Attention Expression Transformer (VBExT), which enhances conventional closed-set FER models with open-world robustness for unknown samples. Specifically, to make full use of the expression representation capabilities of learned features, we introduce a bi-attention feature augmentation mechanism that learns the important regions and integrates the hierarchical features extracted by the emotional CNN backbone. We also propose a variance-aware distribution modeling method that adapts to the diverse distribution of different expression classes in the open environment, thereby enhancing the detection ability of unknown expressions. Additionally, we have constructed a Fine-Grained Light Facial Expression dataset that includes 30 different light brightnesses to better validate the efficiency of VBExT. Extensive experiments and ablation studies show that VBExT significantly improves the performance of open-set FER and achieves state-of-the-art results on CFEE (lab, basic), RAF-DB (wild, basic+compound), and FGL-FE (multiple light brightnesses, basic).

AffectFAL: Federated Active Affective Computing with Non-IID Data

  • Zixin Zhang
  • Fan Qi
  • Shuai Li
  • Changsheng Xu

Federated affective computing, which deploys traditional affective computing in a distributed framework, achieves a trade-off between privacy and utility, and offers a wide variety of applications in business and society. However, the expensive annotation cost of obtaining reliable emotion labels at the local client remains a barrier to the effective use of local emotional data. Therefore, we propose a federated active affective paradigm to improve the performance of federated affective computing with a limited annotation budget on the client. A major challenge in federated active learning is the inconsistency between the active sampling goals of global and local models, particularly in scenarios with Non-IID data across clients, which exacerbates the problem. To address the above challenge, we propose AffectFAL, a federated active affective computing framework. It incorporates a Preference-aware Group Aggregation module, which obtains global models representing the different emotional preferences among clients. We also devise a tailored De-biased Federated Active Sampling strategy with an improved vote entropy, facilitating class balancing of labeled samples and alleviating the problem of sampling goals inconsistency between the global and local models. We evaluate AffectFAL on diverse benchmarks (image, video and physiological signal) and experimental settings for affective computing. Thorough comparisons with other active sampling strategies demonstrate our method's advantages in affective computing for Non-IID federated learning.

ASTDF-Net: Attention-Based Spatial-Temporal Dual-Stream Fusion Network for EEG-Based Emotion Recognition

  • Peiliang Gong
  • Ziyu Jia
  • Pengpai Wang
  • Yueying Zhou
  • Daoqiang Zhang

Emotion recognition based on electroencephalography (EEG) has attracted significant attention and achieved considerable advances in the fields of affective computing and human-computer interaction. However, most existing studies ignore the coupling and complementarity of complex spatiotemporal patterns in EEG signals. Moreover, how to exploit and fuse crucial discriminative aspects in high redundancy and low signal-to-noise ratio EEG signals remains a great challenge for emotion recognition. In this paper, we propose a novel attention-based spatial-temporal dual-stream fusion network, named ASTDF-Net, for EEG-based emotion recognition. Specifically, ASTDF-Net comprises three main stages: first, the collaborative embedding module is designed to learn a joint latent subspace to capture the coupling of complicated spatiotemporal information in EEG signals. Second, stacked parallel spatial and temporal attention streams are employed to extract the most essential discriminative features and filter out redundant task-irrelevant factors. Finally, the hybrid attention-based feature fusion module is proposed to integrate significant features discovered from the dual-stream structure to take full advantage of the complementarity of the diverse characteristics. Extensive experiments on two publicly available emotion recognition datasets indicate that our proposed approach consistently outperforms state-of-the-art methods.

SESSION: Oral Session V: Engaging Users with Multimedia -- Multimedia Search and Recommendation

Multi-Granularity Interactive Transformer Hashing for Cross-modal Retrieval

  • Yishu Liu
  • Qingpeng Wu
  • Zheng Zhang
  • Jingyi Zhang
  • Guangming Lu

With the powerful representation ability and privileged efficiency, deep cross-modal hashing (DCMH) has become an emerging fast similarity search technique. Prior studies primarily focus on exploring pairwise similarities across modalities, but fail to comprehensively capture the multi-grained semantic correlations during intra- and inter-modal negotiation. To tackle this issue, this paper proposes a novel Multi-granularity Interactive Transformer Hashing (MITH) network, which hierarchically considers both coarse- and fine-grained similarity measurements across different modalities in one unified transformer-based framework. To the best of our knowledge, this is the first attempt for multi-granularity transformer-based cross-modal hashing. Specifically, a well-designed distilled intra-modal interaction module is deployed to excavate modality-specific concept knowledge with global-local knowledge distillation under the guidance of implicit conceptual category-level representations. Moreover, we construct a contrastive inter-modal alignment module to mine modality-independent semantic concept correspondences with instance- and token-wise contrastive learning, respectively. Such a collaborative learning paradigm can jointly alleviate the heterogeneity and semantic gaps among different modalities from a multi-granularity perspective, yielding discriminative modality-invariant hash codes. Extensive experiments on multiple representative cross-modal datasets demonstrate the consistent superiority of MITH over the existing state-of-the-art baselines. The codes are available at

Equivariant Learning for Out-of-Distribution Cold-start Recommendation

  • Wenjie Wang
  • Xinyu Lin
  • Liuhui Wang
  • Fuli Feng
  • Yinwei Wei
  • Tat-Seng Chua

Recommender systems rely on user-item interactions to learn Collaborative Filtering (CF) signals and easily under-recommend the cold-start items without historical interactions. To boost cold-start item recommendation, previous studies usually incorporate item features (e.g., micro-video content features) into CF models. They essentially align the feature representations of warm-start items with CF representations during training, and then adopt the feature representations of cold-start items to make recommendations. However, cold-start items might have feature distribution shifts from warm-start ones due to different upload times. As such, these cold-start item features fall into the underrepresented feature space, where their feature representations cannot align well with CF signals, causing poor cold-start recommendation.

To combat item feature shifts, the key lies in pushing feature representation learning to well represent the shifted item features and align with the CF representations in the underrepresented feature space. To this end, we propose an equivariant learning framework, which aims to achieve equivariant alignment between item features, feature representations, and CF representations in the underrepresented feature space. Specifically, since cold-start items are unavailable for training, we interpolate the features and CF representations of two underrepresented warm items to simulate the feature shifts. The interpolated feature representations are then regulated to achieve equivariant alignment with the interpolated features and CF representations via three alignment losses. We instantiate the proposed framework on two competitive cold-start models, and empirical results on three datasets validate that the framework significantly improves cold-start recommendation.

Target-Guided Composed Image Retrieval

  • Haokun Wen
  • Xian Zhang
  • Xuemeng Song
  • Yinwei Wei
  • Liqiang Nie

Composed image retrieval (CIR) is a new and flexible image retrieval paradigm, which can retrieve the target image for a multimodal query, including a reference image and its corresponding modification text. Although existing efforts have achieved compelling success, they overlook the conflict relationship modeling between the reference image and the modification text for improving the multimodal query composition and the adaptive matching degree modeling for promoting the ranking of the candidate images that could present different levels of matching degrees with the given query. To address these two limitations, in this work, we propose a Target-Guided Composed Image Retrieval network (TG-CIR). In particular, TG-CIR first extracts the unified global and local attribute features for the reference/target image and the modification text with the contrastive language-image pre-training model (CLIP) as the backbone, where an orthogonal regularization is introduced to promote the independence among the attribute features. Then TG-CIR designs a target-query relationship-guided multimodal query composition module, comprising a target-free student composition branch and a target-based teacher composition branch, where the target-query relationship is injected into the teacher branch for guiding the conflict relationship modeling of the student branch. Last, apart from the conventional batch-based classification loss, TG-CIR additionally introduces a batch-based target similarity-guided matching degree regularization to promote the metric learning process. Extensive experiments on three benchmark datasets demonstrate the superiority of our proposed method.

Your Negative May not Be True Negative: Boosting Image-Text Matching with False Negative Elimination

  • HaoXuan Li
  • Yi Bin
  • Junrong Liao
  • Yang Yang
  • Heng Tao Shen

Most existing image-text matching methods adopt triplet loss as the optimization objective, and choosing a proper negative sample for the triplet of <anchor, positive, negative> is important for effectively training the model, e.g., hard negatives make the model learn efficiently and effectively. However, we observe that existing methods mainly employ the most similar samples as hard negatives, which may not be true negatives. In other words, the samples with high similarity but not paired with the anchor may reserve positive semantic associations, and we call them false negatives. Repelling these false negatives in triplet loss would mislead the semantic representation learning and result in inferior retrieval performance. In this paper, we propose a novel False Negative Elimination (FNE) strategy to select negatives via sampling, which could alleviate the problem introduced by false negatives. Specifically, we first construct the distributions of positive and negative samples separately via their similarities with the anchor, based on the features extracted from image and text encoders. Then we calculate the false negative probability of a given sample based on its similarity with the anchor and the above distributions via the Bayes' rule, which is employed as the sampling weight during negative sampling process. Since there may not exist any false negative in a small batch size, we design a memory module with momentum to retain a large negative buffer and implement our negative sampling strategy spanning over the buffer. In addition, to make the model focus on hard negatives, we reassign the sampling weights for the simple negatives with a cut-down strategy. The extensive experiments are conducted on Flickr30K and MS-COCO, and the results demonstrate the superiority of our proposed false negative elimination strategy. The code is available at

A Tale of Two Graphs: Freezing and Denoising Graph Structures for Multimodal Recommendation

  • Xin Zhou
  • Zhiqi Shen

Multimodal recommender systems utilizing multimodal features (e.g., images and textual descriptions) typically show better recommendation accuracy than general recommendation models based solely on user-item interactions. Generally, prior work fuses multimodal features into item ID embeddings to enrich item representations, thus failing to capture the latent semantic item-item structures. In this context, LATTICE proposes to learn the latent structure between items explicitly and achieves state-of-the-art performance for multimodal recommendations. However, we argue the latent graph structure learning of LATTICE is both inefficient and unnecessary. Experimentally, we demonstrate that freezing its item-item structure before training can also achieve competitive performance. Based on this finding, we propose a simple yet effective model, dubbed as FREEDOM, that FREEzes the item-item graph and DenOises the user-item interaction graph simultaneously for Multimodal recommendation. Theoretically, we examine the design of FREEDOM through a graph spectral perspective and demonstrate that it possesses a tighter upper bound on the graph spectrum. In denoising the user-item interaction graph, we devise a degree-sensitive edge pruning method, which rejects possibly noisy edges with a high probability when sampling the graph. We evaluate the proposed model on three real-world datasets and show that FREEDOM can significantly outperform the strongest baselines. Compared with LATTICE, FREEDOM achieves an average improvement of 19.07% in recommendation accuracy while reducing its memory cost up to 6x on large graphs. The source code is available at:

ProtoHPE: Prototype-guided High-frequency Patch Enhancement for Visible-Infrared Person Re-identification

  • Guiwei Zhang
  • Yongfei Zhang
  • Zichang Tan

Visible-Infrared person re-identification is challenging due to the large modality gap. To bridge the gap, most studies heavily rely on the correlation of visible-infrared holistic person images, which may perform poorly under severe distribution shifts. In contrast, we find that some cross-modal correlated high-frequency components contain discriminative visual patterns and are less affected by variations such as wavelength, pose, and background clutter than holistic images. Therefore, we are motivated to bridge the modality gap based on such high-frequency components, and propose Prototype-guided High-frequency Patch Enhancement (ProtoHPE) with two core designs. First, to enhance the representation ability of cross-modal correlated high-frequency components, we split patches with such components by Wavelet Transform and exponential moving average Vision Transformer (ViT), then empower ViT to take the split patches as auxiliary input. Second, to obtain semantically compact and discriminative high-frequency representations of the same identity, we propose Multimodal Prototypical Contrast. To be specific, it hierarchically captures comprehensive semantics of different modal instances, facilitating the aggregation of high-frequency representations belonging to the same identity. With it, ViT can capture key high-frequency components during inference without relying on ProtoHPE, thus bringing no extra complexity. Extensive experiments validate the effectiveness of ProtoHPE.

Online Distillation-enhanced Multi-modal Transformer for Sequential Recommendation

  • Wei Ji
  • Xiangyan Liu
  • An Zhang
  • Yinwei Wei
  • Yongxin Ni
  • Xiang Wang

Multi-modal recommendation systems, which integrate diverse types of information, have gained widespread attention in recent years. However, compared to traditional collaborative filtering-based multi-modal recommendation systems, research on multi-modal sequential recommendation is still in its nascent stages. Unlike traditional sequential recommendation models that solely rely on item identifier (ID) information and focus on network structure design, multi-modal recommendation models need to emphasize item representation learning and the fusion of heterogeneous data sources. This paper investigates the impact of item representation learning on downstream recommendation tasks and examines the disparities in information fusion at different stages. Empirical experiments are conducted to demonstrate the need to design a framework suitable for collaborative learning and fusion of diverse information. Based on this, we propose a new model-agnostic framework for multi-modal sequential recommendation tasks, called Online Distillation-enhanced Multi-modal Transformer (ODMT), to enhance feature interaction and mutual learning among multi-source input (ID, text, and image), while avoiding conflicts among different features during training, thereby improving recommendation accuracy. To be specific, we first introduce an ID-aware Multi-modal Transformer module in the item representation learning stage to facilitate information interaction among different features. Secondly, we employ an online distillation training strategy in the prediction optimization stage to make multi-source data learn from each other and improve prediction robustness. Experimental results on a stream media recommendation dataset and three e-commerce recommendation datasets demonstrate the effectiveness of the proposed two modules, which is approximately 10% improvement in performance compared to baseline models. Our code will be released at:

Zero-shot Micro-video Classification with Neural Variational Inference in Graph Prototype Network

  • Junyang Chen
  • Jialong Wang
  • Zhijiang Dai
  • Huisi Wu
  • Mengzhu Wang
  • Qin Zhang
  • Huan Wang

Micro-video classification plays a central role in online content recommendation platforms, such as Kwai and Tik-Tok. Existing works on video classification largely exploit the interactions between users and items as well as the item labels to provide quality recommendation services. However, scarce or even no labeled data of emerging videos is a great challenge for existing classification methods. In this paper, we propose a zero-shot micro-video classification model (NVIGPN) by exploiting the hidden topics behind items to guide the representation learning in user-item interactions. Specifically, we study this zero-shot classification in two stages: (1) exploiting a generalized semantic hidden topic descriptions for transferable knowledge learning, and (2) designing a graph-based learning model for guiding the minor seen class information to the unseen ones. Through mining the transferable knowledge between the hidden topics and the small number of the seen classes, NVIGPN can achieves state-of-the-art performances in predicting the unseen classes of micro-videos. We conduct extensive experiments to demonstrate the effectiveness of our method.

Joint Searching and Grounding: Multi-Granularity Video Content Retrieval

  • Zhiguo Chen
  • Xun Jiang
  • Xing Xu
  • Zuo Cao
  • Yijun Mo
  • Heng Tao Shen

Text-based video retrieval is a well-studied task aimed at retrieving relevant videos from a large collection in response to a given text query. Most existing TVR works assume that videos are already trimmed and fully relevant to the query thus ignoring that most videos in real-world scenarios are untrimmed and contain massive irrelevant video content. Moreover, as users' queries are only relevant to video events rather than complete videos, it is also more practical to provide specific video events rather than an untrimmed video list. In this paper, we introduce a challenging but more realistic task called Multi-Granularity Video Content Retrieval (MGVCR), which involves retrieving both video files and specific video content with their temporal locations. This task presents significant challenges since it requires identifying and ranking the partial relevance between long videos and text queries under the lack of temporal alignment supervision between the query and relevant moments. To this end, we propose a novel unified framework, termed, Joint Searching and Grounding (JSG). It consists of two branches: (1) a glance branch that coarsely aligns the query and moment proposals using inter-video contrastive learning, and (2) a gaze branch that finely aligns two modalities using both inter- and intra-video contrastive learning. Based on the glance-to-gaze design, our JSG method learns two separate joint embedding spaces for moments and text queries using a hybrid synergistic contrastive learning strategy. Extensive experiments on three public benchmarks, i.e., Charades-STA, DiDeMo, and ActivityNet-Captions demonstrate the superior performance of our JSG method on both video-level retrieval and event-level retrieval subtasks. Our open-source implementation code is available at

Making Users Indistinguishable: Attribute-wise Unlearning in Recommender Systems

  • Yuyuan Li
  • Chaochao Chen
  • Xiaolin Zheng
  • Yizhao Zhang
  • Zhongxuan Han
  • Dan Meng
  • Jun Wang

With the growing privacy concerns in recommender systems, recommendation unlearning, i.e., forgetting the impact of specific learned targets, is getting increasing attention. Existing studies predominantly use training data, i.e., model inputs, as the unlearning target. However, we find that attackers can extract private information, i.e., gender, race, and age, from a trained model even if it has not been explicitly encountered during training. We name this unseen information as attribute and treat it as the unlearning target. To protect the sensitive attribute of users, Attribute Unlearning (AU) aims to degrade attacking performance and make target attributes indistinguishable. In this paper, we focus on a strict but practical setting of AU, namely Post-Training Attribute Unlearning (PoT-AU), where unlearning can only be performed after the training of the recommendation model is completed. To address the PoT-AU problem in recommender systems, we design a two-component loss function that consists of i) distinguishability loss: making attribute labels indistinguishable from attackers, and ii) regularization loss: preventing drastic changes in the model that result in a negative impact on recommendation performance. Specifically, we investigate two types of distinguishability measurements, i.e., user-to-user and distribution-to-distribution. We use the stochastic gradient descent algorithm to optimize our proposed loss. Extensive experiments on three real-world datasets demonstrate the effectiveness of our proposed methods.

Prior-Guided Accuracy-Bias Tradeoff Learning for CTR Prediction in Multimedia Recommendation

  • Dugang Liu
  • Yang Qiao
  • Xing Tang
  • Liang Chen
  • Xiuqiang He
  • Zhong Ming

Although debiasing in multimedia recommendation has shown promising results, most existing work relies on the ability of the model itself to fully disentangle the biased and unbiased information and considers arbitrarily removing all the biases. However, in many business scenarios, it is usually possible to extract a subset of features associated with the biases by means of expert knowledge, i.e., the confounding proxy features. Therefore, in this paper, we propose a novel debiasing framework with confounding proxy priors for the accuracy-bias tradeoff learning in the multimedia recommendation, or CP2Rec for short, in which these confounding proxy features driven by the expert experience are integrated into the model as prior knowledge corresponding to the biases. Specifically, guided by these priors, we use a bias disentangling module with some orthogonal constraints to force the model to avoid encoding biased information in the feature embeddings. We then introduce an auxiliary unbiased loss to synergize with the original biased loss in an accuracy-bias tradeoff module, aiming at recovering the beneficial bias information from the above-purified feature embeddings to achieve a more reasonable accuracy-bias tradeoff recommendation. Finally, we conduct extensive experiments on a public dataset and a product dataset to verify the effectiveness of CR2Rec. In addition, CR2Rec is also deployed on a large-scale financial multimedia recommendation platform in China and achieves a sustained performance gain.

GoRec: A Generative Cold-start Recommendation Framework

  • Haoyue Bai
  • Min Hou
  • Le Wu
  • Yonghui Yang
  • Kun Zhang
  • Richang Hong
  • Meng Wang

Multimedia-based recommendation models learn user and item preference representation by fusing both the user-item collaborative signals and the multimedia content signals. In real scenarios, cold items appear in the test stage without any user interaction record. How to perform cold item recommendation is challenging as the training items and test items have different data distributions. These hybrid preference representations contained auxiliary collaborative signals, so current solutions designed alignment functions to transfer learned hybrid preference representations to cold items. Despite the effectiveness, we argue that they are still limited as these models relied heavily on the manually carefully designed alignment functions, which are easily influenced by the limited item records and noises in the training data.

To tackle the above limitations, we propose a Generative cold-start Recommendation (GoRec) framework for multimedia-based new item recommendation. Specifically, we design a Conditional Variational AutoEncoder~(CVAE) based method that first estimates the underlying distribution of each warm item conditioned on the multimedia content representation. Then, we propose a uniformity-enhanced optimization objective to ensure the latent space of CVAE is more distinguishable and informative. In the inference stage, a generative approach is designed to obtain warm-up new item representations from the latent distribution. Please note that GoRec is applicable to arbitrary recommendation backbones. Extensive experiments on three real datasets and various recommendation backbones verify the superiority of our proposed framework. The code is available at

Prototype-guided Knowledge Transfer for Federated Unsupervised Cross-modal Hashing

  • Jingzhi Li
  • Fengling Li
  • Lei Zhu
  • Hui Cui
  • Jingjing Li

Although deep cross-modal hashing methods have shown superiorities for cross-modal retrieval recently, there is a concern about potential data privacy leakage when training the models. Federated learning adopts a distributed machine learning strategy, which can collaboratively train models without leaking local private data. It is a promising technique to support privacy-preserving cross-modal hashing. However, existing federated learning-based cross-modal retrieval methods usually rely on a large number of semantic annotations, which limits the scalability of the retrieval models. Furthermore, they mostly update the global models by aggregating local model parameters, ignoring the differences in the quantity and category of multi-modal data from multiple clients. To address these issues, we propose a Prototype Transfer-based Federated Unsupervised Cross-modal Hashing(PT-FUCH) method for solving the privacy leakage problem in cross-modal retrieval model learning. PT-FUCH protects local private data by exploring unified global prototypes for different clients, without relying on any semantic annotations. Global prototypes are used to guide the local cross-modal hash learning and promote the alignment of the feature space, thereby alleviating the model bias caused by the difference in the distribution of local multi-modal data and improving the retrieval accuracy. Additionally, we design an adaptive cross-modal knowledge distillation to transfer valuable semantic knowledge from modal-specific global models to local prototype learning processes, reducing the risk of overfitting. Experimental results on three benchmark cross-modal retrieval datasets validate that our PT-FUCH method can achieve outstanding retrieval performance when trained under distributed privacy-preserving mode. The source codes of our method are available at

SESSION: Oral Session VI: Engaging Users with Multimedia -- Interactions and Quality of Experience

EAT: An Enhancer for Aesthetics-Oriented Transformers

  • Shuai He
  • Anlong Ming
  • Shuntian Zheng
  • Haobin Zhong
  • Huadong Ma

Transformers have shown great potential in various vision tasks, but none of them have surpassed the best CNN model on image aesthetics assessment (IAA) tasks. IAA is a challenging task in multimedia systems that requires attention to both foreground and background, as well as robustness to noisy and redundant labels. The global and dense attention mechanism of Transformers, designed for saliency-oriented tasks, may miss important aesthetic information in the background, increase the computational cost and slow down the convergence on IAA tasks. To address these issues, we propose an Enhancer for Aesthetics-Oriented Transformers (EAT). EAT uses a deformable, sparse and data-dependent attention mechanism that learns where to focus and how to refine attention by offsets. EAT also guides the offsets to balance the attention between foreground and background according to dedicated rules. Our EAT-enhanced Transformers outperform the previous methods on four representative datasets with fewer training epochs. Code is available in

UnifiedGesture: A Unified Gesture Synthesis Model for Multiple Skeletons

  • Sicheng Yang
  • Zilin Wang
  • Zhiyong Wu
  • Minglei Li
  • Zhensong Zhang
  • Qiaochu Huang
  • Lei Hao
  • Songcen Xu
  • Xiaofei Wu
  • Changpeng Yang
  • Zonghong Dai

The automatic co-speech gesture generation draws much attention in computer animation. Previous works designed network structures on individual datasets, which resulted in a lack of data volume and generalizability across different motion capture standards. In addition, it is a challenging task due to the weak correlation between speech and gestures. To address these problems, we present UnifiedGesture, a novel diffusion model-based speech-driven gesture synthesis approach, trained on multiple gesture datasets with different skeletons. Specifically, we first present a retargeting network to learn latent homeomorphic graphs for different motion capture standards, unifying the representations of various gestures while extending the dataset. We then capture the correlation between speech and gestures based on a diffusion model architecture using cross-local attention and self-attention to generate better speech-matched and realistic gestures. To further align speech and gesture and increase diversity, we incorporate reinforcement learning on the discrete gesture units with a learned reward function. Extensive experiments show that UnifiedGesture outperforms recent approaches on speech-driven gesture generation in terms of CCA, FGD, and human-likeness.

Towards Explainable In-the-Wild Video Quality Assessment: A Database and a Language-Prompted Approach

  • Haoning Wu
  • Erli Zhang
  • Liang Liao
  • Chaofeng Chen
  • Jingwen Hou
  • Annan Wang
  • Wenxiu Sun
  • Qiong Yan
  • Weisi Lin

The proliferation of in-the-wild videos has greatly expanded the Video Quality Assessment (VQA) problem. Unlike early definitions that usually focus on limited distortion types, VQA on in-the-wild videos is especially challenging as it could be affected by complicated factors, including various distortions and diverse contents. Though subjective studies have collected overall quality scores for these videos, how the abstract quality scores relate with specific factors is still obscure, hindering VQA methods from more concrete quality evaluations (e.g. sharpness of a video). To solve this problem, we collect over two million opinions on 4,543 in-the-wild videos on 13 dimensions of quality-related factors, including in-capture authentic distortions (e.g. motion blur, noise, flicker), errors introduced by compression and transmission, and higher-level experiences on semantic contents and aesthetic issues (e.g. composition, camera trajectory), to establish the multi-dimensional Maxwell database. Specifically, we ask the subjects to label among a positive, a negative, and a neutral choice for each dimension. These explanation-level opinions allow us to measure the relationships between specific quality factors and abstract subjective quality ratings, and to benchmark different categories of VQA algorithms on each dimension, so as to more comprehensively analyze their strengths and weaknesses. Furthermore, we propose the MaxVQA, a language-prompted VQA approach that modifies vision-language foundation model CLIP to better capture important quality issues as observed in our analyses. The MaxVQA can jointly evaluate various specific quality factors and final quality scores with state-of-the-art accuracy on all dimensions, and superb generalization ability on existing datasets. Code and data available at

Sketch Input Method Editor: A Comprehensive Dataset and Methodology for Systematic Input Recognition

  • Guangming Zhu
  • Siyuan Wang
  • Qing Cheng
  • Kelong Wu
  • Hao Li
  • Liang Zhang

With the recent surge in the use of touchscreen devices, free-hand sketching has emerged as a promising modality for human-computer interaction. While previous research has focused on tasks such as recognition, retrieval, and generation of familiar everyday objects, this study aims to create a Sketch Input Method Editor (SketchIME) specifically designed for a professional Command, Control, Communications, Computer, and Intelligence (C4I) system. Within this system, sketches are utilized as low-fidelity prototypes for recommending standardized symbols in the creation of comprehensive situation maps. This paper also presents a systematic dataset comprising 374 specialized sketch types, and proposes a simultaneous recognition and segmentation architecture with multilevel supervision between recognition and segmentation to improve performance and enhance interpretability. By incorporating few-shot domain adaptation and class-incremental learning, the network's ability to adapt to new users and extend to new task-specific classes is significantly enhanced. Results from experiments conducted on both the proposed dataset and the SPG dataset illustrate the superior performance of the proposed architecture. Our dataset and code are publicly available at

StableVQA: A Deep No-Reference Quality Assessment Model for Video Stability

  • Tengchuan Kou
  • Xiaohong Liu
  • Wei Sun
  • Jun Jia
  • Xiongkuo Min
  • Guangtao Zhai
  • Ning Liu

Video shakiness is an unpleasant distortion of User Generated Content (UGC) videos, which is usually caused by the unstable hold of cameras. In recent years, many video stabilization algorithms have been proposed, yet no specific and accurate metric enables comprehensively evaluating the stability of videos. Indeed, most existing quality assessment models evaluate video quality as a whole without specifically taking the subjective experience of video stability into consideration. Therefore, these models cannot measure the video stability explicitly and precisely when severe shakes are present. In addition, there is no large-scale video database in public that includes various degrees of shaky videos with the corresponding subjective scores available, which hinders the development of Video Quality Assessment for Stability (VQA-S). To this end, we build a new database named StableDB that contains 1,952 diversely-shaky UGC videos, where each video has a Mean Opinion Score (MOS) on the degree of video stability rated by 34 subjects. Moreover, we elaborately design a novel VQA-S model named StableVQA, which consists of three feature extractors to acquire the optical flow, semantic, and blur features respectively, and a regression layer to predict the final stability score. Extensive experiments demonstrate that the StableVQA achieves a higher correlation with subjective opinions than the existing VQA-S models and generic VQA models. The database and codes are available at

Spatial-angular Quality-aware Representation Learning for Blind Light Field Image Quality Assessment

  • Jianjun Xiang
  • Yuanjie Dang
  • Peng Chen
  • Ronghua Liang
  • Ruohong Huan
  • Zhengyu Zhang

Blind light field image quality assessment (BLFIQA) remains a challenging task in deep learning due to the unique spatial-angular structure of light field images (LFIs) and the lack of large-scale labeled data for training. In this work, we propose a novel BLFIQA method using spatial-angular quality-aware representation learning in a self-supervised learning manner. Visual content and distortion type are important factors affecting the perceived quality of LFIs. In our observation, the band-pass transform maps of LFIs with the same distortion type exhibit similar Gaussian distributions. Thus, we learn spatial-angular quality-aware representations by minimizing the distance in the embedding space between the luminance map and the band-pass transform map of the same LFI. To implement spatial-angular quality-aware representations of LFI, we also build a large-scale unlabeled dataset containing 40k distorted LFIs with different distortion types and visual content. Further, we propose a fusion-separation-fusion network (FSFNet) to extract features for representing the intrinsic spatial-angular structure of the LFI. After pre-training on the unlabeled dataset using the proposed self-supervised learning, the FSFNet is employed for downstream BLFIQA tasks and achieves good performance. Experimental results show that our proposed method outperforms seventeen state-of-the-art models on the Win5-LID, NBU-LF1.0 and LFDD datasets, and achieves 3.78%, 6.61% and 4.06% SRCC improvements, respectively. The code and dataset will be publicly available in

Light-VQA: A Multi-Dimensional Quality Assessment Model for Low-Light Video Enhancement

  • Yunlong Dong
  • Xiaohong Liu
  • Yixuan Gao
  • Xunchu Zhou
  • Tao Tan
  • Guangtao Zhai

Recently, Users Generated Content (UGC) videos becomes ubiquitous in our daily lives. However, due to the limitations of photographic equipments and techniques, UGC videos often contain various degradations, in which one of the most visually unfavorable effects is the underexposure. Therefore, corresponding video enhancement algorithms such as Low-Light Video Enhancement (LLVE) have been proposed to deal with the specific degradation. However, different from video enhancement algorithms, almost all existing Video Quality Assessment (VQA) models are built generally rather than specifically, which measure the quality of a video from a comprehensive perspective. To the best of our knowledge, there is no VQA model specially designed for videos enhanced by LLVE algorithms. To this end, we first construct a Low-Light Video Enhancement Quality Assessment (LLVE-QA) dataset in which 254 original low-light videos are collected and then enhanced by leveraging 8 LLVE algorithms to obtain 2,060 videos in total. Moreover, we propose a quality assessment model specialized in LLVE, named Light-VQA. More concretely, since the brightness and noise have the most impact on low-light enhanced VQA, we handcraft corresponding features and integrate them with deep-learning-based semantic features as the overall spatial information. As for temporal information, in addition to deep-learning-based motion features, we also investigate the handcrafted brightness consistency among video frames, and the overall temporal information is their concatenation. Subsequently, spatial and temporal information is fused to obtain the quality-aware representation of a video. Extensive experimental results show that our Light-VQA achieves the best performance against the current State-Of-The-Art (SOTA) on LLVE-QA and public dataset. Dataset and Codes can be found at

Capturing Co-existing Distortions in User-Generated Content for No-reference Video Quality Assessment

  • Kun Yuan
  • Zishang Kong
  • Chuanchuan Zheng
  • Ming Sun
  • Xing Wen

Video Quality Assessment (VQA), which aims to predict the perceptual quality of a video, has attracted raising attention with the rapid development of streaming media technology, such as Facebook, TikTok, Kwai, and so on. Compared with other sequence-based visual tasks (e.g., action recognition), VQA faces two under-estimated challenges unresolved in User Generated Content (UGC) videos. First, it is not rare that several frames containing serious distortions (e.g., blocking, blurriness), can determine the perceptual quality of the whole video, while other sequence-based tasks require more frames of equal importance for representations.Second, the perceptual quality of a video exhibits a multi-distortion distribution, due to the differences in the duration and probability of occurrence for various distortions. In order to solve the above challenges, we propose Visual Quality Transformer (VQT) to extract quality-related sparse features more efficiently. Methodologically, a Sparse Temporal Attention (STA) is proposed to sample keyframes by analyzing the temporal correlation between frames, which reduces the computational complexity from O(T2) to O(T log T). Structurally, a Multi-Pathway Temporal Network (MPTN) utilizes multiple STA modules with different degrees of sparsity in parallel, capturing co-existing distortions in a video. Experimentally, VQT demonstrates superior performance than many state-of-the-art methods in three public no-reference VQA datasets. Furthermore, VQT shows better performance in four full-reference VQA datasets against widely-adopted industrial algorithms (e.g., VMAF and AVQT).

Understanding User Behavior in Volumetric Video Watching: Dataset, Analysis and Prediction

  • Kaiyuan Hu
  • Haowen Yang
  • Yili Jin
  • Junhua Liu
  • Yongting Chen
  • Miao Zhang
  • Fangxin Wang

Volumetric video emerges as a new attractive video paradigm in recent years since it provides an immersive and interactive 3D viewing experience with six degree-of-freedom (DoF). Unlike traditional 2D or panoramic videos, volumetric videos require dense point clouds, voxels, meshes, or huge neural models to depict volumetric scenes, which results in a prohibitively high bandwidth burden for video delivery. Users' behavior analysis, especially the viewport and gaze analysis, then plays a significant role in prioritizing the content streaming within users' viewport and degrading the remaining content to maximize user QoE with limited bandwidth. Although understanding user behavior is crucial, to the best of our best knowledge, there are no available 3D volumetric video viewing datasets containing fine-grained user interactivity features, not to mention further analysis and behavior prediction.

In this paper, we for the first time release a volumetric video viewing behavior dataset, with a large scale, multiple dimensions, and diverse conditions. We conduct an in-depth analysis to understand user behaviors when viewing volumetric videos. Interesting findings on user viewport, gaze, and motion preference related to different videos and users are revealed. We finally design a transformer-based viewport prediction model that fuses the features of both gaze and motion, which is able to achieve high accuracy at various conditions. Our prediction model is expected to further benefit volumetric video streaming optimization.

Our dataset, along with its corresponding visualization tools and prediction models, is accessible at

AesCLIP: Multi-Attribute Contrastive Learning for Image Aesthetics Assessment

  • Xiangfei Sheng
  • Leida Li
  • Pengfei Chen
  • Jinjian Wu
  • Weisheng Dong
  • Yuzhe Yang
  • Liwu Xu
  • Yaqian Li
  • Guangming Shi

Image aesthetics assessment (IAA) aims at predicting the aesthetic quality of images. Recently, large pre-trained vision-language models, like CLIP, have shown impressive performances on various visual tasks. When it comes to IAA, a straightforward way is to finetune the CLIP image encoder using aesthetic images. However, this can only achieve limited success without considering the uniqueness of multimodal data in the aesthetics domain. People usually assess image aesthetics according to fine-grained visual attributes, e.g., color, light and composition. However, how to learn aesthetics-aware attributes from CLIP-based semantic space has not been addressed before. With this motivation, this paper presents a CLIP-based multi-attribute contrastive learning framework for IAA, dubbed AesCLIP. Specifically, AesCLIP consists of two major components, i.e., aesthetic attribute-based comment classification and attribute-aware learning. The former classifies the aesthetic comments into different attribute categories. Then the latter learns an aesthetic attribute-aware representation by contrastive learning, aiming to mitigate the domain shift from the general visual domain to the aesthetics domain. Extensive experiments have been done by using the pre-trained AesCLIP on four popular IAA databases, and the results demonstrate the advantage of AesCLIP over the state-of-the-arts. The source code will be public at

SESSION: Oral Session VII: Engaging Users with Multimedia -- Metaverse, Art and Culture

Feeling Present! From Physical to Virtual Cinematography Lighting Education with Metashadow

  • Zheng Wei
  • Xian Xu
  • Lik-Hang Lee
  • Wai Tong
  • Huamin Qu
  • Pan Hui

The high cost and limited availability of soundstages for cinematography lighting education pose significant challenges for art institutions. Traditional teaching methods, combining basic lighting equipment operation with slide lectures, often yield unsatisfactory results, hindering students' mastery of cinematography lighting techniques. Therefore, we propose Metashadow, a virtual reality (VR) cinematography lighting education system demonstrating the feasibility of learning in a virtual soundstage. Based on the presence theory, Metashadow features high-fidelity lighting devices that enable users to adjust multiple parameters, providing a quantifiable learning approach. We evaluated Metashadow with 24 participants and found that it provides better learning outcomes than traditional teaching methods regarding presence, collaboration, usability, realism, creativity, and flexibility. Six experts also praised the Metashadow's expressiveness and its learning outcomes. Our study demonstrates the potential of VR technology to enhance cinematography lighting education while imposing a smaller cost burden and space requirement.

Automatic Generation of Commercial Scenes

  • Shao-Kui Zhang
  • Jia-Hong Liu
  • Yike Li
  • Tianyi Xiong
  • Ke-Xin Ren
  • Hongbo Fu
  • Song-Hai Zhang

Commercial scenes such as markets and shops are everyday scenes for both virtual scenes and real-world interior designs. However, existing literature on interior scene synthesis mainly focuses on formulating and optimizing residential scenes such as bedrooms, living rooms, etc. Existing literature typically presents a set of relations among objects. It recognizes each furniture object as the smallest unit while optimizing a residential room. However, object relations become less critical in commercial scenes since shelves are often placed next to each other so pre-calculated relations of objects are less needed. Instead, interior designers resort to evaluating how groups of objects perform in commercial scenes, i.e., the smallest unit to be evaluated is a group of objects. This paper presents a system automatically synthesizes market-like commercial scenes in virtual environments. Following the rules of commercial layout design, we parameterize groups of objects as "patterns" contributing to a scene. Each pattern directly yields a human-centric routine locally, provides potential connectivity with other routines, and derives the arrangements of objects concerning itself according to the assigned parameters. In order to optimize a scene, the patterns are iteratively multiplexed to insert new routines or modify existing ones under a set of constraints derived from commercial layout designs. Through extensive experiments, we demonstrate the ability of our framework to generate plausible and practical commercial scenes.

Control3D: Towards Controllable Text-to-3D Generation

  • Yang Chen
  • Yingwei Pan
  • Yehao Li
  • Ting Yao
  • Tao Mei

Recent remarkable advances in large-scale text-to-image diffusion models have inspired a significant breakthrough in text-to-3D generation, pursuing 3D content creation solely from a given text prompt. However, existing text-to-3D techniques lack a crucial ability in the creative process: interactively control and shape the synthetic 3D contents according to users' desired specifications (e.g., sketch). To alleviate this issue, we present the first attempt for text-to-3D generation conditioning on the additional hand-drawn sketch, namely Control3D, which enhances controllability for users. In particular, a 2D conditioned diffusion model (ControlNet) is remoulded to guide the learning of 3D scene parameterized as NeRF, encouraging each view of 3D scene aligned with the given text prompt and hand-drawn sketch. Moreover, we exploit a pre-trained differentiable photo-to-sketch model to directly estimate the sketch of the rendered image over synthetic 3D scene. Such estimated sketch along with each sampled view is further enforced to be geometrically consistent with the given sketch, pursuing better controllable text-to-3D generation. Through extensive experiments, we demonstrate that our proposal can generate accurate and faithful 3D scenes that align closely with the input text prompts and sketches.

Reconnecting the Broken Civilization: Patchwork Integration of Fragments from Ancient Manuscripts

  • Yuqing Zhang
  • Zhou Fang
  • Xinyu Yang
  • Shengyu Zhang
  • Baoyi He
  • Huaiyong Dou
  • Junchi Yan
  • Yongquan Zhang
  • Fei Wu

The rich tapestry of human history is often painstakingly pieced together from ancient manuscripts, serving as resilient time capsules of cultural heritage, societal norms, religious tenets, and quotidian life. Unfortunately, the ravages of time, careless preservation, and varied forms of degradation frequently leave us with fragmented relics. The traditional process of reconstructing these fragments is an arduous task, demanding exhaustive manual intervention and a global collaboration among archaeologists. This paper presents a transformative approach to this challenge, harnessing multi-media techniques to restore the connectable fragments of the invaluable Dunhuang scrolls. We curate a unique multimodal dataset of the fragmented Dunhuang manuscripts and architect an innovative three-tiered pipeline to reconstruct these historical scrolls. Our initial stage uses a text-based localization strategy, filtering fragment pairs through text comparison. We then employ a novel self-supervised contour-based pairwise matching framework to overcome the hurdle of limited labeled pairing samples. This process is powered by data augmentation techniques and a Siamese network which determines the most compatible matches. The final stage in our pipeline globally reconstructs the selected fragment pairs with hierarchical clustering, bringing us closer to the original grandeur of the Dunhuang scrolls. Our empirical evaluations reveal that this pipeline exhibits a remarkable success rate in fragment assembly. By addressing this cross-disciplinary challenge, our dataset and pipeline not only contribute to the field of multi-media artificial intelligence but also hold profound implications for sociocultural studies and future explorations into the understanding of ancient cultural history.

SESSION: Oral Session VIII: Engaging Users with Multimedia -- Multimedia Applications

Cal-SFDA: Source-Free Domain-adaptive Semantic Segmentation with Differentiable Expected Calibration Error

  • Zixin Wang
  • Yadan Luo
  • Zhi Chen
  • Sen Wang
  • Zi Huang

The prevalence of domain adaptive semantic segmentation has prompted concerns regarding source domain data leakage, where private information from the source domain could inadvertently be exposed in the target domain. To circumvent the requirement for source data, source-free domain adaptation has emerged as a viable solution that leverages self-training methods to pseudo-label high-confidence regions and adapt the model to the target data. However, the confidence scores obtained are often highly biased due to overconfidence and class-imbalance issues, which render both model selection and optimization problematic. In this paper, we propose a novel calibration-guided source-free domain adaptive semantic segmentation (Cal-SFDA) framework. The core idea is to estimate the expected calibration error (ECE) from the segmentation predictions, serving as a strong indicator of the model's generalization capability to the unlabeled target domain. The estimated ECE scores, in turn, assist the model training and fair selection in both source training and target adaptation stages. During model pre-training on the source domain, we ensure the differentiability of the ECE objective by leveraging the LogSumExp trick and using ECE scores to select the best source checkpoints for adaptation. To enable ECE estimation on the target domain without requiring labels, we train a value net for ECE estimation and apply statistic warm-up on its BatchNorm layers for stability. The estimated ECE scores assist in determining the reliability of prediction and enable class-balanced pseudo-labeling by positively guiding the adaptation progress and inhibiting potential error accumulation. Extensive experiments on two widely-used synthetic-to-real transfer tasks show that the proposed approach surpasses previous state-of-the-art by up to 5.25% of mIoU with fair model selection criteria.

Frequency Perception Network for Camouflaged Object Detection

  • Runmin Cong
  • Mengyao Sun
  • Sanyi Zhang
  • Xiaofei Zhou
  • Wei Zhang
  • Yao Zhao

Camouflaged object detection (COD) aims to accurately detect objects hidden in the surrounding environment. However,the existing COD methods mainly locate camouflaged objects in the RGB domain, their performance has not been fully exploited in many challenging scenarios. Considering that the features of the camouflaged object and the background are more discriminative in the frequency domain, we propose a novel learnable and separable frequency perception mechanism driven by the semantic hierarchy in the frequency domain. Our entire network adopts a two-stage model, including a frequency-guided coarse localization stage and a detail-preserving fine localization stage.With the multi-level features extracted by the backbone, we design a flexible frequency perception module based on octave convolution for coarse positioning. Then, we design the correction fusion module to step-by-step integrate the high-level features through the prior-guided correction and cross-layer feature channel association, and finally combine them with the shallow features to achieve the detailed correction of the camouflaged objects. Compared with the currently existing models, our proposed method achieves competitive performance in three popular benchmark datasets both qualitatively and quantitatively. The code will be released at

SepMark: Deep Separable Watermarking for Unified Source Tracing and Deepfake Detection

  • Xiaoshuai Wu
  • Xin Liao
  • Bo Ou

Malicious Deepfakes have led to a sharp conflict over distinguishing between genuine and forged faces. Although many countermeasures have been developed to detect Deepfakes ex-post, undoubtedly, passive forensics has not considered any preventive measures for the pristine face before foreseeable manipulations. To complete this forensics ecosystem, we thus put forward the proactive solution dubbed SepMark, which provides a unified framework for source tracing and Deepfake detection. SepMark originates from encoder-decoder-based deep watermarking but with two separable decoders. For the first time the deep separable watermarking, SepMark brings a new paradigm to the established study of deep watermarking, where a single encoder embeds one watermark elegantly, while two decoders can extract the watermark separately at different levels of robustness. The robust decoder termed Tracer that resists various distortions may have an overly high level of robustness, allowing the watermark to survive both before and after Deepfake. The semi-robust one termed Detector is selectively sensitive to malicious distortions, making the watermark disappear after Deepfake. Only SepMark comprising of Tracer and Detector can reliably trace the trusted source of the marked face and detect whether it has been altered since being marked; neither of the two alone can achieve this. Extensive experiments demonstrate the effectiveness of the proposed SepMark on typical Deepfakes, including face swapping, expression reenactment, and attribute editing. Code will be available at

SDDNet: Style-guided Dual-layer Disentanglement Network for Shadow Detection

  • Runmin Cong
  • Yuchen Guan
  • Jinpeng Chen
  • Wei Zhang
  • Yao Zhao
  • Sam Kwong

Despite significant progress in shadow detection, current methods still struggle with the adverse impact of background color, which may lead to errors when shadows are present on complex backgrounds. Drawing inspiration from the human visual system, we treat the input shadow image as a composition of a background layer and a shadow layer, and design a Style-guided Dual-layer Disentanglement Network (SDDNet) to model these layers independently. To achieve this, we devise a Feature Separation and Recombination (FSR) module that decomposes multi-level features into shadow-related and background-related components by offering specialized supervision for each component, while preserving information integrity and avoiding redundancy through the reconstruction constraint. Moreover, we propose a Shadow Style Filter (SSF) module to guide the feature disentanglement by focusing on style differentiation and uniformization. With these two modules and our overall pipeline, our model effectively minimizes the detrimental effects of background color, yielding superior performance on three public datasets with a real-time inference speed of 32 FPS. Our code is publicly available at:

High-Order Tensor Recovery Coupling Multilayer Subspace Priori with Application in Video Restoration

  • Hao Tan
  • Weichao Kong
  • Feng Zhang
  • Wenjin Qin
  • Jianjun Wang

In the real world, a large amount of high-order tensor data (order>3) exists, such as color videos, multispectral videos, and light-field images. However, these data often face challenges in transportation, storage, and susceptibility to damage. Meanwhile, most existing tensor-based information processing methods only concentrate on third-order tensors, which may not meet the complex requirements of high-dimensional data processing. In this paper, to better address the high-order tensor recovery issue, we propose a novel method that couples multilayer subspace priors with high-order tensor recovery techniques for tensor completion and robust tensor principal component analysis. Moreover, we provide theoretical guarantees for our approach's recovery and demonstrate that it achieves comparable performance under weaker incoherent conditions. Additionally, we develop two efficient and interpretable algorithms based on the alternating direction method of multipliers (ADMM) to solve our model. Owing to the adaptability of subspace prior information, our method demonstrates superior performance in recovering various types of data, including color videos and multispectral videos, compared with various advanced algorithms currently available.

Digging into Depth Priors for Outdoor Neural Radiance Fields

  • Chen Wang
  • Jiadai Sun
  • Lina Liu
  • Chenming Wu
  • Zhelun Shen
  • Dayan Wu
  • Yuchao Dai
  • Liangjun Zhang

Neural Radiance Fields (NeRFs) have demonstrated impressive performance in vision and graphics tasks, such as novel view synthesis and immersive reality. However, the shape-radiance ambiguity of radiance fields remains a challenge, especially in the sparse viewpoints setting. Recent work resorts to integrating depth priors into outdoor NeRF training to alleviate the issue. However, the criteria for selecting depth priors and the relative merits of different priors have not been thoroughly investigated. Moreover, the relative merits of selecting different approaches to use the depth priors is also an unexplored problem. In this paper, we provide a comprehensive study and evaluation of employing depth priors to outdoor neural radiance fields, covering common depth sensing technologies and most application ways. Specifically, we conduct extensive experiments with two representative NeRF methods equipped with four commonly-used depth priors and different depth usages on two widely used outdoor datasets. Our experimental results reveal several interesting findings that can potentially benefit practitioners and researchers in training their NeRF models with depth priors. Project page:

ECENet: Explainable and Context-Enhanced Network for Muti-modal Fact verification

  • Fanrui Zhang
  • Jiawei Liu
  • Qiang Zhang
  • Esther Sun
  • Jingyi Xie
  • Zheng-Jun Zha

Recently, falsified claims incorporating both text and images have been disseminated more effectively than those containing text alone, raising significant concerns for multi-modal fact verification. Existing research makes contributions to multi-modal feature extraction and interaction, but fails to fully utilize and enhance the valuable and intricate semantic relationships between distinct features. Moreover, most detectors merely provide a single outcome judgment and lack an inference process or explanation. Taking these factors into account, we propose a novel Explainable and Context-Enhanced Network (ECENet) for multi-modal fact verification, making the first attempt to integrate multi-clue feature extraction, multi-level feature reasoning, and justification (explanation) generation within a unified framework. Specifically, we propose an Improved Coarse- and Fine-grained Attention Network, equipped with two types of level-grained attention mechanisms, to facilitate a comprehensive understanding of contextual information. Furthermore, we propose a novel justification generation module via deep reinforcement learning that does not require additional labels. In this module, a sentence extractor agent measures the importance between the query claim and all document sentences at each time step, selecting a suitable amount of high-scoring sentences to be rewritten as the explanation of the model. Extensive experiments demonstrate the effectiveness of the proposed method.

Client-Adaptive Cross-Model Reconstruction Network for Modality-Incomplete Multimodal Federated Learning

  • Baochen Xiong
  • Xiaoshan Yang
  • Yaguang Song
  • Yaowei Wang
  • Changsheng Xu

Multimodal federated learning (MFL) is an emerging field that allows many distributed clients, each with multimodal data, to work together to train models targeting multimodal tasks without sharing local data. Whereas, existing methods assume that all modalities for each sample are complete, which limits their practicality. In this paper, we propose a Client-Adaptive Cross-Modal Reconstruction Network (CACMRN) to solve the modality-incomplete multimodal federated learning (MI-MFL). Compared to existing centralized methods for reconstructing missing modality, the local client data in federated learning is typically much less, which makes it challenging to train a reliable reconstruction model that can accurately predict missing data. We propose a cross-modal reconstruction transformer, which can prevent the model overfitting on the local client by exploring instance-instance relationships within the local client and utilizing normalized self-attention to conduct data-depended partial updating. Using federated optimization with alternative local updating and global aggregation, our method can not only collaboratively utilize the distributed data on different local clients to learn the cross-modal reconstruction transformer, but also prevent the reconstruction model from overfitting the data on the local client. Extensive experimental results on three datasets demonstrate the effectiveness of our method.

AutoPoster: A Highly Automatic and Content-aware Design System for Advertising Poster Generation

  • Jinpeng Lin
  • Min Zhou
  • Ye Ma
  • Yifan Gao
  • Chenxi Fei
  • Yangjian Chen
  • Zhang Yu
  • Tiezheng Ge

Advertising posters, a form of information presentation, combine visual and linguistic modalities. Creating a poster involves multiple steps and necessitates design experience and creativity. This paper introduces AutoPoster, a highly automatic and content-aware system for generating advertising posters. With only product images and titles as inputs, AutoPoster can automatically produce posters of varying sizes through four key stages: image cleaning and retargeting, layout generation, tagline generation, and style attribute prediction. To ensure visual harmony of posters, two content-aware models are incorporated for layout and tagline generation. Moreover, we propose a novel multi-task Style Attribute Predictor (SAP) to jointly predict visual style attributes. Meanwhile, to our knowledge, we propose the first poster generation dataset that includes visual attribute annotations for over 76k posters. Qualitative and quantitative outcomes from user studies and experiments substantiate the efficacy of our system and the aesthetic superiority of the generated posters compared to other poster generation methods.

Filling in the Blank: Rationale-Augmented Prompt Tuning for TextVQA

  • Gangyan Zeng
  • Yuan Zhang
  • Yu Zhou
  • Bo Fang
  • Guoqing Zhao
  • Xin Wei
  • Weiping Wang

Recently, generative Text-based visual question answering (TextVQA) methods, which are often based on language models, have exhibited impressive results and drawn increasing attention. However, due to the inconsistencies in both input forms and optimization objectives, the power of pretrained language models is not fully explored, resulting in the need for large amounts of training data. In this work, we rethink the characteristics of the TextVQA task and find that scene text is indeed a special kind of language embedded in images. To this end, we propose a text-centered generative framework FITB (stands for Filling In The Blank), in which multimodal information is mainly represented in textual form and rationale-augmented prompting is involved. Specifically, an infilling-based prompt strategy is utilized to formulate TextVQA as a novel problem of filling in the blank with proper scene text according to the language context. Furthermore, aiming to prevent the model from language bias overfitting, we design a rough answer grounding module to provide visual rationales for promoting multimodal reasoning. Extensive experiments verify the superiority of FITB in both fully-supervised and zero-shot/few-shot settings. Notably, even with a saving of about 64M data, FITB surpasses the state-of-the-art method by 3.00% and 1.99% on TextVQA and ST-VQA datasets, respectively.

End-to-end XY Separation for Single Image Blind Deblurring

  • Liuhan Chen
  • Yirou Wang
  • Yongyong Chen

Single image blind deblurring, only exploiting a blurry observation to reconstruct the sharp image, is a popular yet challenging low-level vision task. Current state-of-the-art deblurring networks mainly follow the coarse-to-fine strategy for architecture design and utilize U-net or its variant, XYDeblur, as the basic units. However, the one-encoder-one-decoder and the recently proposed one-encoder-two-decoder structures of basic units both fail to comprehensively take advantage of the directional separability of 2D deblurring, which increases the learning content of networks, thus leading to performance degradation. To thoroughly decouple the deblurring into two spatially orthogonal parts, we propose a novel substitution for U-net and its variant, called XYU-net. Specifically, it consists of two structurally identical U-nets, named XU-net and YU-net. They share orthogonal parameters by rotating kernels and focus on restoring a 2D blurry image in two spatially orthogonal directions respectively, which not only brings efficiency enhancement but also maintains parameter number. To further reduce the graphics memory demand of XYU-net, we transfer some non-linear transform modules (NLTM) from the outside of the network to its inside and propose the modified version, called MXYU-net. Experimental results on three large blurry image datasets demonstrate the efficiency of XYU-net and MXYU-net compared with U-net and XYDeblur, both as standalone models and as basic units of advanced U-net-based deblurring networks.

SD-Net: Spatially-Disentangled Point Cloud Completion Network

  • Junxian Chen
  • Ying Liu
  • Yiqi Liang
  • Dandan Long
  • Xiaolin He
  • Ruihui Li

Point clouds obtained from 3D scanning are typically incomplete, noisy, and sparse. Previous completion methods aim to generate complete point clouds, while taking into account the densification of point clouds, filling small holes, and proximity-to-surface, all through a single network. After revisiting the task, we propose SDNet, which disentangles the task based on the spatial characteristics of point clouds and formulates two sub-networks, a Dense Refiner and a Missing Generator. Given a partial input, the Dense Refiner produces a dense and clean point cloud, as a more reliable partial surface, which assists the Missing Generator to better infer the remaining point cloud structure. To promote the alignment and interaction across these two modules, we propose a Cross Fusion Unit with designed Non-Symmetrical Cross Transformers to capture geometric relationships between partial and missing regions, contributing to a complete, dense and well-aligned output. Extensive quantitative and qualitative results demonstrate that our method outperforms the state-of-the-art methods.

Latent-space Unfolding for MRI Reconstruction

  • Jiawei Jiang
  • Yuchao Feng
  • Jiacheng Chen
  • Dongyan Guo
  • Jianwei Zheng

To circumvent the problems caused by prolonged acquisition periods, compressed sensing MRI enjoys a high usage profile to accelerate the recovery of high-quality images from under-sampled k-space data. Most current solutions dedicate to solving this issue with the pursuit of certain prior properties, yet the treatments are all enforced in the original space, resulting in limited feature information. To achieve a performance promotion yet with the guarantee of running efficiency, in this work, we propose a latent-space unfolding network (LsUNet). Specifically, by an elaborately designed reversible network, the inputs are first mapped to a channel-lifted latent space, which taps the potential of capturing spatial-invariant features sufficiently. Within the latent space, we then unfold an accelerated optimization algorithm to iterate an efficient and feasible solution, in which a parallelly dual-domain update is equipped for better feature fusion. Finally, an inverse embedding transformation of the recovered high-dimensional representation is applied to achieve the expected estimation. LsUNet enjoys high interpretability due to the physically induced modules, which not only facilitates an intuitive understanding of the internal operating mechanism but also endows it with high generalization ability. Comprehensive experiments on different datasets and various sampling rates/patterns demonstrate the advantages of our proposal over the latest methods both visually and numerically.

TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real World

  • Hongpeng Lin
  • Ludan Ruan
  • Wenke Xia
  • Peiyu Liu
  • Jingyuan Wen
  • Yixin Xu
  • Di Hu
  • Ruihua Song
  • Wayne Xin Zhao
  • Qin Jin
  • Zhiwu Lu

To facilitate the research on intelligent and human-like chatbots with multi-modal context, we introduce a new video-based multi-modal dialogue dataset, called TikTalk. We collect 38K videos from a popular video-sharing platform, along with 367K conversations posted by users beneath them. Users engage in spontaneous conversations based on their multi-modal experiences from watching videos, which helps recreate real-world chitchat context. Compared to previous multi-modal dialogue datasets, the richer context types in TikTalk lead to more diverse conversations, but also increase the difficulty in capturing human interests from intricate multi-modal information to generate personalized responses. Moreover, external knowledge is more frequently evoked in our dataset. These facts reveal new challenges for multi-modal dialogue models. We quantitatively demonstrate the characteristics of TikTalk, propose a video-based multi-modal chitchat task, and evaluate several dialogue baselines. Experimental results indicate that the models incorporating large language models (LLM) can generate more diverse responses, while the model utilizing knowledge graphs to introduce external knowledge performs the best overall. Furthermore, no existing model can solve all the above challenges well. There is still a large room for future improvements, even for LLM with visual extensions. Our dataset is available at

IGG: Improved Graph Generation for Domain Adaptive Object Detection

  • Pengteng Li
  • Ying He
  • F. Richard Yu
  • Pinhao Song
  • Dongfu Yin
  • Guang Zhou

Domain Adaptive Object Detection (DAOD) transfers an object detector from a labeled source domain to a novel unlabeled target domain. Recent works bridge the domain gap by aligning cross-domain pixel-pairs in the non-euclidean graphical space and minimizing the domain discrepancy for adapting semantic distribution. Though great successes, these methods model graphs roughly with coarse semantic sampling due to ignoring the non-informative noises and failing to concentrate on precise semantics alignment. Besides, the coarse graph generation inevitably contains abnormal nodes. These challenges result in biased domain adaptation. Therefore, we propose an Improved Graph Generation (IGG) framework which conducts high-quality graph generation for DAOD. Specifically, we design an Intensive Node Refinement (INR) module that reconstructs the noisy sampled nodes with a memory bank, and contrastively regularizes the noisy features. For better semantics alignment, we decouple the domain-specific style and category-invariant content encoded in graph covariance and selectively eliminate only the domain-specific style. Then, a Precision Graph Optimization (PGO) adaptor is proposed which utilizes the variational inference to down-weight abnormal nodes. Comprehensive experiments on three adaptation benchmarks demonstrate that IGG achieves state-of-the-art results in unsupervised domain adaptation.

Efficient Bilateral Cross-Modality Cluster Matching for Unsupervised Visible-Infrared Person ReID

  • De Cheng
  • Lingfeng He
  • Nannan Wang
  • Shizhou Zhang
  • Zhen Wang
  • Xinbo Gao

Unsupervised visible-infrared person re-identification (USL-VI-ReID) aims to match pedestrian images of the same identity from different modalities without annotations. Existing works mainly focus on alleviating the modality gap by aligning instance-level features of the unlabeled samples. However, the relationships between cross-modality clusters are not well explored. To this end, we propose a novel bilateral cluster matching-based learning framework to reduce the modality gap by matching cross-modality clusters. Specifically, we design a Many-to-many Bilateral Cross-Modality Cluster Matching (MBCCM) algorithm through optimizing the maximum matching problem in a bipartite graph. Then, the matched pairwise clusters utilize shared visible and infrared pseudo-labels during the model training. Under such a supervisory signal, a Modality-Specific and Modality-Agnostic (MSMA) contrastive learning framework is proposed to align features jointly at a cluster-level. Meanwhile, the cross-modality Consistency Constraint (CC) is proposed to explicitly reduce the large modality discrepancy. Extensive experiments on the public SYSU-MM01 and RegDB datasets demonstrate the effectiveness of the proposed method, surpassing state-of-the-art approaches by a large margin of 8.76% mAP on average.

Faster Video Moment Retrieval with Point-Level Supervision

  • Xun Jiang
  • Zailei Zhou
  • Xing Xu
  • Yang Yang
  • Guoqing Wang
  • Heng Tao Shen

Video Moment Retrieval (VMR) aims at retrieving the most relevant events from an untrimmed video with natural language queries. Existing VMR methods suffer from two defects: (1) massive expensive temporal annotations are required to obtain satisfying performance; (2) complicated cross-modal interaction modules are deployed, which lead to high computational cost and low efficiency for the retrieval process. To address these issues, we propose a novel method termed Cheaper and Faster Moment Retrieval (CFMR), which balances the retrieval accuracy, efficiency, and annotation cost for VMR. Specifically, our proposed CFMR method learns from point-level supervision where each annotation is a single frame randomly located within the target moment. Such a labeling strategy achieves 6 times cheaper than the conventional annotations of event boundaries. Furthermore, we also design a concept-based multimodal alignment mechanism to bypass the usage of cross-modal interaction modules during the inference process, remarkably improving retrieval efficiency. The experimental results on three widely used VMR benchmarks demonstrate our proposed CFMR method achieves superior comprehensive performance to current state-of-the-art methods. Moreover, it significantly accelerates the retrieval speed with more than 100 times FLOPs compared to existing approaches with point-level supervision. Our open-source implementation is available at

IDDR-NGP:Incorporating Detectors for Distractors Removal with Instant Neural Radiance Field

  • Xianliang Huang
  • Jiajie Gou
  • Shuhang Chen
  • Zhizhou Zhong
  • Jihong Guan
  • Shuigeng Zhou

This paper presents the first unified distractor removal method, named IDDR-NGP, which directly operates on Instant-NPG. The method is able to remove a wide range of distractors in 3D scenes, such as snowflakes, confetti, defoliation and petals, whereas existing methods usually focus on a specific type of distractors. By incorporating implicit 3D representations with 2D detectors, we demonstrate that it is possible to efficiently restore 3D scenes from multiple corrupted images. We design the learned perceptual image patch similarity~( LPIPS) loss and the multi-view compensation loss (MVCL) to jointly optimize the rendering results of IDDR-NGP, which could aggregate information from multi-view corrupted images. All of them can be trained in an end-to-end manner to synthesize high-quality 3D scenes. To support the research on distractors removal in implicit 3D representations, we build a new benchmark dataset that consists of both synthetic and real-world distractors. To validate the effectiveness and robustness of IDDR-NGP, we provide a wide range of distractors with corresponding annotated labels added to both realistic and synthetic scenesExtensive experimental results demonstrate the effectiveness and robustness of IDDR-NGP in removing multiple types of distractors. In addition, our approach achieves results comparable with the existing SOTA desnow methods and is capable of accurately removing both realistic and synthetic distractors.

G-PCC++: Enhanced Geometry-based Point Cloud Compression

  • Junzhe Zhang
  • Tong Chen
  • Dandan Ding
  • Zhan Ma

MPEG Geometry-based Point Cloud Compression (G-PCC) standard is developed for lossy encoding of point clouds to enable immersive services over the Internet. However, lossy G-PCC introduces superimposed distortions from both geometry and attribute information, seriously deteriorating the Quality of Experience (QoE). This paper thus proposes the Enhanced G-PCC (GPCC++), to effectively address the compression distortion and restore the quality. G-PCC++ separates the enhancement into two stages: it first enhances the geometry and then maps the decoded attribute to the enhanced geometry for refinement. As for geometry restoration, a k Nearest Neighbors (kNN)-based Linear Interpolation is first used to generate a denser geometry representation, on top of which GeoNet further generates sufficient candidates to restore geometry through probability-sorted selection. For attribute enhancement, a kNN-based Gaussian Distance Weighted Mapping is devised to re-colorize all points in enhanced geometry tensor, which are then refined by AttNet for the final reconstruction. G-PCC++ is the first solution addressing the geometry and attribute artifacts together. Extensive experiments on several public datasets demonstrate the superiority of G-PCC++, e.g., on the solid point cloud dataset 8iVFB, G-PCC++ outperforms G-PCC by 88.24% (80.54%) BD-BR in D1 (D2) measurement of geometry and by 14.64% (13.09%) BD-BR in Y (YUV) attribute. Moreover, when considering both geometry and attribute, G-PCC++ also largely surpasses G-PCC by 25.58% BD-BR using PCQM assessment.

Gradient-Free Textual Inversion

  • Zhengcong Fei
  • Mingyuan Fan
  • Junshi Huang

Recent works on personalized text-to-image generation usually learn to bind a special token with specific subjects or styles of a few given images by tuning its embedding through gradient descent. It is natural to question whether we can optimize the textual inversions by only accessing the process of model inference. As only requiring the forward computation to determine the textual inversion retains the benefits of less GPU memory, simple deployment, and secure access for scalable models. In this paper, we introduce a gradient-free framework to optimize the continuous textual inversion in an iterative evolutionary strategy. Specifically, we first initialize an appropriate token embedding for textual inversion with the consideration of visual and text vocabulary information. Then, we decompose the optimization of evolutionary strategy into dimension reduction of searching space and non-convex gradient-free optimization in subspace, which significantly accelerates the optimization process with negligible performance loss. Experiments in several creative applications demonstrate that the performance of text-to-image model equipped with our proposed gradient-free method is comparable to that of gradient-based counterparts with variant GPU/CPU platforms, flexible employment, as well as computational efficiency.

DiffDance: Cascaded Human Motion Diffusion Model for Dance Generation

  • Qiaosong Qi
  • Le Zhuo
  • Aixi Zhang
  • Yue Liao
  • Fei Fang
  • Si Liu
  • Shuicheng Yan

When hearing music, it is natural for people to dance to its rhythm. Automatic dance generation, however, is a challenging task due to the physical constraints of human motion and rhythmic alignment with target music. Conventional autoregressive methods introduce compounding errors during sampling and struggle to capture the long-term structure of dance sequences. To address these limitations, we present a novel cascaded motion diffusion model, DiffDance, designed for high-resolution, long-form dance generation. This model comprises a music-to-dance diffusion model and a sequence super-resolution diffusion model. To bridge the gap between music and motion for conditional generation, DiffDance employs a pretrained audio representation learning model to extract music embeddings and further align its embedding space to motion via contrastive loss. During training our cascaded diffusion model, we also incorporate multiple geometric losses to constrain the model outputs to be physically plausible and add a dynamic loss weight that adaptively changes over diffusion timesteps to facilitate sample diversity. Through comprehensive experiments performed on the benchmark dataset AIST++, we demonstrate that DiffDance is capable of generating realistic dance sequences that align effectively with the input music. These results are comparable to those achieved by state-of-the-art autoregressive methods.

Video Inverse Tone Mapping Network with Luma and Chroma Mapping

  • Peihuan Huang
  • Gaofeng Cao
  • Fei Zhou
  • Guoping Qiu

\beginabstract With the popularity of consumer high dynamic range (HDR) display devices, video inverse tone mapping (iTM) has become a research hotspot. However, existing methods are designed based on a perceptual non-uniformity color space (e.g., RGB and YC_BC_R), resulting in limited quality of HDR video rendered by these methods. Considering the two key factors involved in the video iTM task: luma and chroma, in this paper, we design an IC_TC_P color space based video iTM model, which reproduces high quality HDR video by processing luma and chroma information. Benefitting from the decorrelated perception of luma and chroma in the IC_TC_P color space, two global mapping networks (INet and TPNet) are developed to enhance the luma and chroma pixels, respectively. However, luma and chroma mapping in the iTM task may be affected by color appearance phenomena. Thus, a luma-chroma adaptation transform network (LCATNet) is proposed to process the luma and chroma pixels affected by color appearance phenomena, which can complement the local details to the globally enhanced luma and chroma pixels. In the LCATNet, either the luma mapping or the chroma mapping is adaptively adjusted according to both the luma and the chroma information. Besides, benefitting from the perceptually consistent property of the IC_T C_P color space, the same pixel errors can draw equal model attentions during the training. Thus, the proposed model can correctly render luma and chroma information without highlighting special regions or designing special training losses. Extensive experimental results demonstrate the effectiveness of the proposed model. \endabstract

Learning Pixel-wise Alignment for Unsupervised Image Stitching

  • Qi Jia
  • Xiaomei Feng
  • Yu Liu
  • Xin Fan
  • Longin Jan Latecki

Image stitching aims to align a pair of images in the same view. Generating precise alignment with natural structures is challenging for image stitching, as there is no wider field-of-view image as a reference, especially in non-coplanar practical scenarios. In this paper, we propose an unsupervised image stitching framework, breaking through the coplanar constraints in homography estimation, yielding accurate pixel-wise alignment under limited overlapping regions. First, we generate a global transformation by an iterative dense feature matching combined with an error control strategy to alleviate the difference introduced by large parallax. Second, we propose a pixel-wise warping network embedded within a large-scale feature extractor and a correlative feature enhancement module to explicitly learn correspondences between the inputs, and generate accurate pixel-level offsets upon novel constraints on both overlapping and non-overlapping regions. Notably, we leverage the pixel-level offsets in the overlapping area to guide the adjustment in the non-overlapping area upon content and structure consistency constraints, rendering a natural transition between two regions and distortions suppression over the entire stitched image. The proposed method achieves state-of-the-art performance that surpasses both traditional and deep learning approaches by a large margin. It also achieves the shortest execution time and has the best generalization ability on the traditional dataset.

FashionDiff: A Controllable Diffusion Model Using Pairwise Fashion Elements for Intelligent Design

  • Han Yan
  • Haijun Zhang
  • Xiangyu Mu
  • Jicong Fan
  • Zhao Zhang

The process of fashion design involves creative expression through various methods, including sketch drawing, brush painting, and choices of textures and colors, all of which are employed to characterize the originality and uniqueness of the designed fashion items. Despite recent advances in intelligence-driven fashion design, the complexity of the diverse elements of a fashion item, such as its texture, color and shape, which are associated with the semantic information conveyed, continues to present challenges in terms of generating high-quality fashion images as well as achieving a controllable editing process. To address this issue, we propose a unified framework, FashionDiff, that leverages the diverse elements in fashion items to generate new items. Initially, we collected a large number of fashion images with multiple categories and created pairwise data in terms of sketch and additional data, such as brush areas, textures, or colors. To eliminate semantic discrepancies between these pairwise datasets, we introduce a feature modulation fusion (FMFusion) process, which enables interactive communication among different images, allowing them to be fused into latent spaces characterized by different resolutions. In order to produce high-quality editable fashion images, we develop a generator based on a state-of-the-art diffusion model called FD-ControlNet, which integrates latent spaces into different layers of the generator to generate ready-to-wear fashion items. Qualitative and quantitative experimental results demonstrate the effectiveness of our proposed method, and suggest that our model can offer flexible control over the generated images in terms of sketches, brush areas, textures, and colors.

Learning Non-Uniform-Sampling for Ultra-High-Definition Image Enhancement

  • Wei Yu
  • Qi Zhu
  • Naishan Zheng
  • Jie Huang
  • Man Zhou
  • Feng Zhao

Ultra-high-definition (UHD) image enhancement is a challenging problem that aims to effectively and efficiently recover clean UHD images. To maintain efficiency, the straightforward approach is to downsample and perform most computations on low-resolution images. However, previous studies typically rely on the uniform and content-agnostic downsampling method that equally treats various regions regardless of their complexities, thus limiting the detail reconstruction in UHD image enhancement. To alleviate this issue, we propose a novel spatial-variant and invertible non-uniform downsampler that adaptively adjusts the sampling rate according to the richness of details. It magnifies important regions to preserve more information (e.g., sparse sampling points for sky, dense sampling points for buildings). Therefore, we propose a novel Non-uniform-Sampling Enhancement Network (NSEN) consisting of two core designs: 1) content-guided downsampling that extracts texture representation to guide the sampler to perform content-aware downsampling for producing detail-preserved low-resolution images; 2) invertible pixel-alignment which remaps the forward sampling process in an iterative manner to eliminate the deformations caused by the non-uniform downsampling, thus producing detail-rich clean UHD images. To demonstrate the superiority of our proposed model, we conduct extensive experiments on various UHD enhancement tasks. The results show that the proposed NSEN yields better performance against other state-of-the-art methods both visually and quantitatively.

Hierarchical Dynamic Image Harmonization

  • Haoxing Chen
  • Zhangxuan Gu
  • Yaohui Li
  • Jun Lan
  • Changhua Meng
  • Weiqiang Wang
  • Huaxiong Li

Image harmonization is a critical task in computer vision, which aims to adjust the foreground to make it compatible with the background. Recent works mainly focus on using global transformations (i.e., normalization and color curve rendering) to achieve visual consistency. However, these models ignore local visual consistency and their huge model sizes limit their harmonization ability on edge devices. In this paper, we propose a hierarchical dynamic network (HDNet) to adapt features from local to global view for better feature transformation in efficient image harmonization. Inspired by the success of various dynamic models, local dynamic (LD) module and mask-aware global dynamic (MGD) module are proposed in this paper. Specifically, LD matches local representations between the foreground and background regions based on semantic similarities, then adaptively adjust every foreground local representation according to the appearance of its K-nearest neighbor background regions. In this way, LD can produce more realistic images at a more fine-grained level, and simultaneously enjoy the characteristic of semantic alignment. The MGD effectively applies distinct convolution to the foreground and background, learning the representations of foreground and background regions as well as their correlations to the global harmonization, facilitating local visual consistency for the images much more efficiently. Experimental results demonstrate that the proposed HDNet significantly reduces the total model parameters by more than 80% compared to previous methods, while still attaining state-of-the-art performance on the popular iHarmony4 dataset. Additionally, we introduced a lightweight version of HDNet, i.e., HDNet-lite, which has only 0.65MB parameters, yet it still achieve competitive performance. Our code is avaliable at

Toward Scalable Image Feature Compression: A Content-Adaptive and Diffusion-Based Approach

  • Sha Guo
  • Zhuo Chen
  • Yang Zhao
  • Ning Zhang
  • Xiaotong Li
  • Lingyu Duan

Traditional image codecs prioritize signal fidelity and human perception, often neglecting machine vision tasks. Deep learning approaches have shown promising coding performance by leveraging rich semantic embeddings that can be optimized for both human and machine vision. However, these compact embeddings struggle to represent low-level details like contours and textures, leading to imperfect reconstructions. Additionally, existing learning-based coding tools lack scalability. To address these challenges, this paper presents a content-adaptive diffusion model for scalable image compression. The method encodes accurate texture through a diffusion process, enhancing human perception while preserving important features for machine vision tasks. It employs a Markov palette diffusion model with commonly-used feature extractors and image generators, enabling efficient data compression. By utilizing collaborative texture-semantic feature extraction and pseudo-label generation, the approach accurately learns texture information. A content-adaptive Markov palette diffusion model is then applied to capture both low-level texture and high-level semantic knowledge in a scalable manner. This framework enables elegant compression ratio control by flexibly selecting intermediate diffusion states, eliminating the need for deep learning model re-training at different operating points. Extensive experiments demonstrate the effectiveness of the proposed framework in image reconstruction and downstream machine vision tasks such as object detection, segmentation, and facial landmark detection. It achieves superior perceptual quality scores compared to state-of-the-art methods.

Towards Decision-based Sparse Attacks on Video Recognition

  • Kaixun Jiang
  • Zhaoyu Chen
  • Xinyu Zhou
  • Jingyu Zhang
  • Lingyi Hong
  • JiaFeng Wang
  • Bo Li
  • Yan Wang
  • Wenqiang Zhang

Recent studies indicate that sparse attacks threaten the security of deep learning models, which modify only a small set of pixels in the input based on the l0 norm constraint. While existing research has primarily focused on sparse attacks against image models, there is a notable gap in evaluating the robustness of video recognition models. To bridge this gap, we are the first to study sparse video attacks and propose an attack framework named V-DSA in the most challenging decision-based setting, in which threat models only return the predicted hard label. Specifically, V-DSA comprises two modules: a Cross-Modal Generator (CMG) for query-free transfer attacks on each frame and an Optical flow Grouping Evolution algorithm (OGE) for query-efficient spatial-temporal attacks. CMG passes each frame to generate the transfer video as the starting point of the attack based on the feature similarity between image classification and video recognition models. OGE first initializes populations based on transfer video and then leverages optical flow to establish the temporal connection of the perturbed pixels in each frame, which can reduce the parameter space and break the temporal relationship between frames specifically. Finally, OGE complements the above optical flow modeling by grouping evolution which can realize the coarse-to-fine attack to avoid falling into the local optimum. In addition, OGE makes the perturbation with temporal coherence while balancing the number of perturbed pixels per frame, further increasing the imperceptibility of the attack. Extensive experiments demonstrate that V-DSA achieves state-of-the-art performance in terms of both threat effectiveness and imperceptibility. We hope V-DSA can provide valuable insights into the security of video recognition systems.

RAIRNet: Region-Aware Identity Rectification for Face Forgery Detection

  • Mingqi Fang
  • Lingyun Yu
  • Hongtao Xie
  • Junqiang Wu
  • Zezheng Wang
  • Jiahong Li
  • Yongdong Zhang

The malicious usage of facial manipulation techniques boosts the desire of face forgery detection research. Recently, identity-based approaches have attracted much attention due to the effective observation of identity inconsistency. However, there are still several nonnegligible problems: (1) generic identity extractor is totally trained on real images, leading to enormous identity representation bias during processing forged content; (2) the identity information of forged image is hybrid and presents regional distribution, while the single global identity feature is hard to reflect this local identity inconsistency. To solve the above problems, in this paper a novel Region-Aware Identity Rectification Network (RAIRNet) is proposed to effectively rectify the identity bias and adaptively exploit the inconsistency local region. Firstly, for the identity bias problem, our RAIRNet is devised in a two-branch architecture, which consists of a Generic Identity Extractor (GIE) branch and a Bias Diminishing Module (BDM) branch. The BDM branch is designed to rectify the bias introduced by GIE branch through a prototype-based training schema. This two-branch architecture effectively promotes model to adapt to forged content while maintaining the focus on identity space. Secondly, for local identity inconsistency exploiting, a novel Meta Identity Filter Generator (MIFG) is devised in a meta-learning way to generate the region-aware filter based on identity prior. This region-aware filter can adaptively exploit the local inconsistency clues and activate the discriminative local region. Moreover, to balance the local-global information and highlight the forensic clues, an Adaptive Weight Assignment Mechanism (AWAM) is proposed to assign adaptive importance weight to two branches. Extensive experiments on various datasets show the superiority of our RAIRNet. In particular, on the challenging DFDCp dataset, our approach outperforms previous binary-based and identity-based methods by 10.3% and 5.5% respectively.

Multispectral Object Detection via Cross-Modal Conflict-Aware Learning

  • Xiao He
  • Chang Tang
  • Xin Zou
  • Wei Zhang

Multispectral object detection has gained significant attention due to its potential in all-weather applications, particularly those involving visible (RGB) and infrared (IR) images. Despite substantial advancements in this domain, current methodologies primarily rely on rudimentary accumulation operations to combine complementary information from disparate modalities, overlooking the semantic conflicts that arise from the intrinsic heterogeneity among modalities. To address this issue, we propose a novel learning network, the Cross-modal Conflict-Aware Learning Network (CALNet), that takes into account semantic conflicts and complementary information within multi-modal input. Our network comprises two pivotal modules: the Cross-Modal Conflict Rectification Module (CCR) and the Selected Cross-modal Fusion (SCF) Module. The CCR module mitigates modal heterogeneity by examining contextual information of analogous pixels, thus alleviating multi-modal information with semantic conflicts. Subsequently, semantically coherent information is supplied to the SCF module, which fuses multi-modal features by assessing intra-modal importance to select semantically rich features and mining inter-modal complementary information. To assess the effectiveness of our proposed method, we develop a two-stream one-stage detector based on CALNet for multispectral object detection. Comprehensive experimental outcomes demonstrate that our approach considerably outperforms existing methods in resolving the cross-modal semantic conflict issue and achieving state-of-the-art accuracy in detection results.

Decoupled Cross-Scale Cross-View Interaction for Stereo Image Enhancement in the Dark

  • Huan Zheng
  • Zhao Zhang
  • Jicong Fan
  • Richang Hong
  • Yi Yang
  • Shuicheng Yan

Low-light stereo image enhancement (LLSIE) aims at improving the visual quality of stereo images captured in dark conditions. However, existing methods have shown limited success in detail recovery and illumination adjustment. This can be attributed to two main factors: 1) insufficient single-scale inter-view interaction hinders the exploitation of valuable cross-view cues; 2) lacking long-range dependency leads to the inability to deal with the spatial long-range effects caused by illumination degradation. To address these limitations, we propose a novel LLSIE model named Decoupled Cross-Scale Cross-View Interaction Network (DCI-Net). Our model introduces a key component called the Decoupled Interaction Module (DIM) designed to promote sufficient dual-view information exchange. DIM decouples the dual-view information interaction by discovering multi-scale cross-view correlations and further exploring cross-scale information flow. Furthermore, we present Spatial-channel Information Mining Block (SIMB) for intra-view feature extraction, and the benefits are twofold. One is long-range dependency capture to build spatial long-range relationship, and the other is expanded channel information refinement that enhances information flow in the channel dimension. Extensive experiments conducted on Flickr1024, KITTI 2012, KITTI 2015, and Middlebury datasets show that our method obtains better illumination adjustment and detail recovery, and achieves SOTA performance compared to other related methods.

CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation

  • Kexin Li
  • Zongxin Yang
  • Lei Chen
  • Yi Yang
  • Jun Xiao

Audio-visual video segmentation (AVVS) aims to generate pixel-level maps of sound-producing objects within image frames and ensure the maps faithfully adheres to the given audio, such as identifying and segmenting a singing person in a video. However, existing methods exhibit two limitations: 1) they address video temporal features and audio-visual interactive features separately, disregarding the inherent spatial-temporal dependence of combined audio and video, and 2) they inadequately introduce audio constraints and object-level information during the decoding stage, resulting in segmentation outcomes that fail to comply with audio directives. To tackle these issues, we propose a decoupled audio-video transformer that combines audio and video features from their respective temporal and spatial dimensions, capturing their combined dependence. To optimize memory consumption, we design a block, which, when stacked, enables capturing audio-visual fine-grained combinatorial-dependence in a memory-efficient manner. Additionally, we introduce audio-constrained queries during the decoding phase. These queries contain rich object-level information, ensuring the decoded mask adheres to the sounds. Experimental results confirm our approach's effectiveness, with our framework achieving a new SOTA performance on all three datasets using two backbones. The code is available at

S-OmniMVS: Incorporating Sphere Geometry into Omnidirectional Stereo Matching

  • Zisong Chen
  • Chunyu Lin
  • Lang Nie
  • Zhijie Shen
  • Kang Liao
  • Yuanzhouhan Cao
  • Yao Zhao

Multi-fisheye stereo matching is a promising task that employs the traditional multi-view stereo (MVS) pipeline with spherical sweeping to acquire omnidirectional depth. However, the existing omnidirectional MVS technologies neglect fisheye and omnidirectional distortions, yielding inferior performance. In this paper, we revisit omnidirectional MVS by incorporating three sphere geometry priors: spherical projection, spherical continuity, and spherical position. To deal with fisheye distortion, we propose a new distortion-adaptive fusion module to convert fisheye inputs into distortion-free spherical tangent representations by constructing a spherical projection space. Then these multi-scale features are adaptively aggregated with additional learnable offsets to enhance content perception. To handle omnidirectional distortion, we present a new spherical cost aggregation module with a comprehensive consideration of the spherical continuity and position. Concretely, we first design a rotation continuity compensation mechanism to ensure omnidirectional depth consistency of left-right boundaries without introducing extra computation. On the other hand, we encode the geometry-aware spherical position and push them into the cost aggregation to relieve panoramic distortion and perceive the 3D structure. Furthermore, to avoid the excessive concentration of depth hypothesis caused by inverse depth linear sampling, we develop a segmented sampling strategy that combines linear and exponential spaces to create S-OmniMVS, along with three sphere priors. Extensive experiments demonstrate the proposed method outperforms the state-of-the-art (SoTA) solutions by a large margin on various datasets both quantitatively and qualitatively.

Prototypical Cross-domain Knowledge Transfer for Cervical Dysplasia Visual Inspection

  • Yichen Zhang
  • Yifang Yin
  • Ying Zhang
  • Zhenguang Liu
  • Zheng Wang
  • Roger Zimmermann

Early detection of dysplasia of the cervix is critical for cervical cancer treatment. However, automatic cervical dysplasia diagnosis via visual inspection, which is more appropriate in low-resource settings, remains a challenging problem. Though promising results have been obtained by recent deep learning models, their performance is significantly hindered by the limited scale of the available cervix datasets. Distinct from previous methods that learn from a single dataset, we propose to leverage cross-domain cervical images that were collected in different but related clinical studies to improve the model's performance on the targeted cervix dataset. To robustly learn the transferable information across datasets, we propose a novel prototype-based knowledge filtering method to estimate the transferability of cross-domain samples. We further optimize the shared feature space by aligning the cross-domain image representations simultaneously on domain level with early alignment and class level with supervised contrastive learning, which endows model training and knowledge transfer with stronger robustness. The empirical results on three real-world benchmark cervical image datasets show that our proposed method outperforms the state-of-the-art cervical dysplasia visual inspection by an absolute improvement of 4.7% in top-1 accuracy, 7.0% in precision, 1.4% in recall, 4.6% in F1 score, and 0.05 in ROC-AUC.

When Measures are Unreliable: Imperceptible Adversarial Perturbations toward Top-k Multi-Label Learning

  • Yuchen Sun
  • Qianqian Xu
  • Zitai Wang
  • Qingming Huang

With the great success of deep neural networks, adversarial learning has received widespread attention in various studies, ranging from multi-class learning to multi-label learning. However, existing adversarial attacks toward multi-label learning only pursue the traditional visual imperceptibility but ignore the new perceptible problem coming from measures such as Precision@k and mAP@k. Specifically, when a well-trained multi-label classifier performs far below the expectation on some samples, the victim can easily realize that this performance degeneration stems from attack, rather than the model itself. Therefore, an ideal multi-labeling adversarial attack should manage to not only deceive visual perception but also evade monitoring of measures. To this end, this paper first proposes the concept of measure imperceptibility. Then, a novel loss function is devised to generate such adversarial perturbations that could achieve both visual and measure imperceptibility. Furthermore, an efficient algorithm, which enjoys a convex objective, is established to optimize this objective. Finally, extensive experiments on large-scale benchmark datasets, such as PASCAL VOC 2012, MS COCO, and NUS WIDE, demonstrate the superiority of our proposed method in attacking the top-k multi-label systems.

Karma: Adaptive Video Streaming via Causal Sequence Modeling

  • Bowei Xu
  • Hao Chen
  • Zhan Ma

Optimal adaptive bitrate (ABR) decision depends on a comprehensive characterization of state transitions that involve interrelated modalities over time including environmental observations, returns, and actions. However, state-of-the-art learning-based ABR algorithms solely rely on past observations to decide the next action. This paradigm tends to cause a chain of deviations from optimal action when encountering unfamiliar observations, which consequently undermines the model generalization.

This paper presents Karma, an ABR algorithm that utilizes causal sequence modeling to improve generalization by comprehending the interrelated causality among past observations, returns, and actions and timely refining action when deviation occurs. Unlike direct observation-to-action mapping, Karma recurrently maintains a multi-dimensional time series of observations, returns, and actions as input and employs causal sequence modeling via a decision transformer to determine the next action. In the input sequence, Karma uses the maximum cumulative future quality of experience (QoE) (a.k.a, QoE-to-go) as an extended return signal, which is periodically estimated based on current network conditions and playback status. We evaluate Karma through trace-driven simulations and real-world field tests, demonstrating superior performance compared to existing state-of-the-art ABR algorithms, with an average QoE improvement ranging from 10.8% to 18.7% across diverse network conditions. Furthermore, Karma exhibits strong generalization capabilities, showing leading performance under unseen networks in both simulations and real-world tests.

Joint Local Relational Augmentation and Global Nash Equilibrium for Federated Learning with Non-IID Data

  • Xinting Liao
  • Chaochao Chen
  • Weiming Liu
  • Pengyang Zhou
  • Huabin Zhu
  • Shuheng Shen
  • Weiqiang Wang
  • Mengling Hu
  • Yanchao Tan
  • Xiaolin Zheng

Federated learning (FL) is a distributed machine learning paradigm that needs collaboration between a server and a series of clients with decentralized data. To make FL effective in real-world applications, existing work devotes to improving the modeling of decentralized non-IID data. In non-IID settings, there are intra-client inconsistency that comes from the imbalanced data modeling, and inter-client inconsistency among heterogeneous client distributions, which not only hinders sufficient representation of the minority data, but also brings discrepant model deviations. However, previous work overlooks to tackle the above two coupling inconsistencies together. In this work, we propose FedRANE, which consists of two main modules, i.e., local relational augmentation (LRA) and global Nash equilibrium (GNE), to resolve intra-and inter-client inconsistency simultaneously. Specifically, in each client, LRA mines the similarity relations among different data samples and enhances the minority sample representations with their neighbors using attentive message passing. In server, GNE reaches an agreement among inconsistent and discrepant model deviations from clients to server, which encourages the global model to update in the direction of global optimum without breaking down the clients' optimization toward their local optimums. We conduct extensive experiments on four benchmark datasets to show the superiority of FedRANE in enhancing the performance of FL with non-IID data.

SSPU-Net: A Structure Sensitive Point Cloud Upsampling Network with Multi-Scale Spatial Refinement

  • Jin Wang
  • Jiade Chen
  • Yunhui Shi
  • Nam Ling
  • Baocai Yin

Point cloud upsampling aims to generate a dense and uniform point set from a sparse and irregular point set. The core challenge is to accurately restore the geometric structure and local details. To overcome the challenge, this paper presents a novel frequency-aware attention based point cloud upsampling approach, which combines graph filtering and channel attention based on the detection of high spatial-frequency components like edges and contours in the human visual system. To aggregate the features more efficiently, an intra-feature and inter-feature (I2-feature) aggregation block and a structure sensitive transformer block are introduced. On one hand, the I2-feature aggregation block serves to create a complete local representation of each point by aggregating intra and inter features. On the other hand, the structure sensitive transformer block aims to enhance the quality of the expanded point features by capturing the global geometric structures and the fine local details. Furthermore, to improve the quality of the coarse output, a multi-scale spatial refinement unit is applied, which leverages attentional feature fusion and multi-scale attention. Extensive qualitative and quantitative results on both synthetic and real-scanned datasets validate our proposed scheme outperforms state-of-the-art point cloud upsampling methods.

On Physically Occluded Fake Identity Document Detection

  • Haoyue Wang
  • Sheng Li
  • Silu Cao
  • Rui Yang
  • Jishen Zeng
  • Zhenxing Qian
  • Xinpeng Zhang

Many online applications require the users to upload their identity documents for authentication. The fake identity document is one of the main threats which compromises the security and reliability of such online applications. Existing techniques focus on the detection of digitally forged identity documents, which neglect the impact of physical forgeries. In this paper, we look into the problem of detecting physically occluded fake identity documents, which can be easily generated without any image processing knowledge. We observe that the physical occlusions inevitably produce occluded boundaries on the document. To take the advantage, we propose an Occluded Boundary Representation Learning (OBRL) module to progressively learn the occluded boundary features. These are then fed into an Occluded Boundary Message Passing (OBMP) module to effectively diffuse the physical occlusion traces to enhance the backbone features for robust detection. We newly construct a Physically Occluded Fake ID Card image dataset (POID) for evaluation. Various experiments are conducted on the POID, where our scheme is able to achieve 99.6% of accuracy in detecting physically occluded fake ID card images with a mAP of over 85% to localize the occlusion regions.

Dynamic View Synthesis with Spatio-Temporal Feature Warping from Sparse Views

  • Deqi Li
  • Shi-Sheng Huang
  • Tianyu Shen
  • Hua Huang

Significant progress has been made in realizing novel view synthesis of dynamic scenes from sparse input views. However, achieving spatio-temporal consistency in dynamic view synthesis remains to be challenging for previous approaches, since the spatio-temporal correlation for view synthesis has not been fully explored. In this paper, we propose a spatio-temporal feature warping (STFW) mechanism, which can be embedded into a deep model to produce high-quality and spatio-temporally consistent view synthesis results. The two core components of STFW are: (1) a spatial feature warping (SFW) module, which enables adaptive perception of multi-view context-consistent geometric information with a compact point cloud representation, and (2) a temporal feature warping (TFW) module that implicitly models the dynamic geometry by approaching the pixel shift in image coordinate. In the optimization process of view synthesis, the SFW and TFW are integrated to exploit the spatio-temporal correlation cues across sparse input views and novel views. Leveraging the STFW, we further build an end-to-end dynamic view synthesis model with sparse input views. Qualitative and quantitative evaluation on public multi-view datasets demonstrate that our view synthesis pipeline achieves better performance compared to previous methods in terms of visual quality.

SESSION: Oral Session IX: Engaging Users with Multimedia -- Social-good, Fairness and Transparency

Text-to-Image Diffusion Models can be Easily Backdoored through Multimodal Data Poisoning

  • Shengfang Zhai
  • Yinpeng Dong
  • Qingni Shen
  • Shi Pu
  • Yuejian Fang
  • Hang Su

With the help of conditioning mechanisms, the state-of-the-art diffusion models have achieved tremendous success in guided image generation, particularly in text-to-image synthesis. To gain a better understanding of the training process and potential risks of text-to-image synthesis, we perform a systematic investigation of backdoor attack on text-to-image diffusion models and propose BadT2I, a general multimodal backdoor attack framework that tampers with image synthesis in diverse semantic levels. Specifically, we perform backdoor attacks on three levels of the vision semantics: Pixel-Backdoor, Object-Backdoor and Style-Backdoor. By utilizing a regularization loss, our methods efficiently inject backdoors into a large-scale text-to-image diffusion model while preserving its utility with benign inputs. We conduct empirical experiments on Stable Diffusion, the widely-used text-to-image diffusion model, demonstrating that the large-scale diffusion model can be easily backdoored within a few fine-tuning steps. We conduct additional experiments to explore the impact of different types of textual triggers, as well as the backdoor persistence during further training, providing insights for the development of backdoor defense methods. Besides, our investigation may contribute to the copyright protection of text-to-image models in the future. Our Code:

Deep Neural Network Watermarking against Model Extraction Attack

  • Jingxuan Tan
  • Nan Zhong
  • Zhenxing Qian
  • Xinpeng Zhang
  • Sheng Li

Deep neural network (DNN) watermarking is an emerging technique to protect the intellectual property of deep learning models. At present, many DNN watermarking algorithms have been proposed to achieve provenance verification by embedding identify information into the internals or prediction behaviors of the host model. However, most methods are vulnerable to model extraction attacks, where attackers collect output labels from the model to train a surrogate or a replica. To address this issue, we present a novel DNN watermarking approach, named SSW, which constructs an adaptive trigger set progressively by optimizing over a pair of symmetric shadow models to enhance the robustness to model extraction. Precisely, we train a positive shadow model supervised by the prediction of the host model to mimic the behaviors of potential surrogate models. Additionally, a negative shadow model is normally trained to imitate irrelevant independent models. Using this pair of shadow models as a reference, we design a strategy to update the trigger samples appropriately such that they tend to persist in the host model and its stolen copies. Moreover, our method could well support two specific embedding schemes: embedding the watermark via fine-tuning or from scratch. Our extensive experimental results on popular datasets demonstrate that our SSW approach outperforms state-of-the-art methods against various model extraction attacks in whether trigger set classification accuracy based or hypothesis test based verification. The results also show that our method is robust to common model modification schemes including fine-tuning and model compression.

CoCa: A Connectivity-Aware Cascade Framework for Histology Gland Segmentation

  • Yu Bai
  • Bo Zhang
  • Zheng Zhang
  • Wu Liu
  • Jinwen Li
  • Xiangyang Gong
  • Wendong Wang

Gland segmentation is crucial for computer-aided diagnosis of adenocarcinoma. However, Topologically Critical Areas (TCAs), such as background tissues between two adjacent glands, can easily cause under- or over-connection of gland topological structures that may lead to the opposite diagnostic of the malignancy degree. Therefore, we provide a novel perspective for gland segmentation by incorporating gland connectivity information to locate critical errors within TCAs. We propose a Connectivity-Aware Cascade framework (CoCa) that explicitly encodes gland connectivity information into the network to locate all connectivity errors during training and then leverage attention operations to focus on these errors. Since under- or over-connected glands can change the Betti number (e.g., number of connected components) of glands, we design a Connectivity Refinement Module (CRM) to compare the Betti number of each gland to locate connectivity errors. We propose CoCa-Net to mine the topological relations among different biomedical entities to guide gland prediction. We also use contrastive learning to separate pixel embeddings of different classes within TCAs through our connectivity-aware hard example sampling strategy. Extensive experiments on the GlaS and CRAG datasets demonstrate the effectiveness of CoCa over state-of-the-art methods.

Factorized Omnidirectional Representation based Vision GNN for Anisotropic 3D Multimodal MR Image Segmentation

  • Bo Zhang
  • YunPeng Tan
  • Zheng Zhang
  • Wu Liu
  • Hui Gao
  • Zhijun Xi
  • Wendong Wang

Anisotropy arises due to the influence of scanning equipment and parameters, resulting in a distance between slices that is often much greater than the actual distance represented by a single pixel within each slice. This can lead to inefficiency or ineffectiveness in 3D convolution. To address the anisotropy issue, we propose FOrViG, an asymmetric vision graph neural network (GNN) framework that captures the correlation between different slices by constructing a graph for multi-slice images and aggregating information from adjacent nodes. This allows FOrViG to efficiently extract 3D spatial scale information, and effectively identify feature nodes associated with small lesions efficiently, thereby improving the accuracy of lesion segmentation on anisotropic 3D multimodal MR images. As far as we know, this is the first study that adopts GNN to address anisotropy issues. Additionally, we also design a factorized omnidirectional representation method and a supervised multi-perspective contrastive learning strategy to enhance the capability of FOrViG in learning multi-scale omnidirectional presentation information, graphics construction, and distinguishing foreground from background. Extensive experiments on the PI-CAI dataset demonstrate that FOrViG significantly outperforms several state-of-the-art 3D segmentation algorithms.

Echoes: Unsupervised Debiasing via Pseudo-bias Labeling in an Echo Chamber

  • Rui Hu
  • Yahan Tu
  • Jitao Sang

Neural networks often learn spurious correlations when exposed to biased training data, leading to poor performance on out-of-distribution data. A biased dataset can be divided, according to biased features, into bias-aligned samples (i.e., with biased features) and bias-conflicting samples (i.e., without biased features). Recent debiasing works typically assume that no bias label is available during the training phase, as obtaining such information is challenging and labor-intensive. Following this unsupervised assumption, existing methods usually train two models: a biased model specialized to learn biased features and a target model that uses information from the biased model for debiasing. This paper first presents experimental analyses revealing that the existing biased models overfit to bias-conflicting samples in the training data, which negatively impacts the debiasing performance of the target models. To address this issue, we propose a straightforward and effective method called Echoes, which trains a biased model and a target model with a different strategy. We construct an "echo chamber" environment by reducing the weights of samples which are misclassified by the biased model, to ensure the biased model fully learns the biased features without overfitting to the bias-conflicting samples. The biased model then assigns lower weights on the bias-conflicting samples. Subsequently, we use the inverse of the sample weights of the biased model for training the target model. Experiments show that our approach achieves superior debiasing results compared to the existing baselines on both synthetic and real-world datasets. Our code is available at

FedCE: Personalized Federated Learning Method based on Clustering Ensembles

  • Luxin Cai
  • Naiyue Chen
  • Yuanzhouhan Cao
  • Jiahuan He
  • Yidong Li

Federated learning (FL) is a privacy-aware computing framework that enables multiple clients to collaborate in solving machine learning problems. In real scenarios, non-IID data held by different edge devices will degrade the performance of global FL models. To address this issue, most FL methods utilize cluster algorithms to group clients with similar distributions. However, these methods do not fully utilize the distribution features of client data, resulting in a lack of generalization in the cluster model. In order to make the cluster more suitable for the distribution features of user data, we propose a clustering-ensemble based federated learning method (FedCE) that sets each client associated with multiple clusters. We extract the features of client distributions to quantify the relationship between clients and clusters, and optimize the local model of clients through the historical performance of the cluster model. Furthermore, we dynamically estimate the number of clusters each client belongs to through the diversity of client performance. We conduct experiments on scenarios with mixture two-distributions, three-distributions and Dirichlet-distributions. The results show that the FedCE algorithm has better performance than the state-of-the-art clustered FL methods in both cluster and client models under different data distributions.

SESSION: Oral Session X: Multimedia systems -- Data Systems Management and Indexing

Relative NN-Descent: A Fast Index Construction for Graph-Based Approximate Nearest Neighbor Search

  • Naoki Ono
  • Yusuke Matsui

Approximate Nearest Neighbor Search (ANNS) is the task of finding the database vector that is closest to a given query vector. Graph-based ANNS is the family of methods with the best balance of accuracy and speed for million-scale datasets. However, graph-based methods have the disadvantage of long index construction time. Recently, many researchers have improved the tradeoff between accuracy and speed during a search. However, there is little research on accelerating index construction. We propose a fast graph construction algorithm, Relative NN-Descent (RNN-Descent). RNN-Descent combines NN-Descent, an algorithm for constructing approximate K-nearest neighbor graphs (K-NN graphs), and RNG Strategy, an algorithm for selecting edges effective for search. This algorithm allows the direct construction of graph-based indexes without ANNS. Experimental results demonstrated that the proposed method had the fastest index construction speed, while its search performance is comparable to existing state-of-the-art methods such as NSG. For example, in experiments on the GIST1M dataset, the construction of the proposed method is 2x faster than NSG. Additionally, it was even faster than the construction speed of NN-Descent.

Flexible and Secure Watermarking for Latent Diffusion Model

  • Cheng Xiong
  • Chuan Qin
  • Guorui Feng
  • Xinpeng Zhang

Since the significant advancements and open-source support of latent diffusion models (LDMs) in the field of image generation, numerous researchers and enterprises start fine-tuning the pre-trained models to generate specialized images for different objectives. However, the criminals may turn their attention to generate images by LDMs and then carry out illegal activities. The watermarking technique is a typical solution to deal with this problem. But, the post-hoc watermarking methods can be easily escaped to obtain the non-watermarked images, and the existing watermarking methods designed for LDMs can only embed a fixed message, i.e., the to-be-embedded message cannot be changed unless retraining the model. Therefore, in this work, we propose an end-to-end watermarking method based on the encoder-decoder (ENDE) and message-matrix. The message can be embedded into generated images through fusing the message-matrix and intermediate outputs in the forward propagation of image generation based on LDM. Thus, the message can be flexibly changed by utilizing the message-encoder to generate message-matrix, without training the LDM again. On the other hand, the security mechanism in our watermarking method can defeat the attack that the users may escape the message-matrix usage during image generation. A series of experiments demonstrate the effectiveness and the superiority of our watermarking method compared with SOTA methods.

CHAIN: Exploring Global-Local Spatio-Temporal Information for Improved Self-Supervised Video Hashing

  • Rukai Wei
  • Yu Liu
  • Jingkuan Song
  • Heng Cui
  • Yanzhao Xie
  • Ke Zhou

Compressing videos into binary codes can improve retrieval speed and reduce storage overhead. However, learning accurate hash codes for video retrieval can be challenging due to high local redundancy and complex global dependencies between video frames, especially in the absence of labels. Existing self-supervised video hashing methods have been effective in designing expressive temporal encoders, but have not fully utilized the temporal dynamics and spatial appearance of videos due to less challenging and unreliable learning tasks. To address these challenges, we begin by utilizing the contrastive learning task to capture global spatio-temporal information of videos for hashing. With the aid of our designed augmentation strategies, which focus on spatial and temporal variations to create positive pairs, the learning framework can generate hash codes that are invariant to motion, scale, and viewpoint. Furthermore, we incorporate two collaborative learning tasks, i.e., frame order verification and scene change regularization, to capture local spatio-temporal details within video frames, thereby enhancing the perception of temporal structure and the modeling of spatio-temporal relationships. Our proposed Contrastive Hashing with Global-Local Spatio-temporal Ibnformation (CHAIN) outperforms state-of-the-art self-supervised video hashing methods on four video benchmark datasets. Our codes will be released.

SESSION: Oral Session XI: Multimedia systems -- Systems and Middleware, Transport and Delivery

Pagoda: Privacy Protection for Volumetric Video Streaming through Poisson Diffusion Model

  • Rui Lu
  • Lai Wei
  • Shuntao Zhu
  • Chuang Hu
  • Dan Wang

With the increasing popularity of 3D volumetric video applications, e.g., metaverse, AR/VR, etc., there is a growing need to protect users' privacy while sharing their experiences during streaming. In this paper, we show that the existing privacy-preserving approaches for dense point clouds suffer a massive computation cost and degrade the quality of the streaming experience. We design Pagoda, a new PrivAcy-preservinG VOlumetric ViDeo StreAming incorporating the MPEG V-PCC standard, which protects different domain privacy information of dense point cloud, and maintains high throughput. The core idea is to content-aware transform the privacy attribute information to the geometry domain and content-agnostic protect the geometry information by adding Poisson noise perturbations. These perturbations can be denoised through a Poisson diffusion probabilistic model we design to deploy on the cloud. Users only need to encrypt a small amount of high-sensitive information and achieve secure streaming. Our designs ensure the dense point clouds can be transmitted in high quality and the attackers cannot reconstruct the original one. We evaluate Pagoda using three volumetric video datasets. The results show that Pagoda outperforms existing privacy-preserving baselines for 75.6% protection capability improvement, 4.27 times streaming quality, and 26 times latency reduction.

ScaleFlow: Efficient Deep Vision Pipeline with Closed-Loop Scale-Adaptive Inference

  • Yuyang Leng
  • Renyuan Liu
  • Hongpeng Guo
  • Songqing Chen
  • Shuochao Yao

Deep visual data processing is underpinning many life-changing applications, such as auto-driving and smart cities. Improving the accuracy while minimizing their inference time under constrained resources has been the primary pursuit for their practical adoptions. Existing research thus has been devoted to either narrowing down the area of interest for the detection or miniaturizing the deep learning model for faster inference time. However, the former may risk missing/delaying small but important object detection, potentially leading to disastrous consequences (e.g., car accidents), while the latter often compromises the accuracy without fully utilizing intrinsic semantic information. To overcome these limitations, in this work, we propose ScaleFlow, a closed-loop scale-adaptive inference that can reduce model inference time by progressively processing vision data with increasing resolution but decreasing spatial size, achieving speedup without compromising accuracy. For this purpose, ScaleFlow refactors existing neural networks to be scale-equivariant on multiresolution data with the assistance of wavelet theory, providing predictable feature patterns on different data resolutions. Comprehensive experiments have been conducted to evaluate ScaleFlow. The results show that ScaleFlow can support anytime inference, consistently provide 1.5× to 2.2× speed up, and save around 25% ~ 45% energy consumption with < 1% accuracy loss on four embedded and edge platforms

Optimizing Adaptive Video Streaming with Human Feedback

  • Tianchi Huang
  • Rui-Xiao Zhang
  • Chenglei Wu
  • Lifeng Sun

Quality of Experience (QoE)-driven adaptive bitrate (ABR) algorithms are typically optimized using QoE models that are based on the mean opinion score (MOS), while such principles may not account for user heterogeneity on rating scales, resulting in unexpected behaviors. In this paper, we propose Jade, which leverages reinforcement learning with human feedback(RLHF) technologies to better align the users' opinion scores. Jade's rank-based QoE model considers relative values of user ratings to interpret the subjective perception of video sessions. We implement linear-based and Deep Neural Network (DNN)-based architectures for satisfying both accuracy and generalization ability. We further propose entropy-aware reinforced mechanisms for training policies with the integration of the proposed QoE models. Experimental results demonstrate that Jade performs favorably on conventional metrics, such as quality and stall ratio, and improves QoE by 8.09%-38.13% in different network conditions, emphasizing the importance of user heterogeneity in QoE modeling and the potential of combining linear-based and DNN-based models for performance improvement.

SESSION: Poster Session I: Understanding Multimedia Content -- Media Interpretation

M3Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition

  • Hao Tang
  • Jun Liu
  • Shuanglin Yan
  • Rui Yan
  • Zechao Li
  • Jinhui Tang

Due to the scarcity of manually annotated data required for fine-grained video understanding, few-shot fine-grained (FS-FG) action recognition has gained significant attention, with the aim of classifying novel fine-grained action categories with only a few labeled instances. Despite the progress made in FS coarse-grained action recognition, current approaches encounter two challenges when dealing with the fine-grained action categories: the inability to capture subtle action details and the insufficiency of learning from limited data that exhibit high intra-class variance and inter-class similarity. To address these limitations, we propose M3Net, a matching-based framework for FS-FG action recognition, which incorporates multi-view encoding, multi-view matching, and multi-view fusion to facilitate embedding encoding, similarity matching, and decision making across multiple viewpoints.Multi-view encoding captures rich contextual details from the intra-frame, intra-video, and intra-episode perspectives, generating customized higher-order embeddings for fine-grained data.Multi-view matching integrates various matching functions enabling flexible relation modeling within limited samples to handle multi-scale spatio-temporal variations by leveraging the instance-specific, category-specific, and task-specific perspectives. Multi-view fusion consists of matching-predictions fusion and matching-losses fusion over the above views, where the former promotes mutual complementarity and the latter enhances embedding generalizability by employing multi-task collaborative learning. Explainable visualizations and experimental results on three challenging benchmarks demonstrate the superiority of M3Net in capturing fine-grained action details and achieving state-of-the-art performance for FS-FG action recognition.

CUCL: Codebook for Unsupervised Continual Learning

  • Chen Cheng
  • Jingkuan Song
  • Xiaosu Zhu
  • Junchen Zhu
  • Lianli Gao
  • Hengtao Shen

The focus of this study is on Unsupervised Continual Learning (UCL), as it presents an alternative to Supervised Continual Learning which needs high-quality manual labeled data. The experiments under UCL paradigm indicate a phenomenon where the results on the first few tasks are suboptimal. This phenomenon can render the model inappropriate for practical applications. To address this issue, after analyzing the phenomenon and identifying the lack of diversity as a vital factor, we propose a method named Codebook for Unsupervised Continual Learning (CUCL) which promotes the model to learn discriminative features to complete the class boundary. Specifically, we first introduce a Product Quantization to inject diversity into the representation and apply a cross quantized contrastive loss between the original representation and the quantized one to capture discriminative information. Then, based on the quantizer, we propose a effective Codebook Rehearsal to address catastrophic forgetting. This study involves conducting extensive experiments on CIFAR100, TinyImageNet, and MiniImageNet benchmark datasets. Our method significantly boosts the performances of supervised and unsupervised methods. For instance, on TinyImageNet, our method led to a relative improvement of 12.76% and 7% when compared with Simsiam and BYOL, respectively. Codes are publicly available at

Regress Before Construct: Regress Autoencoder for Point Cloud Self-supervised Learning

  • Yang Liu
  • Chen Chen
  • Can Wang
  • Xulin King
  • Mengyuan Liu

Masked Autoencoders (MAE) have demonstrated promising performance in self-supervised learning for both 2D and 3D computer vision. Nevertheless, existing MAE-based methods still have certain drawbacks. Firstly, the functional decoupling between the encoder and decoder is incomplete, which limits the encoder's representation learning ability. Secondly, downstream tasks solely utilize the encoder, failing to fully leverage the knowledge acquired through the encoder-decoder architecture in the pre-text task. In this paper, we propose Point Regress AutoEncoder (Point-RAE), a new scheme for regressive autoencoders for point cloud self-supervised learning. The proposed method decouples functions between the decoder and the encoder by introducing a mask regressor, which predicts the masked patch representation from the visible patch representation encoded by the encoder and the decoder reconstructs the target from the predicted masked patch representation. By doing so, we minimize the impact of decoder updates on the representation space of the encoder. Moreover, we introduce an alignment constraint to ensure that the representations for masked patches, predicted from the encoded representations of visible patches, are aligned with the masked patch presentations computed from the encoder. To make full use of the knowledge learned in the pre-training stage, we design a new finetune mode for the proposed Point-RAE. Extensive experiments demonstrate that our approach is efficient during pre-training and generalizes well on various downstream tasks. Specifically, our pre-trained models achieve a high accuracy of 90.28% on the ScanObjectNN hardest split and 94.1% accuracy on ModelNet40, surpassing all the other self-supervised learning methods. Our code and pretrained model are public available at:

CropCap: Embedding Visual Cross-Partition Dependency for Image Captioning

  • Bo Wang
  • Zhao Zhang
  • Suiyi Zhao
  • Haijun Zhang
  • Richang Hong
  • Meng Wang

Transformer-based approaches to image captioning have shown great success by utilizing long-term dependency for visual embedding. However, their coarse long-term dependency, using the multi-head self-attention mechanism to capture the contextual interactions between the visual tokens on the time step and (or) embedded dimension, fail to distinguish fine-grained features of local partition. In this case, some similar features are captured, which leads to feature redundancy that decreases the performance. To respond to this issue, this paper proposes a novel image captioner embedding visual cross-partition dependency, dubbed CropCap. Specifically, the visual sequence generated from the Swin Transformer-based pre-embedding network is fed into the proposed cross-partition dependency module to refinedly model the interaction between partial representations on both the time step and embedded dimension. Furthermore, we formulaically reason the proposed cross-partition dependency, and theoretically prove its correctness. Extensive comparisons on the benchmark MS-COCO dataset demonstrated the effectiveness addressing the information redundancy issue, and verified the superior performance of our method.

Generalizing Face Forgery Detection via Uncertainty Learning

  • Yanqi Wu
  • Xue Song
  • Jingjing Chen
  • Yu-Gang Jiang

Current face forgery detection methods have made significant progress in achieving high intra-dataset accuracy by building a deterministic binary detector. However, deterministic networks cannot effectively capture noise and distribution shifts in the input, which makes them less robust and prone to poor generalization in real-world scenarios. To address this problem, in this paper, we propose an Uncertainty-Aware Learning (UAL) method for face forgery detection. Specifically, we extend the Transformer model in a probabilistic manner by modeling dependencies between patches as Gaussian random variables. Additionally, we introduce a Patch Selection Module that can efficiently and accurately identify discriminative regions with high-uncertainty information, which are further utilized for final classification. Furthermore, with the quantified uncertainty of the entire image, we design a novel Uncertainty-Aware One-Center Loss that enhances intra-class compactness for genuine faces only, thereby improving the inter-class separability in the embedding space. We conduct extensive experiments to demonstrate the effectiveness of the proposed method, and the results verify that, our Uncertainty-Aware Learning method enjoys better robustness and generalization ability comparing against other state-of-the-art methods.

Object Detection Difficulty: Suppressing Over-aggregation for Faster and Better Video Object Detection

  • Bingqing Zhang
  • Sen Wang
  • Yifan Liu
  • Brano Kusy
  • Xue Li
  • Jiajun Liu

Current video object detection (VOD) models often encounter issues with over-aggregation due to redundant aggregation strategies, which perform feature aggregation on every frame. This results in suboptimal performance and increased computational complexity. In this work, we propose an image-level Object Detection Difficulty (ODD) metric to quantify the difficulty of detecting objects in a given image. The derived ODD scores can be used in the VOD process to mitigate over-aggregation. Specifically, we train an ODD predictor as an auxiliary head of a still-image object detector to compute the ODD score for each image based on the discrepancies between detection results and ground-truth bounding boxes. The ODD score enhances the VOD system in two ways: 1) it enables the VOD system to select superior global reference frames, thereby improving overall accuracy; and 2) it serves as an indicator in the newly designed ODD Scheduler to eliminate the aggregation of frames that are easy to detect, thus accelerating the VOD process. Comprehensive experiments demonstrate that, when utilized for selecting global reference frames, ODD-VOD consistently enhances the accuracy of Global-frame-based VOD models. When employed for acceleration, ODD-VOD consistently improves the frames per second (FPS) by an average of 73.3% across 8 different VOD models without sacrificing accuracy. When combined, ODD-VOD attains state-of-the-art performance when competing with many VOD methods in both accuracy and speed. Our work represents a significant advancement towards making VOD more practical for real-world applications. The code will be released at

Mutual-Guided Dynamic Network for Image Fusion

  • Yuanshen Guan
  • Ruikang Xu
  • Mingde Yao
  • Lizhi Wang
  • Zhiwei Xiong

Image fusion aims to generate a high-quality image from multiple images captured under varying conditions. The key problem of this task is to preserve complementary information while filtering out irrelevant information for the fused result. However, existing methods address this problem by leveraging static convolutional neural networks (CNNs), suffering two inherent limitations during feature extraction,i.e., being unable to handle spatial-variant contents and lacking guidance from multiple inputs. In this paper, we propose a novel mutual-guided dynamic network (MGDN) for image fusion, which allows for effective information utilization across different locations and inputs. Specifically, we design a mutual-guided dynamic filter (MGDF) for adaptive feature extraction, composed of a mutual-guided cross-attention (MGCA) module and a dynamic filter predictor, where the former incorporates additional guidance from different inputs and the latter generates spatial-variant kernels for different locations. In addition, we introduce a parallel feature fusion (PFF) module to effectively fuse local and global information of the extracted features. To further reduce the redundancy among the extracted features while simultaneously preserving their shared structural information, we devise a novel loss function that combines the minimization of normalized mutual information (NMI) with an estimated gradient mask. Experimental results on five benchmark datasets demonstrate that our proposed method outperforms existing methods on four image fusion tasks. The code and model are publicly available at:

Frequency Representation Integration for Camouflaged Object Detection

  • Chenxi Xie
  • Changqun Xia
  • Tianshu Yu
  • Jia Li

Recent camouflaged object detection (COD) approaches have been proposed to accurately segment objects blended into surroundings. The most challenging and critical issue in COD is to find out the lines of demarcation between objects and background in the camouflage environment. Because of the similarity between the target object and the background, these lines are difficult to be found accurately. However, these are easy to be observed in different frequency components of the image. To this end, in this paper we rethink COD from the perspective of frequency components and propose a Frequency Representation Integration Network to mine informative cues from them. Specifically, we obtain high-frequency components from the original image by Laplacian pyramid-like decomposition, and then respectively send the image to a transformer-based encoder and frequency components to a tailored CNN-based Residual Frequency Array Encoder. Besides, we utilize the multi-head self-attention in transformer encoder to capture low-frequency signals, which can effectively parse the overall contextual information of camouflage scenes. We also design a Frequency Representation Reasoning Module, which progressively eliminates discrepancies between differentiated frequency representations and integrates them by modeling their point-wise relations. Moreover, to further bridge different frequency representations, we introduce the image reconstruction task to implicitly guide their integration. Sufficient experiments on three widely-used COD benchmark datasets demonstrate that our method surpasses existing state-of-the-art methods by a large margin.

DecenterNet: Bottom-Up Human Pose Estimation Via Decentralized Pose Representation

  • Tao Wang
  • Lei Jin
  • Zhang Wang
  • Xiaojin Fan
  • Yu Cheng
  • Yinglei Teng
  • Junliang Xing
  • Jian Zhao

Multi-person pose estimation in crowded scenes remains a very challenging task. This paper finds that most previous methods fail to estimate or group visible keypoints in crowded scenes rather than reasoning invisible keypoints. We thus categorize the crowded scenes into entanglement and occlusion based on the visibility of human parts and observe that entanglement is a significant problem in crowded scenes. With this observation, we propose DecenterNet, an end-to-end deep architecture to perform robust and efficient pose estimation in crowded scenes. Within DecenterNet, we introduce a decentralized pose representation that uses all visible keypoints as the root points to represent human poses, which is more robust in the entanglement area. We also propose a decoupled pose assessment mechanism, which introduces a location map to adaptively select optimal poses in the offset map. In addition, we have constructed a new dataset named SkatingPose, containing more entangled scenes. The proposed DecenterNet surpasses the best method on SkatingPose by 1.8 AP. Furthermore, DecenterNet obtains 71.2 AP and 71.4 AP on the COCO and CrowdPose datasets, respectively, demonstrating the superiority of our method. We will release our source code, trained models, and dataset to facilitate further studies in this research direction. Our code and dataset are available in

Improving Scene Graph Generation with Superpixel-Based Interaction Learning

  • Jingyi Wang
  • Can Zhang
  • Jinfa Huang
  • Botao Ren
  • Zhidong Deng

Recent advances in Scene Graph Generation (SGG) typically model the relationships among entities utilizing box-level features from pre-defined detectors. We argue that an overlooked problem in SGG is the coarse-grained interactions between boxes, which inadequately capture contextual semantics for relationship modeling, practically limiting the development of the field. In this paper, we take the initiative to explore and propose a generic paradigm termed Superpixel-based Interaction Learning (SIL) to remedy coarse-grained interactions at the box level. It allows us to model fine-grained interactions at the superpixel level in SGG. Specifically, (i) we treat a scene as a set of points and cluster them into superpixels representing sub-regions of the scene. (ii) We explore intra-entity and cross-entity interactions among the superpixels to enrich fine-grained interactions between entities at an earlier stage. Extensive experiments on two challenging benchmarks (Visual Genome and Open Image V6) prove that our SIL enables fine-grained interaction at the superpixel level above previous box-level methods, and significantly outperforms previous state-of-the-art methods across all metrics. More encouragingly, the proposed method can be applied to boost the performance of existing box-level approaches in a plug-and-play fashion. In particular, SIL brings an average improvement of 2.0% mR (even up to 3.4%) of baselines for the PredCls task on Visual Genome, which facilitates its integration into any existing box-level method.

Lifelong Scene Text Recognizer via Expert Modules

  • Shifeng Xia
  • Lin Geng
  • Ningzhong Liu
  • Han Sun
  • Jie Qin

Scene text recognition (STR) has been actively studied in recent years, with a wide range of applications in autonomous driving, image retrieval and much more. However, when a pre-trained deep STR model learns a new task, its performance on previous tasks may drop dramatically, due to catastrophic forgetting in deep neural networks. A potential solution to combat the forgetting of prior knowledge is incremental learning (IL), which has shown its effectiveness and significant progress in image classification. Yet, exploiting IL in the context of STR has been barely visited, probably because the forgetting problem is even worse in STR. To address this issue, we propose the lifelong scene text recognizer (LSTR) that learns STR tasks incrementally while alleviating forgetting. Specifically, LSTR assigns each task a set of task-specific expert modules at different stages of an STR model, while other parameters are shared among tasks. These shared parameters are only learned in the first task and remain unchanged during subsequent learning to ensure that no learned knowledge is overlooked. Moreover, in real applications, there is no prior knowledge about which task an input image belongs to, making it impossible to precisely select the corresponding expert modules. To this end, we propose the incremental task prediction network (ITPN) to identify the most related task category by pulling the features of the same task closer and pushing those of different tasks farther apart. To validate the proposed method in our newly-introduced IL setting, we collected a large-scale dataset consisting of both real and synthetic multilingual STR data. Extensive experiments on this dataset clearly show the superiority of our LSTR over state-of-the-art IL methods.

CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model

  • Zhen Ye
  • Wei Xue
  • Xu Tan
  • Jie Chen
  • Qifeng Liu
  • Yike Guo

Denoising diffusion probabilistic models (DDPMs) have shown promising performance for speech synthesis. However, a large number of iterative steps are required to achieve high sample quality, which restricts the inference speed. Maintaining sample quality while increasing sampling speed has become a challenging task. In this paper, we propose a Consistency Model-based Speech synthesis method, CoMoSpeech, which achieve speech synthesis through a single diffusion sampling step while achieving high audio quality. The consistency constraint is applied to distill a consistency model from a well-designed diffusion-based teacher model, which ultimately yields superior performances in the distilled CoMoSpeech. Our experiments show that by generating audio recordings by a single sampling step, the CoMoSpeech achieves an inference speed more than 150 times faster than real-time on a single NVIDIA A100 GPU, which is comparable to FastSpeech2, making diffusion-sampling based speech synthesis truly practical. Meanwhile, objective and subjective evaluations on text-to-speech and singing voice synthesis show that the proposed teacher models yield the best audio quality, and the one-step sampling based CoMoSpeech achieves the best inference speed with better or comparable audio quality to other conventional multi-step diffusion model baselines. Audio samples and codes are available at https://comospeech.github.

Exploring Motion Cues for Video Test-Time Adaptation

  • Runhao Zeng
  • Qi Deng
  • Huixuan Xu
  • Shuaicheng Niu
  • Jian Chen

Test-time adaptation (TTA) aims at boosting the generalization capability of a trained model by conducting self-/un-supervised learning during testing in real-world applications. Though TTA on image-based tasks has seen significant progress, TTA techniques for video remain scarce. Naively introducing image-based TTA methods into video tasks may achieve limited performance, since these methods do not consider the special nature of video tasks, e.g., the motion information. In this paper, we propose leveraging motion cues in videos to design a new test-time learning scheme for video classification. We extract spatial appearance and dynamic motion clip features using two sampling rates (i.e., slow and fast) and propose a fast-to-slow unidirectional alignment scheme to align fast motion and slow appearance features, thereby enhancing the motion encoding ability. Additionally, we propose a slow-fast dual contrastive learning strategy to learn a joint feature space for fastly and slowly sampled clips, guiding the model to extract discriminative video features. Lastly, we introduce a stochastic pseudo-negative sampling scheme to provide better adaptation supervision by selecting a more reliable pseudo-negative label compared to the pseudo-positive label used in prior TTA methods. This technique reduces the adaptation difficulty often caused by poor performance on out-of-distribution test data before adaptation. Our approach significantly improves performance on various video classification backbones, as demonstrated through extensive experiments on two benchmark datasets.

Perceiving Ambiguity and Semantics without Recognition: An Efficient and Effective Ambiguous Scene Text Detector

  • Yan Shu
  • Wei Wang
  • Yu Zhou
  • Shaohui Liu
  • Aoting Zhang
  • Dongbao Yang
  • Weipinng Wang

Ambiguous scene text detection is an extremely challenging task. Existing text detectors that rely solely on visual cues often suffer from confusion due to being evenly distributed in rows/columns or incomplete detection owing to large character spacing. To overcome these challenges, the previous method recognizes a large number of proposals and utilizes semantic information predicted from recognition results to eliminate ambiguity. However, this method is inefficient, which limits their practical applications. In this paper, we propose a novel efficient and effective ambiguous text detector, which can Perceive Ambiguity and SEmantics without Recognition, termed PASER. On the one hand, PASER can perceive semantics without recognition with a light Perceiving Semantics (PerSem) module. In this way, proposals without reasonable semantics are filtered out, which largely speeds up the overall detection process. On the other hand, to detect both ambiguous and regular texts with a unified framework, PASER employs a Perceiving Ambiguity (PerAmb) module to distinguish ambiguous texts and regular texts, so that only the ambiguous proposals will be processed by PerSem while the regular texts are not, which further ensures the high efficiency. Extensive experiments show that our detector achieves state-of-the-art results on both ambiguous and regular scene text detection benchmarks. Notably, over 6 times faster speed and superior accuracy are achieved on TDA-ReCTS simultaneously.

Single-Stage Multi-human Parsing via Point Sets and Center-based Offsets

  • Jiaming Chu
  • Lei Jin
  • Xiaojin Fan
  • Yinglei Teng
  • Yunchao Wei
  • Yuqiang Fang
  • Junliang Xing
  • Jian Zhao

This work studies the multi-human parsing problem. Existing methods, either following top-down or bottom-up two-stage paradigms, usually involve expensive computational costs. We instead present a high-performance Single-stage Multi-human Parsing (SMP) deep architecture that decouples the multi-human parsing problem into two fine-grained sub-problems,i.e., locating the human body and parts. SMP leverages the point features in the barycenter positions to obtain their segmentation and then generates a series of offsets from the barycenter of the human body to the barycenters of parts, thus performing human body and parts matching without the grouping process. Within the SMP architecture, we propose a Refined Feature Retain module to extract the global feature of instances through generated mask attention and a Mask of Interest Reclassify module as a trainable plug-in module to refine the classification results with the predicted segmentation. Extensive experiments on the MHPv2.0 dataset demonstrate the best effectiveness and efficiency of the proposed method, surpassing the state-of-the-art method by 2.1% in AP50p, 1.0% in APvolpsup>, and 1.2% in PCP50. Moreover, SMP also achieves superior performance in DensePose-COCO, verifying generalization of the model. In particular, the proposed method requires fewer training epochs and a less complex model architecture. Our codes are released in

Partitioned Saliency Ranking with Dense Pyramid Transformers

  • Chengxiao Sun
  • Yan Xu
  • Jialun Pei
  • Haopeng Fang
  • He Tang

In recent years, saliency ranking has emerged as a challenging task focusing on assessing the degree of saliency at instance-level. Being subjective, even humans struggle to identify the precise order of all salient instances. Previous approaches undertake the saliency ranking by directly sorting the rank scores of salient instances, which have not explicitly resolved the inherent ambiguities. To overcome this limitation, we propose the ranking by partition paradigm, which segments unordered salient instances into partitions and then ranks them based on the correlations among these partitions. The ranking by partition paradigm alleviates ranking ambiguities in a general sense, as it consistently improves the performance of other saliency ranking models. Additionally, we introduce the Dense Pyramid Transformer (DPT) to enable global cross-scale interactions, which significantly enhances feature interactions with reduced computational burden. Extensive experiments demonstrate that our approach outperforms all existing methods. The code for our method is available at

CenterLPS: Segment Instances by Centers for LiDAR Panoptic Segmentation

  • Jianbiao Mei
  • Yu Yang
  • Mengmeng Wang
  • Zizhang Li
  • Xiaojun Hou
  • Jongwon Ra
  • Laijian Li
  • Yong Liu

This paper focuses on LiDAR Panoptic Segmentation (LPS), which has attracted more attention recently due to its broad application prospect for autonomous driving and robotics. The mainstream LPS approaches either adopt a top-down strategy relying on 3D object detectors to discover instances or utilize time-consuming heuristic clustering algorithms to group instances in a bottom-up manner. Inspired by the center representation and kernel-based segmentation, we propose a new detection-free and clustering-free framework called CenterLPS, with the center-based instance encoding and decoding paradigm. Specifically, we propose a sparse center proposal network to generate the sparse 3D instance centers, as well as center feature embedding, which can well encode characteristics of instances. Then a center-aware transformer is applied to collect the context between different center feature embedding and around centers. Moreover, we generate the kernel weights based on the enhanced center feature embedding and initialize dynamic convolutions to decode the final instance masks. Finally, a mask fusion module is devised to unify the semantic and instance predictions and improve the panoptic quality. Extensive experiments on SemanticKITTI and nuScenes demonstrate the effectiveness of our proposed center-based framework CenterLPS.

Boosting Few-shot 3D Point Cloud Segmentation via Query-Guided Enhancement

  • Zhenhua Ning
  • Zhuotao Tian
  • Guangming Lu
  • Wenjie Pei

Although extensive research has been conducted on 3D point cloud segmentation, effectively adapting generic models to novel categories remains a formidable challenge. This paper proposes a novel approach to improve point cloud few-shot segmentation (PC-FSS) models. Unlike existing PC-FSS methods that directly utilize categorical information from support prototypes to recognize novel classes in query samples, our method identifies two critical aspects that substantially enhance model performance by reducing contextual gaps between support prototypes and query features. Specifically, we (1) adapt support background prototypes to match query context while removing extraneous cues that may obscure foreground and background in query samples, and (2) holistically rectify support prototypes under the guidance of query features to emulate the latter having no semantic gap to the query targets. Our proposed designs are agnostic to the feature extractor, rendering them readily applicable to any prototype-based methods. The experimental results on S3DIS and ScanNet demonstrate notable practical benefits, as our approach achieves significant improvements while still maintaining high efficiency. The code for our approach is available at

PiPa: Pixel- and Patch-wise Self-supervised Learning for Domain Adaptative Semantic Segmentation

  • Mu Chen
  • Zhedong Zheng
  • Yi Yang
  • Tat-Seng Chua

Unsupervised Domain Adaptation (UDA) aims to enhance the generalization of the learned model to other domains. The domain-invariant knowledge is transferred from the model trained on labeled source domain, e.g., video game, to unlabeled target domains, e.g., real-world scenarios, saving annotation expenses. Existing UDA methods for semantic segmentation usually focus on minimizing the inter-domain discrepancy of various levels, e.g., pixels, features, and predictions, for extracting domain-invariant knowledge. However, the primary intra-domain knowledge, such as context correlation inside an image, remains under-explored. In an attempt to fill this gap, we revisit the current pixel contrast in semantic segmentation and propose a unified pixel- and patch-wise self-supervised learning framework, called PiPa, for domain adaptive semantic segmentation that facilitates intra-image pixel-wise correlations and patch-wise semantic consistency against different contexts. The proposed framework exploits the inherent structures of intra-domain images, which: (1) explicitly encourages learning the discriminative pixel-wise features with intra-class compactness and inter-class separability, and (2) motivates the robust feature learning of the identical patch against different contexts or fluctuations. Extensive experiments verify the effectiveness of the proposed method, which obtains competitive accuracy on the two widely-used UDA benchmarks, e.g., 75.6 mIoU on GTA→Cityscapes and 68.2 mIoU on Synthia→Cityscapes. Moreover, our method is compatible with other UDA approaches to further improve the performance without introducing extra parameters.

Weakly-Supervised Text Instance Segmentation

  • Xinyan Zu
  • Haiyang Yu
  • Bin Li
  • Xiangyang Xue

Text segmentation is a challenging computer vision task with many downstream applications. Current text segmentation models need to be trained with pixel-level annotations, which requires a lot of labor cost. In this paper, we take the first attempt to perform weakly-supervised text instance segmentation through bridging text recognition and text segmentation. We observe that text recognition models are able to produce the attention localization of each text instance. Based on this observation, we propose a two-stage Text Adaptive Refinement (TAR) module to generate the pseudo labels based on the attention map of a text recognizer. Meanwhile, we develop a text segmentation module to take the rough attention location as input to predict segmentation masks, which are supervised by the aforementioned pseudo labels. In addition, we introduce a mask-augmented contrastive learning by treating the segmentation result as an augmented version of the input text image, thus improving the visual representation and further enhancing the performance of both recognition and segmentation. The experimental results demonstrate that the proposed method outperforms the state-of-the-art (SOTA) weakly-supervised generic segmentation methods by 18.95% and 17.80% in fgIoU on ICDAR13-FST and TextSeg. On MLT-S, COCO-TS and Total-Text, the proposed method achieves about 82% of the fully-supervised methods' performance. When evaluated on instance segmentation, the proposed method exceeds existing SOTA methods by 23.32% and 21.34% on ICDAR13-FST and TextSeg, respectively. Code and Supplementary Materials are available at

PNT-Edge: Towards Robust Edge Detection with Noisy Labels by Learning Pixel-level Noise Transitions

  • Wenjie Xuan
  • Shanshan Zhao
  • Yu Yao
  • Juhua Liu
  • Tongliang Liu
  • Yixin Chen
  • Bo Du
  • Dacheng Tao

Relying on large-scale training data with pixel-level labels, previous edge detection methods have achieved high performance. However, it is hard to manually label edges accurately, especially for large datasets, and thus the datasets inevitably contain noisy labels. This label-noise issue has been studied extensively for classification, while still remaining under-explored for edge detection. To address the label-noise issue for edge detection, this paper proposes to learn Pixel-level Noise Transitions to model the label-corruption process. To achieve it, we develop a novel Pixel-wise Shift Learning (PSL) module to estimate the transition from clean to noisy labels as a displacement field. Exploiting the estimated noise transitions, our model, named PNT-Edge, is able to fit the prediction to clean labels. In addition, a local edge density regularization term is devised to exploit local structure information for better transition learning. This term encourages learning large shifts for the edges with complex local structures. Experiments on SBD and Cityscapes demonstrate the effectiveness of our method in relieving the impact of label noise. Codes will be available at

Video Frame Interpolation with Flow Transformer

  • Pan Gao
  • Haoyue Tian
  • Jie Qin

Video frame interpolation has been actively studied with the development of convolutional neural networks. However, due to the intrinsic limitations of kernel weight sharing in convolution, the interpolated frame generated by it may lose details. In contrast, the attention mechanism in Transformer can better distinguish the contribution of each pixel, and it can also capture long-range pixel dependencies, which provides great potential for video interpolation. Nevertheless, the original Transformer is commonly used for 2D images; how to develop a Transformer-based framework with consideration of temporal self-attention for video frame interpolation remains an open issue. In this paper, we propose Video Frame Interpolation Flow Transformer to incorporate motion dynamics from optical flows into the self-attention mechanism. Specifically, we design a Flow Transformer Block that calculates the temporal self-attention in a matched local area with the guidance of flow, making our framework suitable for interpolating frames with large motion while maintaining reasonably low complexity. In addition, we construct a multi-scale architecture to account for multi-scale motion, further improving the overall performance. Extensive experiments on three benchmarks demonstrate that the proposed method can generate interpolated frames with better visual quality than state-of-the-art methods.

DUSA: Decoupled Unsupervised Sim2Real Adaptation for Vehicle-to-Everything Collaborative Perception

  • Xianghao Kong
  • Wentao Jiang
  • Jinrang Jia
  • Yifeng Shi
  • Runsheng Xu
  • Si Liu

Vehicle-to-Everything (V2X) collaborative perception is crucial for the advancement of autonomous driving. However, achieving high-precision V2X perception requires a significant amount of annotated real-world data, which can always be expensive and hard to acquire. Simulated data have raised much attention since they can be massively produced at an extremely low cost. Nevertheless, the significant domain gap between simulated and real-world data, including differences in sensor type, reflectance patterns, and road surroundings, often leads to poor performance of models trained on simulated data when evaluated on real-world data. In addition, there remains a domain gap between real-world collaborative agents, e.g. different types of sensors may be installed on autonomous vehicles and roadside infrastructures with different extrinsics, further increasing the difficulty of sim2real generalization. To take full advantage of simulated data, we present a new unsupervised sim2real domain adaptation method for V2X collaborative detection named Decoupled Unsupervised Sim2Real Adaptation (DUSA). Our new method decouples the V2X collaborative sim2real domain adaptation problem into two sub-problems: sim2real adaptation and inter-agent adaptation. For sim2real adaptation, we design a Location-adaptive Sim2Real Adapter (LSA) module to adaptively aggregate features from critical locations of the feature map and align the features between simulated data and real-world data via a sim/real discriminator on the aggregated global feature. For inter-agent adaptation, we further devise a Confidence-aware Inter-agent Adapter (CIA) module to align the fine-grained features from heterogeneous agents under the guidance of agent-wise confidence maps. Experiments demonstrate the effectiveness of the proposed DUSA approach on unsupervised sim2real adaptation from the simulated V2XSet dataset to the real-world DAIR-V2X-C dataset.

Explicifying Neural Implicit Fields for Efficient Dynamic Human Avatar Modeling via a Neural Explicit Surface

  • Ruiqi Zhang
  • Jie Chen
  • Qiang Wang

This paper proposes a technique for efficiently modeling dynamic humans by explicifying the implicit neural fields via a Neural Explicit Surface (NES). Implicit neural fields have advantages over traditional explicit representations in modeling dynamic 3D content from sparse observations and effectively representing complex geometries and appearances. Implicit neural fields defined in 3D space, however, are expensive to render due to the need for dense sampling during volumetric rendering. Moreover, their memory efficiency can be further optimized when modeling sparse 3D space. To overcome these issues, the paper proposes utilizing Neural Explicit Surface (NES) to explicitly represent implicit neural fields, facilitating memory and computational efficiency. To achieve this, the paper creates a fully differentiable conversion between the implicit neural fields and the explicit rendering interface of NES, leveraging the strengths of both implicit and explicit approaches. This conversion enables effective training of the hybrid representation using implicit methods and efficient rendering by integrating the explicit rendering interface with a newly proposed rasterization-based neural renderer that only incurs a texture color query once for the initial ray interaction with the explicit surface, resulting in improved inference efficiency. NES describes dynamic human geometries with pose-dependent neural implicit surface deformation fields and their dynamic neural textures both in 2D space, which is a more memory-efficient alternative to traditional 3D methods, reducing redundancy and computational load. The comprehensive experiments show that NES performs similarly to previous 3D approaches, with greatly improved rendering speed and reduced memory cost.

MVFlow: Deep Optical Flow Estimation of Compressed Videos with Motion Vector Prior

  • Shili Zhou
  • Xuhao Jiang
  • Weimin Tan
  • Ruian He
  • Bo Yan

In recent years, many deep learning-based methods have been proposed to tackle the problem of optical flow estimation and achieved promising results. However, they hardly consider that most videos are compressed and thus ignore the pre-computed information in compressed video streams. Motion vectors, one of the compression information, record the motion of the video frames. They can be directly extracted from the compression code stream without computational cost and serve as a solid prior for optical flow estimation. Therefore, we propose an optical flow model, MVFlow, which uses motion vectors to improve the speed and accuracy of optical flow estimation for compressed videos. In detail, MVFlow includes a key Motion-Vector Converting Module, which ensures that the motion vectors can be transformed into the same domain of optical flow and then be utilized fully by the flow estimation module. Meanwhile, we construct four optical flow datasets for compressed videos containing frames and motion vectors in pairs. The experimental results demonstrate the superiority of our proposed MVFlow, which can reduce the AEPE by 1.09 compared to existing models or save 52% time to achieve similar accuracy to existing models.

Uncertainty-Guided Spatial Pruning Architecture for Efficient Frame Interpolation

  • Ri Cheng
  • Xuhao Jiang
  • Ruian He
  • Shili Zhou
  • Weimin Tan
  • Bo Yan

The video frame interpolation (VFI) model applies the convolution operation to all locations, leading to redundant computations in regions with easy motion. We can use dynamic spatial pruning method to skip redundant computation, but this method cannot properly identify easy regions in VFI tasks without supervision. In this paper, we develop an Uncertainty-Guided Spatial Pruning (UGSP) architecture to skip redundant computation for efficient frame interpolation dynamically. Specifically, pixels with low uncertainty indicate easy regions, where the calculation can be reduced without bringing undesirable visual results. Therefore, we utilize uncertainty-generated mask labels to guide our UGSP in properly locating the easy region. Furthermore, we propose a self-contrast training strategy that leverages an auxiliary non-pruning branch to improve the performance of our UGSP. Extensive experiments show that UGSP maintains performance but reduces FLOPs by 34%/52%/30% compared to baseline without pruning on Vimeo90K/UCF101/MiddleBury datasets. In addition, our method achieves state-of-the-art performance with lower FLOPs on multiple benchmarks.

Learning Generalized Representations for Open-Set Temporal Action Localization

  • Junshan Hu
  • Liansheng Zhuang
  • Weisong Dong
  • Shiming Ge
  • Shafei Wang

Open-set Temporal Action Localization (OSTAL) is a critical and challenging task that aims to recognize and temporally localize human actions in untrimmed videos in open word scenarios. The main challenge in this task is the knowledge transfer from known actions to unknown actions. However, existing methods utilize limited training data and overparameterized deep neural network, which have poor generalization. This paper proposes a novel Generalized OSTAL model (namely GOTAL) to learn generalized representations of actions. GOTAL utilizes a Transformer network to model actions and a open-set detection head to perform action localization and recognition. Benefitting from Transformer's temporal modeling capabilities, GOTAL facilitates the extraction of human motion information from videos to mitigate the effects of irrelevant background data. Furthermore, a sharpness minimization algorithm is used to learn the network parameters of GOTAL, which facilitates the convergence of network parameters towards flatter minima by simultaneously minimizing the training loss value and sharpness of the loss plane. The collaboration of the above components significantly enhances the generalization of the representation. Experimental results demonstrate that GOTAL achieves the state-of-the-art performance on THUMOS14 and ActivityNet1.3 benchmarks, confirming the effectiveness of our proposed method.

Unambiguous Object Tracking by Exploiting Target Cues

  • Jie Gao
  • Bineng Zhong
  • Yan Chen

Siamese tracking exploits the template and the search region features to adaptively locate arbitrary objects in the tracking. A noteworthy issue is that both foreground and background mix in the template, and thus a tracker needs to learn what the target is and which pixels belong to it. However, existing trackers cannot effectively exploit the template information, resulting in a deficiency of target information and causing confusion for the tracker regarding which pixels belong to the target. To alleviate this issue, we propose UTrack, a simple and effective algorithm for unambiguous object tracking. UTrack utilizes long-term contextual information to propagate the appearance state of the target so as to explicitly model the apparent information of the target. Additionally, UTrack can resist the appearance change of the target by leveraging the target cues. Moreover, the proposed method uses the refined template to obtain more detailed information about the target and better understand which pixels belong to the target. Extensive experiments and comparisons with competitive trackers on challenging large-scale benchmarks show that our tracker can achieve state-of-the-art performances with real-time running. In particular, UTrack achieves 77.7% AO on GOT-10k.

Masked Text Modeling: A Self-Supervised Pre-training Method for Scene Text Detection

  • Keran Wang
  • Hongtao Xie
  • Yuxin Wang
  • Dongming Zhang
  • Yadong Qu
  • Zuan Gao
  • Yongdong Zhang

Scene text detection has made great progress recently with the wide use of pre-training. Nonetheless, existing scene text detection methods still suffer from two problems: 1) Limited annotated real data reduces the feature robustness. 2) Detectors perform poorly on text lacking of visual information. In this paper, we explore the potential of the CLIP model, and propose a novel self-supervised Masked Text Modeling (MTM) pre-training method for scene text detection, which can be trained with unlabeled data and improve the linguistic reasoning ability for text occlusion. Different from previous randomly pixel-level masking methods, MTM performs a targeted text-aware masking process under an unsupervised manner. Specifically, MTM consists of text perception and masked text modeling. In the text perception step, benefiting from the text-friendliness of CLIP, a Text Perception Module is proposed to attend to text area by computing the similarity between the text and image tokens from CLIP model. In the masked text modeling step, a Text-aware Masking Strategy is designed to mask the text area, and the Masked Text Modeling Module is used to reconstruct the masked texts. MTM obtains the ability to reason the linguistic information of masked texts with the reconstruction. This robust feature extraction learned by MTM ensures a more discriminative representation for the text lacking of visual information. Moreover, a new text dataset named OcclusionText is proposed to evaluate the robustness for text occlusion of detection methods. Extensive experiments on public benchmarks demonstrate that our MTM can boost the performance of existing text detectors.

Object Part Parsing with Hierarchical Dual Transformer

  • Jiamin Chen
  • Jianlou Si
  • Naihao Liu
  • Yao Wu
  • Li Niu
  • Chen Qian

Object part parsing involves segmenting objects into semantic parts, which has drawn great attention recently. The current methods ignore the specific hierarchical structure of the object, which can be used as strong prior knowledge. To address this, we propose the Hierarchical Dual Transformer (HDTR) to explore the contribution of the typical structural priors of the object parts. HDTR first generates the pyramid multi-granularity pixel representations under the supervision of the object part parsing maps at different semantic levels and then assigns each region an initial part embedding. Moreover, HDTR generates an edge pixel representation to extend the capability of the network to capture detailed information. Afterward, we design a Hierarchical Part Transformer to upgrade the part embeddings to their hierarchical counterparts with the assistance of the multi-granularity pixel representations. Next, we propose a Hierarchical Pixel Transformer to infer the hierarchical information from the part embeddings to enrich the pixel representations. Note that both transformer decoders rely on the structural relations between object parts, i.e., dependency, composition, and decomposition relations. The experiments on five large-scale datasets, i.e., LaPa, CelebAMask-HQ, CIHP, LIP and Pascal Animal, demonstrate that our method sets a new state-of-the-art performance for object part parsing.

Towards Robust Real-Time Scene Text Detection: From Semantic to Instance Representation Learning

  • Xugong Qin
  • Pengyuan Lyu
  • Chengquan Zhang
  • Yu Zhou
  • Kun Yao
  • Peng Zhang
  • Hailun Lin
  • Weiping Wang

Due to the flexible representation of arbitrary-shaped scene text and simple pipeline, bottom-up segmentation-based methods begin to be mainstream in real-time scene text detection. Despite great progress, these methods show deficiencies in robustness and still suffer from false positives and instance adhesion. Different from existing methods which integrate multiple-granularity features or multiple outputs, we resort to the perspective of representation learning in which auxiliary tasks are utilized to enable the encoder to jointly learn robust features with the main task of per-pixel classification during optimization. For semantic representation learning, we propose global-dense semantic contrast (GDSC), in which a vector is extracted for global semantic representation, then used to perform element-wise contrast with the dense grid features. To learn instance-aware representation, we propose to combine top-down modeling (TDM) with the bottom-up framework to provide implicit instance-level clues for the encoder. With the proposed GDSC and TDM, the encoder network learns stronger representation without introducing any parameters and computations during inference. Equipped with a very light decoder, the detector can achieve more robust real-time scene text detection. Experimental results on four public datasets show that the proposed method can outperform or be comparable to the state-of-the-art on both accuracy and speed. Specifically, the proposed method achieves 87.2% F-measure with 48.2 FPS on Total-Text and 89.6% F-measure with 36.9 FPS on MSRA-TD500 on a single GeForce RTX 2080 Ti GPU.

Towards Flexible and Universal: A Novel Endpoint-based Framework for Vessel Structural Information Extraction

  • Xiyao Ma
  • Shiqi Liu
  • Xiaoliang Xie
  • Xiaohu Zhou
  • Zengguang Hou
  • Xinkai Qu
  • Wenzheng Han
  • Ming Wang
  • Meng Song
  • Linsen Zhang

In computer-assisted intravascular interventional surgery, extracting detailed information of target vessels from X-ray angiographic images can be meaningful in improving safety and effectiveness. However, large amounts of effort have been dedicated to segmenting the whole blood vessels from the background while ignoring the internal structure, which is limited in clinical application. In this paper, we propose a flexible and universal endpoint-based framework for vessel structural information extraction. The framework first localizes all the endpoints of target vessel segments through a Coarse-to-Fine Keypoint Detection Network (CFKD-Net), in which the designed Multi-branch Feature Aggregation (MFA) module captures both in-patch and cross-patch information to help recognize the points of interest based on global structure. A novel MaskMSELoss is also proposed to disambiguate those irrelevant responses. Then a designed VEssel Segmentation and Analysis (VESA) algorithm will generate the segmentation mask and morphological analysis for each vessel segment simply based on the endpoints. It can also be flexibly applied to analyze variant blood vessels which are not pre-defined before. Extensive experiments on two different coronary artery datasets consistently demonstrate that this framework can achieve state-of-the-art detection performance and successfully extract and analyze target vessel segments. Since the framework shows excellent performance on the coronary arteries with severe deformation and strong noise, it is highly promising for analyzing other vascular images.

FDCNet: Feature Drift Compensation Network for Class-Incremental Weakly Supervised Object Localization

  • Sejin Park
  • Taehyung Lee
  • Yeejin Lee
  • Byeongkeun Kang

This work addresses the task of class-incremental weakly supervised object localization (CI-WSOL). The goal is to incrementally learn object localization for novel classes using only image-level annotations while retaining the ability to localize previously learned classes. This task is important because annotating bounding boxes for every new incoming data is expensive, although object localization is crucial in various applications. To the best of our knowledge, we are the first to address this task. Thus, we first present a strong baseline method for CI-WSOL by adapting the strategies of class-incremental classifiers to mitigate catastrophic forgetting. These strategies include applying knowledge distillation, maintaining a small data set from previous tasks, and using cosine normalization. We then propose the feature drift compensation network to compensate for the effects of feature drifts on class scores and localization maps. Since updating network parameters to learn new tasks causes feature drifts, compensating for the final outputs is necessary. Finally, we evaluate our proposed method by conducting experiments on two publicly available datasets (ImageNet-100 and CUB-200). The experimental results demonstrate that the proposed method outperforms other baseline methods.

Collaborative Learning of Diverse Experts for Source-free Universal Domain Adaptation

  • Meng Shen
  • Yanzuo Lu
  • Yanxu Hu
  • Andy J. Ma

Source-free universal domain adaptation (SFUniDA) is a challenging yet practical problem that adapts the source model to the target domain in the presence of distribution and category shifts without accessing source domain data. Most existing methods are developed based on a single-expert target model for both known- and unknown-class data training, such that the known- and unknown-class data in the target domain may not be separated well from each other. To address this issue, we propose a novel Cobllaborative Learning of Diverse Experts (CoDE) method for SFUniDA. In our method, unknown-class compatible source model training is designed to reserve space for the potential target unknown-class data. Two diverse experts are learned to better recognize the target known- and unknown-class data respectively by the specialized entropy discrimination. We improve the transferability of both experts by collaboratively correcting the possible misclassification errors with consistency and diversity learning. The final prediction with high confidence is obtained by gating the diverse experts based on soft neighbor density. Extensive experiments on four publicly available benchmarks demonstrate the superiority of our method compared to the state of the art.

Read Ten Lines at One Glance: Line-Aware Semi-Autoregressive Transformer for Multi-Line Handwritten Mathematical Expression Recognition

  • Wentao Yang
  • Zhe Li
  • Dezhi Peng
  • Lianwen Jin
  • Mengchao He
  • Cong Yao

Handwritten Mathematical Expression Recognition (HMER) plays a critical role in various applications, such as digitized education and scientific research. Although existing methods have achieved promising performance on publicly available datasets, they still struggle to recognize multi-line mathematical expressions (MEs), suffering from complex structures and slow inference speed. To address these issues, we propose a Line-Aware Semi-autoregressive Transformer (LAST) that treats multi-line mathematical expression sequences as two-dimensional dual-end structures. The proposed LAST utilizes a line-wise dual-end decoding strategy to decode multi-line mathematical expressions in parallel and perform dual-end decoding within each line. Specifically, we introduce a line-aware positional encoding module and a line-partitioned dual-end mask to endow LAST with line order awareness and directionality. Additionally, we adopt a shared-task optimization strategy to train LAST in both autoregressive and semi-autoregressive tasks. To evaluate the effectiveness of our approach in real-world scenarios, we have built a new Multi-line Mathematical Expression dataset (M2E), which, to the best of our knowledge, is the first of its kind and boasts with the largest character category, the largest samples of characters, and the longest average sequence length, compared to existing ME datasets. Experimental results on both the M2E dataset and publicly available datasets demonstrate the effectiveness of our proposed method. Notably, our semi-autoregressive decoding approach achieves significantly faster decoding speeds while still achieving state-of-the-art performance compared to the existing methods.

Beyond Domain Gap: Exploiting Subjectivity in Sketch-Based Person Retrieval

  • Kejun Lin
  • Zhixiang Wang
  • Zheng Wang
  • Yinqiang Zheng
  • Shin'ichi Satoh

Person re-identification (re-ID) requires densely distributed cameras. In practice, the person of interest may not be captured by cameras and therefore need to be retrieved using subjective information (e.g., sketches from witnesses). Previous research defines this case using the sketch as sketch re-identification (Sketch re-ID) and focuses on eliminating the domain gap. Actually, subjectivity is another significant challenge. We model and investigate it by posing a new dataset with multi-witness descriptions. It features two aspects. 1) Large-scale. It contains over 4,763 sketches and 32,668 photos, making it the largest Sketch re-ID dataset. 2) Multi-perspective and multi-style. Our dataset offers multiple sketches for each identity. Witnesses' subjective cognition provides multiple perspectives on the same individual, while different artists' drawing styles provide variation in sketch styles. We further have two novel designs to alleviate the challenge of subjectivity. 1) Fusing subjectivity. We propose a non-local (NL) fusion module that gathers sketches from different witnesses for the same identity. 2) Introducing objectivity. An AttrAlign module utilizes attributes as an implicit mask to align cross-domain features. To push forward the advance of Sketch re-ID, we set three benchmarks (large-scale, multi-style, cross-style). Extensive experiments demonstrate our leading performance in these benchmarks. Dataset and Codes are publicly available at:

Rethinking Pseudo-Label-Based Unsupervised Person Re-ID with Hierarchical Prototype-based Graph

  • Ben Sha
  • Baopu Li
  • Tao Chen
  • Jiayuan Fan
  • Tao Sheng

Unsupervised person re-identification (Re-ID) aims to match individuals without manual annotations. However, existing methods often struggle with intra-class variations due to differences in person poses and camera styles such as resolution and environment information. Additionally, clustering may produce incorrect pseudo-labels, compounding the issue. To address these challenges, we propose a novel hierarchical prototype-based graph network (HPG-Net) for unsupervised person Re-ID. Our approach uses a hierarchical prototype-based graph structure to describe person images by attributes of poses and camera styles, with each graph node representing the average of image features as a prototype. We then apply a hierarchical contrastive learning module to enhance the feature learning at each level, reducing the impact of intra-class differences caused by extraneous attributes. We also calculate the similarity between samples and each level of prototypes, maintaining prototype-based graph consistency with the mean-teacher network to mitigate the accumulation errors caused by pseudo-labels. Experimental results on three benchmarks show that our method outperforms state-of-the-art (SOTA) works. Moreover, we achieve promising performance on an occluded dataset.

Single Domain Generalization via Unsupervised Diversity Probe

  • Kehua Guo
  • Rui Ding
  • Tian Qiu
  • Xiangyuan Zhu
  • Zheng Wu
  • Liwei Wang
  • Hui Fang

Single domain generalization (SDG) is a realistic yet challenging domain generalization scenario that aims to generalize a model trained on a single domain to multiple unseen domains. Typical SDG methods are essentially supervised data augmentation strategies, which tend to enhance the novelty rather than the diversity of augmented samples. Insufficient diversity may jeopardize the model generalization ability. In this paper, we propose a novel adversarial method, termed Unsupervised Diversity Probe (UDP), to synthesize novel and diverse samples in fully unsupervised settings. More specifically, to ensure that samples are novel, we study SDG from an information-theoretic perspective that minimizes the uncertainty coefficients between synthesized and source samples. Considering that the variation in a single source domain is limited, we introduce a regularization imposed on the auxiliary module that synthesizes variable samples, incorporated with uncertainty coefficients in an adversarial manner to complement the diversity. Subsequently, an available region is utilized to guarantee the samples' safety. For the network architecture, we design a simple probe module that can synthesize samples in several different aspects. UDP is an unsupervised and easy-to-implement method that solves SDG using only synthetic (source) samples, thus reducing the dependence on task models. Extensive experiments on three benchmark datasets show that UDP achieves remarkable results and outperforms existing supervised and unsupervised methods by a large margin in single domain generalization.

PBFormer: Capturing Complex Scene Text Shape with Polynomial Band Transformer

  • Ruijin Liu
  • Ning Lu
  • Dapeng Chen
  • Cheng LI
  • Zejian Yuan
  • Wei Peng

We present PBFormer, an efficient yet powerful scene text detector that unifies the transformer with a novel text shape representation Polynomial Band (PB). The representation has four polynomial curves to fit a text's top, bottom, left, and right sides, which can capture a text with a complex shape by varying polynomial coefficients. PB has appealing features compared with conventional representations: 1) It can model different curvatures with a fixed number of parameters, while polygon-points-based methods need to utilize a different number of points. 2) It can distinguish adjacent or overlapping texts as they have apparent different curve coefficients, while segmentation-based or points-based methods suffer from adhesive spatial positions. PBFormer combines the PB with the transformer, which can directly generate smooth text contours sampled from predicted curves without interpolation. A parameter-free cross-scale pixel attention (CPA) module is employed to highlight the feature map of a suitable scale while suppressing the other feature maps. The simple operation can help detect small-scale texts and is compatible with the one-stage DETR framework, where no postprocessing exists for NMS. Furthermore, PBFormer is trained with a shape-contained loss, which not only enforces the piecewise alignment between the ground truth and the predicted curves but also makes curves' position and shapes consistent with each other. Without bells and whistles about text pre-training, our method is superior to the previous state-of-the-art text detectors on the arbitrary-shaped text datasets. Codes will be public.

DANet: Multi-scale UAV Target Detection with Dynamic Feature Perception and Scale-aware Knowledge Distillation

  • Houzhang Fang
  • Zikai Liao
  • Lu Wang
  • Qingshan Li
  • Yi Chang
  • Luxin Yan
  • Xuhua Wang

Multi-scale infrared unmanned aerial vehicle (UAV) targets (IRUTs) detection under dynamic scenarios remains a challenging task due to weak target features, varying shapes and poses, and complex background interference. Current detection methods find it difficult to address the above issues accurately and efficiently. In this paper, we design a dynamic attentive network (DANet) incorporating a scale-adaptive feature enhancement mechanism (SaFEM) and an attention-guided cross-weighting feature aggregator (ACFA). The SaFEM adaptively adjusts the network's receptive fields at hierarchical network levels leveraging separable deformable convolution (SDC), which enhances the network's multi-scale IRUT awareness. The ACFA, modulated by two crossing attention mechanisms, strengthens structural and semantic properties on neighboring levels for the accurate representation of multi-scale IRUT features from different levels. A plug-and-play anti-distractor contrastive regularization (ADCR) is also imposed on our DANet, which enforces similarity on features of targets and distractors from a new uncompressed feature projector (UFP) to increase the network's anti-distractor ability in complex backgrounds. To further increase the multi-scale UAV detection performance of DANet while maintaining its efficiency superiority, we propose a novel scale-specific knowledge distiller (SSKD) based on a divide-and-conquer strategy. For the "divide'' stage, we intendedly construct three task-oriented teachers to learn tailored knowledge for small-, medium-, and large-scale IRUTs. For the "conquer'' stage, we propose a novel element-wise attentive distillation module (EADM), where we employ a pixel-wise attention mechanism to highlight teacher and student IRUT features, and incorporate IRUT-associated prior knowledge for the collaborative transfer of refined multi-scale IRUT features to our DANet. Extensive experiments on real infrared UAV datasets demonstrate that our DANet is able to detect multi-scale UAVs with a satisfactory balance between accuracy and efficiency.

A Unified Query-based Paradigm for Camouflaged Instance Segmentation

  • Bo Dong
  • Jialun Pei
  • Rongrong Gao
  • Tian-Zhu Xiang
  • Shuo Wang
  • Huan Xiong

Due to the high similarity between camouflaged instances and the background, the recently proposed camouflaged instance segmentation (CIS) faces challenges in accurate localization and instance segmentation. To this end, inspired by query-based transformers, we propose a unified query-based multi-task learning framework for camouflaged instance segmentation, termed UQFormer, which builds a set of mask queries and a set of boundary queries to learn a shared composed query representation and efficiently integrates global camouflaged object region and boundary cues, for simultaneous instance segmentation and instance boundary detection in camouflaged scenarios. Specifically, we design a composed query learning paradigm that learns a shared representation to capture object region and boundary features by the cross-attention interaction of mask queries and boundary queries in the designed multi-scale unified learning transformer decoder. Then, we present a transformer-based multi-task learning framework for simultaneous camouflaged instance segmentation and camouflaged instance boundary detection based on the learned composed query representation, which also forces the model to learn a strong instance-level query representation. Notably, our model views the instance segmentation as a query-based direct set prediction problem, without other post-processing such as non-maximal suppression. Compared with 14 state-of-the-art approaches, our UQFormer significantly improves the performance of camouflaged instance segmentation. Our code will be available at:

Unite-Divide-Unite: Joint Boosting Trunk and Structure for High-accuracy Dichotomous Image Segmentation

  • Jialun Pei
  • Zhangjun Zhou
  • Yueming Jin
  • He Tang
  • Pheng-Ann Heng

High-accuracy Dichotomous Image Segmentation (DIS) aims to pinpoint category-agnostic foreground objects from natural scenes. The main challenge for DIS involves identifying the highly accurate dominant area while rendering detailed object structure. However, directly using a general encoder-decoder architecture may result in an oversupply of high-level features and neglect the shallow spatial information necessary for partitioning meticulous structures. To fill this gap, we introduce a novel Unite-Divide-Unite Network (UDUN) that restructures and bipartitely arranges complementary features to simultaneously boost the effectiveness of trunk and structure identification. The proposed UDUN proceeds from several strengths. First, a dual-size input feeds into the shared backbone to produce more holistic and detailed features while keeping the model lightweight. Second, a simple Divide-and-Conquer Module (DCM) is proposed to decouple multiscale low- and high-level features into our structure decoder and trunk decoder to obtain structure and trunk information respectively. Moreover, we design a Trunk-Structure Aggregation module (TSA) in our union decoder that performs cascade integration for uniform high-accuracy segmentation. As a result, UDUN performs favorably against state-of-the-art competitors in all six evaluation metrics on overall DIS-TE, i.e., achieving 0.772 weighted F-measure and 977 HCE. Using 1024X1024 input, our model enables real-time inference at 65.3 fps with ResNet-18. The source code is available at

Exploring High-Correlation Source Domain Information for Multi-Source Domain Adaptation in Semantic Segmentation

  • Yuxiang Cai
  • Meng Xi
  • Yongheng Shang
  • Jianwei Yin

Multi-source domain adaptation (MSDA) aims to transfer knowledge from multiple source domains to one target domain. Although multi-source domains contain more complementary information than single source domain, MSDA involves some disturbed source samples, which will degrade the adaptation performance. To solve this problem, we propose a novel MSDA method for semantic segmentation. Specifically, to fully explore the optimal source samples for target domain, we propose a novel correlation measurement mechanism, weighing domain-level source-target correlation (DSC) and pixel-level source-target correlation (PSC). For each pair of source and target domains, DSC and PSC estimate the source-target correlations via the distances between target class prototypes and source class prototypes, and between target class prototypes and every pixel of source features, respectively. Built upon PSC, we propose a novel mix-up strategy, which pastes high-correlation source pixels to target images, to construct augmented mixing images for adaptation. Then we train the segmentor on the mixed images with pseudo labels and labeled source images, with DSC and PSC to suppress the negative effects of the low-correlation source domains and pixels. Furthermore, an attentive prototype alignment loss, based on DSC, is proposed to align target and multi-source domains, which attaches more importance to high-correlation source domains. The experimental results on the representative benchmark datasets (i.e., GTA5 and SYNTHIA → Cityscapes) highlight that our method substantially outperforms the state-of-the-art single-source domain adaptation and MSDA methods.

Deep Image Harmonization in Dual Color Spaces

  • Linfeng Tan
  • Jiangtong Li
  • Li Niu
  • Liqing Zhang

Image harmonization is an essential step in image composition that adjusts the appearance of composite foreground to address the inconsistency between foreground and background. Existing methods primarily operate in correlated RGB color space, leading to entangled features and limited representation ability. In contrast, decorrelated color space (e.g., Lab) has decorrelated channels that provide disentangled color and illumination statistics. In this paper, we explore image harmonization in dual color spaces, which supplements entangled RGB features with disentangled L, a, b features to alleviate the workload in harmonization process. The network comprises a RGB harmonization backbone, an Lab encoding module, and an Lab control module. The backbone is a U-Net network translating composite image to harmonized image. Three encoders in Lab encoding module extract three control codes independently from L, a, b channels, which are used to manipulate the decoder features in harmonization backbone via Lab control module. Our code and model are available at

Pixel Adapter: A Graph-Based Post-Processing Approach for Scene Text Image Super-Resolution

  • Wenyu Zhang
  • Xin Deng
  • Baojun Jia
  • Xingtong Yu
  • Yifan Chen
  • Jin Ma
  • Qing Ding
  • Xinming Zhang

Current Scene text image super-resolution approaches primarily focus on extracting robust features, acquiring text information, and complex training strategies to generate super-resolution images. However, the upsampling module, which is crucial in the process of converting low-resolution images to high-resolution ones, has received little attention in existing works. To address this issue, we propose the Pixel Adapter Module (PAM) based on graph attention to address pixel distortion caused by upsampling. The PAM effectively captures local structural information by allowing each pixel to interact with its neighbors and update features. Unlike previous graph attention mechanisms, our approach achieves 2-3 orders of magnitude improvement in efficiency and memory utilization by eliminating the dependency on sparse adjacency matrices and introducing a sliding window approach for efficient parallel computation. Additionally, we introduce the MLP-based Sequential Residual Block (MSRB) for robust feature extraction from text images, and a Local Contour Awareness loss (ℒlca) to enhance the model's perception of details. Comprehensive experiments on TextZoom demonstrate that our proposed method generates high-quality super-resolution images, surpassing existing methods in recognition accuracy. For single-stage and multi-stage strategies, we achieved improvements of 0.7% and 2.6%, respectively, increasing the performance from 52.6% and 53.7% to 53.3% and 56.3%. The code is available at

Where and How: Mitigating Confusion in Neural Radiance Fields from Sparse Inputs

  • Yanqi Bao
  • Yuxin Li
  • Jing Huo
  • Tianyu Ding
  • Xinyue Liang
  • Wenbin Li
  • Yang Gao

Neural Radiance Fields from Sparse inputs (NeRF-S) have shown great potential in synthesizing novel views with a limited number of observed viewpoints. However, due to the inherent limitations of sparse inputs and the gap between non-adjacent views, rendering results often suffer from over-fitting and foggy surfaces, a phenomenon we refer to as "CONFUSION" during volume rendering. In this paper, we analyze the root cause of this confusion and attribute it to two fundamental questions: "WHERE" and "HOW". To this end, we present a novel learning framework, WaH-NeRF, which effectively mitigates confusion by tackling the following challenges: (i) "WHERE" to Sample? in NeRF-S-we introduce a Deformable Sampling strategy and a Weight-based Mutual Information Loss to address sample-position confusion arising from the limited number of viewpoints; and (ii) "HOW" to Predict? in NeRF-S-we propose a Semi-Supervised NeRF learning Paradigm based on pose perturbation and a Pixel-Patch Correspondence Loss to alleviate prediction confusion caused by the disparity between training and testing viewpoints. By integrating our proposed modules and loss functions, WaH-NeRF outperforms previous methods under the NeRF-S setting. Code is available

One-stage Low-resolution Text Recognition with High-resolution Knowledge Transfer

  • Hang Guo
  • Tao Dai
  • Mingyan Zhu
  • Guanghao Meng
  • Bin Chen
  • Zhi Wang
  • Shu-Tao Xia

Recognizing characters from low-resolution (LR) text images poses a significant challenge due to the information deficiency as well as the noise and blur in low-quality images. Current solutions for low-resolution text recognition (LTR) typically rely on a two-stage pipeline that involves super-resolution as the first stage followed by the second-stage recognition. Although this pipeline is straightforward and intuitive, it has to use an additional super-resolution network, which causes inefficiencies during training and testing. Moreover, the recognition accuracy of the second stage heavily depends on the reconstruction quality of the first stage, causing ineffectiveness.In this work, we attempt to address these challenges from a novel perspective: adapting the recognizer to low-resolution inputs by transferring the knowledge from the high-resolution. Guided by this idea, we propose an efficient and effective knowledge distillation framework to achieve multi-level knowledge transfer.Specifically, the visual focus loss is proposed to extract the character position knowledge with resolution gap reduction and character region focus, the semantic contrastive loss is employed to exploit the contextual semantic knowledge with contrastive learning, and the soft logits loss facilitates both local word-level and global sequence-level learning from the soft teacher label.Extensive experiments show that the proposed one-stage pipeline significantly outperforms super-resolution based two-stage frameworks in terms of effectiveness and efficiency, accompanied by favorable robustness.Code is available at

Calibration-based Dual Prototypical Contrastive Learning Approach for Domain Generalization Semantic Segmentation

  • Muxin Liao
  • Shishun Tian
  • Yuhang Zhang
  • Guoguang Hua
  • Wenbin Zou
  • Xia Li

Prototypical contrastive learning (PCL) has been widely used to learn class-wise domain-invariant features recently. These methods are based on the assumption that the prototypes, which are represented as the central value of the same class in a certain domain, are domain-invariant. Since the prototypes of different domains have discrepancies as well, the class-wise domain-invariant features learned from the source domain by PCL need to be aligned with the prototypes of other domains simultaneously. However, the prototypes of the same class in different domains may be different while the prototypes of different classes may be similar, which may affect the learning of class-wise domain-invariant features. Based on these observations, a calibration-based dual prototypical contrastive learning (CDPCL) approach is proposed to reduce the domain discrepancy between the learned class-wise features and the prototypes of different domains for domain generalization semantic segmentation. It contains an uncertainty-guided PCL (UPCL) and a hard-weighted PCL (HPCL). Since the domain discrepancies of the prototypes of different classes may be different, we propose an uncertainty probability matrix to represent the domain discrepancies of the prototypes of all the classes. The UPCL estimates the uncertainty probability matrix to calibrate the weights of the prototypes during the PCL. Moreover, considering that the prototypes of different classes may be similar in some circumstances, which means these prototypes are hard-aligned, the HPCL is proposed to generate a hard-weighted matrix to calibrate the weights of the hard-aligned prototypes during the PCL. Extensive experiments demonstrate that our approach achieves superior performance over current approaches on domain generalization segmentation tasks. The source code will be released at

Skeleton MixFormer: Multivariate Topology Representation for Skeleton-based Action Recognition

  • Wentian Xin
  • Qiguang Miao
  • Yi Liu
  • Ruyi Liu
  • Chi-Man Pun
  • Cheng Shi

Vision Transformer, which performs well in various vision tasks, encounters a bottleneck in skeleton-based action recognition and falls short of advanced GCN-based methods. The root cause is that the current skeleton transformer depends on the self-attention mechanism of the complete channel of the global joint, ignoring the highly discriminative differential correlation within the channel, so it is challenging to learn the expression of the multivariate topology dynamically. To tackle this, we present Skeleton MixFormer, an innovative spatio-temporal architecture to effectively represent the physical correlations and temporal interactivity of the compact skeleton data. Two essential components make up the proposed framework: 1) Spatial MixFormer. The channel-grouping and mix-attention are utilized to calculate the dynamic multivariate topological relationships. Compared with the full-channel self-attention method, Spatial MixFormer better highlights the channel groups' discriminative differences and the joint adjacency's interpretable learning. 2) Temporal MixFormer, which consists of Multiscale Convolution, Temporal Transformer and Sequential Holding Module. The multivariate temporal models ensure the richness of global difference expression and realize the discrimination of crucial intervals in the sequence, thereby enabling more effective learning of long and short-term dependencies in actions. Our Skeleton MixFormer demonstrates state-of-the-art (SOTA) performance across seven different settings on four standard datasets, namely NTU-60, NTU-120, NW-UCLA, and UAV-Human. Related code will be available on

Mask Again: Masked Knowledge Distillation for Masked Video Modeling

  • Xiaojie Li
  • Shaowei He
  • Jianlong Wu
  • Yue Yu
  • Liqiang Nie
  • Min Zhang

Masked video modeling has shown remarkable performance in downstream tasks by predicting masked video tokens from visible ones. However, training models from scratch on large-scale unlabeled data remains computationally challenging and time-consuming. Moreover, the commonly used random-based sampling techniques may lead to the selection of redundant or low-information regions, hindering the model from learning discriminative representations within the limited training epochs. To achieve efficient pre-training, we propose MaskAgain, an efficient feature-based knowledge distillation framework for masked video pre-training that facilitates knowledge transfer from a pre-trained teacher model to a student model. In contrast to previous approaches that align all visible token features with the teacher model at output layers, MaskAgain adopts a selective approach by masking visible tokens again at both the hidden and output layers of the transformer block. Attention mechanisms are utilized for informative feature selection. At the hidden level, attention maps generated by the transformer's multi-head attention structure are utilized to select crucial token information at both temporally-global and temporally-local levels. Additionally, at the output level, an activation-based attention map is generated using token features, enabling us to focus on important tokens while preserving feature similarity and the relationship matrix similarity between patches. Extensive experimental results show that MaskAgain achieves comparable or even better performance than existing methods on benchmark datasets with much fewer training epochs and much less memory, which demonstrates that MaskAgain allows for efficient pre-training of accurate video models, reducing computational resources and training time significantly. Code is released at

Human-Object-Object Interaction: Towards Human-Centric Complex Interaction Detection

  • Mingxuan Zhang
  • Xiao Wu
  • Zhaoquan Yuan
  • Qi He
  • Xiang Huang

Localizing and recognizing interactive actions in videos is a pivotal yet intricate task that paves the way towards profound video comprehension. Recent advancements in Human-Object Interaction (HOI) detection, which involve detecting and localizing the interactions between human and object pairs, have undeniably marked significant progress. However, the realm of human-object-object interaction, an essential aspect of real-world industrial applications, remains largely uncharted. In this paper, we introduce a novel task referred to as Human-Object-Object Interaction (HOOI) detection and present a cutting-edge method named the Human-Object-Object Interaction Network (H2O-Net). The proposed H2O-Net is comprised of two principal modules: sequential motion feature extraction and HOOI modeling. The former module delves into the gradually evolving visual characteristics of entities throughout the HOOI process, harnessing spatial-temporal features across multiple fine-grained partitions. Conversely, the latter module aspires to encapsulate HOOI actions through intricate interactions between entities. It commences by capturing and amalgamating two sub-interaction features to extract comprehensive HOOI features, subsequently refining them using the interaction cues embedded within the long-term global context. Furthermore, we contribute to the research community by constructing a new video dataset, dubbed the HOOI dataset. The actions encompassed within this dataset pertain to pivotal operational behaviors in industrial manufacturing, imbuing it with substantial application potential and serving as a valuable addition to the existing repertoire of interaction action detection datasets. Experimental evaluations conducted on the proposed HOOI and widely-used AVA datasets demonstrate that our method outperforms existing state-of-the-art techniques by margins of 6.16 mAP and 1.9 mAP, respectively, thus substantiating its effectiveness.

On the Importance of Spatial Relations for Few-shot Action Recognition

  • Yilun Zhang
  • Yuqian Fu
  • Xingjun Ma
  • Lizhe Qi
  • Jingjing Chen
  • Zuxuan Wu
  • Yu-Gang Jiang

Deep learning has achieved great success in video recognition, yet still struggles to recognize novel actions when faced with only a few examples. To tackle this challenge, few-shot action recognition methods have been proposed to transfer knowledge from a source dataset to a novel target dataset with only one or a few labeled videos. However, existing methods mainly focus on modeling the temporal relations between the query and support videos while ignoring the spatial relations. In this paper, we find that the spatial misalignment between objects also occurs in videos, notably more common than the temporal inconsistency. We are thus motivated to investigate the importance of spatial relations and propose a more accurate few-shot action recognition method that leverages both spatial and temporal information. Particularly, a novel Spatial Alignment Cross Transformer (SA-CT) which learns to re-adjust the spatial relations and incorporates the temporal information is contributed. Experiments reveal that, even without using any temporal information, the performance of SA-CT is comparable to temporal based methods on 3/4 benchmarks. To further incorporate the temporal information, we propose a simple yet effective Temporal Mixer module. The Temporal Mixer enhances the video representation and improves the performance of the full SA-CT model, achieving very competitive results. In this work, we also exploit large-scale pretrained models for few-shot action recognition, providing useful insights for this research direction.

CgT-GAN: CLIP-guided Text GAN for Image Captioning

  • Jiarui Yu
  • Haoran Li
  • Yanbin Hao
  • Bin Zhu
  • Tong Xu
  • Xiangnan He

The large-scale visual-language pre-trained model, Contrastive Language-Image Pre-training (CLIP), has significantly improved image captioning for scenarios without human-annotated image-caption pairs. Recent advanced CLIP-based image captioning without human annotations follows a text-only training paradigm, i.e., reconstructing text from shared embedding space. Nevertheless, these approaches are limited by the training/inference gap or huge storage requirements for text embeddings. Given that it is trivial to obtain images in the real world, we propose CLIP-guided text GAN (CgT-GAN), which incorporates images into the training process to enable the model to "see" real visual modality. Particularly, we use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus and CLIP-based reward to provide semantic guidance. The caption generator is jointly rewarded based on the caption naturalness to human language calculated from the GAN's discriminator and the semantic guidance reward computed by the CLIP-based reward module. In addition to the cosine similarity as the semantic guidance reward (i.e., CLIP-cos), we further introduce a novel semantic guidance reward called CLIP-agg, which aligns the generated caption with a weighted text embedding by attentively aggregating the entire corpus. Experimental results on three subtasks (ZS-IC, In-UIC and Cross-UIC) show that CgT-GAN outperforms state-of-the-art methods significantly across all metrics. Code is available at

Fine-grained Key-Value Memory Enhanced Predictor for Video Representation Learning

  • Xiaojie Li
  • Jianlong Wu
  • Shaowei He
  • Shuo Kang
  • Yue Yu
  • Liqiang Nie
  • Min Zhang

Self-supervised learning methods have shown significant promise in acquiring robust spatiotemporal representations from unlabeled videos. In this work, we address three critical limitations in existing self-supervised video representation learning: 1) insufficient utilization of contextual information and lifelong memory, 2) lack of fine-grained visual concept alignment, and 3) neglect of the feature distribution gap between encoders. To overcome these limitations, we propose a novel memory-enhanced predictor that leverages key-value memory networks with separate memories for the online and target encoders. This design enables the effective storage and retrieval of contextual knowledge, facilitating informed predictions and enhancing overall performance. Additionally, we introduce a visual concept alignment module that ensures fine-grained alignment of shared semantic information across segments of the same video. By employing coupled dictionary learning, we effectively decouple visual concepts, enriching the semantic representation stored in the memory networks. Our proposed approach is extensively evaluated on widely recognized benchmarks for action recognition and retrieval tasks, demonstrating its superiority in learning generalized video representations with significantly improved performance compared to existing state-of-the-art self-supervised learning methods. Code is released at

Train One, Generalize to All: Generalizable Semantic Segmentation from Single-Scene to All Adverse Scenes

  • Ziyang Gong
  • Fuhao Li
  • Yupeng Deng
  • Wenjun Shen
  • Xianzheng Ma
  • Zhenming Ji
  • Nan Xia

Unsupervised Domain Adaptation (UDA) for semantic segmentation has received widespread attention for its ability to transfer knowledge from the source to target domains without a high demand for annotations. However, semantic segmentation under adverse conditions still poses significant challenges for autonomous driving, as bad weather observation data may introduce unforeseeable problems. Although previous UDA works are devoted to adverse scene tasks, their adaptation process is redundant. For instance, unlabeled snow scene training data is a must for the model to achieve fair segmentation performance in snowy scenarios. We propose calling this type of adaptation process the Single to Single (STS) strategy. Clearly, STS is time-consuming and may show weaknesses in some comprehensive scenes, such as a night scene of sleet. Motivated by the concept of Domain Generalization (DG), we propose the Single to All (STA) model. Unlike DG, which trains models on one or multiple source domains without target domains, the STA model is based on UDA and employs one source domain, one target domain, and one introduced domain to achieve generalization to all adverse conditions by training on a single-scene dataset. Specifically, the STA model is advantageous as it learns from the source domain, reserves the style factors via a Reservation domain, and adapts the unified factors by the Randomization module. An Output Space Refusion module is also further incorporated to strengthen STA. Our STA achieves state-of-the-art performance in the Foggy Driving benchmark and demonstrates great domain generalizability in all conditions of the ACDC and Foggy Zurich benchmarks.

All-in-one Multi-degradation Image Restoration Network via Hierarchical Degradation Representation

  • Cheng Zhang
  • Yu Zhu
  • Qingsen Yan
  • Jinqiu Sun
  • Yanning Zhang

The aim of image restoration is to recover high-quality images from distorted ones. However, current methods usually focus on a single task (e.g., denoising, deblurring or super-resolution) which cannot address the needs of real-world multi-task processing, especially on mobile devices. Thus, developing an all-in-one method that can restore images from various unknown distortions is a significant challenge. Previous works have employed contrastive learning to learn the degradation representation from observed images, but this often leads to representation drift caused by deficient positive and negative pairs. To address this issue, we propose a novel All-in-one Multi-degradation Image Restoration Network (AMIRNet) that can effectively capture and utilize accurate degradation representation for image restoration. AMIRNet learns a degradation representation for unknown degraded images by progressively constructing a tree structure through clustering, without any prior knowledge of degradation information. This tree-structured representation explicitly reflects the consistency and discrepancy of various distortions, providing a specific clue for image restoration. To further enhance the performance of the image restoration network and overcome domain gaps caused by unknown distortions, we design a feature transform block (FTB) that aligns domains and refines features with the guidance of the degradation representation. We conduct extensive experiments on multiple distorted datasets, demonstrating the effectiveness of our method and its advantages over state-of-the-art restoration methods both qualitatively and quantitatively.

NPF-200: A Multi-Modal Eye Fixation Dataset and Method for Non-Photorealistic Videos

  • Ziyu Yang
  • Sucheng Ren
  • Zongwei Wu
  • Nanxuan Zhao
  • Junle Wang
  • Jing Qin
  • Shengfeng He

Non-photorealistic videos are in demand with the wave of the metaverse, but lack of sufficient research studies. This work aims to take a step forward to understand how humans perceive non-photorealistic videos with eye fixation (i.e., saliency detection), which is critical for enhancing media production, artistic design, and game user experience. To fill in the gap of missing a suitable dataset for this research line, we present NPF-200, the first large-scale multi-modal dataset of purely non-photorealistic videos with eye fixations. Our dataset has three characteristics: 1) it contains soundtracks that are essential according to vision and psychological studies; 2) it includes diverse semantic content and videos are of high-quality; 3) it has rich motions across and within videos. We conduct a series of analyses to gain deeper insights into this task and compare several state-of-the-art methods to explore the gap between natural images and non-photorealistic data. Additionally, as the human attention system tends to extract visual and audio features with different frequencies, we propose a universal frequency-aware multi-modal non-photorealistic saliency detection model called NPSNet, demonstrating the state-of-the-art performance of our task. The results uncover strengths and weaknesses of multi-modal network design and multi-domain training, opening up promising directions for future works. Our dataset and code can be found at

LandmarkGait: Intrinsic Human Parsing for Gait Recognition

  • Zengbin Wang
  • Saihui Hou
  • Man Zhang
  • Xu Liu
  • Chunshui Cao
  • Yongzhen Huang
  • Shibiao Xu

Gait recognition is an emerging biometric technology for identifying pedestrians based on their unique walking patterns. In past gait recognition, global-based methods are inadequate to meet the growing demand for accuracy, while commonly used part-based methods provided coarse and inaccurate feature representation for specific body parts. Human parsing appears to be a better option for accurately representing specific and complete body parts in gait recognition. However, its practical application in gait recognition is often hindered by missing RGB modality, lack of annotated body parts, and difficulty in balancing parsing quantity and quality. To address this issue, we propose LandmarkGait, an accessible and alternative parsing-based solution for gait recognition. LandmarkGait introduces an unsupervised landmark discovery network to transform the dense silhouette into a finite set of landmarks with remarkable consistency across various conditions. By grouping landmarks subsets corresponding to distinct body part regions, following a reconstruction task and further refinement from high-quality input silhouettes, we can directly obtain fine-grained parsing results from original binary silhouettes in an unsupervised manner. Moreover, we also develop a multi-scale feature extractor that simultaneously captures global and parsing feature representations based on the integrity and flexibility of specific body parts. Extensive experiments demonstrate that our LandmarkGait can extract more stable features and exhibit significant performance improvement under all conditions, especially in various dressing conditions. Code is available at

Patchmatch Stereo++: Patchmatch Binocular Stereo with Continuous Disparity Optimization

  • Wenjia Ren
  • Qingmin Liao
  • Zhijing Shao
  • Xiangru Lin
  • Xin Yue
  • Yu Zhang
  • Zongqing Lu

Current deep-learning-based stereo matching algorithms achieve remarkably low error rates but they suffer from the edge ambiguity effect. The primary reason is that they treat disparity estimation as a labeling problem, constructing a cost volume based on uniform discrete pixel-wise labels. It is insufficient to model the continuous disparity probability distribution (DPD), which harms the accuracy of complex regions. Moreover, current cost aggregation strategies cannot process unstructured disparity candidates very well, which is one of the bottlenecks limiting continuous modeling. We propose Patchmatch Stereo++, inspired by the traditional Patchmatch Stereo to achieve better continuous disparity optimization in deep-learning-based methods. Firstly, to model accurate continuous DPD, we introduce an adaptive dense sub-pixel sampling strategy to binocular stereo and approximate a continuous unstructured DPD for every pixel. Secondly, we design a convolution-based optimizer that can accept unstructured disparity candidates to parse the above continuous DPD in an adaptive manner and perform updates accordingly. Extensive experiments demonstrate our method has the best performance among existing stereo matching networks at the edges, both quantitatively and qualitatively. At the time of submission, compared with published works pre-trained on SceneFlow, we rank 1st in the foreground of KITTI and 2nd on SceneFlow, ETH3D under various metrics.The source code will be released.

Consistency-aware Feature Learning for Hierarchical Fine-grained Visual Classification

  • Rui Wang
  • Cong Zou
  • Weizhong Zhang
  • Zixuan Zhu
  • Lihua Jing

Hierarchical Fine-Grained Visual Classification (HFGVC) assigns a label sequence (e.g., ["Albatross'', "Laysan Albatross'']) with a coarse to fine hierarchy to each object. It remains challenging to achieve high accuracy and consistency due to the small inter-class difference, large intra-class variance, and difficulty in modeling relationships among classification tasks at different granularities. In this paper, we propose an effective Consistency-Aware Feature Learning (CAFL) method for HFGVC to improve prediction consistency and classification accuracy simultaneously. Our key idea is to encode the prediction consistency constraint into a weak supervision mechanism via forward deduction and backward induction over the label hierarchy. Furthermore, we develop a disentanglement and bidirectional reinforcement classification head to extract the features for the classifiers at different granularities. Together with the stop-gradient policy and attention mechanism, they enable each classifier to exploit the features from the ones at other granularities without suffering from their conflicting gradients in training. We evaluate our method on several commonly-used fine-grained public datasets, including CUB-200-2011, FGVC-Aircraft, and Stanford Cars. The results show that our method not only achieves state-of-the-art classification accuracy but also effectively reduces inconsistency errors by 50% under the hierarchical fine-grained classification setting.

FSR-Net: Deep Fourier Network for Shadow Removal

  • Jun Yu
  • Peng He
  • Ziqi Peng

The presence of shadows degrades the performance of various multimedia tasks. Image shadow removal aims at restoring the background of shadow regions, which is generally an open challenge. Unlike most existing deep learning-based methods that focus on restoring such degradations in the spatial domain, we introduce a novel shadow removal method that also exploits frequency domain information. Specifically, we firstly revisit the frequency characteristics of shadow images via Fourier transform, where amplitude components contain most lightness information and phase components are related to structure information. To this end, we propose a two-stage deep Fourier shadow removal network (FSR-Net) to enhance the brightness of shadow regions, and correspondingly improve the shadow removal performance of whole images. For each stage, it consists of an amplitude recovery network and a phase recovery network to progressively reconstruct the lightness and structure components. To facilitate the learning of these two representations, we introduce the frequency and spatial interaction blocks to process the local spatial features and the global frequency information separately. Extensive experiments demonstrate that FSR-Net achieves superior results than other approaches with fewer parameters. For example, our method obtains a 1.05dB improvement on ISTD[34] dataset over the previous state-of-the-art method [43] with 0.30M parameters.

Multi-Speed Global Contextual Subspace Matching for Few-Shot Action Recognition

  • Tianwei Yu
  • Peng Chen
  • Yuanjie Dang
  • Ruohong Huan
  • Ronghua Liang

Few-shot action recognition (FSAR) aims to classify unseen query actions into categories represented by a few labeled support videos. Most current FSAR methods adopt the frame-level matching mechanism that requires continuous actions to be represented by a fixed number of frame features. However, this could compromise the completeness of the contextual video information and make it difficult to handle video features of varying frame sampling speeds. In this paper, we propose a multi-speed global contextual subspace matching (MGCSM) method that generates global contextual action subspace representations from videos containing different numbers of frames to preserve contextual semantic information. Specifically, we propose to obtain the scale-agnostic information of embedding video features using a global contextual aggregation (GCA) module and then generate the discriminative action subspace representation with an action subspace generation (ASG) module. Furthermore, we introduce a multi-speed subspace matching (MSM) mechanism that generates a multi-speed classification score by integrating the similarities between query videos and support subspaces of varying sampling speeds. The proposed method is embedding-agnostic and can be combined with most mainstream embedding networks without model re-designs. Comprehensive and reproducible experiments on standard datasets demonstrate our method's superior performance compared to existing state-of-the-art methods.

Lightweight Super-Resolution Head for Human Pose Estimation

  • Haonan Wang
  • Jie Liu
  • Jie Tang
  • Gangshan Wu

Heatmap-based methods have become the mainstream method for pose estimation due to their superior performance. However, heatmap-based approaches suffer from significant quantization errors with downscale heatmaps, which result in limited performance and the detrimental effects of intermediate supervision. Previous heatmap-based methods relied heavily on additional post-processing to mitigate quantization errors. Some heatmap-based approaches improve the resolution of feature maps by using multiple costly upsampling layers to improve localization precision. To solve the above issues, we creatively view the backbone network as a degradation process and thus reformulate the heatmap prediction as a Super-Resolution (SR) task. We first propose the SR head, which predicts heatmaps with a spatial resolution higher than the input feature maps (or even consistent with the input image) by super-resolution, to effectively reduce the quantization error and the dependence on further post-processing. Besides, we propose SRPose to gradually recover the HR heatmaps from LR heatmaps and degraded features in a coarse-to-fine manner. To reduce the training difficulty of HR heatmaps, SRPose applies SR heads to supervise the intermediate features in each stage. In addition, the SR head is a lightweight and generic head that applies to top-down and bottom-up methods. Extensive experiments on the COCO, MPII, and CrowdPose datasets show that SRPose outperforms the corresponding heatmap-based approaches.

Exploiting Time-Frequency Conformers for Music Audio Enhancement

  • Yunkee Chae
  • Junghyun Koo
  • Sungho Lee
  • Kyogu Lee

With the proliferation of video platforms on the internet, recording musical performances by mobile devices has become commonplace. However, these recordings often suffer from degradation such as noise and reverberation, which negatively impact the listening experience. Consequently, the necessity for music audio enhancement (referred to as music enhancement from this point onward), involving the transformation of degraded audio recordings into pristine high-quality music, has surged to augment the auditory experience. To address this issue, we propose a music enhancement system based on the Conformer architecture that has demonstrated outstanding performance in speech enhancement tasks. Our approach explores the attention mechanisms of the Conformer and examines their performance to discover the best approach for the music enhancement task. Our experimental results show that our proposed model achieves state-of-the-art performance on single-stem music enhancement. Furthermore, our system can perform general music enhancement with multi-track mixtures, which has not been examined in previous work. Audio samples enhanced with our system are available at:

Exploring Dual Representations in Large-Scale Point Clouds: A Simple Weakly Supervised Semantic Segmentation Framework

  • Jiaming Liu
  • Yue Wu
  • Maoguo Gong
  • Qiguang Miao
  • Wenping Ma
  • Cai Xu

Existing work shows that 3D point clouds produce only about a 4% drop in semantic segmentation even at 1% random point annotation, which inspires us to further explore how to achieve better results at lower cost. As scene point clouds provide position and color information and often used in tandem as the only input, with little work going into segmentation by fusing information from dual spaces. To optimize point cloud representations, we propose a novel framework for the dual representation query network (DRQNet). The proposed framework partitions the input point cloud into position and color spaces, using the separately extracted geometric structure and semantic context to create an internal supervisory mechanism that bridges the dual spaces and fuses the information. Adopting sparsely annotated points as the query set, DRQNet provide guidance and perceptual information for multi-stage point clouds through random sampling. More, to differentiate and enhance the features generated by local neighbourhoods within multiple perceptual fields, we design a representation selection module to identify the contributions made by the position and color of each query point, and weight them adaptively according to reliability. The proposed DRQNet is robust to point cloud analysis and eliminates the effects of irregularities and disorder. Our method achieves significant performance gains on three mainstream benchmarks.

Foreground/Background-Masked Interaction Learning for Spatio-temporal Action Detection

  • Keke Chen
  • Xiangbo Shu
  • Guo-Sen Xie
  • Rui Yan
  • Jinhui Tang

Spatio-temporal Action Detection (SAD) aims to recognize the multi-class actions, and meanwhile locate their spatio-temporal occurrence in untrimmed videos. Besides relying on the inherent inter-actor interactions, most previous SAD approaches model actor interactions between multi-actors and the whole frames or special parts (e.g., objects/hands). However, such approaches are relatively graceless by 1) roughly treating all various actors to equivalently interact with frames/parts or by 2) sumptuously borrowing multiple costly detectors to acquire the special parts. To solve the above dilemma, we propose a novel Foreground/Background-masked Interaction Learning (dubbed as FBI Learning) framework to learn the multi-actor features by attentively interacting with the hands-down foreground and background frames. Specifically, we first design a new Mask-guided Cross Attention (MCA) mechanism that calculates the masked cross-attentions to capture the compact relations between the actors and foreground/background regions. Next, we present a new Actor-guided Feature Aggregation (AFA) scheme that integrates foreground- and background-interacted actor features with the learnable actor-based weights. Finally, we construct a long-term feature bank that associates temporal context information to facilitate action classification. Extensive experiments are conducted on commonly available UCF101-24, MultiSports, and AVA v2.1/v2.2 datasets, which illustrate the competitive performance of FBI Learning against the state-of-the-art methods.

TIVA-KG: A Multimodal Knowledge Graph with Text, Image, Video and Audio

  • Xin Wang
  • Benyuan Meng
  • Hong Chen
  • Yuan Meng
  • Ke Lv
  • Wenwu Zhu

Knowledge graphs serve as a powerful tool to boost model performances for various applications covering computer vision, natural language processing, multimedia data mining, etc. The process of knowledge acquisition for human is multimodal in essence, covering text, image, video and audio modalities. However, existing multimodal knowledge graphs fail to cover all these four elements simultaneously, severely limiting their expressive powers in performance improvement for downstream tasks. In this paper, we propose TIVA-KG, a multimodal Knowledge Graph covering Text, Image, Video and Audio, which can benefit various downstream tasks. Our proposed TIVA-KG has two significant advantages over existing knowledge graphs in i) coverage of up to four modalities including text, image, video, audio, and ii) capability of triplet grounding which grounds multimodal relations to triples instead of entities. We further design a Quadruple Embedding Baseline (QEB) model to validate the necessity and efficacy of considering four modalities in KG. We conduct extensive experiments to test the proposed TIVA-KG with various knowledge graph representation approaches over link prediction task, demonstrating the benefits and necessity of introducing multiple modalities and triplet grounding. TIVA-KG is expected to promote further research on mining multimodal knowledge graph as well as the relevant downstream tasks in the community. TIVA-KG is now available at our website:

Enhancing Fake News Detection in Social Media via Label Propagation on Cross-modal Tweet Graph

  • Wanqing Zhao
  • Yuta Nakashima
  • Haiyuan Chen
  • Noboru Babaguchi

Fake news detection in social media has become increasingly important due to the rapid proliferation of personal media channels and the consequential dissemination of misleading information. Existing methods, which primarily rely on multimodal features and graph-based techniques, have shown promising performance in detecting fake news. However, they still face a limitation, i.e., sparsity in graph connections, which hinders capturing possible interactions among tweets. This challenge has motivated us to explore a novel method that densifies the graph's connectivity to capture denser interaction better. Our method constructs a cross-modal tweet graph using CLIP, which encodes images and text into a unified space, allowing us to extract potential connections based on similarities in text and images. We then design a Feature Contextualization Network with Label Propagation (FCN-LP) to model the interaction among tweets as well as positive or negative correlations between predicted labels of connected tweets. The propagated labels from the graph are weighted and aggregated for the final detection. To enhance the model's generalization ability to unseen events, we introduce a domain generalization loss that ensures consistent features between tweets on seen and unseen events. We use three publicly available fake news datasets, Twitter, PHEME, and Weibo, for evaluation. Our method consistently improves the performance over the state-of-the-art methods on all benchmark datasets and effectively demonstrates its aptitude for generalizing fake news detection in social media.

Cooperative Colorization: Exploring Latent Cross-Domain Priors for NIR Image Spectrum Translation

  • Xingxing Yang
  • Jie Chen
  • Zaifeng Yang

Near-infrared (NIR) image spectrum translation is a challenging problem with many promising applications. Existing methods struggle with the mapping ambiguity between the NIR and the RGB domains, and generalize poorly due to the limitations of models' learning capabilities and the unavailability of sufficient NIR-RGB image pairs for training. To address these challenges, we propose a cooperative learning paradigm that colorizes NIR images in parallel with another proxy grayscale colorization task by exploring latent cross-domain priors (i.e., latent spectrum context priors and task domain priors), dubbed CoColor. The complementary statistical and semantic spectrum information from these two task domains -- in the forms of pre-trained colorization networks -- are brought in as task domain priors. A bilateral domain translation module is subsequently designed, in which intermittent NIR images are generated from grayscale and colorized in parallel with authentic NIR images; and vice versa for the grayscale images. These intermittent transformations act as latent spectrum context priors for efficient domain knowledge exchange. We progressively fine-tune and fuse these modules with a series of pixel-level and feature-level consistency constraints. Experiments show that our proposed cooperative learning framework produces satisfactory spectrum translation outputs with diverse colors and rich textures, and outperforms state-of-the-art counterparts by 3.95dB and 4.66dB in terms of PNSR for the NIR and grayscale colorization tasks, respectively.

ALA: Naturalness-aware Adversarial Lightness Attack

  • Yihao Huang
  • Liangru Sun
  • Qing Guo
  • Felix Juefei-Xu
  • Jiayi Zhu
  • Jincao Feng
  • Yang Liu
  • Geguang Pu

Most researchers have tried to enhance the robustness of deep neural networks (DNNs) by revealing and repairing the vulnerability of DNNs with specialized adversarial examples. Parts of the attack examples have imperceptible perturbations restricted by Lp norm. However, due to their high-frequency property, the adversarial examples can be defended by denoising methods and are hard to realize in the physical world. To avoid the defects, some works have proposed unrestricted attacks to gain better robustness and practicality. It is disappointing that these examples usually look unnatural and can alert the guards. In this paper, we propose Adversarial Lightness Attack (ALA), a white-box unrestricted adversarial attack that focuses on modifying the lightness of the images. The shape and color of the samples, which are crucial to human perception, are barely influenced. To obtain adversarial examples with a high attack success rate, we propose unconstrained enhancement in terms of the light and shade relationship in images. To enhance the naturalness of images, we craft the naturalness-aware regularization according to the range and distribution of light. The effectiveness of ALA is verified on two popular datasets for different tasks (i.e., ImageNet for image classification and Places-365 for scene recognition).

Neural Image Popularity Assessment with Retrieval-augmented Transformer

  • Liya Ji
  • Chan Ho Park
  • Zhefan Rao
  • Qifeng Chen

Since the advent of social media platforms, image selection based on social preference is a challenging task that all users inherently undertake before sharing images with the public. In our user study for this problem, human choices of images based on perceived social preference are largely inaccurate (58.7% accuracy). The challenge of this task, also known as image popularity assessment, lies in its subjective nature caused by visual and non-visual factors. Especially in the social media setting, social feedback on a particular image largely differs depending on who uploads it. Therefore social preference model should be able to account for this user-specific image aspect of the task. To address this issue, we present a retrieval-augmented approach that leverages both image features and user-specific statistics for neural image popularity assessment. User-specific statistics are derived by retrieving past images with their statistics from a memory bank. By combining these statistics with image features, our approach achieves 79.5% accuracy, which significantly outperforms human and baseline models on the pairwise ranking of images from the Instagram Influencer Dataset. Our source code will be publicly available.

A Figure Skating Jumping Dataset for Replay-Guided Action Quality Assessment

  • Yanchao Liu
  • Xina Cheng
  • Takeshi Ikenaga

In competitive sports, judges often scrutinize replay videos from multiple views to adjudicate uncertain or contentious actions, and ultimately ascertain the definitive score. Most existing action quality assessment methods regress from a single video or a pairwise exemplar and input videos, which are limited by the viewpoint and zoom scale of videos. To end this, we construct a Replay Figure Skating Jumping dataset (RFSJ), containing additional view information provided by the post-match replay video and fine-grained annotations. We also propose a Replay-Guided approach for action quality assessment, learned by a Triple-Stream Contrastive Transformer and a Temporal Concentration Module. Specifically, besides the pairwise input and exemplar, we contrast the input and its replay by an extra contrastive module. Then the consistency of scores guides the model to learn features of the same action under different views and zoom scales. In addition, based on the fact that errors or highlight moments of athletes are crucial factors affecting scoring, these moments are concentrated in parts of the video rather than a uniform distribution. The proposed temporal concentration module encourages the model to concentrate on these features, then cooperates with the contrastive regression module to obtain an effective scoring mechanism. Extensive experiments demonstrate that our method achieves Spearman's Rank Correlation of 0.9346 on the proposed RFSJ dataset, improving over the existing state-of-the-art methods.

Enhancing Visibility in Nighttime Haze Images Using Guided APSF and Gradient Adaptive Convolution

  • Yeying Jin
  • Beibei Lin
  • Wending Yan
  • Yuan Yuan
  • Wei Ye
  • Robby T. Tan

Visibility in hazy nighttime scenes is frequently reduced by multiple factors, including low light, intense glow, light scattering, and the presence of multicolored light sources. Existing nighttime dehazing methods often struggle with handling glow or low-light conditions, resulting in either excessively dark visuals or unsuppressed glow outputs. In this paper, we enhance the visibility from a single nighttime haze image by suppressing glow and enhancing low-light regions. To handle glow effects, our framework learns from the rendered glow pairs. Specifically, a light source aware network is proposed to detect light sources of night images, followed by the APSF (Angular Point Spread Function)-guided glow rendering. Our framework is then trained on the rendered images, resulting in glow suppression. Moreover, we utilize gradient-adaptive convolution, to capture edges and textures in hazy scenes. By leveraging extracted edges and textures, we enhance the contrast of the scene without losing important structural details. To boost low-light intensity, our network learns an attention map, then adjusted by gamma correction. This attention has high values on low-light regions and low values on haze and glow regions. Extensive evaluation on real nighttime haze images, demonstrates the effectiveness of our method. Our experiments demonstrate that our method achieves a PSNR of 30.38dB, outperforming state-of-the-art methods by 13% on GTA5 nighttime haze dataset. Our data and code is available at:

Rethinking Voice-Face Correlation: A Geometry View

  • Xiang Li
  • Yandong Wen
  • Muqiao Yang
  • Jinglu Wang
  • Rita Singh
  • Bhiksha Raj

Previous works on voice-face matching and voice-guided face synthesis demonstrate strong correlations between voice and face, but mainly rely on coarse semantic cues such as gender, age, and emotion. In this paper, we aim to investigate the capability of reconstructing the 3D facial shape from voice from a geometry perspective without any semantic information. We propose a voice-anthropometric measurement (AM)-face paradigm, which identifies predictable facial AMs from the voice and uses them to guide 3D face reconstruction. By leveraging AMs as a proxy to link the voice and face geometry, we can eliminate the influence of unpredictable AMs and make the face geometry tractable. Our approach is evaluated on our proposed dataset with ground-truth 3D face scans and corresponding voice recordings, and we find significant correlations between voice and specific parts of the face geometry, such as the nasal cavity and cranium. Our work offers a new perspective on voice-face correlation and can serve as a good empirical study for anthropometry science.

Dynamic Grouped Interaction Network for Low-Light Stereo Image Enhancement

  • Baiang Li
  • Huan Zheng
  • Zhao Zhang
  • Yang Zhao
  • Zhongqiu Zhao
  • Haijun Zhang

Low-Light Stereo Image Enhancement (LLSIE) tackles the challenge of improving the illumination and restoring the details in stereo images. However, existing deep learning-based LLSIE methods trained on high-resolution low-light images often exhibit sub-optimal performance when interacting with information from the left and right views. We find that this is because of: (1) the high computational cost arising from quadratic complexity, which hinders the enhancement model's ability to process high-resolution images; and (2) the limitations of conventional fusion strategies in previous work, which inadequately capture cross-view cues, resulting in weak feature representation and compromised detail recovery. To address these limitations, we propose a novel Dynamic Grouped Interaction Network (DGI-Net) to enhance illumination and recover more details while reducing the computational cost. Specifically, DGI-Net employs the U-Net structure, which effectively mitigates noise during the low-light enhancement. Furthermore, we design a Grouped Stereo Interaction Module (GSIM) with a grouping strategy to efficiently discover cross-view cues while minimizing computations. To dynamically fuse stereo information and fully exploit cross-view correlations, we also introduce a Dynamic Embedding Module (DEM) to establish dynamic connections between inter-view cues and intra-view features, which performs dynamic weight processing on cross-view cues to eliminate noise during fusion. For intra-view processing, we present a Diversity Enhanced Block (DEB) to extract multi-scale features, thereby improving diversity and feature representation. This multi-scale feature extraction also addresses low image contrast in dark lighting conditions. Experimental results demonstrate that DGI-Net outperforms current state-of-the-art methods in low-light stereo image enhancement.

PVG: Progressive Vision Graph for Vision Recognition

  • JiaFu Wu
  • Jian Li
  • Jiangning Zhang
  • Boshen Zhang
  • Mingmin Chi
  • Yabiao Wang
  • Chengjie Wang

Convolution-based and Transformer-based vision backbone networks process images into the grid or sequence structures, respectively, which are inflexible for capturing irregular objects. Though Vision GNN (ViG) adopts graph-level features for complex images, it has some issues, such as inaccurate neighbor node selection, expensive node information aggregation calculation, and over-smoothing in the deep layers. To address the above problems, we propose a Progressive Vision Graph (PVG) architecture for vision recognition task. Compared with previous works, PVG contains three main components: 1) Progressively Separated Graph Construction (PSGC) to introduce second-order similarity by gradually increasing the channel of the global graph branch and decreasing the channel of local branch as the layer deepens; 2) Neighbor nodes information aggregation and update module by using Max pooling and mathematical Expectation (MaxE) to aggregate rich neighbor information; 3) Graph error Linear Unit (GraphLU) to enhance low-value information in a relaxed form to reduce the compression of image detail information for alleviating the over-smoothing. Extensive experiments on mainstream benchmarks demonstrate the superiority of PVG over state-of-the-art methods, e.g., our PVG-S obtains 83.0% Top-1 accuracy on ImageNet-1K that surpasses GNN-based ViG-S by +0.9↑ with the parameters reduced by 18.5%, while the largest PVG-B obtains 84.2% that has +0.5↑ improvement than ViG-B. Furthermore, our PVG-S obtains +1.3↑ box AP and +0.4↑ mask AP gains than ViG-S on COCO dataset.

StylePrompter: All Styles Need Is Attention

  • Chenyi Zhuang
  • Pan Gao
  • Aljosa Smolic

GAN inversion aims at inverting given images into corresponding latent codes for Generative Adversarial Networks (GANs), especially StyleGAN where exists a disentangled latent space that allows attribute-based image manipulation. As most inversion methods build upon Convolutional Neural Networks (CNNs), we transfer a hierarchical vision Transformer backbone innovatively to predict W+ latent codes at token level. We further apply a Style-driven Multi-scale Adaptive Refinement Transformer (SMART) in ℱ space to refine the intermediate style features of the generator. By treating style features as queries to retrieve lost identity information from the encoder's feature maps, SMART can not only produce high-quality inverted images but also surprisingly adapt to editing tasks. We then prove that StylePrompter lies in a more disentangled W+ and show the controllability of SMART. Finally, quantitative and qualitative experiments demonstrate that Style Prompter can achieve desirable performance in balancing reconstruction quality and editability, and is "smart" enough to fit into most edits, outperforming other ℱ -involved inversion methods. Our code is available at:

Improving Federated Person Re-Identification through Feature-Aware Proximity and Aggregation

  • Pengling Zhang
  • Huibin Yan
  • Wenhui Wu
  • Shuoyao Wang

Person re-identification (ReID) is a challenging task that aims to identify individuals across multiple non-overlapping camera views. To enhance the performance and robustness of ReID models, it is crucial to train them over multiple data sources. However, the traditional centralized approach poses a significant challenge to privacy as it requires collecting data from distributed data owners. To overcome this challenge, we employ the federated learning approach, which enables distributed model training without compromising data privacy. In this paper, we propose a novel feature-aware local proximity and global aggregation method for federated ReID to extract robust feature representations. Specifically, we introduce a proximal term and a feature regularization term for local model training to improve local training accuracy while ensuring global aggregation convergence. Furthermore, we use the cosine distance of backbone features to determine the global aggregation weight of each local model. Our proposed method significantly improves the performance and generalization of the global model. Extensive experiments demonstrate the effectiveness of our proposal. Specifically, our method achieves an additional 27.3% Rank-1 average accuracy in federated full supervision and an extra 20.3% mean Average Precision (mAP) on DukeMTMC in federated domain generalization.

Transformer-based Open-world Instance Segmentation with Cross-task Consistency Regularization

  • Xizhe Xue
  • Dongdong Yu
  • Lingqiao Liu
  • Yu Liu
  • Satoshi Tsutsui
  • Ying Li
  • Zehuan Yuan
  • Ping Song
  • Mike Zheng Shou

Open-World Instance Segmentation (OWIS) is an emerging research topic that aims to segment class-agnostic object instances from images. The mainstream approaches use a two-stage segmentation framework, which first locates the candidate object bounding boxes and then performs instance segmentation. In this work, we instead promote a single-stage transformer-based framework for OWIS. We argue that the end-to-end training process in the single-stage framework can be more convenient for directly regularizing the localization of class-agnostic object pixels. Based on the transformer-based instance segmentation framework, we propose a regularization model to predict foreground pixels and use its relation to instance segmentation to construct a cross-task consistency loss. We show that such a consistency loss could alleviate the problem of incomplete instance annotation - a common problem in the existing OWIS datasets. We also show that the proposed loss lends itself to an effective solution to semi-supervised OWIS that could be considered an extreme case that all object annotations are absent for some images. Our extensive experiments demonstrate that the proposed method achieves impressive results in both fully-supervised and semi-supervised settings. Compared to SOTA methods, the proposed method significantly improves the AP_100 score by 4.75% in UVO dataset →UVO dataset setting and 4.05% in COCO dataset →UVO dataset setting.

Cross-Illumination Video Anomaly Detection Benchmark

  • Dongliang Zhu
  • Ruimin Hu
  • Shengli Song
  • Xiang Guo
  • Xixi Li
  • Zheng Wang

Video anomaly detection is a critical problem with widespread applications in domains such as security surveillance. Most existing methods focus on video anomaly detection tasks under uniform illumination conditions. However, in the real world, the situation is much more complicated. Video anomalies are widespread across periods and under different illumination conditions, which can lead to the detector model incorrectly reporting high anomaly scores. To address this challenge, we design a benchmark framework for the cross-illumination video anomaly detection task. The framework restores videos under different illumination scales to the same illumination scale. This reduces domain differences between uniformly illuminated training videos and differently illuminated test videos. Additionally, to demonstrate the illumination change problem and evaluate our model, we construct three large-scale datasets with a wide range of illumination variations. We experimentally validate our approach on three cross-illuminance video anomaly detection datasets. Experimental results show that our method outperforms existing methods regarding detection accuracy and is more robust.

Practical Edge Detection via Robust Collaborative Learning

  • Yuanbin Fu
  • Xiaojie Guo

Edge detection, as a core component in a wide range of vision-oriented tasks, is to identify object boundaries and prominent edges in natural images. An edge detector is desired to be both efficient and accurate for practical use. To achieve the goal, two key issues should be concerned: 1) How to liberate deep edge models from inefficient pre-trained backbones that are leveraged by most existing deep learning methods, for saving the computational cost and cutting the model size; and 2) How to mitigate the negative influence from noisy or even wrong labels in training data, which widely exist in edge detection due to the subjectivity and ambiguity of annotators, for the robustness and accuracy. In this paper, we attempt to simultaneously address the above problems via developing a collaborative learning based model, termed PEdger. The principle behind our PEdger is that, the information learned from different training moments and heterogeneous (recurrent and non recurrent in this work) architectures, can be assembled to explore robust knowledge against noisy annotations, even without the help of pre-training on extra data. Extensive ablation studies together with quantitative and qualitative experimental comparisons on the BSDS500 and NYUD datasets are conducted to verify the effectiveness of our design, and demonstrate its superiority over other competitors in terms of accuracy, speed, and model size.

MSECNet: Accurate and Robust Normal Estimation for 3D Point Clouds by Multi-Scale Edge Conditioning

  • Haoyi Xiu
  • Xin Liu
  • Weimin Wang
  • Kyoung-Sook Kim
  • Masashi Matsuoka

Estimating surface normals from 3D point clouds is critical for various applications, including surface reconstruction and rendering. While existing methods for normal estimation perform well in regions where normals change slowly, they tend to fail where normals vary rapidly. To address this issue, we propose a novel approach called MSECNet, which improves estimation in normal varying regions by treating normal variation modeling as an edge detection problem. MSECNet consists of a backbone network and a multi-scale edge conditioning (MSEC) stream. The MSEC stream achieves robust edge detection through multi-scale feature fusion and adaptive edge detection. The detected edges are then combined with the output of the backbone network using the edge conditioning module to produce edge-aware representations. Extensive experiments show that MSECNet outperforms existing methods on both synthetic (PCPNet) and real-world (SceneNN) datasets while running significantly faster. We also conduct various analyses to investigate the contribution of each component in the MSEC stream. Finally, we demonstrate the effectiveness of our approach in surface reconstruction.

Efficient Parallel Multi-Scale Detail and Semantic Encoding Network for Lightweight Semantic Segmentation

  • Xiao Liu
  • Xiuya Shi
  • Lufei Chen
  • Linbo Qing
  • Chao Ren

In this work, we propose PMSDSEN, a parallel multi-scale encoder-decoder network architecture for semantic segmentation, inspired by the human visual perception system's ability to aggregate contextual information in various contexts and scales. Our approach introduces the efficient Parallel Multi-Scale Detail and Semantic Encoding (PMSDSE) unit to extract detailed local information and coarse large-range relationships in parallel, enabling the recognition of object boundaries and object-level areas. By stacking multiple PMSDSEs, our network learns fine-grained details and textures along with abstract category and semantic information, effectively utilizing a larger range of surrounding context information for robust segmentation. To further enhance the network's receptive field without increasing computational complexity, the Multi-Scale Semantic Extractor (MSSE) at the end of the encoder is utilized for multi-scale semantic context extraction and detailed information encoding. Additionally, the Dynamic Weighted Feature Fusion (DWFF) strategy is employed to integrate shallow layer detail information and deep layer semantic information during the decoder stage. Our method can obtain multi-scale context from local to global, achieving efficiently low-level feature extraction to high-level semantic interpretation at different scales and in different contexts. Without bells and whistles, PMSDSEN obtains a better trade-off between accuracy and complexity on popular benchmarks, including Cityscapes and Camvid. Specifically, PMSDSEN attains 73.2% mIoU with only 0.9M parameters on the Cityscapes test set. Codes and supplementary materials link:

Multi-Frame Self-Supervised Depth Estimation with Multi-Scale Feature Fusion in Dynamic Scenes

  • Jiquan Zhong
  • Xiaolin Huang
  • Xiao Yu

Monocular depth estimation is a fundamental task in computer vision and multimedia. The self-supervised learning pipeline makes it possible to train the monocular depth network with no need of depth labels. In this paper, a multi-frame depth model with multi-scale feature fusion is proposed for strengthening texture features and spatial-temporal features, which improves the robustness of depth estimation between frames with large camera ego-motion. A novel dynamic object detecting method with geometry explainability is proposed. The detected dynamic objects are excluded during training, which guarantees the static environment assumption and relieves the accuracy degradation problem of the multi-frame depth estimation. Robust knowledge distillation with a consistent teacher network and reliability guarantee is proposed, which improves the multi-frame depth estimation without an increase in computation complexity during the test. The experiments show that our proposed methods achieve great performance improvement on the multi-frame depth estimation.

Peering into The Sketch: Ultra-Low Bitrate Face Compression for Joint Human and Machine Perception

  • Yudong Mao
  • Peilin Chen
  • Shurun Wang
  • Shiqi Wang
  • Dapeng Wu

We propose a novel face compression framework that leverages the external priors for joint human and machine perception under ultra-low bitrate scenarios. The proposed framework leverages the semantic richness of face images by representing the faces into sketches and thumbnails, resulting in improved bitrate utility for both human and machine vision. At the decoder side, the framework introduces a two-stage generative reconstruction, which faithfully enhances the reconstructed image via semi-parametric modeling and retrieved guidance from the external database. In particular, this coarse-to-fine strategy also results in improved identity consistency and analysis performance of the reconstructed image. Extensive evaluations of the proposed method have been conducted on the public face dataset by comparing it with end-to-end image compression techniques as well as traditional image compression standards. The experimental results demonstrate the effectiveness of the proposed method via superior perceptual and analytical performance under ultra-low bitrate conditions.

MTSN: Multiscale Temporal Similarity Network for Temporal Action Localization

  • Xiaodong Jin
  • Taiping Zhang

Temporal Action Localization (TAL) aims to predict the categories and temporal segments of all action instances in untrimmed videos, which is a critical and challenging task in the video understanding field. The performances of existing TAL methods remain unsatisfactory, due to the lack of highly effective temporal modeling and refined action proposal decoding. In this paper, we propose Multiscale Temporal Similarity Network (MTSN), a novel one-stage method for TAL, which mainly benefits from dynamic complementary modeling and temporal similarity decoding. Specifically, we first design Dynamic Complementary Context Aggregation (DCCA), a Transformer-based encoder. DCCA performs both long-range and short-range temporal modeling through different interaction range types of attention heads at each feature pyramid level, while higher-level semantic representations are effectively complemented with more short-range detail information in a dynamic fashion. Moreover, Temporal Similarity Mask (TSM) is designed to generate masks through an optimized globally-aware decoding process, including similarity cross-modeling, region-aware optimization and multiscale aggregated residual, which leads to high-quality action proposals. We conduct extensive experiments on two major TAL benchmarks: THUMOS14 and ActivityNet-1.3, where our method establishes a new state-of-the-art and significantly outperforms the previous best methods. Without bells and whistles, on THUMOS14, MTSN achieves an average mAP of 72.1% (+5.3%). On ActivityNet-1.3, MTSN reaches an average mAP of 40.7% (+3.1%), which crosses the 40% average mAP for the first time.

Disentangling Multi-view Representations Beyond Inductive Bias

  • Guanzhou Ke
  • Yang Yu
  • Guoqing Chao
  • Xiaoli Wang
  • Chenyang Xu
  • Shengfeng He

Multi-view (or -modality) representation learning aims to understand the relationships between different view representations. Existing methods disentangle multi-view representations into consistent and view-specific representations by introducing strong inductive biases, which can limit their generalization ability. In this paper, we propose a novel multi-view representation disentangling method that aims to go beyond inductive biases, ensuring both interpretability and generalizability of the resulting representations. Our method is based on the observation that discovering multi-view consistency in advance can determine the disentangling information boundary, leading to a decoupled learning objective. We also found that the consistency can be easily extracted by maximizing the transformation invariance and clustering consistency between views. These observations drive us to propose a two-stage framework. In the first stage, we obtain multi-view consistency by training a consistent encoder to produce semantically-consistent representations across views as well as their corresponding pseudo-labels. In the second stage, we disentangle specificity from comprehensive representations by minimizing the upper bound of mutual information between consistent and comprehensive representations. Finally, we reconstruct the original data by concatenating pseudo-labels and view-specific representations. Our experiments on four multi-view datasets demonstrate that our proposed method outperforms 12 comparison methods in terms of clustering and classification performance. The visualization results also show that the extracted consistency and specificity are compact and interpretable. Our code can be found at

Implicit Decouple Network for Efficient Pose Estimation

  • Lei Zhao
  • Le Han
  • Min Yao
  • Nenggan Zheng

In the field of pose estimation, keypoint representations can take the form of Gaussian heatmaps, classification vectors, or direct coordinates. However, the current networks suffer from a lack of consistency with these keypoint representations. They only accommodate these representations in the final layer, resulting in suboptimal efficiency and requiring a high number of parameters or computational resources. In this paper, we propose a simple yet efficient plug-and-play module, named the Implicit Decouple Module (IDM), which decouples features into two parts along the x-y axes and aggregates features in a direction-aware manner. This approach implicitly fuses direction-specific coordinate information, improving the consistency with the keypoint representations, especially in vector form. Furthermore, we introduce a fully convolutional backbone network, named the Implicit Decouple Network (IDN), which incorporates IDM without the need to maintain high-resolution features, dense multi-level feature fusion, or lots of repeated stages, while still achieving high performance. In experiments on the COCO dataset, our basic IDN without pre-training can outperform HRNet (28.5M) by 2.4 AP with 18.2M parameters, and even surpass some transformer-based methods. In the lightweight model scenario, our model outstrips Lite-HRNet by 3.9 AP with only 2.5M parameters. We also evaluate our model on the person instance segmentation task and other datasets, demonstrating its generality and effectiveness. http(s)://

Occluded Skeleton-Based Human Action Recognition with Dual Inhibition Training

  • Zhenjie Chen
  • Hongsong Wang
  • Jie Gui

Recently, skeleton-based human action recognition has received widespread attention in computer vision community. However, most existing research focuses on improving the recognition accuracy on complete skeleton data, while ignoring the performance on the incomplete skeleton data with occlusion or noise. This paper addresses occluded and noise-robust skeleton-based action recognition and presents a novel Dual Inhibition Training strategy. Specifically, we propose Part-aware and Dual-inhibition Graph Convolutional Network (PDGCN), which comprises of three parts: Input Skeleton Inhibition (ISI), Part-Aware Representation Learning (PARL) and Predicted Score Inhibition (PSI). The ISI and PSI are plug and play modules which could encourage the model to learn discriminative features from diversified body joints by effectively simulating key body part occlusions and random occlusions. The PARL module learns both the global and local representations from the whole body and body parts, respectively, and progressively fuses them during representation learning to enhance the model robustness under occlusions. Finally, we design different settings for occluded skeleton-based human action recognition to deep study this problem and better evaluate different approaches. Our approach achieves state-of-the-art results on different benchmarks and dramatically outperforms the recent skeleton-based action recognition approaches, especially under large-scale temporal occlusion.

P2I-NET: Mapping Camera Pose to Image via Adversarial Learning for New View Synthesis in Real Indoor Environments

  • Xujie Kang
  • Kanglin Liu
  • Jiang Duan
  • Yuanhao Gong
  • Guoping Qiu

Given a new 6DoF camera pose in an indoor environment, we study the challenging problem of predicting the view from that pose based on a set of reference RGBD views. Existing explicit or implicit 3D geometry construction methods are computationally expensive while those based on learning have predominantly focused on isolated views of object categories with regular geometric structure. Differing from the traditional render-inpaint approach to new view synthesis in the real indoor environment, we propose a conditional generative adversarial neural network (P2I-NET) to directly predict the new view from the given pose. P2I-NET learns the conditional distribution of the images of the environment for establishing the correspondence between the camera pose and its view of the environment, and achieves this through a number of innovative designs in its architecture and training lost function. Two auxiliary discriminator constraints are introduced for enforcing the consistency between the pose of the generated image and that of the corresponding real world image in both the latent feature space and the real world pose space. Additionally a deep convolutional neural network (CNN) is introduced to further reinforce this consistency in the pixel space. We have performed extensive new view synthesis experiments on real indoor datasets. Results show that P2I-NET has superior performance against a number of NeRF based strong baseline models. In particular, we show that P2I-NET is 40 to 100 times faster than these competitor techniques while synthesising similar quality images. Furthermore, we contribute a new publicly available indoor environment dataset containing 22 high resolution RGBD videos where each frame also has accurate camera pose parameters.

IRCasTRF: Inverse Rendering by Optimizing Cascaded Tensorial Radiance Fields, Lighting, and Materials From Multi-view Images

  • Wenpeng Xing
  • Jie Chen
  • Ka Chun Cheung
  • Simon See

We propose an inverse rendering pipeline that simultaneously reconstructs scene geometry, lighting, and spatially-varying material from a set of multi-view images. Specifically, the proposed pipeline involves volume and physics-based rendering, which are performed separately in two steps: exploration and exploitation. During the exploration step, our method utilizes the compactness of neural radiance fields and a flexible differentiable volume rendering technique to learn an initial volumetric field. Here, we introduce a novel cascaded tensorial radiance field method on top of the Canonical Polyadic (CP) decomposition to boost model compactness beyond conventional methods. In the exploitation step, a shading pass that incorporates a differentiable physics-based shading method is applied to jointly optimize the scene's geometry, spatially-varying materials, and lighting, using image reconstruction loss. Experimental results demonstrate that our proposed inverse rendering pipeline, IRCasTRF, outperforms prior works in inverse rendering quality. The final output is highly compatible with downstream applications like scene editing and advanced simulations. Further details are available on the project page:

Noise-Robust Continual Test-Time Domain Adaptation

  • Zhiqi Yu
  • Jingjing Li
  • Zhekai Du
  • Fengling Li
  • Lei Zhu
  • Yang Yang

Continual test-time domain adaptation (TTA) is a challenging topic in the field of source-free domain adaptation, which focuses on addressing cross-domain multimedia information during inference with a continuously changing data distribution. Previous methods have been found to lack noise robustness, leading to a significant increase in errors under strong noise. In this paper, we address the noise-robustness problem in continual TTA by offering three effective recipes to mitigate it. At the category level, we employ the Taylor cross-entropy loss to alleviate the low confidence category bias commonly associated with cross-entropy. At the sample level, we reweight the target samples based on uncertainty to prevent the model from overfitting on noisy samples. Finally, to reduce pseudo-label noise, we propose a soft ensemble negative learning mechanism to guide the model optimization using ensemble complementary pseudo labels. Our method achieves state-of-the-art performance on three widely used continual TTA datasets, particularly in the strong noise setting that we introduced.

TIRDet: Mono-Modality Thermal InfraRed Object Detection Based on Prior Thermal-To-Visible Translation

  • Zeyu Wang
  • Fabien Colonnier
  • Jinghong Zheng
  • Jyotibdha Acharya
  • Wenyu Jiang
  • Kejie Huang

Cross-modality images that combine visible-infrared spectra can provide complementary information for object detection. In particular, they are well-suited for autonomous vehicle applications in dark environments with limited illumination. However, it is time-consuming to acquire a large number of pixel-aligned visible-thermal image pairs, and real-time alignment is challenging in practical driving systems. Furthermore, the quality of visible-spectrum images can be adversely affected by complex environmental conditions. In this paper, we propose a novel neural network called TIRDet, which only utilizes Thermal InfraRed (TIR) images for mono-modality object detection. To compensate for the lacked visible-band information, we adopt a prior Thermal-To-Visible (T2V) translation model to obtain the translated visible images and the latent T2V codes. In addition, we introduce a novel attention-based Cross-Modality Aggregation (CMA) module, which can augment the modality-translation awareness of TIRDet by preserving the T2V semantic information. Extensive experiments on FLIR and LLVIP datasets demonstrate that our TIRDet significantly outperforms all mono-modality detection methods based on thermal images, and it even surpasses most State-Of-The-Art (SOTA) multispectral methods using visible-thermal image pairs. Code is available at

HARP: Let Object Detector Undergo Hyperplasia to Counter Adversarial Patches

  • Junzhe Cai
  • Shuiyan Chen
  • Heng Li
  • Beihao Xia
  • Zimin Mao
  • Wei Yuan

Adversarial patches can mislead object detectors to produce erroneous predictions. To defend against adversarial patches, one can take two types of protections on the model side, including modifying the detector itself (e.g., adversarial training) or attaching a new model in front of the detector. However, the former often deteriorates clean performance of detectors, and the latter may have high deployment costs caused by too many training parameters. Inspired by the phenomenon of "bone hyperplasia" in human bodies, we present a novel model-side adversarial patch defense, called HARP (Hyperplasia based Adversarial Patch defense). Just as bone hyperplasia can enhance bone strength and skeletal stability, the hyperostosia of detectors can also help to resist adversarial patches. Following this idea, HARP chooses to improve adversarial robustness by "growing" lightweight CNN modules (i.e., hyperplasia modules) on the pre-trained object detectors. We conduct extensive experiments on the PASCAL VOC and COCO datasets to compare HARP with the data-side defense JPEG and the model-side defenses adversarial training, SAC and FNC. Experimental results show that HARP provides excellent defense against adversarial patches while maintaining clean performance, outperforming the compared defense methods. Under PGD-based adaptive attacks, HARP surpasses the recently proposed defense method SAC by 12.5% in mean average precision (mAP) on PASCAL VOC, and 13.2% on COCO dataset. In addition, experiments confirm that the increase in model inference time caused by HARP is almost negligible.

Scale-space Tokenization for Improving the Robustness of Vision Transformers

  • Lei Xu
  • Rei Kawakami
  • Nakamasa Inoue

The performance of the Vision Transformer (ViT) model and its variants in most vision tasks has surpassed traditional Convolutional Neural Networks (CNNs) in terms of in-distribution accuracy. However, ViTs still have significant room for improvement in their robustness to input perturbations. Furthermore, robustness is a critical aspect to consider when deploying ViTs in real-world scenarios. Despite this, some variants of ViT improve the in-distribution accuracy and computation performance at the cost of sacrificing the model's robustness and generalization. In this study, inspired by the prior findings on the potential effectiveness of shape bias to robustness improvement and the importance of multi-scale analysis, we propose a simple yet effective method, scale-space tokenization, to improve the robustness of ViT while maintaining in-distribution accuracy. Based on this method, we build Scale-space-based Robust Vision Transformer (SRVT) model. Our method consists of scale-space patch embedding and scale-space positional encoding. The scale-space patch embedding makes a sequence of variable-scale images and increases the model's shape bias to enhance its robustness. The scale-space positional encoding implicitly boosts the model's invariance to input perturbations by incorporating scale-aware position information into 3D sinusoidal positional encoding. We conduct experiments on image recognition benchmarks (CIFAR10/100 and ImageNet-1k) from the perspectives of in-distribution accuracy, adversarial and out-of-distribution robustness. The experimental results demonstrate our method's effectiveness in improving robustness without compromising in-distribution accuracy. Especially, our approach achieves advanced adversarial robustness on ImageNet-1k benchmark compared with state-of-the-art robust ViT.

Margin MCC: Chance-Robust Metric for Video Boundary Detection with Allowed Margin

  • Kosuke Mizufune
  • Shunsuke Tanaka
  • Toshihide Yukitake
  • Tatsushi Matsubayashi

Video boundary detection is a task to divide a video into several segments based on event changes such as scenes or actions. The most common evaluation is to judge whether the distance between predicted boundaries and the ground truth boundaries is lower than allowed margin and then compute F1 score. However, we found that the evaluation only by F1 measure can lead to wrong conclusions since even completely random model can achieve inflated F1 when the number of predictions is large. To design a robust metric against chance, we propose Margin Matthews Correlation Coefficient (MMCC) as an extension of Matthews Correlation Coefficient (MCC) to video boundary detection with allowed margin. Although MCC is a robust metric against chance, it is not obvious that the same is true in video boundary detection due to allowed margin. Specifically, some definitions of MCC do not keep a constant as for the number of predicted boundaries. Therefore, we design MMCC so that the expected MMCC for random guessing will be zero, based on mathematical analysis. We empirically examine if MMCC is robust against completely random guessing and oversegmentation/undersegmentation, while F1 is not.

Exploring the Knowledge Transferred by Response-Based Teacher-Student Distillation

  • Liangchen Song
  • Xuan Gong
  • Helong Zhou
  • Jiajie Chen
  • Qian Zhang
  • David Doermann
  • Junsong Yuan

Response-based Knowledge Distillation refers to the technique of supervising the student network with the teacher networks' predictions. The method is motivated by observing that the predicted probabilities reflect the relation among labels, which is the knowledge to be transferred. This paper explores the transferred knowledge from a novel perspective: comparing the knowledge transferred through different teachers. Two intriguing properties are observed. First, higher confidence scores of teachers' predictions lead to better distillation results, and second, teachers' incorrectly predicted training samples should be kept for distillation. We then analyze the phenomenon by studying teachers' decision boundaries, of which some can help the student generalize while some may not. Based on the observations, we further propose an embarrassingly simple distillation framework named Efficient Distillation, which is effective on ImageNet with different teacher-student pairs: When using ResNet34 as the teacher, the student ResNet18 trained from scratch reaches 74.07% Top-1 accuracy within 98 GPU hours (RTX 3090), outperforming current state-of-the-art result (73.19%) by a large margin. Our code is available at

Selecting Learnable Training Samples is All DETRs Need in Crowded Pedestrian Detection

  • Feng Gao
  • Jiaxu Leng
  • Ji Gan
  • Xinbo Gao

DEtection TRansformer (DETR) and its variants (DETRs) achieved impressive performance in general object detection. However, in crowded pedestrian detection, the performance of DETRs is still unsatisfactory due to the inappropriate sample selection method which results in more false positives. To settle the issue, we propose a simple but effective sample selection method for DETRs, Sample Selection for Crowded Pedestrians (SSCP), which consists of the constraint-guided label assignment scheme (CGLA) and the utilizability-aware focal loss (UAFL). Our core idea is to select learnable samples for DETRs and adaptively regulate the loss weights of samples based on their utilizability. Specifically, in CGLA, we proposed a new cost function to ensure that only learnable positive training samples are retained and the rest are negative training samples. Further, considering the utilizability of samples, we designed UAFL to adaptively assign different loss weights to learnable positive samples depending on their gradient ratio and IoU. Experimental results show that the proposed SSCP effectively improves the baselines without introducing any overhead in inference. Especially, Iter Deformable DETR is improved to 39.7(-2.0)% MR on Crowdhuman and 31.8(-0.4)% MR on Citypersons.

Data-Efficient Masked Video Modeling for Self-supervised Action Recognition

  • Qiankun Li
  • Xiaolong Huang
  • Zhifan Wan
  • Lanqing Hu
  • Shuzhe Wu
  • Jie Zhang
  • Shiguang Shan
  • Zengfu Wang

Recently, self-supervised video representation learning based on Masked Video Modeling (MVM) has demonstrated promising results for action recognition. However, existing methods face two significant challenges: (1) video actions involve a crucial temporal dimension, yet current masking strategies adopt inefficient random approaches that undermine low-density dynamic motion clues in videos; (2) pre-training requires large-scale datasets and significant computing resources (including large batch sizes and enormous iterations). To address these issues, we propose a novel method named Data-Efficient Masked Video Modeling (DEMVM) for self-supervised action recognition. Specifically, a novel masking strategy named Flow-Guided Dense Masking (FGDM) is proposed to facilitate efficient learning by focusing more on the action-related temporal clues, which applies dense masking to dynamic regions based on optical flow priors, while sparse masking to background regions. Furthermore, DEMVM introduces a 3D video tokenizer to enhance the modeling of temporal clues. Finally, Progressive Masking Ratio (PMR) and 2D initialization strategies are presented to enable the model to adapt to the characteristics of the MVM paradigm during different training stages. Extensive experiments on multiple benchmarks, UCF101, HMDB51, and Mimetics, demonstrate that our method achieves state-of-the-art performance in the downstream action recognition task with both efficient data and low computational cost. More interestingly, the few-shot experiment on the Mimetics dataset shows that DEMVM can accurately recognize actions even in the presence of context bias.

DeNoising-MOT: Towards Multiple Object Tracking with Severe Occlusions

  • Teng Fu
  • Xiaocong Wang
  • Haiyang Yu
  • Ke Niu
  • Bin Li
  • Xiangyang Xue

Multiple object tracking (MOT) tends to become more challenging when severe occlusions occur. In this paper, we analyze the limitations of traditional Convolutional Neural Network-based methods and Transformer-based methods in handling occlusions and propose DNMOT, an end-to-end trainable DeNoising Transformer for MOT. To address the challenge of occlusions, we explicitly simulate the scenarios when occlusions occur. Specifically, we augment the trajectory with noises during training and make our model learn the denoising process in an encoder-decoder architecture, so that our model can exhibit strong robustness and perform well under crowded scenes. Additionally, we propose a Cascaded Mask strategy to better coordinate the interaction between different types of queries in the decoder to prevent the mutual suppression between neighboring trajectories under crowded scenes. Notably, the proposed method requires no additional modules like matching strategy and motion state estimation in inference. We conduct extensive experiments on the MOT17, MOT20, and DanceTrack datasets, and the experimental results show that our method outperforms previous state-of-the-art methods by a clear margin.

Co-Salient Object Detection with Semantic-Level Consensus Extraction and Dispersion

  • Peiran Xu
  • Yadong Mu

Given a group of images, co-salient object detection (CoSOD) aims to highlight the common salient object in each image. There are two factors closely related to the success of this task, namely consensus extraction, and the dispersion of consensus to each image. Most previous works represent the group consensus using local features, while we instead utilize a hierarchical Transformer module for extracting semantic-level consensus. Therefore, it can obtain a more comprehensive representation of the common object category, and exclude interference from other objects that share local similarities with the target object. In addition, we propose a Transformer-based dispersion module that takes into account the variation of the co-salient object in different scenes. It distributes the consensus to the image feature maps in an image-specific way while making full use of interactions within the group. These two modules are integrated with a ViT encoder and an FPN-like decoder to form an end-to-end trainable network, without additional branch and auxiliary loss. The proposed method is evaluated on three commonly used CoSOD datasets and achieves state-of-the-art performance.

BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet Tag-guided Synthetic Data

  • Xuenan Xu
  • Zhiling Zhang
  • Zelin Zhou
  • Pingyue Zhang
  • Zeyu Xie
  • Mengyue Wu
  • Kenny Q. Zhu

Compared with ample visual-text pre-training research, few works explore audio-text pre-training, mostly due to the lack of sufficient parallel audio-text data. Most existing methods incorporate the visual modality as a pivot for audio-text pre-training, which inevitably induces data noise. In this paper, we propose to utilize audio captioning to generate text directly from audio, without the aid of the visual modality so that potential noise from modality mismatch is eliminated. Furthermore, we propose caption generation under the guidance of AudioSet tags, leading to more accurate captions. With the above two improvements, we curate high-quality, large-scale parallel audio-text data, based on which we perform audio-text pre-training. We comprehensively demonstrate the performance of the pre-trained model on a series of downstream audio-related tasks, including single-modality tasks like audio classification and tagging, as well as cross-modal tasks consisting of audio-text retrieval and audio-based text generation. Experimental results indicate that our approach achieves state-of-the-art zero-shot classification performance on most datasets, suggesting the effectiveness of our synthetic data. The audio encoder also serves as an efficient pattern recognition model by fine-tuning it on audio-related tasks. Synthetic data and pre-trained models are available online1 The code, checkpoints and data are available at and

A Simple Baseline for Open-World Tracking via Self-training

  • Bingyang Wang
  • Tanlin Li
  • Jiannan Wu
  • Yi Jiang
  • Huchuan Lu
  • You He

Open-World Tracking (OWT) presents a challenging yet emerging problem, aiming to track every object of any category. Different from traditional Multi-Object Tracking (MOT), OWT needs to additionally track targets beyond predefined categories in the training set. To address the problem, we propose a simple baseline, SimOWT. We simplify the recently proposed OWT algorithm by streamlining the association module and accelerating the inference speed. By leveraging the self-training paradigm, SimOWT can distinguish unknown-class targets from the background, fully unleashing the potential of TAO-OW dataset. Furthermore, we enhance SimOWT from the perspectives of Pseudo Boxes Merging and Re-Weighting, thereby discovering more targets belonging to unknown classes and reducing the sensitivity of the model to low-quality pseudo-labels. Benefiting from the proposed approaches, SimOWT demonstrates a significant improvement in tracking performance on unknown classes. Moreover, the comprehensive experiments on the TAO-OW benchmark demonstrate that our model outperforms the state-of-the-art OWT method, OWTB, with an absolute gain of 11.2% OWTA and 16.4% detection recall respectively on unknown classes. The code is released at

VTLayout: A Multi-Modal Approach for Video Text Layout

  • Yuxuan Zhao
  • Jin Ma
  • Zhongang Qi
  • Zehua Xie
  • Yu Luo
  • Qiusheng Kang
  • Ying Shan

The rapid explosion of video distribution is accompanied by a massive amount of video text, which encompasses rich information about the video content. While previous research has primarily focused on text extraction from videos like text detection, tracking, recognition and end to end spotting, the layout of video text has received limited attention. As different text categories convey distinct meanings, video text layout is critical for video understanding tasks such as video summarization and shooting environment comprehension. To bridge the gap between video OCR and understanding, we explore the study of video text layout in this work. We first optimize the layout annotation of the BOVText, a bilingual, open-world video text dataset, by expanding text categories and defining five clear categories: scene, subtitle, title, logo, and other. Additionally, we rectify the original unreasonable layout annotation based on these definitions. We also propose a Video-level Text Layout model (VTLayout) to address the layout problem, which fuses textual, visual, and spatial-temporal embedding of video text trajectories. To the best of our knowledge, this is the first method to tackle text layout on video level. Our method outperforms image-level layout methods across all text categories and exhibits faster inference speed. This study underscores the significance of video text layout in video understanding and offers an effective solution to this challenge. Our annotation is available at

SEAR: Semantically-grounded Audio Representations

  • Rajat Hebbar
  • Digbalay Bose
  • Shrikanth Narayanan

Audio supports visual story-telling in movies through the use of different sounds. These sounds are often tied to different visual elements, including foreground entities, the interactions between them as well as background context. Visual captions provide a condensed view of an image, providing a natural language description of entities and the relationships between them. In this work, we utilize visual captions to semantically ground audio representations in a self-supervised setup. We leverage state-of-the-art vision-language models to augment movie datasets with visual captions at scale to the order of 9.6M captions to learn audio representations from over 2500 hours of movie data. We evaluate the utility of the learned representations and show state-of-the art performance on two movie understanding tasks, genre and speaking-style classification, outperforming video based methods and audio baselines. Finally, we show that the learned model can be transferred in a zero-shot manner through application in both movie understanding tasks and general action recognition.

DocDiff: Document Enhancement via Residual Diffusion Models

  • Zongyuan Yang
  • Baolin Liu
  • Yongping Xxiong
  • Lan Yi
  • Guibin Wu
  • Xiaojun Tang
  • Ziqi Liu
  • Junjie Zhou
  • Xing Zhang

Removing degradation from document images not only improves their visual quality and readability, but also enhances the performance of numerous automated document analysis and recognition tasks. However, existing regression-based methods optimized for pixel-level distortion reduction tend to suffer from significant loss of high-frequency information, leading to distorted and blurred text edges. To compensate for this major deficiency, we propose DocDiff, the first diffusion-based framework specifically designed for diverse challenging document enhancement problems, including document deblurring, denoising, and removal of watermarks and seals. DocDiff consists of two modules: the Coarse Predictor (CP), which is responsible for recovering the primary low-frequency content, and the High-Frequency Residual Refinement (HRR) module, which adopts the diffusion models to predict the residual (high-frequency information, including text edges), between the ground-truth and the CP-predicted image. DocDiff is a compact and computationally efficient model that benefits from a well-designed network architecture, an optimized training loss objective, and a deterministic sampling process with short time steps. Extensive experiments demonstrate that DocDiff achieves state-of-the-art (SOTA) performance on multiple benchmark datasets, and can significantly enhance the readability and recognizability of degraded document images. Furthermore, our proposed HRR module in pre-trained DocDiff is plug-and-play and ready-to-use, with only 4.17M parameters. It greatly sharpens the text edges generated by SOTA deblurring methods without additional joint training. Available codes:

POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-view World

  • Boshen Xu
  • Sipeng Zheng
  • Qin Jin

We humans are good at translating third-person observations of hand-object interactions (HOI) into an egocentric view. However, current methods struggle to replicate this ability of view adaptation from third-person to first-person. Although some approaches attempt to learn view-agnostic representation from large-scale video datasets, they ignore the relationships among multiple third-person views. To this end, we propose a Prompt-Oriented View-agnostic learning (POV) framework in this paper, which enables this view adaptation with few egocentric videos. Specifically, We introduce interactive masking prompts at the frame level to capture fine-grained action information, and view-aware prompts at the token level to learn view-agnostic representation. To verify our method, we establish two benchmarks for transferring from multiple third-person views to the egocentric view. Our extensive experiments on these benchmarks demonstrate the efficiency and effectiveness of our POV framework and prompt tuning techniques in terms of view adaptation and view generalization.

GraMMaR: Ground-aware Motion Model for 3D Human Motion Reconstruction

  • Sihan Ma
  • Qiong Cao
  • Hongwei Yi
  • Jing Zhang
  • Dacheng Tao

Demystifying complex human-ground interactions is essential for accurate and realistic 3D human motion reconstruction from RGB videos, as it ensures consistency between the humans and the ground plane. Prior methods have modeled human-ground interactions either implicitly or in a sparse manner, often resulting in unrealistic and incorrect motions when faced with noise and uncertainty. In contrast, our approach explicitly represents these interactions in a dense and continuous manner. To this end, we propose a novel Ground-aware Motion Model for 3D Human Motion Reconstruction, named GraMMaR, which jointly learns the distribution of transitions in both pose and interaction between every joint and ground plane at each time step of a motion sequence. It is trained to explicitly promote consistency between the motion and distance change towards the ground. After training, we establish a joint optimization strategy that utilizes GraMMaR as a dual-prior, regularizing the optimization towards the space of plausible ground-aware motions. This leads to realistic and coherent motion reconstruction, irrespective of the assumed or learned ground plane. Through extensive evaluation on the AMASS and AIST++ datasets, our model demonstrates good generalization and discriminating abilities in challenging cases including complex and ambiguous human-ground interactions. The code will be available at

SpeechTripleNet: End-to-End Disentangled Speech Representation Learning for Content, Timbre and Prosody

  • Hui Lu
  • Xixin Wu
  • Zhiyong Wu
  • Helen Meng

Disentangled speech representation learning aims to separate different factors of variation from speech into disjoint representations. This paper focuses on disentangling speech into representations for three factors: spoken content, speaker timbre, and speech prosody. Many previous methods for speech disentanglement have focused on separating spoken content and speaker timbre. However, the lack of explicit modeling of prosodic information leads to degraded speech generation performance and uncontrollable prosody leakage into content and/or speaker representations. While some recent methods have utilized explicit speaker labels or pre-trained models to facilitate triple-factor disentanglement, there are no end-to-end methods to simultaneously disentangle three factors using only unsupervised or self-supervised learning objectives. This paper introduces SpeechTripleNet, an end-to-end method to disentangle speech into representations for content, timbre, and prosody. Based on VAE, SpeechTripleNet restricts the structures of the latent variables and the amount of information captured in them to induce disentanglement. It is a pure unsupervised/self-supervised learning method that only requires speech data and no additional labels. Our qualitative and quantitative results demonstrate that SpeechTripleNet is effective in achieving triple-factor speech disentanglement, as well as controllable speech editing concerning different factors.

Generating Explanations for Embodied Action Decision from Visual Observation

  • Xiaohan Wang
  • Yuehu Liu
  • Xinhang Song
  • Beibei Wang
  • Shuqiang Jiang

Getting trust is crucial for embodied agents (such as robots and autonomous vehicles) to collaborate with human beings, especially non-experts. The most direct way for mutual understanding is through natural language explanation. Existing researches consider generating visual explanations for object recognition, while the exploration of explaining embodied decisions remains vacant. In this paper, we study generating action decisions and explanations based on visual observation. Distinct to explanations for recognition, justifying an action needs to show why it's better than other actions. Besides, the understanding of scene structure is required since the agent needs to interact with the environment (e.g. navigation, moving objects). We introduce a new dataset THOR-EAE (Embodied Action Explanation) collected based on AI2-THOR simulator. The dataset consists of over 840,000 egocentric images of indoor embodied observation which are annotated with the optimal action labels and explanation sentences. An explainable decision-making criterion is developed considering scene layout and action attributes for efficient annotation. We propose a graph action justification model, exploiting graph neural networks for obstacle-surroundings relations representation and justifying the actions under the guidance of decision results. Experimental results on THOR-EAE dataset showcase its challenge and the effectiveness of the proposed method.

Scene-aware Human Pose Generation using Transformer

  • Jieteng Yao
  • Junjie Chen
  • Li Niu
  • Bin Sheng

Affordance learning considers the interaction opportunities for an actor in the scene and thus has wide application in scene understanding and intelligent robotics. In this paper, we focus on contextual affordance learning, i.e., using affordance as context to generate a reasonable human pose in a scene. Existing scene-aware human pose generation methods could be divided into two categories depending on whether using pose templates. Our proposed method belongs to the template-based category, which benefits from the representative pose templates. Moreover, inspired by recent transformer-based methods, we associate each query embedding with a pose template, and use the interaction between query embeddings and scene feature map to effectively predict the scale and offsets for each pose template. In addition, we employ knowledge distillation to facilitate the offset learning given the predicted scale. Comprehensive experiments on Sitcom dataset demonstrate the effectiveness of our method.

Dynamic Compositional Graph Convolutional Network for Efficient Composite Human Motion Prediction

  • Wanying Zhang
  • Shen Zhao
  • Fanyang Meng
  • Songtao Wu
  • Mengyuan Liu

With potential applications in fields including intelligent surveillance and human-robot interaction, the human motion prediction task has become a hot research topic and also has achieved high success, especially using the recent Graph Convolutional Network (GCN). Current human motion prediction task usually focuses on predicting human motions for atomic actions. Observing that atomic actions can happen at the same time and thus formulating the composite actions, we propose the composite human motion prediction task. To handle this task, we first present a Composite Action Generation (CAG) module to generate synthetic composite actions for training, thus avoiding the laborious work of collecting composite action samples. Moreover, we alleviate the effect of composite actions on demand for a more complicated model by presenting a Dynamic Compositional Graph Convolutional Network (DC-GCN). Extensive experiments on the Human3.6M dataset and our newly collected CHAMP dataset consistently verify the efficiency of our DC-GCN method, which achieves state-of-the-art motion prediction accuracies and meanwhile needs few extra computational costs than traditional GCN-based human motion methods.

Diffusion-Augmented Depth Prediction with Sparse Annotations

  • Jiaqi Li
  • Yiran Wang
  • Zihao Huang
  • Jinghong Zheng
  • Ke Xian
  • Zhiguo Cao
  • Jianming Zhang

Depth estimation aims to predict dense depth maps. In autonomous driving scenes, sparsity of annotations makes the task challenging. Supervised models produce concave objects due to insufficient structural information. They overfit to valid pixels and fail to restore spatial structures. Self-supervised methods are proposed for the problem. Their robustness is limited by pose estimation, leading to erroneous results in natural scenes. In this paper, we propose a supervised framework termed Diffusion-Augmented Depth Prediction (DADP). We leverage the structural characteristics of diffusion model to enforce depth structures of depth models in a plug-and-play manner. An object-guided integrality loss is also proposed to further enhance regional structure integrality by fetching objective information. We evaluate DADP on three driving benchmarks and achieve significant improvements in depth structures and robustness. Our work provides a new perspective on depth estimation with sparse annotations in autonomous driving scenes.

Chaos to Order: A Label Propagation Perspective on Source-Free Domain Adaptation

  • Chunwei Wu
  • Guitao Cao
  • Yan Li
  • Xidong Xi
  • Wenming Cao
  • Hong Wang

Source-free domain adaptation (SFDA), where only a pre-trained source model is used to adapt to the target distribution, is a more general approach to achieving domain adaptation in the real world. However, it can be challenging to capture the inherent structure of the target features accurately due to the lack of supervised information on the target domain. By analyzing the clustering performance of the target features, we show that they still contain core features related to discriminative attributes but lack the collation of semantic information. Inspired by this insight, we present Chaos to Order (CtO), a novel approach for SFDA that strives to constrain semantic credibility and propagate label information among target subpopulations. CtO divides the target data into inner and outlier samples based on the adaptive threshold of the learning state, customizing the learning strategy to fit the data properties best. Specifically, inner samples are utilized for learning intra-class structure thanks to their relatively well-clustered properties. The low-density outlier samples are regularized by input consistency to achieve high accuracy with respect to the ground truth labels. In CtO, by employing different learning strategies to propagate the labels from the inner local to outlier instances, it clusters the global samples from chaos to order. We further adaptively regulate the neighborhood affinity of the inner samples to constrain the local semantic credibility. In theoretical and empirical analyses, we demonstrate that our algorithm not only propagates from inner to outlier but also prevents local clustering from forming spurious clusters. Empirical evidence demonstrates that CtO outperforms the state of the arts on three public benchmarks: Office-31, Office-Home, and VisDA.

Beware of Overcorrection: Scene-induced Commonsense Graph for Scene Graph Generation

  • Lianggangxu Chen
  • Jiale Lu
  • Youqi Song
  • Changbo Wang
  • Gaoqi He

A scene graph generation task is largely restricted under a class imbalance. Previous methods have alleviated the class imbalance problem by incorporating commonsense information into the classification, enabling the prediction model to rectify the incorrect head class into the correct tail class. However, the results of commonsense-based models are typically overcorrected, e.g., the visually correct head class is forcibly modified into the wrong tail class. We argue that there are two principal reasons for this phenomenon. First, existing models ignore the semantic gap between commonsense knowledge and real scenes. Second, current commonsense fusion strategies propagate the neighbors in the visual-linguistic contexts without long-range correlation. To alleviate overcorrection, we formulate the commonsense-based scene graph generation task as two sub-problems: scene-induced commonsense graph generation (SI-CGG) and commonsense-inspired scene graph generation (CI-SGG). In SI-CGG module, unlike conventional methods using fixed commonsense graph, we adaptively adjust the node embeddings in a commonsense graph according to their visual appearance and configure the new reasoning edge under a specific visual context. The CI-SGG module is proposed to propagate the information from scene-induced commonsense graph back to the scene graph. It updates the representations of each node in scene graph by the aggregation of neighbourhood information at different scales. Through maximum likelihood optimisation of the logarithmic Gaussian process, the scene graph automatically adapt to the different neighbors in the visual-linguistic contexts. Systematic experiments on the Visual Genome dataset show that our full method achieves state-of-the-art performance.

Scene Text Segmentation with Text-Focused Transformers

  • Haiyang Yu
  • Xiaocong Wang
  • Ke Niu
  • Bin Li
  • Xiangyang Xue

Text segmentation is a crucial aspect of various text-related tasks, including text erasing, text editing, and font style transfer. In recent years, multiple text segmentation datasets, such as TextSeg focusing on Latin text segmentation and BTS on bilingual text segmentation, have been proposed. However, existing methods either disregard the annotations of text location or directly use pre-trained text detectors. In general, these methods cannot fully utilize the annotations of text location in the datasets. To explicitly incorporate text location information to guide text segmentation, we propose an end-to-end text-focused segmentation framework, where text detection and segmentation are jointly optimized. In the proposed framework, we first extract multi-level global visual features through residual convolution blocks and then predict the mask of text areas using a text detection head. Subsequently, we develop a text-focused module that compels the model to pay more attention to text areas. Specifically, we introduce two types of attention masks to extract corresponding features: text-aware and instance-aware features. Finally, we employ hierarchical Transformer encoders to fuse multi-level features and predict the text mask with a text segmentation head. To evaluate the effectiveness of our method, we conduct experiments on six text segmentation benchmarks. The experimental results demonstrate that the proposed method outperforms the previous state-of-the-art (SOTA) methods by a clear margin in most cases. The code and supplementary materials are available at

MIEP: Channel Pruning with Multi-granular Importance Estimation for Object Detection

  • Liangwei Jiang
  • Jiaxin Chen
  • Di Huang
  • Yunhong Wang

This paper investigates compressing a pre-trained deep object detector to a lightweight one by channel pruning, which has proved effective and flexible in promoting efficiency. However, the majority of existing works trim channels based on a monotonous criterion for general purposes, i.e., the importance to the task-specific loss. They are prone to overly prune intermediate layers and simultaneously leave large intra-layer redundancy, severely deteriorating the detection accuracy. To address the issues above, we propose a novel channel pruning approach with multi-granular importance estimation (MIEP), consisting of the Feature-level Object-sensitive Importance (FOI) and the Intra-layer Redundancy-aware Importance (IRI). The former puts large weights on channels that are critical for object representation through the guidance of object features from the pre-trained model, and mitigates over-pruning when combined with the task-specific loss. The latter groups highly correlated channels based on clustering, which are subsequently pruned with priority to decrease redundancy. Extensive experiments on the COCO and VOC benchmarks demonstrate that MIEP remarkably outperforms the state-of-the-art channel pruning approaches, achieves a better balance between accuracy and efficiency compared to lightweight object detectors, and generalizes well to various detection frameworks (e.g., Faster-RCNN and FSAF) and tasks (e.g., classification).

SESSION: Poster Session II: Understanding Multimedia Content -- Multimodal Fusion and Embedding

Disentangled Representation Learning with Causality for Unsupervised Domain Adaptation

  • Shanshan Wang
  • Yiyang Chen
  • Zhenwei He
  • Xun Yang
  • Mengzhu Wang
  • Quanzeng You
  • Xingyi Zhang

Most efforts in unsupervised domain adaptation (UDA) focus on learning the domain-invariant representations between the two domains. However, such representations may still confuse two patterns due to the domain gap. Considering that semantic information is useful for the final task and domain information always indicates the discrepancy between two domains, to address this issue, we propose to decouple the representations of semantic features from domain features to reduce domain bias. Different from traditional methods, we adopt a simple but effective module with only one domain discriminator to decouple the representations, offering two benefits. Firstly, it eliminates the need for labeled sample pairs, making it more suitable for UDA. Secondly, without adversarial learning, our model can achieve a more stable training phase. Moreover, to further enhance the task-specific features, we employ a causal mechanism to separate semantic features related to causal factors from the overall feature representations. Specially, we utilize a dual-classifier strategy, where each classifier is fed with the entire features and the semantic features, respectively. By minimizing the discrepancy between the outputs of the two classifiers, the causal influence of the semantic features is accentuated. Experiments on several public datasets demonstrate the proposed model can outperform the state-of-the-art methods. Our code is available at:

Localized and Balanced Efficient Incomplete Multi-view Clustering

  • Jie Wen
  • Gehui Xu
  • Chengliang Liu
  • Lunke Fei
  • Chao Huang
  • Wei Wang
  • Yong Xu

In recent years, many incomplete multi-view clustering methods have been proposed to address the challenging unsupervised clustering issue on the multi-view data with missing views. However, most of the existing works are inapplicable to large-scale clustering task and their clustering results are unstable since these methods have high computational complexities and their results are produced by kmeans rather than their designed learning models. In this paper, we propose a new one-step incomplete multi-view clustering model, called Localized and Balanced Incomplete Multi-view Clustering (LBIMVC), to address these issues. Specifically, LBIMVC develops a new graph regularized incomplete multi-matrix-factorization model to obtain the unique clustering result by learning a consensus probability representation, where each element of the consensus representation can directly reflect the probability of the corresponding sample to the class. In addition, the proposed graph regularized model integrates geometric preserving and consensus representation learning into one term without introducing any extra constraint terms and parameters to explore the structure of data. Moreover, to avoid that samples are over divided into a few clusters, a balanced constraint is introduced to the model. Experimental results on four databases demonstrate that our method not only obtains competitive clustering performance, but also performs faster than some state-of-the-art methods.

Interpolation Normalization for Contrast Domain Generalization

  • Mengzhu Wang
  • Junyang Chen
  • Huan Wang
  • Huisi Wu
  • Zhidan Liu
  • Qin Zhang

Domain generalization refers to the challenge of training a model from various source domains that can generalize well to unseen target domains. Contrastive learning is a promising solution that aims to learn domain-invariant representations by utilizing rich semantic relations among sample pairs from different domains. One simple approach is to bring positive sample pairs from different domains closer, while pushing negative pairs further apart. However, in this paper, we find that directly applying contrastive-based methods is not effective in domain generalization. To overcome this limitation, we propose to leverage a novel contrastive learning approach that promotes class-discriminative and class-balanced features from source domains. Essentially, clusters of sample representations from the same category are encouraged to cluster, while those from different categories are spread out, thus enhancing the model's generalization capability. Furthermore, most existing contrastive learning methods use batch normalization, which may prevent the model from learning domain-invariant features. Inspired by recent research on universal representations for neural networks, we propose a simple emulation of this mechanism by utilizing batch normalization layers to distinguish visual classes and formulating a way to combine them for domain generalization tasks. Our experiments demonstrate a significant improvement in classification accuracy over state-of-the-art techniques on popular domain generalization benchmarks, including Digits-DG, PACS, Office-Home and DomainNet.

Multi-teacher Self-training for Semi-supervised Node Classification with Noisy Labels

  • Yujing Liu
  • Zongqian Wu
  • Zhengyu Lu
  • Guoqiu Wen
  • Junbo Ma
  • Guangquan Lu
  • Xiaofeng Zhu

Graph neural networks (GNNs) have achieved promising results for semi-supervised learning tasks on the graph-structured data. However, most existing methods assume that the training data are with correct labels, but in the real world, the graph-structured data often carry noisy labels to reduce the effectiveness of GNNs. To address this issue, this paper proposes a new label correction method, called multi-teacher self-training (MTS-GNN for short), to conduct semi-supervised node classification with noisy labels. Specifically, we first save the parameters of the model training in the earlier iterations as teacher models, and then use them to guide the processes, including model training, noisy label removal, and pseudo-label selection, in the later iterations of the training process of semi-supervised node classification. As a result, based on the guidance of the teacher models, the proposed method achieves the model effectiveness by solving the over-fitting issue, improves the accuracy of noisy label removal and the quality of pseudo-label selection. Extensive experimental results on real datasets show that our method achieves the best effectiveness, compared to state-of-the-art methods.

Long Short-Term Graph Memory Against Class-imbalanced Over-smoothing

  • Liang Yang
  • Jiayi Wang
  • Tingting Zhang
  • Dongxiao He
  • Chuan Wang
  • Yuanfang Guo
  • Xiaochun Cao
  • Bingxin Niu
  • Zhen Wang

Most Graph Neural Networks (GNNs) follow the message-passing scheme. Residual connection is an effective strategy to tackle GNNs' over-smoothing issue and performance reduction issue on non-homophilic networks. Unfortunately, the coarse-grained residual connection still suffers from class-imbalanced over-smoothing issue, due to the fixed and linear combination of topology and attribute in node representation learning. To make the combination flexible to capture complicated relationship, this paper reveals that the residual connection needs to be node-dependent, layer-dependent, and related to both topology and attribute. To alleviate the difficulty in specifying complicated relationship, this paper presents a novel perspective on GNNs, i.e., the representations of one node in different layers can be seen as a sequence of states. From this perspective, existing residual connections are not flexible enough for sequence modeling. Therefore, a novel node-dependent residual connection, i.e., Long Short-Term Graph Memory Network (LSTGM) is proposed to employ Long Short-Term Memory (LSTM), to model the sequence of node representation. To make the graph topology fully employed, LSTGM innovatively enhances the updated memory and three gates with graph topology. A speedup version is also proposed for effective training. Experimental evaluations on real-world datasets demonstrate their effectiveness in preventing over-smoothing issue and handling networks with heterophily.

Class-level Structural Relation Modeling and Smoothing for Visual Representation Learning

  • Zitan Chen
  • Zhuang Qi
  • Xiao Cao
  • Xiangxian Li
  • Xiangxu Meng
  • Lei Meng

Representation learning for images has been advanced by recent progress in more complex neural models such as the Vision Transformers and new learning theories such as the structural causal models. However, these models mainly rely on the classification loss to implicitly regularize the class-level data distributions, and they may face difficulties when handling classes with diverse visual patterns. We argue that the incorporation of the structural information between data samples may improve this situation. To achieve this goal, this paper presents a framework termed Class-level Structural Relation Modeling and Smoothing for Visual Representation Learning (CSRMS), which includes the Class-level Relation Modelling, Class-aware Graph Sampling, and Relational Graph-Guided Representation Learning modules to model a relational graph of the entire dataset and perform class-aware smoothing and regularization operations to alleviate the issue of intra-class visual diversity and inter-class similarity. Specifically, the Class-level Relation Modelling module uses a clustering algorithm to learn the data distributions in the feature space and identify three types of class-level sample relations for the training set; Class-aware Graph Sampling module extends typical training batch construction process with three strategies to sample dataset-level sub-graphs; and Relational Graph-Guided Representation Learning module employs a graph convolution network with knowledge-guided smoothing operations to ease the projection from different visual patterns to the same class. Experiments demonstrate the effectiveness of structured knowledge modelling for enhanced representation learning and show that CSRMS can be incorporated with any state-of-the-art visual representation learning models for performance gains. The source codes and demos have been released at

Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding

  • Shengkai Sun
  • Daizong Liu
  • Jianfeng Dong
  • Xiaoye Qu
  • Junyu Gao
  • Xun Yang
  • Xun Wang
  • Meng Wang

Unsupervised pre-training has shown great success in skeleton-based action understanding recently. Existing works typically train separate modality-specific models (i.e., joint, bone, and motion), then integrate the multi-modal information for action understanding by a late-fusion strategy. Although these approaches have achieved significant performance, they suffer from the complex yet redundant multi-stream model designs, each of which is also limited to the fixed input skeleton modality. To alleviate these issues, in this paper, we propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL, which exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner. Specifically, instead of designing separate modality-specific optimization processes for uni-modal unsupervised learning, we feed different modality inputs into the same stream with an early-fusion strategy to learn their multi-modal features for reducing model complexity. To ensure that the fused multi-modal features do not exhibit modality bias, i.e., being dominated by a certain modality input, we further propose both intra- and inter-modal consistency learning to guarantee that the multi-modal features contain the complete semantics of each modal via feature decomposition and distinct alignment. In this manner, our framework is able to learn the unified representations of uni-modal or multi-modal skeleton input, which is flexible to different kinds of modality input for robust action understanding in practical cases. Extensive experiments conducted on three large-scale datasets, i.e., NTU-60, NTU-120, and PKU-MMD II, demonstrate that UmURL is highly efficient, possessing the approximate complexity with the uni-modal methods, while achieving new state-of-the-art performance across various downstream task scenarios in skeleton-based action representation learning. Our source code is available at

Little Strokes Fell Great Oaks: Boosting the Hierarchical Features for Multi-exposure Image Fusion

  • Pan Mu
  • Zhiying Du
  • Jinyuan Liu
  • Cong Bai

In recent years, deep learning networks have made remarkable strides in the domain of multi-exposure image fusion. Nonetheless, prevailing approaches often involve directly feeding over-exposed and under-exposed images into the network, which leads to the under-utilization of inherent information present in the source images. Additionally, unsupervised techniques predominantly employ rudimentary weighted summation for color channel processing, culminating in an overall desaturated final image tone. To partially mitigate these issues, this study proposes a gamma correction module specifically designed to fully leverage latent information embedded within source images. Furthermore, a modified transformer block, embracing self-attention mechanisms, is introduced to optimize the fusion process. Ultimately, a novel color enhancement algorithm is presented to augment image saturation while preserving intricate details. The source code is available at

Triple-Granularity Contrastive Learning for Deep Multi-View Subspace Clustering

  • Jing Wang
  • Songhe Feng
  • Gengyu Lyu
  • Zhibin Gu

Multi-view subspace clustering (MVSC), which leverages comprehensive information from multiple views to effectively reveal the intrinsic relationships among instances, has garnered significant research interest. However, previous MVSC research focuses on exploring the cross-view consistent information only in the instance representation hierarchy or affinity relationship hierarchy, which prevents a joint investigation of the multi-view consistency in multiple hierarchies. To this end, we propose a Triple-gRanularity contrastive learning framework for deep mUlti-view Subspace clusTering (TRUST), which benefits from the comprehensive discovery of valuable information from three hierarchies, including the instance, specific-affinity relationship, and consensus-affinity relationship. Specifically, we first use multiple view-specific autoencoders to extract noise-robust instance representations, which are then respectively input into the MLP model and self-representation model to obtain high-level instance representations and view-specific affinity matrices. Then, the instance and specific-affinity relationship contrastive regularization terms are separately imposed on the high-level instance representations and view specific-affinity matrices, ensuring the cross-view consistency can be found from the instance representations to the view-specific affinity matrices. Furthermore, multiple view-specific affinity matrices are fused into a consensus one associated with the consensus-affinity relationship contrastive constraint, which embeds the local structural relationship of high-level instance representations into the consensus affinity matrix. Extensive experiments on various datasets demonstrate that our method is more effective when compared with other state-of-art methods.

CTCP: Cross Transformer and CNN for Pansharpening

  • Zhao Su
  • Yong Yang
  • Shuying Huang
  • Weiguo Wan
  • Wei Tu
  • Hangyuan Lu
  • Changjie Chen

Pansharpening is to fuse a high-resolution panchromatic (PAN) image with a low-resolution multispectral (LRMS) image to obtain an enhanced LRMS image with high spectral and spatial resolution. The current Transformer-based pansharpening methods neglect the interaction between the extracted long- and short-range features, resulting in spectral and spatial distortion in the fusion results. To address this issue, a novel cross Transformer and convolutional neural network (CNN) for pansharpening (CTCP) is proposed to achieve better fusion results by designing a cross mechanism, which can enhance the interaction between long- and short-range features. First, a dual branch feature extraction module (DBFEM) is constructed to extract the features from the LRMS and PAN images, respectively, reducing the aliasing of the two image features. In the DBFEM, to improve the feature representation ability of the network, a cross long-short-range feature module (CLSFM) is designed by combining the feature learning capabilities of Transformer and CNN via the cross mechanism, which achieves the integration of long-short-range features. Then, to improve the ability of spectral feature representation, a spectral feature enhancement fusion module (SFEFM) based on a frequency channel attention is constructed to realize feature fusion. Finally, the shallow features from the PAN image are reused to provide detail features, which are integrated with the fused features to obtain the final pansharpened results. To the best of our knowledge, this is the first attempt to introduce the cross mechanism between Transformer and CNN in pansharpening field. Numerous experiments show that our CTCP outperforms some state-of-the-art (SOTA) approaches both subjectively and objectively. The source code will be released at

Chain of Propagation Prompting for Node Classification

  • Yonghua Zhu
  • Zhenyun Deng
  • Yang Chen
  • Robert Amor
  • Michael Witbrock

Graph Neural Networks (GNN) are an effective technique for node classification, but their performance is easily affected by the quality of the primitive graph and the limited receptive field of message-passing. In this paper, we propose a new self-attention method, namely Chain of Propagation Prompting (CPP), to address the above issues as well as reduce dependence on label information when employing self-attention for node classification. To do this, we apply the self-attention framework to reduce the impact of a low-quality graph and to obtain a maximal receptive field for the message-passing. We also design a simple pattern of message-passing as the prompt to make self-attention capture complex patterns and reduce the dependence on label information. Comprehensive experimental results on real graph datasets demonstrate that CPP outperforms all relevant comparison methods.

Efficient Multi-View Graph Clustering with Local and Global Structure Preservation

  • Yi Wen
  • Suyuan Liu
  • Xinhang Wan
  • Siwei Wang
  • Ke Liang
  • Xinwang Liu
  • Xihong Yang
  • Pei Zhang

Anchor-based multi-view graph clustering (AMVGC) has received abundant attention owing to its high efficiency and the capability to capture complementary structural information across multiple views. Intuitively, a high-quality anchor graph plays an essential role in the success of AMVGC. However, the existing AMVGC methods only consider single-structure information, i.e., local or global structure, which provides insufficient information for the learning task. To be specific, the over-scattered global structure leads to learned anchors failing to depict the cluster partition well. In contrast, the local structure with an improper similarity measure results in potentially inaccurate anchor assignment, ultimately leading to sub-optimal clustering performance. To tackle the issue, we propose a novel anchor-based multi-view graph clustering framework termed Efficient Multi-View Graph Clustering with Local and Global Structure Preservation (EMVGC-LG). Specifically, a unified framework with a theoretical guarantee is designed to capture local and global information. Besides, EMVGC-LG jointly optimizes anchor construction and graph learning to enhance the clustering quality. In addition, EMVGC-LG inherits the linear complexity of existing AMVGC methods respecting the sample number, which is time-economical and scales well with the data size. Extensive experiments demonstrate the effectiveness and efficiency of our proposed method.

Scalable Incomplete Multi-View Clustering with Structure Alignment

  • Yi Wen
  • Siwei Wang
  • Ke Liang
  • Weixuan Liang
  • Xinhang Wan
  • Xinwang Liu
  • Suyuan Liu
  • Jiyuan Liu
  • En Zhu

The success of existing multi-view clustering (MVC) relies on the assumption that all views are complete. However, samples are usually partially available due to data corruption or sensor malfunction, which raises the research of incomplete multi-view clustering (IMVC). Although several anchor-based IMVC methods have been proposed to process the large-scale incomplete data, they still suffer from the following drawbacks: i) Most existing approaches neglect the inter-view discrepancy and enforce cross-view representation to be consistent, which would corrupt the representation capability of the model; ii) Due to the samples disparity between different views, the learned anchor might be misaligned, which we referred as the Anchor-Unaligned Problem for Incomplete data (AUP-ID). Such the AUP-ID would cause inaccurate graph fusion and degrades clustering performance. To tackle these issues, we propose a novel incomplete anchor graph learning framework termed Scalable Incomplete Multi-View Clustering with Structure Alignment (SIMVC-SA). Specially, we construct the view-specific anchor graph to capture the complementary information from different views. In order to solve the AUP-ID, we propose a novel structure alignment module to refine the cross-view anchor correspondence. Meanwhile, the anchor graph construction and alignment are jointly optimized in our unified framework to enhance clustering quality. Through anchor graph construction instead of full graphs, the time and space complexity of the proposed SIMVC-SA is proven to be linearly correlated with the number of samples. Extensive experiments on seven incomplete benchmark datasets demonstrate the effectiveness and efficiency of our proposed method. Our code is publicly available at

Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval

  • Yi Bin
  • Haoxuan Li
  • Yahui Xu
  • Xing Xu
  • Yang Yang
  • Heng Tao Shen

Most existing cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts, e.g., CNN for images and RNN/Transformer for texts. Such discrepancy in architectures may induce different semantic distribution spaces and limit the interactions between images and texts, and further result in inferior alignment between images and texts. To fill this research gap, inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities. Specifically, we design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed Hierarchical Alignment Transformers (HAT), which consists of an image Transformer, a text Transformer, and a hierarchical alignment module. With such identical architectures, the encoders could produce representations with more similar characteristics for images and texts, and make the interactions and alignments between them much easier. Besides, to leverage the rich semantics, we devise a hierarchical alignment scheme to explore multi-level correspondences of different layers between images and texts. To evaluate the effectiveness of the proposed HAT, we conduct extensive experiments on two benchmark datasets, MSCOCO and Flickr30K. Experimental results demonstrate that HAT outperforms SOTA baselines by a large margin. Specifically, on two key tasks, i.e., image-to-text and text-to-image retrieval, HAT achieves 7.6% and 16.7% relative score improvement of Recall@1 on MSCOCO, and 4.4% and 11.6% on Flickr30k respectively. The code is available at

Unbalanced Multi-view Deep Learning

  • Cai Xu
  • Zehui Li
  • Ziyu Guan
  • Wei Zhao
  • Xiangyu Song
  • Yue Wu
  • Jianxin Li

Most existing multi-view learning methods assume that the dimensions of different views are similar. In real-world applications, it is often the case that the dimension of a view may be extremely small compared with these of other views, resulting in an unbalanced multi-view learning problem. Previous methods for this problem have at least one of the following drawbacks: (1) despising the information of low dimensional views; (2) constructing balanced view-specific inter-instance similarity graphs or employing decision-level fusion, which cannot well learn multi-level inter-view correlations and is limited to category-related tasks such as clustering. To eliminate all these drawbacks, we present an Unbalanced Multi-view Deep Learning (UMDL) method. Considering a low dimensional view usually contains multiple patterns, we construct an overcomplete dictionary with its atoms exceeding the dimension of the original data. We transfer the original data into a combination of atoms and obtain a higher dimensional representation. We propose a sparse multi-view fusion paradigm to explicitly capture the complementarity of multi-view data in a flexible manner. Moreover, we construct positive and negative examples via balanced similarity graphs and employ contrastive learning to train UMDL in a self-supervised manner. Experiments conducted on a toy example and 7 balanced/unbalanced datasets show that UMDL outperforms baseline methods and can be well applied to downstream classification and segmentation tasks. The code is released at

Incomplete Multi-View Clustering with Regularized Hierarchical Graph

  • Shuping Zhao
  • Lunke Fei
  • Jie Wen
  • Bob Zhang
  • Pengyang Zhao

In this article, we propose a novel and effective incomplete multi-view clustering (IMVC) framework, referred to as incomplete multi-view clustering with regularized hierarchical graph (IMVC_RHG). Different from the existing graph learning-based IMVC methods, IMVC_RHG introduces a novel heterogeneous-graph learning and embedding strategy, which adopts the high-order structures between four tuples for each view, rather than a simple paired-sample intrinsic structure. Besides this, with the aid of the learned heterogeneous graphs, a between-view preserving strategy is designed to recover the incomplete graph for each view. Finally, a consensus representation for each sample is gained with a co-regularization term for final clustering. As a result of integrating these three learning strategies, IMVC_RHG can be flexibly applied to different types of IMVC tasks. Comparing with the other state-of-the-art methods, the proposed IMVC_RHG can achieve the best performances on real-world incomplete multi-view databases.

On Regularizing Multiple Clusterings for Ensemble Clustering by Graph Tensor Learning

  • Man-Sheng Chen
  • Jia-Qi Lin
  • Chang-Dong Wang
  • Wu-Dong Xi
  • Dong Huang

Ensemble clustering has shown its promising ability in fusing multiple base clusterings into a probably better and more robust clustering result. Typically, the co-association matrix based ensemble clustering methods attempt to integrate multiple connective matrices from base clusterings by weighted fusion to acquire a common graph representation. However, few of them are aware of the potential noise or corruption from the common representation by direct integration of different connective matrices with distinct cluster structures, and further consider the mutual information propagation between the input observations. In this paper, we propose a Graph Tensor Learning based Ensemble Clustering (GTLEC) method to refine multiple connective matrices by the substantial rank recovery and graph tensor learning. Within this framework, each input connective matrix is dexterously refined to approximate a graph structure by obeying the theoretical rank constraint with an adaptive weight coefficient. Further, we stack multiple refined connective matrices into a three-order tensor to extract their higher-order similarities via graph tensor learning, where the mutual information propagation across different graph matrices will also be promoted. Extensive experiments on several challenging datasets have confirmed the superiority of GTLEC compared with the state-of-the-art.

Event-guided Frame Interpolation and Dynamic Range Expansion of Single Rolling Shutter Image

  • Guixu Lin
  • Jin Han
  • Mingdeng Cao
  • Zhihang Zhong
  • Yinqiang Zheng

In the presence of abrupt motion, the pushbroom scanning mechanism of a rolling shutter (RS) camera tends to bring undesirable distortion, which is recently shown to be beneficial for high-speed frame interpolation. Although promising results have been reported by using multiple consecutive RS frames, to interpolate intermediate distortion-free frames from a single RS image is still an open question, due to the existence of multiple motions that can account for the recorded distortion. Another limitation of RS cameras in complex dynamic scenarios lies in the dynamic range, since traditional ways of multiple exposure for high dynamic range (HDR) imaging will fail due to alignment issues. To deal with these two challenges simultaneously, we propose to use an event camera for assistance, which has much faster temporal response and wider dynamic range. Since there does not exist learning data for this brand new imaging setup, we first build a quad-axis imaging system to capture a realistic dataset called REG-HDR, with pairs of fully aligned RS image and its associated events, as well as their corresponding high-speed HDR GS images. We also propose a flow-based network for frame interpolation, compounded with an attention-based fusion network for dynamic range expansion. Experimental results have verified the effectiveness of our proposed algorithm and the superiority of using realistic data for this challenging dural-purpose enhancement task.

Learnable Graph Filter for Multi-view Clustering

  • Peng Zhou
  • Liang Du

Multi-view clustering is an important machine learning task for multi-media data. Recently, graph filter based multi-view clustering achieves promising performance and attracts much attention. However, the conventional graph filter based methods only use a pre-defined graph filter for each view and the used graph filters ignore the rich information among all views. Different from the conventional methods, in this paper, we aim to tackle a new problem, i.e., instead of using the pre-defined graph filters, how to construct an appropriate consensus graph filter by considering the information in all views. To achieve this, we propose a novel multi-view clustering method with graph filter learning. In our method, we learn an appropriate consensus graph filter from all views of data with multiple graph learning rather than directly pre-defining it. Then, we provide an iterative algorithm to obtain the consensus graph filter and analyze why it can lead to better clustering results. The extensive experiments on benchmark data sets demonstrate the effectiveness and superiority of the proposed method. The codes of this article are released in

Cross-Silo Prototypical Calibration for Federated Learning with Non-IID Data

  • Zhuang Qi
  • Lei Meng
  • Zitan Chen
  • Han Hu
  • Hui Lin
  • Xiangxu Meng

Federated Learning aims to learn a global model on the server side that generalizes to all clients in a privacy-preserving manner, by leveraging the local models from different clients. Existing solutions focus on either regularizing the objective functions among clients or improving the aggregation mechanism for the improved model generalization capability. However, their performance is typically limited by the dataset biases, such as the heterogeneous data distributions and the missing classes. To address this issue, this paper presents a cross-silo prototypical calibration method (FedCSPC), which takes additional prototype information from the clients to learn a unified feature space on the server side. Specifically, FedCSPC first employs the Data Prototypical Modeling (DPM) module to learn data patterns via clustering to aid calibration. Subsequently, the cross-silo prototypical calibration (CSPC) module develops an augmented contrastive learning method to improve the robustness of the calibration, which can effectively project cross-source features into a consistent space while maintaining clear decision boundaries. Moreover, the CSPC module's ease of implementation and plug-and-play characteristics make it even more remarkable. Experiments were conducted on four datasets in terms of performance comparison, ablation study, in-depth analysis and case study, and the results verified that FedCSPC is capable of learning the consistent features across different data sources of the same class under the guidance of calibrated model, which leads to better performance than the state-of-the-art methods. The source codes have been released at

CALM: An Enhanced Encoding and Confidence Evaluating Framework for Trustworthy Multi-view Learning

  • Hai Zhou
  • Zhe Xue
  • Ying Liu
  • Boang Li
  • Junping Du
  • Meiyu Liang
  • Yuankai Qi

Multi-view learning aims to leverage data acquired from multiple sources to achieve better performance compared to using a single view. However, the performance of multi-view learning can be negatively impacted by noisy or corrupted views in certain real-world situations. As a result, it is crucial to assess the confidence of predictions and obtain reliable learning outcomes. In this paper, we introduce CALM, an enhanced encoding and confidence evaluation framework for trustworthy multi-view classification. Our method comprises enhanced multi-view encoding, multi-view confidence-aware fusion, and multi-view classification regularization, enabling the simultaneous evaluation of prediction confidence and the yielding trustworthy classifications. Enhanced multi-view encoding takes advantage of cross-view consistency and class diversity to improve the efficacy of the learned latent representation, facilitating more reliable classification results. Multi-view confidence-aware fusion utilizes a confidence-aware estimator to evaluate the confidence scores of classification outcomes. The final multi-view classification results are then derived through confidence-aware fusion. To achieve reliable and accurate confidence scores, multivariate Gaussian distributions are employed to model the prediction distribution. The advantage of CALM lies in its ability to evaluate the quality of each view, reducing the influence of low-quality views on the multi-view fusion process and ultimately leading to improved classification performance and confidence evaluation. Comprehensive experimental results demonstrate that our method outperforms other trusted multi-view learning methods in terms of effectiveness, reliability, and robustness.

Curriculum-Listener: Consistency- and Complementarity-Aware Audio-Enhanced Temporal Sentence Grounding

  • Houlun Chen
  • Xin Wang
  • Xiaohan Lan
  • Hong Chen
  • Xuguang Duan
  • Jia Jia
  • Wenwu Zhu

Temporal Sentence Grounding aims to retrieve a video moment given a natural language query. Most existing literature merely focuses on visual information in videos without considering the naturally accompanied audio which may contain rich semantics. The few works considering audio simply regard it as an additional modality, overlooking that: i) it's non-trivial to explore consistency and complementarity between audio and visual; ii) such exploration requires handling different levels of information densities and noises in the two modalities. To tackle these challenges, we propose Adaptive Dual-branch Promoted Network (ADPN) to exploit such consistency and complementarity: i) we introduce a dual-branch pipeline capable of jointly training visual-only and audio-visual branches to simultaneously eliminate inter-modal interference; ii) we design Text-Guided Clues Miner (TGCM) to discover crucial locating clues via considering both consistency and complementarity during audio-visual interaction guided by text semantics; iii) we propose a novel curriculum-based denoising optimization strategy, where we adaptively evaluate sample difficulty as a measure of noise intensity in a self-aware fashion. Extensive experiments show the state-of-the-art performance of our method.

Quality-Aware RGBT Tracking via Supervised Reliability Learning and Weighted Residual Guidance

  • Lei Liu
  • Chenglong Li
  • Yun Xiao
  • Jin Tang

RGB and thermal infrared (TIR) data have different visual properties, which make their fusion essential for effective object tracking in diverse environments and scenes. Existing RGBT tracking methods commonly use attention mechanisms to generate reliability weights for multi-modal feature fusion. However, without explicit supervision, these weights may be unreliably estimated, especially in complex scenarios. To address this problem, we propose a novel Quality-Aware RGBT Tracker (QAT) for robust RGBT tracking. QAT learns reliable weights for each modality in a supervised manner and performs weighted residual guidance to extract and leverage useful features from both modalities. We address the issue of the lack of labels for reliability learning by designing an efficient three-branch network that generates reliable pseudo labels, and a simple binary classification scheme that estimates high-accuracy reliability weights, mitigating the effect of noisy pseudo labels. To propagate useful features between modalities while reducing the influence of noisy modal features on the migrated information, we design a weighted residual guidance module based on the estimated weights and residual connections. We evaluate our proposed QAT on five benchmark datasets, including GTOT, RGBT210, RGBT234, LasHeR, and VTUAV, and demonstrate its excellent performance compared to state-of-the-art methods. Experimental results show that QAT outperforms existing RGBT tracking methods in various challenging scenarios, demonstrating its efficacy in improving the reliability and accuracy of RGBT tracking.

Event-Enhanced Multi-Modal Spiking Neural Network for Dynamic Obstacle Avoidance

  • Yang Wang
  • Bo Dong
  • Yuji Zhang
  • Yunduo Zhou
  • Haiyang Mei
  • Ziqi Wei
  • Xin Yang

Autonomous obstacle avoidance is of vital importance for an intelligent agent such as a mobile robot to navigate in its environment. Existing state-of-the-art methods train a spiking neural network (SNN) with deep reinforcement learning (DRL) to achieve energy-efficient and fast inference speed in complex/unknown scenes. These methods typically assume that the environment is static while the obstacles in real-world scenes are often dynamic. The movement of obstacles increases the complexity of the environment and poses a great challenge to the existing methods. In this work, we approach robust dynamic obstacle avoidance twofold. First, we introduce the neuromorphic vision sensor (i.e., event camera) to provide motion cues complementary to the traditional Laser depth data for handling dynamic obstacles. Second, we develop an DRL-based event-enhanced multimodal spiking actor network (EEM-SAN) that extracts information from motion events data via unsupervised representation learning and fuses Laser and event camera data with learnable thresholding. Experiments demonstrate that our EEM-SAN outperforms state-of-the-art obstacle avoidance methods by a significant margin, especially for dynamic obstacle avoidance.

Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture Recognition

  • Yujun Ma
  • Benjia Zhou
  • Ruili Wang
  • Pichao Wang

RGB-D action and gesture recognition remain an interesting topic in human-centered scene understanding, primarily due to the multiple granularities and large variation in human motion. Although many RGB-D based action and gesture recognition approaches have demonstrated remarkable results by utilizing highly integrated spatio-temporal representations across multiple modalities (i.e., RGB and depth data), they still encounter several challenges. Firstly, vanilla 3D convolution makes it hard to capture fine-grained motion differences between local clips under different modalities. Secondly, the intricate nature of highly integrated spatio-temporal modeling can lead to optimization difficulties. Thirdly, duplicate and unnecessary information can add complexity and complicate entangled spatio-temporal modeling. To address the above issues, we propose an innovative heuristic architecture called Multi-stage Factorized Spatio-Temporal (MFST) for RGB-D action and gesture recognition. The proposed MFST model comprises a 3D Central Difference Convolution Stem (CDC-Stem) module and multiple factorized spatio-temporal stages. The CDC-Stem enriches fine-grained temporal perception, and the multiple hierarchical spatio-temporal stages construct dimension-independent higher-order semantic primitives. Specifically, the CDC-Stem module captures bottom-level spatio-temporal features and passes them successively to the following spatio-temporal factored stages to capture the hierarchical spatial and temporal features through the Multi-Scale Convolution and Transformer (MSC-Trans) hybrid block and Weight-shared Multi-Scale Transformer (WMS-Trans) block. The seamless integration of these innovative designs results in a robust spatio-temporal representation that outperforms state-of-the-art approaches on RGB-D action and gesture recognition datasets.

M3R: Masked Token Mixup and Cross-Modal Reconstruction for Zero-Shot Learning

  • Peng Zhao
  • Qiangchang Wang
  • Yilong Yin

In the zero-shot learning (ZSL), learned representation spaces are often biased toward seen classes, thus limiting the ability to predict previously unseen classes. In this paper, we propose Masked token Mixup and cross-Modal Reconstruction for zero-shot learning, termed as M3R, which can significantly alleviate the bias toward seen classes. The M3R mainly consists of Random Token Mixup (RTM), Unseen Class Detection (UCD), and Hard Cross-modal Reconstruction (HCR). Firstly, mappings without proper adaptations to unseen classes would cause the bias toward seen classes. To address this issue, the RTM is introduced to generate diverse unseen class agents, thereby broadening the representation space to cover unknown classes. It is applied at a randomly selected layer in the Vision Transformer, producing smooth low- and high-level representation space boundaries to cover rich attributes. Secondly, it should be noted that unseen class agents generated by the RTM may be mixed with seen class samples. To overcome this challenge, the UCD is designed to generate greater entropy values for unseen classes, thereby distinguishing seen classes from unseen classes. Thirdly, to further mitigate the bias toward seen classes and explore associations between semantics and visual images, the HCR is proposed, which can reconstruct masked pixels based on few discriminative tokens and attribute embeddings. This approach can enable models to have a deep understanding of image contents and build powerful connections between semantic attributes and visual information. Both qualitative and quantitative results demonstrate the effectiveness and usefulness of our proposed M3R model.

Redundancy-aware Transformer for Video Question Answering

  • Yicong Li
  • Xun Yang
  • An Zhang
  • Chun Feng
  • Xiang Wang
  • Tat-Seng Chua

This paper identifies two kinds of redundancy in the current VideoQA paradigm. Specifically, the current video encoders tend to holistically embed all video clues at different granularities in a hierarchical manner, which inevitably introducesneighboring-frame redundancy that can overwhelm detailed visual clues at the object level. Subsequently, prevailing vision-language fusion designs introduce thecross-modal redundancy by exhaustively fusing all visual elements with question tokens without explicitly differentiating their pairwise vision-language interactions, thus making a pernicious impact on the answering. To this end, we propose a novel transformer-based architecture, that aims to model VideoQA in a redundancy-aware manner. To address the neighboring-frame redundancy, we introduce a video encoder structure that emphasizes the object-level change in neighboring frames, while adopting an out-of-neighboring message-passing scheme that imposes attention only on distant frames. As for the cross-modal redundancy, we equip our fusion module with a novel adaptive sampling, which explicitly differentiates the vision-language interactions by identifying a small subset of visual elements that exclusively support the answer. Upon these advancements, we find this \underlineR edundancy-\underlinea ware trans\underlineformer (RaFormer) can achieve state-of-the-art results on multiple VideoQA benchmarks.

Frequency-based Zero-Shot Learning with Phase Augmentation

  • Wanting Yin
  • Hongtao Xie
  • Lei Zhang
  • Jiannan Ge
  • Pandeng Li
  • Chuanbin Liu
  • Yongdong Zhang

Zero-Shot Learning (ZSL) aims to recognize images from seen and unseen classes by aligning visual and semantic knowledge (e.g., attribute descriptions). However, the fine-grained attributes in the RGB domain can be easily affected by background noise (e.g., the grey bird tail blending with the ground), making it difficult to effectively distinguish them. Analyzing the features in the frequency domain assists in better distinguishing the attributes since their patterns remain consistent across different images, unlike noise which may be more variable. Nevertheless, existing ZSL methods typically learn visual features directly from the RGB domain, which can impede the recognition of certain attributes. To overcome this limitation, we propose a novel ZSL method named Frequency-based Phase Augmentation (FPA) network, which learns an effective representation of the attributes in the frequency domain. Specifically, we introduce a Hybrid Phase Augmentation (HPA) module to transform visual features into the frequency domain and augment the phase component for better retention of semantic information of the attributes. The use of phase-augmented features enables FPA to capture more semantic knowledge that can be challenging to distinguish in the RGB domain, suppress noise, and highlight significant attributes. Our extensive experiments show that FPA achieves state-of-the-art performance across four standard datasets.

Uni-paint: A Unified Framework for Multimodal Image Inpainting with Pretrained Diffusion Model

  • Shiyuan Yang
  • Xiaodong Chen
  • Jing Liao

Recently, text-to-image denoising diffusion probabilistic models (DDPMs) have demonstrated impressive image generation capabilities and have also been successfully applied to image inpainting. However, in practice, users often require more control over the inpainting process beyond textual guidance, especially when they want to composite objects with customized appearance, color, shape, and layout. Unfortunately, existing diffusion-based inpainting methods are limited to single-modal guidance and require task-specific training, hindering their cross-modal scalability. To address these limitations, we propose Uni-paint, a unified framework for multimodal inpainting that offers various modes of guidance, including unconditional, text-driven, stroke-driven, exemplar-driven inpainting, as well as a combination of these modes. Furthermore, our Uni-paint is based on pretrained Stable Diffusion and does not require task-specific training on specific datasets, enabling few-shot generalizability to customized images. We have conducted extensive qualitative and quantitative evaluations that show our approach achieves comparable results to existing single-modal methods while offering multimodal inpainting capabilities not available in other methods. Code is available at

UniNeXt: Exploring A Unified Architecture for Vision Recognition

  • Fangjian Lin
  • Jianlong Yuan
  • Sitong Wu
  • Fan Wang
  • Zhibin Wang

Vision Transformers have shown great potential in computer vision tasks. Most recent works have focused on elaborating the spatial token mixer for performance gains. However, we observe that a well-designed general architecture can significantly improve the performance of the entire backbone, regardless of which spatial token mixer is equipped. In this paper, we propose UniNeXt, an improved general architecture for the vision backbone. To verify its effectiveness, we instantiate the spatial token mixer with various typical and modern designs, including both convolution and attention modules. Compared with the architecture in which they are first proposed, our UniNeXt architecture can steadily boost the performance of all the spatial token mixers, and narrows the performance gap among them. Surprisingly, our UniNeXt equipped with naive local window attention even outperforms the previous state-of-the-art. Interestingly, the ranking of these spatial token mixers also changes under our UniNeXt, suggesting that an excellent spatial token mixer may be stifled due to a suboptimal general architecture, which further shows the importance of the study on the general architecture of vision backbone. Code is available at UniNeXt.

MCG-MNER: A Multi-Granularity Cross-Modality Generative Framework for Multimodal NER with Instruction

  • Junjie Wu
  • Chen Gong
  • Ziqiang Cao
  • Guohong Fu

Multimodal named entity recognition (MNER) is an essential task of vision and language, which aims to locate named entities and classify them to the predefined categories using visual scenarios. However, existing MNER studies often suffer from bias issues with fine-grained visual cue fusion, which may produce noisy coarse-grained visual cues for MNER. To accurately capture text-image relations and better refine multimodal representations, we propose a novel instruction-based Multi-granularity Cross-modality Generative framework for MNER, namely MCG-MNER. Concretely, we introduce a multi-granularity relation propagation to infer visual clues relevant to text. Then, we propose a method to jnject multi-granularity visual information into cross-modality interaction and fusion to learn a unified representation. Finally, we integrate task-specific instructions and answers for MCG-MNER. Comprehensive experimental results on three benchmark datasets, such as Twitter2015, Twitter2017 and WikiDiverse, demonstrate the superiority of our proposed method over several state-of-the-art MNER methods. We will publicly release our codes for future studies.

U2Net: A General Framework with Spatial-Spectral-Integrated Double U-Net for Image Fusion

  • Siran Peng
  • Chenhao Guo
  • Xiao Wu
  • Liang-Jian Deng

In image fusion tasks, images obtained from different sources exhibit distinct properties. Consequently, treating them uniformly with a single-branch network can lead to inadequate feature extraction. Additionally, numerous works have demonstrated that multi-scaled networks capture information more sufficiently than single-scaled models in pixel-level computer vision problems. Considering these factors, we propose U2Net, a spatial-spectral-integrated double U-shape network for image fusion. The U2Net utilizes a spatial U-Net and a spectral U-Net to extract spatial details and spectral characteristics, which allows for the discriminative and hierarchical learning of features from diverse images. In contrast to most previous works that merely employ concatenation to merge spatial and spectral information, this paper introduces a novel spatial-spectral integration structure called S2Block, which combines feature maps from different sources in a logical and effective way. We conduct a series of experiments on two image fusion tasks, including remote sensing pansharpening and hyperspectral image super-resolution (HISR). The U2Net outperforms representative state-of-the-art (SOTA) approaches in both quantitative and qualitative evaluations, demonstrating the superiority of our method. The code is available at

Modal-aware Visual Prompting for Incomplete Multi-modal Brain Tumor Segmentation

  • Yansheng Qiu
  • Ziyuan Zhao
  • Hongdou Yao
  • Delin Chen
  • Zheng Wang

In the realm of medical imaging, distinct magnetic resonance imaging (MRI) modalities can provide complementary medical insights. However, it is not uncommon for one or more modalities to be absent due to image corruption, artifacts, acquisition protocols, allergies to contrast agents, or cost constraints, posing a significant challenge for perceiving the modality-absent state in incomplete modality segmentation.In this work, we introduce a novel incomplete multi-modal segmentation framework called Modal-aware Visual Prompting (MAVP), which draws inspiration from the widely used pre-training and prompt adjustment protocol employed in natural language processing (NLP). In contrast to previous prompts that typically use textual network embeddings, we utilize embeddings as the prompts generated by a modality state classifier that focuses on the missing modality states. Additionally, we integrate modality state prompts into both the extraction stage of each modality and the modality fusion stage to facilitate intra/inter-modal adaptation. Our approach achieves state-of-the-art performance in various modality-incomplete scenarios compared to incomplete modality-specific solutions.

Where to Find Fascinating Inter-Graph Supervision: Imbalanced Graph Classification with Kernel Information Bottleneck

  • Hui Tang
  • Xun Liang

Imbalanced graph classification is ubiquitous yet challenging in many real-world applications. Existing methods typically follow the same convention of treating graph instances as discrete individuals and exploit graph neural networks (GNNs) to predict graph labels. Despite their success, they only propagate intra-graph information within a single graph while disregarding extra supervision globally derived from other graphs. In fact, the inter-graph learning plays a vital role in providing more supervision for minority graphs. However, it is disadvantageous to accurately derive reliable inter-graph supervision because the redundancy information from majority graphs is introduced to obscure the representations of minority graphs during the propagation process. To tackle this issue, we propose a novel method that integrates the restricted random walk kernel with the global graph information bottleneck (GIB) to improve imbalanced graph classification. Specifically, the restricted random walk kernel is proposed to perform the inter-graph learning with learnable graph filters and produce kernel outputs. To ensure that the redundant information of majority graphs does not plague kernel outputs, we model the entire kernel learning as a Markovian decision process and employ the global GIB manner to optimize it. Extensive experiments on real-world graph benchmark datasets verify the competitive performance of the proposed method.

pmBQA: Projection-based Blind Point Cloud Quality Assessment via Multimodal Learning

  • Wuyuan Xie
  • Kaimin Wang
  • Yakun Ju
  • Miaohui Wang

With the increasing communication and storage of point cloud data, there is an urgent need for an effective objective method to measure the quality before and after processing. To address this difficulty, we propose a projection-based blind quality indicator via multimodal learning for point cloud data, which can perceive both geometric distortion and texture distortion by using four homogeneous modalities (i.e., texture, normal, depth and roughness). To fully exploit the multimodal information, we further develop a deformable convolutionbased alignment module and a graph-based feature fusion module, and investigate a graph node attention-based evaluation method to forecast the quality score. Extensive experimental results on three benchmark databases show that our method achieves more accurate evaluation performance in comparison with 12 competitive methods.

Dropping Pathways Towards Deep Multi-View Graph Subspace Clustering Networks

  • Zihao Zhang
  • Qianqian Wang
  • Zhiqiang Tao
  • Quanxue Gao
  • Wei Feng

Multi-view graph clustering aims to leverage different views to obtain consistent information and improve clustering performance by sharing the graph structure. Existing multi-view graph clustering algorithms generally adopt a single-pathway network reconstruction and consistent feature extraction, building on top of auto-encoders and graph convolutional networks (GCN). Despite their promising results, these single-pathway methods may ignore the significant complementary information between different layers and the rich multi-level context inside. On the other hand, GCN usually employs a shallow network structure (2-3 layers) due to the over-smoothing with the increase of network depth, while few multi-view graph clustering methods explore the performance of deep networks. In this work, we propose a novel Dropping Pathways strategy toward building a deep Multi-view Graph Subspace Clustering network, namely DPMGSC, to fully exploit the deep and multi-level graph network representations. The proposed method implements a multi-pathway self-expressive network to capture pairwise affinities of graph nodes among multiple views. Moreover, we empirically study the impact of a series of dropping methods on deep multi-pathway networks. Extensive experiments demonstrate the effectiveness of the proposed DPMGSC compared with its deep counterpart and state-of-the-art methods.

Multi-view Graph Clustering via Efficient Global-Local Spectral Embedding Fusion

  • Penglei Wang
  • Danyang Wu
  • Rong Wang
  • Feiping Nie

With the proliferation of multimedia applications, data is frequently derived from multiple sources, leading to the accelerated advancement of multi-view clustering (MVC) methods. In this paper, we propose a novel MVC method, termed GLSEF, to handle the inconsistency existing in multiple spectral embeddings. To this end, GLSEF contains a two-level learning mechanism. Specifically, on the global level, GLSEF considers the diversity of features and selectively assigns smooth weights to partial more discriminative features that are conducive to clustering. On the local level, GLSEF resorts to the Grassmann manifold to maintain spatial and topological information and local structure in each view, thereby enhancing its suitability and accuracy for clustering. Moreover, unlike most previous methods that learn a low-dimension embedding and perform the k-means algorithm to obtain the final cluster labels, GLSEF directly acquires the discrete indicator matrix to prevent potential information loss during post-processing. To address the optimization involved in GLSEF, we present an efficient alternating optimization algorithm accompanied by convergence and time complexity analyses. Extensive empirical results on nine real-world datasets demonstrate the effectiveness and efficiency of GLSEF compared to existing state-of-the-art MVC methods.

Debunking Free Fusion Myth: Online Multi-view Anomaly Detection with Disentangled Product-of-Experts Modeling

  • Hao Wang
  • Zhi-Qi Cheng
  • Jingdong Sun
  • Xin Yang
  • Xiao Wu
  • Hongyang Chen
  • Yan Yang

Multi-view or even multi-modal data is appealing yet challenging for real-world applications. Detecting anomalies in multi-view data is a prominent recent research topic. However, most of the existing methods 1) are only suitable for two views or type-specific anomalies, 2) suffer from the issue of fusion disentanglement, and 3) do not support online detection after model deployment. To address these challenges, our main ideas in this paper are three-fold: multi-view learning, disentangled representation learning, and generative model. To this end, we propose dPoE, a novel multi-view variational autoencoder model that involves (1) a Product-of-Experts (PoE) layer in tackling multi-view data, (2) a Total Correction (TC) discriminator in disentangling view-common and view-specific representations, and (3) a joint loss function in wrapping up all components. In addition, we devise theoretical information bounds to control both view-common and view-specific representations. Extensive experiments on six real-world datasets demonstrate that the proposed dPoE outperforms baselines markedly.

Domain-irrelevant Feature Learning for Generalizable Pan-sharpening

  • Yunlong Lin
  • Zhenqi Fu
  • Ge Meng
  • Yingying Wang
  • Yuhang Dong
  • Linyu Fan
  • Hedeng Yu
  • Xinghao Ding

Pan-sharpening aims to spatially enhance the low-resolution multispectral image (LRMS) by transferring high-frequency details from a panchromatic image (PAN) while preserving the spectral characteristics of LRMS. Previous arts mainly focus on how to learn a high-resolution multispectral image (HRMS) on the i.i.d. assumption. However, the distribution of training and testing data often encounters significant shifts in different satellites. To this end, this paper proposes a generalizable pan-sharpening network via domain-irrelevant feature learning. On the one hand, a structural preservation module (STP) is designed to fuse high-frequency information of PAN and LRMS. Our STP is performed on the gradient domain because it consists of structure and texture details that can generalize well on different satellites. On the other hand, to avoid spectral distortion while promoting the generalization ability, a spectral preservation module (SPP) is developed. The key design of SPP is to learn a phase fusion network of PAN and LRMS. The amplitude of LRMS, which contains 'satellite style' information is directly injected in different fusion stages. Extensive experiments have demonstrated the effectiveness of our method against state-of-the-art methods in both single-satellite and cross-satellite scenarios. Code is available at:

Depth-aided Camouflaged Object Detection

  • Qingwei Wang
  • Jinyu Yang
  • Xiaosheng Yu
  • Fangyi Wang
  • Peng Chen
  • Feng Zheng

Camouflaged Object Detection (COD) aims to identify and segment objects that blend into their surroundings. Since the color and texture of the camouflaged objects are extremely similar to the surrounding environment, it is super challenging for vision models to precisely detect them. Inspired by research on biology and evolution, we introduce depth information as an additional cue to help break camouflage, which can provide spatial information and texture-free separation for foreground and background. To dig clues of camouflaged objects in both RGB and depth modalities, we innovatively propose Depth-aided Camouflaged Object Detection (DaCOD), which involves two key components. We firstly propose the Multi-modal Collaborative Learning (MCL) module, which aims to collaboratively learning deep features from both RGB and depth channels via a hybrid backbone. Then, we propose a novel Cross-modal Asymmetric Fusion (CAF) strategy, which asymmetrically fuse RGB and depth information for complementary depth feature enhancement to produce accurate predictions. We conducted numerous experiments of the proposed DaCOD on three widely-used challenging COD benchmark datasets, in which DaCOD outperforms the current state-of-the-arts by a large margin. All resources are available at

SemanticRT: A Large-Scale Dataset and Method for Robust Semantic Segmentation in Multispectral Images

  • Wei Ji
  • Jingjing Li
  • Cheng Bian
  • Zhicheng Zhang
  • Li Cheng

Growing interests in multispectral semantic segmentation (MSS) have been witnessed in recent years, thanks to the unique advantages of combining RGB and thermal infrared images to tackle challenging scenarios with adverse conditions. However, unlike traditional RGB-only semantic segmentation, the lack of a large-scale MSS dataset has become a hindrance to the progress of this field. To address this issue, we introduce a SemanticRT dataset - the largest MSS dataset to date, comprising 11,371 high-quality, pixel-level annotated RGB-thermal image pairs. It is 7 times larger than the existing MFNet dataset, and covers a wide variety of challenging scenarios in adverse lighting conditions such as low-light and pitch black. Further, a novel Explicit Complement Modeling (ECM) framework is developed to extract modality-specific information, which is propagated through a robust cross-modal feature encoding and fusion process. Extensive experiments demonstrate the advantages of our approach and dataset over the existing counterparts. Our new dataset may also facilitate further development and evaluation of existing and new MSS algorithms.

MEAformer: Multi-modal Entity Alignment Transformer for Meta Modality Hybrid

  • Zhuo Chen
  • Jiaoyan Chen
  • Wen Zhang
  • Lingbing Guo
  • Yin Fang
  • Yufeng Huang
  • Yichi Zhang
  • Yuxia Geng
  • Jeff Z. Pan
  • Wenting Song
  • Huajun Chen

Multi-modal entity alignment (MMEA) aims to discover identical entities across different knowledge graphs (KGs) whose entities are associated with relevant images. However, current MMEA algorithms rely on KG-level modality fusion strategies for multi-modal entity representation, which ignores the variations of modality preferences of different entities, thus compromising robustness against noise in modalities such as blurry images and relations. This paper introduces MEAformer, a mlti-modal entity alignment transformer approach for meta modality hybrid, which dynamically predicts the mutual correlation coefficients among modalities for more fine-grained entity-level modality fusion and alignment. Experimental results demonstrate that our model not only achieves SOTA performance in multiple training scenarios, including supervised, unsupervised, iterative, and low-resource settings, but also has a limited number of parameters, efficient runtime, and interpretability. Our code is available at

Multi-Modal and Multi-Scale Temporal Fusion Architecture Search for Audio-Visual Video Parsing

  • Jiayi Zhang
  • Weixin Li

The weakly supervised audio-visual video parsing (AVVP) task aims to parse a video into a set of modality-wise events (i.e., audible, visible, or both), recognize categories of these events, and localize their temporal boundaries. Given the prevalence of audio-visual synchronous and asynchronous contents in multi-modal videos, it is crucial to capture and integrate the contextual events occurring at different moments and temporal scales. Although some researchers have made preliminary attempts at modeling event semantics with various temporal lengths, they mostly only perform a late fusion of multi-scale features across modalities. A comprehensive cross-modal and multi-scale temporal fusion strategy remains largely unexplored in the literature. To address this gap, we propose a novel framework named Audio-Visual Fusion Architecture Search (AVFAS) that can automatically find the optimal multi-scale temporal fusion strategy within and between modalities. Our framework generates a set of audio and visual features with distinct temporal scales and employs three modality-wise modules to search multi-scale feature selection and fusion strategies, jointly modeling modality-specific discriminative information. Furthermore, to enhance the alignment of audio-visual asynchrony, we introduce a Position- and Length-Adaptive Temporal Attention (PLATA) mechanism for cross-modal feature fusion. Extensive quantitative and qualitative experimental results demonstrate the effectiveness and efficiency of our framework.

Incorporating Domain Knowledge Graph into Multimodal Movie Genre Classification with Self-Supervised Attention and Contrastive Learning

  • Jiaqi Li
  • Guilin Qi
  • Chuanyi Zhang
  • Yongrui Chen
  • Yiming Tan
  • Chenlong Xia
  • Ye Tian

Multimodal movie genre classification has always been regarded as a demanding multi-label classification task due to the diversity of multimodal data such as posters, plot summaries, trailers and metadata. Although existing works have made great progress in modeling and combining each modality, they still face three issues: 1) unutilized group relations in metadata, 2) unreliable attention allocation, and 3) indiscriminative fused features. Given that the knowledge graph has been proven to contain rich information, we present a novel framework that exploits the knowledge graph from various perspectives to address the above problems. As a preparation, the metadata is processed into a domain knowledge graph. A translate model for knowledge graph embedding is adopted to capture the relations between entities. Firstly we retrieve the relevant embedding from the knowledge graph by utilizing group relations in metadata and then integrate it with other modalities. Next, we introduce an Attention Teacher module for reliable attention allocation based on self-supervised learning. It learns the distribution of the knowledge graph and produces rational attention weights. Finally, a Genre-Centroid Anchored Contrastive Learning module is proposed to strengthen the discriminative ability of fused features. The embedding space of anchors is initialized from the genre entities in the knowledge graph. To verify the effectiveness of our framework, we collect a larger and more challenging dataset named MM-IMDb 2.0 compared with the MM-IMDb dataset. The experimental results on two datasets demonstrate that our model is superior to the state-of-the-art methods. Our code and dataset is available at

Multi-scale Spatial-Spectral Attention Guided Fusion Network for Pansharpening

  • Yong Yang
  • Mengzhen Li
  • Shuying Huang
  • Hangyuan Lu
  • Wei Tu
  • Weiguo Wan

Pansharpening is to fuse high-resolution panchromatic (PAN) images with low-resolution multispectral (LR-MS) images to generate high-resolution multispectral (HR-MS) images. Most of the deep learning-based pansharpening methods did not consider the inconsistency of the PAN and LR-MS images and used simple concatenation to fuse the source images, which may cause spectral and spatial distortion in the fused results. To address this problem, a multi-scale spatial-spectral attention guided fusion network for pansharpening is proposed. First, the spatial features from the PAN image and spectral features from the LR-MS image are independently extracted to obtain the shallow features. Then, a spatial-spectral attention feature fusion module (SAFFM) is constructed to guide the reconstruction of spatial-spectral features by generating a guidance map to achieve the fusion of reconstructed features at different scales. In SAFFM, the guidance map is designed to ensure the spatial-spectral consistency of the reconstructed features. Finally, considering the difference between multiply scale features, a multi-level feature integration scheme is proposed to progressively achieve fusion of multi-scale features from different SAFFMs. Extensive experiments validate the effectiveness of the proposed network against other state-of-the-art (SOTA) pansharpening methods in both quantitative and qualitative assessments. The source code will be released at

Modality Profile - A New Critical Aspect to be Considered When Generating RGB-D Salient Object Detection Training Set

  • Xuehao Wang
  • Shuai Li
  • Chenglizhao Chen
  • Aimin Hao
  • Hong Qin

It is widely acknowledged that selecting appropriate training data is crucial for obtaining good results in real-world testing, more so than utilizing complex network architectures. However, in the field of RGB-D SOD research, researchers have primarily focused on enhancing network architectures and have given less consideration to the choice of training and testing datasets, which may not translate well in practical applications. This paper aims to address an existing issue - how can we automatically generate a data-driven RGB-D SOD training dataset? We propose that in addition to scene similarity, the concept of "modality profile'' should be taken into account. The term "modality profile'' refers to the complementary status of modalities within a given dataset. A training dataset with a modality profile similar to the test dataset can significantly improve performance. To address this, we present a viable solution for automatically generating a training dataset with any desired modality profile in a weakly supervised manner. Our method also provides high-quality pseudo-GTs for all RGB-D images obtained from the web, making it suitable for training RGB-D SOD models. Extensive quantitative evaluations demonstrate the significance of the proposed "modality profile'' and confirm the superiority of the newly constructed training set guided by our "modality profile''. All codes, datasets, and results are available at this link.

TMac: Temporal Multi-Modal Graph Learning for Acoustic Event Classification

  • Meng Liu
  • Ke Liang
  • Dayu Hu
  • Hao Yu
  • Yue Liu
  • Lingyuan Meng
  • Wenxuan Tu
  • Sihang Zhou
  • Xinwang Liu

Audiovisual data is everywhere in this digital age, which raises higher requirements for the deep learning models developed on them. To well handle the information of the multi-modal data is the key to a better audiovisual modal. We observe that these audiovisual data naturally have temporal attributes, such as the time information for each frame in the video. More concretely, such data is inherently multi-modal according to both audio and visual cues, which proceed in a strict chronological order. It indicates that temporal information is important in multi-modal acoustic event modeling for both intra- and inter-modal. However, existing methods deal with each modal feature independently and simply fuse them together, which neglects the mining of temporal relation and thus leads to sub-optimal performance. With this motivation, we propose a Temporal Multi-modal graph learning method for Acoustic event Classification, called TMac, by modeling such temporal information via graph learning techniques. In particular, we construct a temporal graph for each acoustic event, dividing its audio data and video data into multiple segments. Each segment can be considered as a node, and the temporal relationships between nodes can be considered as timestamps on their edges. In this case, we can smoothly capture the dynamic information in intra-modal and inter-modal. Several experiments are conducted to demonstrate TMac outperforms other SOTA models in performance. Our code is available at

FOLT: Fast Multiple Object Tracking from UAV-captured Videos Based on Optical Flow

  • Mufeng Yao
  • Jiaqi Wang
  • Jinlong Peng
  • Mingmin Chi
  • Chao Liu

Multiple object tracking (MOT) has been successfully investigated in computer vision. However, MOT for the videos captured by unmanned aerial vehicles (UAV) is still challenging due to small object size, blurred object appearance, and very large and/or irregular motion in both ground objects and UAV platforms. In this paper, we propose FOLT to mitigate these problems and reach fast and accurate MOT in UAV view. Aiming at speed-accuracy trade-off, FOLT adopts a modern detector and light-weight optical flow extractor to extract object detection features and motion features at a minimum cost. Given the extracted flow, the flow-guided feature augmentation is designed to augment the object detection feature based on its optical flow, which improves the detection of small objects. Then the flow-guided motion prediction is also proposed to predict the object's position in the next frame, which improves the tracking performance of objects with very large displacements between adjacent frames. Finally, the tracker matches the detected objects and predicted objects using a spatially matching scheme to generate tracks for every object. Experiments on Visdrone and UAVDT datasets show that our proposed model can successfully track small objects with large and irregular motion and outperform existing state-of-the-art methods in UAV-MOT tasks.

ScribbleVC: Scribble-supervised Medical Image Segmentation with Vision-Class Embedding

  • Zihan Li
  • Yuan Zheng
  • Xiangde Luo
  • Dandan Shan
  • Qingqi Hong

Medical image segmentation plays a critical role in clinical decision-making, treatment planning, and disease monitoring. However, accurate segmentation of medical images is challenging due to several factors, such as the lack of high-quality annotation, imaging noise, and anatomical differences across patients. In addition, there is still a considerable gap in performance between the existing label-efficient methods and fully-supervised methods. To address the above challenges, we propose ScribbleVC, a novel framework for scribble-supervised medical image segmentation that leverages vision and class embeddings via the multimodal information enhancement mechanism. In addition, ScribbleVC uniformly utilizes the CNN features and Transformer features to achieve better visual feature extraction. The proposed method combines a scribble-based approach with a segmentation network and a class-embedding module to produce accurate segmentation masks. We evaluate ScribbleVC on three benchmark datasets and compare it with state-of-the-art methods. The experimental results demonstrate that our method outperforms existing approaches in terms of accuracy, robustness, and efficiency. The datasets and code are released on GitHub.

Temporally Efficient Gabor Transformer for Unsupervised Video Object Segmentation

  • Jiaqing Fan
  • Tiankang Su
  • Kaihua Zhang
  • Bo Liu
  • Qingshan Liu

Spatial-temporal structural details of targets in video (e.g. varying edges, textures over time) are essential to accurate Unsupervised Video Object Segmentation (UVOS). The vanilla multi-head self-attention in the Transformer-based UVOS methods usually concentrates on learning the general low-frequency information (e.g. illumination, color), while neglecting the high-frequency texture details, leading to unsatisfying segmentation results. To address this issue, this paper presents a Temporally efficient Gabor Transformer (TGFormer) for UVOS. The TGFormer jointly models the spatial dependencies and temporal coherence intra- and inter-frames, which can fully capture the rich structural details for accurate UVOS. Concretely, we first propose an effective learnable Gabor filtering Transformer to mine the structural texture details of the object for accurate UVOS. Then, to adaptively store the redundant neighboring historical information, we present an efficient dynamic neighboring frame selection module to automatically choose the useful temporal information, which simultaneously relieves the blurry frame and reduces the computation burden. Finally, we make the UVOS model be a fully Transformer architecture, meanwhile aggregating the information from space, Gabor and time domains, yielding a strong representation with rich structure details. Extensive experiments on five mainstream UVOS benchmarks (DAVIS2016, FBMS, DAVSOD, ViSal, and MCL) demonstrate the superiority of the presented solution to sate-of-the-art methods.

Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation

  • Haowei Wang
  • Jiji Tang
  • Jiayi Ji
  • Xiaoshuai Sun
  • Rongsheng Zhang
  • Yiwei Ma
  • Minda Zhao
  • Lincheng Li
  • Zeng Zhao
  • Tangjie Lv
  • Rongrong Ji

In recent years, 3D representation learning has turned to 2D vision-language pre-trained models to overcome data scarcity challenges. However, existing methods simply transfer 2D alignment strategies, aligning 3D representations with single-view 2D images and coarse-grained parent category text. These approaches introduce information degradation and insufficient synergy issues, leading to performance loss. Information degradation arises from overlooking the fact that a 3D representation should be equivalent to a series of multi-view images and more fine-grained subcategory text. Insufficient synergy neglects the idea that a robust 3D representation should align with the joint vision-language space, rather than independently aligning with each modality. In this paper, we propose a multi-view joint modality modeling approach, termed JM3D, to obtain a unified representation for point cloud, text, and image. Specifically, a novel Structured Multimodal Organizer (SMO) is proposed to address the information degradation issue, which introduces contiguous multi-view images and hierarchical text to enrich the representation of vision and language modalities. A Joint Multi-modal Alignment (JMA) is designed to tackle the insufficient synergy problem, which models the joint modality by incorporating language knowledge into the visual modality. Extensive experiments on ModelNet40 and ScanObjectNN demonstrate the effectiveness of our proposed method, JM3D, which achieves state-of-the-art performance in zero-shot 3D classification. JM3D outperforms ULIP by approximately 4.3% on PointMLP and achieves an improvement of up to 6.5% accuracy on PointNet++ in top-1 accuracy for zero-shot 3D classification on ModelNet40. The source code and trained models for all our experiments are publicly available at

Hierarchical Visual Attribute Learning in the Wild

  • Kongming Liang
  • Xinran Wang
  • Haiwen Zhang
  • Zhanyu Ma
  • Jun Guo

Observing objects' attributes at different levels of detail is a fundamental aspect of how humans perceive and understand the world around them. Existing studies focused on attribute prediction in a flat way, but they overlook the underlying attribute hierarchy, e.g., navy blue is a subcategory of blue. In recent years, large language models, e.g., ChatGPT, have emerged with the ability to perform an extensive range of natural language processing tasks like text generation and classification. The factual knowledge learned by LLM can assist us build the hierarchical relations of visual attributes in the wild. Based on that, we propose a model called the object-specific attribute relation net, which takes advantage of three types of relations among attributes - positive, negative, and hierarchical - to better facilitate attribute recognition in images. Guided by the extracted hierarchical relations, our model can predict attributes from coarse to fine. Additionally, we introduce several evaluation metrics for attribute hierarchy to comprehensively assess the model's ability to comprehend hierarchical relations. Our extensive experiments demonstrate that our proposed hierarchical annotation brings improvements to the model's understanding of hierarchical relations of attributes, and the object-specific attribute relation net can recognize visual attributes more accurately.

Hierarchical Semantic Enhancement Network for Multimodal Fake News Detection

  • Qiang Zhang
  • Jiawei Liu
  • Fanrui Zhang
  • Jingyi Xie
  • Zheng-Jun Zha

The explosion of multimodal fake news content on social media has sparked widespread concern. Existing multimodal fake news detection methods have made significant contributions to the development of this field, but fail to adequately exploit the potential semantic information of images and ignore the noise embedded in news entities, which severely limits the performance of the models. In this paper, we propose a novel Hierarchical Semantic Enhancement Network (HSEN) for multimodal fake news detection by learning text-related image semantic and precise news high-order knowledge semantic information. Specifically, to complement the image semantic information, HSEN utilizes textual entities as the prompt subject vocabulary and applies reinforcement learning to discover the optimal prompt format for generating image captions specific to the corresponding textual entities, which contain multi-level cross-modal correlation information. Moreover, HSEN extracts visual and textual entities from image and text, and identifies additional visual entities from image captions to extend image semantic knowledge. Based on that, HSEN exploits an adaptive hard attention mechanism to automatically select strongly related news entities and remove irrelevant noise entities to obtain precise high-order knowledge semantic information, while generating attention mask for guiding cross-modal knowledge interaction. Extensive experiments show that our method outperforms state-of-the-art methods.

Towards Balanced Active Learning for Multimodal Classification

  • Meng Shen
  • Yizheng Huang
  • Jianxiong Yin
  • Heqing Zou
  • Deepu Rajan
  • Simon See

Training multimodal networks requires a vast amount of data due to their larger parameter space compared to unimodal networks. Active learning is a widely used technique for reducing data annotation costs by selecting only those samples that could contribute to improving model performance. However, current active learning strategies are mostly designed for unimodal tasks, and when applied to multimodal data, they often result in biased sample selection from the dominant modality. This unfairness hinders balanced multimodal learning, which is crucial for achieving optimal performance. To address this issue, we propose three guidelines for designing a more balanced multimodal active learning strategy. Following these guidelines, a novel approach is proposed to achieve more fair data selection by modulating the gradient embedding with the dominance degree among modalities. Our studies demonstrate that the proposed method achieves more balanced multimodal learning by avoiding greedy sample selection from the dominant modality. Our approach outperforms existing active learning strategies on a variety of multimodal classification tasks. Overall, our work highlights the importance of balancing sample selection in multimodal active learning and provides a practical solution for achieving more balanced active learning for multimodal classification.

Learning Event-Specific Localization Preferences for Audio-Visual Event Localization

  • Shiping Ge
  • Zhiwei Jiang
  • Yafeng Yin
  • Cong Wang
  • Zifeng Cheng
  • Qing Gu

Audio-Visual Event Localization (AVEL) aims to locate events that are both visible and audible in a video. Existing AVEL methods primarily focus on learning generic localization patterns that are applicable to all events. However, events often exhibit modality biases, such as visual-dominated, audio-dominated, or modality-balanced, which can lead to different localization preferences. These preferences may be overlooked by existing methods, resulting in unsatisfactory localization performance. To address this issue, this paper proposes a novel event-aware localization paradigm, which first identifies the event category and then leverages localization preferences specific to that event for improved event localization. To achieve this, we introduce a memory-assisted metric learning framework, which utilizes historic segments as anchors to adjust the unified representation space for both event classification and event localization. To provide sufficient information for this metric learning, we design a spatial-temporal audio-visual fusion encoder to capture the spatial and temporal interaction between audio and visual modalities. Extensive experiments on the public AVE dataset in both fully-supervised and weakly-supervised settings demonstrate the effectiveness of our approach. Code will be released at

Object Segmentation by Mining Cross-Modal Semantics

  • Zongwei Wu
  • Jingjing Wang
  • Zhuyun Zhou
  • Zhaochong An
  • Qiuping Jiang
  • Cédric Demonceaux
  • Guolei Sun
  • Radu Timofte

Multi-sensor clues have shown promise for object segmentation, but inherent noise in each sensor, as well as the calibration error in practice, may bias the segmentation accuracy. In this paper, we propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features, with the aim of controlling the modal contribution based on relative entropy. We explore semantics among the multimodal inputs in two aspects: the modality-shared consistency and the modality-specific variation. Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision. On the one hand, the AF block explicitly dissociates the shared and specific representation and learns to weight the modal contribution by adjusting the proportion, region, and pattern, depending upon the quality. On the other hand, our CFD initially decodes the shared feature and then refines the output through specificity-aware querying. Further, we enforce semantic consistency across the decoding layers to enable interaction across network hierarchies, improving feature discriminability. Exhaustive comparison on eleven datasets with depth or thermal clues, and on two challenging tasks, namely salient and camouflage object segmentation, validate our effectiveness in terms of both performance and robustness. The source code is publicly available at

PSNEA: Pseudo-Siamese Network for Entity Alignment between Multi-modal Knowledge Graphs

  • Wenxin Ni
  • Qianqian Xu
  • Yangbangyan Jiang
  • Zongsheng Cao
  • Xiaochun Cao
  • Qingming Huang

Multi-modal entity alignment aims to identify entities that refer to the same concept in the real world across a plethora of multi-modal knowledge graphs (MMKGs). Most existing methods focus on reducing the embedding differences between multiple modalities while neglecting the following challenges: 1) cannot handle the heterogeneity across graphs, 2) suffer from the scarcity of pre-aligned data (a.k.a. initial seeds). To tackle these issues, we propose a Pseudo-Siamese Network for multi-modal Entity Alignment (PSNEA). It consists of two modules to extract various information and generate holistic embeddings. Specifically, the first module PSN is designed with two parallel branches to learn the representations for different MMKGs, thus effectively bridging the graph heterogeneity. On top of this, we introduce an Incremental Alignment Pool (IAP) to alleviate the scarcity of initial seeds by labeling likely alignment. IAP avoids error-prone by data swapping and sample re-weighting strategies. To the best of our knowledge, PSNEA is the first model that tackles graph heterogeneity and scarcity of initial seeds in one unified framework. The extensive experiments demonstrate that our model achieves the best performance on both cross-lingual and cross-graph datasets. The source code is available at

Federated Deep Multi-View Clustering with Global Self-Supervision

  • Xinyue Chen
  • Jie Xu
  • Yazhou Ren
  • Xiaorong Pu
  • Ce Zhu
  • Xiaofeng Zhu
  • Zhifeng Hao
  • Lifang He

Federated multi-view clustering has the potential to learn a global clustering model from data distributed across multiple devices. In this setting, label information is unknown and data privacy must be preserved, leading to two major challenges. First, views on different clients often have feature heterogeneity, and mining their complementary cluster information is not trivial. Second, the storage and usage of data from multiple clients in a distributed environment can lead to incompleteness of multi-view data. To address these challenges, we propose a novel federated deep multi-view clustering method that can mine complementary cluster structures from multiple clients, while dealing with data incompleteness and privacy concerns. Specifically, in the server environment, we propose sample alignment and data extension techniques to explore the complementary cluster structures of multiple views. The server then distributes global prototypes and global pseudo-labels to each client as global self-supervised information. In the client environment, multiple clients use the global self-supervised information and deep autoencoders to learn view-specific cluster assignments and embedded features, which are then uploaded to the server for refining the global self-supervised information. Finally, the results of our extensive experiments demonstrate that our proposed method exhibits superior performance in addressing the challenges of incomplete multi-view data in distributed environments.

Audio-Visual Spatial Integration and Recursive Attention for Robust Sound Source Localization

  • Sung Jin Um
  • Dongjin Kim
  • Jung Uk Kim

The objective of the sound source localization task is to enable machines to detect the location of sound-making objects within a visual scene. While the audio modality provides spatial cues to locate the sound source, existing approaches only use audio as an auxiliary role to compare spatial regions of the visual modality. Humans, on the other hand, utilize both audio and visual modalities as spatial cues to locate sound sources. In this paper, we propose an audio-visual spatial integration network that integrates spatial cues from both modalities to mimic human behavior when detecting sound-making objects. Additionally, we introduce a recursive attention network to mimic human behavior of iterative focusing on objects, resulting in more accurate attention regions. To effectively encode spatial information from both modalities, we propose audio-visual pair matching loss and spatial region alignment loss. By utilizing the spatial cues of audio-visual modalities and recursively focusing objects, our method can perform more robust sound source localization. Comprehensive experimental results on the Flickr SoundNet and VGG-Sound Source datasets demonstrate the superiority of our proposed method over existing approaches. Our code is available at:

Hypergraph-Enhanced Hashing for Unsupervised Cross-Modal Retrieval via Robust Similarity Guidance

  • Fangming Zhong
  • Chenglong Chu
  • Zijie Zhu
  • Zhikui Chen

Unsupervised cross-modal hashing retrieval across image and text modality is a challenging task because of the suboptimality of similarity guidance, i.e., the joint similarity matrix constructed by existing methods does not possess clear enough guiding significance. How to construct more robust similarity matrix is the key to solve this problem. The unsupervised cross-modal retrieval methods based on graph have a good performance in mining semantic information of input samples, but the graph hashing based on traditional affinity graph cannot capture the high-order semantic information of input samples effectively. In order to overcome the aforementioned limitations, this paper presents a novel hypergraph-based approach for unsupervised cross-modal retrieval that differs from previous works in two significant ways. Firstly, to address the ubiquitous redundant information present in current methods, this paper introduces a robust similarity matrix constructing method. Secondly, we propose a novel hypergraph enhanced module that produces embedding vectors by hypergraph convolution and attention mechanism for input data, capturing important high-order semantics. Our approach is evaluated on the NUS-WIDE and MIRFlickr datasets, and yields state-of-the-art performance for unsupervised cross-modal retrieval.

Reinforcement Graph Clustering with Unknown Cluster Number

  • Yue Liu
  • Ke Liang
  • Jun Xia
  • Xihong Yang
  • Sihang Zhou
  • Meng Liu
  • Xinwang Liu
  • Stan Z. Li

Deep graph clustering, which aims to group nodes into disjoint clusters by neural networks in an unsupervised manner, has attracted great attention in recent years. Although the performance has been largely improved, the excellent performance of the existing methods heavily relies on an accurately predefined cluster number, which is not always available in the real-world scenario. To enable the deep graph clustering algorithms to work without the guidance of the predefined cluster number, we propose a new deep graph clustering method termed Reinforcement Graph Clustering (RGC). In our proposed method, cluster number determination and unsupervised representation learning are unified into a uniform framework by the reinforcement learning mechanism. Concretely, the discriminative node representations are first learned with the contrastive pretext task. Then, to capture the clustering state accurately with both local and global information in the graph, both node and cluster states are considered. Subsequently, at each state, the qualities of different cluster numbers are evaluated by the quality network, and the greedy action is executed to determine the cluster number. In order to conduct feedback actions, the clustering-oriented reward function is proposed to enhance the cohesion of the same clusters and separate the different clusters. Extensive experiments demonstrate the effectiveness and efficiency of our proposed method. The source code of RGC is shared at and a collection (papers, codes and, datasets) of deep graph clustering is shared at on Github.

Cultural Self-Adaptive Multimodal Gesture Generation Based on Multiple Culture Gesture Dataset

  • Jingyu Wu
  • Shi Chen
  • Shuyu Gan
  • Weijun Li
  • Changyuan Yang
  • Lingyun Sun

Co-speech gesture generation is essential for multimodal chatbots and agents. Previous research extensively studies the relationship between text, audio, and gesture. Meanwhile, to enhance cross-culture communication, culture-specific gestures are crucial for chatbots to learn cultural differences and incorporate cultural cues. However, culture-specific gesture generation faces two challenges: lack of large-scale, high-quality gesture datasets that include diverse cultural groups, and lack of generalization across different cultures. Therefore, in this paper, we first introduce a Multiple Culture Gesture Dataset (MCGD), the largest freely available gesture dataset to date. It consists of ten different cultures, over 200 speakers, and 10,000 segmented sequences. We further propose a Cultural Self-adaptive Gesture Generation Network (CSGN) that takes multimodal relationships into consideration while generating gestures using a cascade architecture and learnable dynamic weight. The CSGN adaptively generates gestures with different cultural characteristics without the need to retrain a new network. It extracts cultural features from the multimodal inputs or a cultural style embedding space with a designated culture. We broadly evaluate our method across four large-scale benchmark datasets. Empirical results show that our method achieves multiple cultural gesture generation and improves comprehensiveness of multimodal inputs. Our method improves the state-of-the-art average FGD from 53.7 to 48.0 and culture deception rate (CDR) from 33.63% to 39.87%.

DPNET: Dynamic Poly-attention Network for Trustworthy Multi-modal Classification

  • Xin Zou
  • Chang Tang
  • Xiao Zheng
  • Zhenglai Li
  • Xiao He
  • Shan An
  • Xinwang Liu

With advances in sensing technology, multi-modal data collected from different sources are increasingly available. Multi-modal classification aims to integrate complementary information from multi-modal data to improve model classification performance. However, existing multi-modal classification methods are basically weak in integrating global structural information and providing trustworthy multi-modal fusion, especially in safety-sensitive practical applications (e.g., medical diagnosis). In this paper, we propose a novel Dynamic Poly-attention Network (DPNET) for trustworthy multi-modal classification. Specifically, DPNET has four merits: (i) To capture the intrinsic modality-specific structural information, we design a structure-aware feature aggregation module to learn the corresponding structure-preserved global compact feature representation. (ii) A transparent fusion strategy based on the modality confidence estimation strategy is induced to track information variation within different modalities for dynamical fusion. (iii) To facilitate more effective and efficient multi-modal fusion, we introduce a cross-modal low-rank fusion module to reduce the complexity of tensor-based fusion and activate the implication of different rank-wise features via a rank attention mechanism. (iv) A label confidence estimation module is devised to drive the network to generate more credible confidence. An intra-class attention loss is introduced to supervise the network training. Extensive experiments on four real-world multi-modal biomedical datasets demonstrate that the proposed method achieves competitive performance compared to other state-of-the-art ones.

Tile Classification Based Viewport Prediction with Multi-modal Fusion Transformer

  • Zhiahao Zhang
  • Yiwei Chen
  • Weizhan Zhang
  • Caixia Yan
  • Qinghua Zheng
  • Qi Wang
  • Wangdu Chen

Viewport prediction is a crucial aspect of tile-based 360° video streaming system. However, existing trajectory based methods lack of robustness, also oversimplify the process of information construction and fusion between different modality inputs, leading to the error accumulation problem. In this paper, we propose a tile classification based viewport prediction method with Multi-modal Fusion Transformer, namely MFTR. Specifically, MFTR utilizes transformer-based networks to extract the long-range dependencies within each modality, then mine intra- and inter-modality relations to capture the combined impact of user historical inputs and video contents on future viewport selection. In addition, MFTR categorizes future tiles into two categories: user interested or not, and selects future viewport as the region that contains most user interested tiles. Comparing with predicting head trajectories, choosing future viewport based on tile's binary classification results exhibits better robustness and interpretability. To evaluate our proposed MFTR, we conduct extensive experiments on two widely used PVS-HM and Xu-Gaze dataset. MFTR shows superior performance over state-of-the-art methods in terms of average prediction accuracy and overlap ratio, also presents competitive computation efficiency.

Semantic-based Selection, Synthesis, and Supervision for Few-shot Learning

  • Jinda Lu
  • Shuo Wang
  • Xinyu Zhang
  • Yanbin Hao
  • Xiangnan He

Few-shot learning (FSL) is designed to explore the distribution of novel categories from a few samples. It is a challenging task since the classifier is usually susceptible to over-fitting when learning from limited training samples. To alleviate this phenomenon, a common solution is to achieve more training samples using a generic generation strategy in visual space. However, there are some limitations to this solution. It is because a feature extractor trained on base samples (known knowledge) tends to focus on the textures and structures of the objects it learns, which is inadequate for describing novel samples. To solve these issues, we introduce semantics and propose a Semantic-based Selection, Synthesis, and S upervision (4S) method, where semantics provide more diverse and informative supervision for recognizing novel objects. Specifically, we first utilize semantic knowledge to explore the correlation of categories in the textual space and select base categories related to the given novel category. This process can improve the efficiency of subsequent operations (synthesis and supervision). Then, we analyze the semantic knowledge to hallucinate the training samples by selectively synthesizing the contents from base and support samples. This operation not only increases the number of training samples but also takes advantage of the contents of the base categories to enhance the description of support samples. Finally, we also employ semantic knowledge as both soft and hard supervision to enrich the supervision for the fine-tuning procedure. Empirical studies on four FSL benchmarks demonstrate the effectiveness of 4S.

Exploring Universal Principles for Graph Contrastive Learning: A Statistical Perspective

  • Jinyong Wen
  • Shiming Xiang
  • Chunhong Pan

Although recent advances have prompted the prosperity in graph contrastive learning, the researches on universal principles for model design and desirable properties of latent representations are still inadequate. From a statistical perspective, this paper proposes two principles for guidance and constructs a general self-supervised framework for negative-free graph contrastive learning. Reformulating data augmentation as a mixture process, the first one, termed consistency principle, lays stress on exploring and mapping cross-view common information to consistent and essence-revealing representations. For the purpose of instantiation, four statistical indicators are employed to estimate and maximize the correlation between representations from various views, whose accordant variation trend during training implies the extraction of common content. With awareness of the insufficiency of a solo consistency principle, suffering from degenerated and coupled solutions, a decorrelation principle is put forward to encourage diverse and informative representations. Accordingly, two specific strategies, performing in representation space and eigen spectral space, respectively, are propounded to decouple various representation channels. Under two principles, various combinations of concrete implementations derive a family of methods. The comparison experiments with current state-of-the-arts demonstrate the effectiveness and sufficiency of two principles for high-quality graph representations. Furthermore, visual studies reveal how certain principles affect learned representations.

Text-to-Audio Generation using Instruction Guided Latent Diffusion Model

  • Deepanway Ghosal
  • Navonil Majumder
  • Ambuj Mehrish
  • Soujanya Poria

The immense scale of the recent large language models (LLM) allows many interesting properties, such as, instruction- and chain-of-thought-based fine-tuning, that has significantly improved zero- and few-shot performance in many natural language processing (NLP) tasks. Inspired by such successes, we adopt such an instruction-tuned LLM Flan-T5 as the text encoder for text-to-audio (TTA) generation-a task where the goal is to generate an audio from its textual description. The prior works on TTA either pre-trained a joint text-audio encoder or used a non-instruction-tuned model, such as, T5. Consequently, our latent diffusion model (LDM)-based approach (Tango) outperforms the state-of-the-art AudioLDM on most metrics and stays comparable on the rest on AudioCaps test set, despite training the LDM on a 63 times smaller dataset and keeping the text encoder frozen. This improvement might also be attributed to the adoption of audio pressure level-based sound mixing for training set augmentation, whereas the prior methods take a random mix.

DRIN: Dynamic Relation Interactive Network for Multimodal Entity Linking

  • Shangyu Xing
  • Fei Zhao
  • Zhen Wu
  • Chunhui Li
  • Jianbing Zhang
  • Xinyu Dai

Multimodal Entity Linking (MEL) is a task that aims to link ambiguous mentions within multimodal contexts to referential entities in a multimodal knowledge base. Recent methods for MEL adopt a common framework: they first interact and fuse the text and image to obtain representations of the mention and entity respectively, and then compute the similarity between them to predict the correct entity. However, these methods still suffer from two limitations: first, as they fuse the features of text and image before matching, they cannot fully exploit the fine-grained alignment relations between the mention and entity. Second, their alignment is static, leading to low performance when dealing with complex and diverse data. To address these issues, we propose a novel framework called Dynamic Relation Interactive Network (DRIN) for MEL tasks. DRIN explicitly models four different types of alignment between a mention and entity and builds a dynamic Graph Convolutional Network (GCN) to dynamically select the corresponding alignment relations for different input samples. Experiments on two datasets show that DRIN outperforms state-of-the-art methods by a large margin, demonstrating the effectiveness of our approach. Our code and datasets are publicly available.

MVCIR-net: Multi-view Clustering Information Reinforcement Network

  • Shaokui Gu
  • Xu Yuan
  • Liang Zhao
  • Zhenjiao Liu
  • Yan Hu
  • Zhikui Chen

Multi-view clustering (MVC) integrates information from different views to improve clustering performance compared to single-view clustering. However, the raw multi-view data in the feature space often contains irrelevant information to the clustering task, which is difficult to separate using existing methods. This irrelevant information is processed equally with clustering information, negatively impacting the final clustering performance. In this paper, we propose a new framework for multi-view clustering information reinforcement network (MVCIR-net) to alleviate these problems. Our method gives practical clustering meaning to the clustering distribution layer by contrastive learning. Then, the trusted neighbor instances distribution of the normalized graph is debias aggregated to form the clustering information propensity distribution, and the clustering information distribution is made to fit this distribution. In addition, the coupling degree of the clustering information distribution in different views on the same sample should be enhanced. Through the aforementioned strategies, the raw data is fuzzy mapped into clustering information, and the network's ability to recognize clustering information is strengthened. Finally, the fuzzy mapping data is input into the network and reconstructed to evaluate the quality of the extracted clustering information. Extensive experiments on public multi-view datasets show that MVCIR-net achieves superior clustering effectiveness and the ability to identify clustering information.

Preserving Local and Global Information: An Effective Metric-based Subspace Clustering

  • Yixi Liu
  • Yuze Tan
  • Hongjie Wu
  • Shudong Huang
  • Yazhou Ren
  • Jiancheng Lv

Subspace clustering, which recoveries the subspace representation in the form of an affinity graph, has drawn tons of attention due to its effectiveness in various clustering tasks. However, existing subspace clustering methods are usually fed with raw data, which may lead to a suboptimal result since it is difficult to directly and accurately depict the inherent relation between data points. In this paper, we propose a novel subspace clustering method by holistically utilizing the pairwise similarity and graph geometric structure. Our model first constructs an initial subspace representation by means of self-expression, which is able to depict the global structure of data. Then, we use an effective metric to recover an intrinsic matrix with pairwise similarity based on the obtained representation, which further preserves the local structure. Besides, we propose to facilitate the downstream subspace learning task by searching for a smooth representation of the original data, which is obtained by applying a low-pass filter to retain the graph geometric features. By leveraging the subtasks of learning the smooth representation, performing the subspace learning, and recovering the intrinsic similarity matrix in a unified learning framework, each subtask can be alternately boosted. Experiments on several benchmark data sets have been conducted to verify the proposed method.

FeaCo: Reaching Robust Feature-Level Consensus in Noisy Pose Conditions

  • Jiaming Gu
  • Jingyu Zhang
  • Muyang Zhang
  • Weiliang Meng
  • Shibiao Xu
  • Jiguang Zhang
  • Xiaopeng Zhang

Collaborative perception offers a promising solution to overcome challenges such as occlusion and long-range data processing. However, limited sensor accuracy leads to noisy poses that misalign observations among vehicles. To address this problem, we propose the FeaCo, which achieves robust Feature-level Consensus among collaborating agents in noisy pose conditions without additional training. We design an efficient Pose-error Rectification Module (PRM) to align derived feature maps from different vehicles, reducing the adverse effect of noisy pose and bandwidth requirements. We also provide an effective multi-scale Cross-level Attention Module (CAM) to enhance information aggregation and interaction between various scales. Our FeaCo outperforms all other localization rectification methods, as validated on both the collaborative perception simulation dataset OPV2V and real-world dataset V2V4Real, reducing heading error and enhancing localization accuracy across various error levels. Our code is available at:

Cross-Lingual Transfer of Large Language Model by Visually-Derived Supervision Toward Low-Resource Languages

  • Masayasu Muraoka
  • Bishwaranjan Bhattacharjee
  • Michele Merler
  • Graeme Blackwood
  • Yulong Li
  • Yang Zhao

Recent progress on vision and language research has shown that visual supervision improves the performance of large language models (LLMs) in various natural language processing (NLP) tasks. In particular, the Vokenization approach [65] initiated a new way of incorporating visual information into LLM training, demonstrating the potential of visual supervision for NLP tasks in a monolingual (i.e., English) setting. Given the effectiveness of visual information in human communication among people who speak different languages, we tackle an ambitious question in this paper; can we expect that visual supervision contributes to cross-lingual transfer learning from a high-resource language to low-resource languages in NLP tasks? To study this hypothesis, we build a cross-lingual Vokenization model and train a cross-lingual LLM on three languages, English, Urdu, and Swahili, in which the last two are considered low-resource languages. The experimental results demonstrate that our visually-supervised cross-lingual transfer learning method significantly improves the LLM performance in multiple cross-lingual NLP tasks such as XNLI, NER, and TyDiQA tasks for low-resource languages. We also qualitatively and quantitatively demonstrate that the benefit of our approach increases as the linguistic distance between low-and high-resource languages grows larger.

ALEX: Towards Effective Graph Transfer Learning with Noisy Labels

  • Jingyang Yuan
  • Xiao Luo
  • Yifang Qin
  • Zhengyang Mao
  • Wei Ju
  • Ming Zhang

Graph Neural Networks (GNNs) have garnered considerable interest due to their exceptional performance in a wide range of graph machine learning tasks. Nevertheless, the majority of GNN-based approaches have been examined using well-annotated benchmark datasets, leading to suboptimal performance in real-world graph learning scenarios. To bridge this gap, the present paper investigates the problem of graph transfer learning in the presence of label noise, which transfers knowledge from a noisy source graph to an unlabeled target graph. We introduce a novel technique termed Balance Alignment and Information-aware Examination (ALEX) to address this challenge. ALEX first employs singular value decomposition to generate different views with crucial structural semantics, which help provide robust node representations using graph contrastive learning. To mitigate both label shift and domain shift, we estimate a prior distribution to build subgraphs with balanced label distributions. Building on this foundation, an adversarial domain discriminator is incorporated for the implicit domain alignment of complex multi-modal distributions. Furthermore, we project node representations into a different space, optimizing the mutual information between the projected features and labels. Subsequently, the inconsistency of similarity structures is evaluated to identify noisy samples with potential overfitting. Comprehensive experiments on various benchmark datasets substantiate the outstanding superiority of the proposed ALEX in different settings.

Skeletal Spatial-Temporal Semantics Guided Homogeneous-Heterogeneous Multimodal Network for Action Recognition

  • Chenwei Zhang
  • Yuxuan Hu
  • Min Yang
  • Chengming Li
  • Xiping Hu

Action recognition research has gained significant attention with two dominant unimodal approaches: skeleton-based and RGB video-based. While the former is known for its robustness in complex backgrounds, the latter provides rich environmental information useful for context-based analysis. However, the fusion of these two modalities remains an open challenge. In this paper, we propose a Spatial Transformer & Selective Temporal encoder (ST&ST) for skeleton-based action recognition by constructing two modules: Reranking-Enhanced Dynamic Mask Transformer (RE-DMT) and Selective Kernel Temporal Convolution (SK-TC). The RE-DMT captures global spatial features, while the dynamic mask strategy and reranking strategy reduce redundancy. The SK-TC captures both long-term and short-term temporal features and enables adaptive fusion. Furthermore, in two phases, we propose a Homogeneous-Heterogeneous Multimodal Network (HHMNet) for multi-modal action recognition. In the first phase, contrastive learning is employed to achieve implicit semantic fusion within the four homogeneous skeletal modalities (joint, bone, etc.). In the second phase, the fusion of heterogeneous modalities (skeleton & RGB video) is carried out at three levels: model, feature, and decision. At the model level, the powerful skeleton-based model from the previous phase provides explicit attention guidance to the RGB video-based model. At the feature level, multi-part contrastive learning enables semantic distillation between heterogeneous modalities. At the decision level, ensemble learning combines outputs for final action recognition. We evaluate our proposed ST&ST guided HHMNet on NTU RGB+D 60 & 120 and NW-UCLA datasets and demonstrate that it achieves state-of-the-art performance in both skeleton-based and multi-modal action recognition tasks.

Unveiling the Power of CLIP in Unsupervised Visible-Infrared Person Re-Identification

  • Zhong Chen
  • Zhizhong Zhang
  • Xin Tan
  • Yanyun Qu
  • Yuan Xie

Large-scale Vision-Language Pre-training (VLP) model, e.g., CLIP, has demonstrated its natural advantage in generating textual descriptions for images. These textual descriptions afford us greater semantic monitoring insights while not requiring any domain knowledge. In this paper, we propose a new prompt learning paradigm for unsupervised visible-infrared person re-identification (USL-VI-ReID) by taking full advantage of the visual-text representation ability from CLIP. In our framework, we establish a learnable cluster-aware prompt for person images and obtain textual descriptions allowing for subsequent unsupervised training. This description complements the rigid pseudo-labels and provides an important semantic supervised signal. On that basis, we propose a new memory-swapping contrastive learning, where we first find the correlated cross-modal prototypes by the Hungarian matching method and then swap the prototype pairs in the memory. Thus typical contrastive learning without any change could easily associate the cross-modal information. Extensive experiments on the benchmark datasets demonstrate the effectiveness of our method. For example, on SYSU-MM01 we arrive at 54.0% in terms of Rank-1 accuracy, over 9% improvement against state-of-the-art approaches. Code is available at

DTF-Net: Category-Level Pose Estimation and Shape Reconstruction via Deformable Template Field

  • Haowen Wang
  • Zhipeng Fan
  • Zhen Zhao
  • Zhengping Che
  • Zhiyuan Xu
  • Dong Liu
  • Feifei Feng
  • Yakun Huang
  • Xiuquan Qiao
  • Jian Tang

Estimating 6D poses and reconstructing 3D shapes of objects in open-world scenes from RGB-depth image pairs is challenging. Many existing methods rely on learning geometric features that correspond to specific templates while disregarding shape variations and pose differences among objects in the same category. As a result, these methods underperform when handling unseen object instances in complex environments. In contrast, other approaches aim to achieve category-level estimation and reconstruction by leveraging normalized geometric structure priors, but the static prior-based reconstruction struggles with substantial intra-class variations. To solve these problems, we propose the DTF-Net, a novel framework for pose estimation and shape reconstruction based on implicit neural fields of object categories. In DTF-Net, we design a deformable template field to represent the general category-wise shape latent features and intra-category geometric deformation features. The field establishes continuous shape correspondences, deforming the category template into arbitrary observed instances to accomplish shape reconstruction. We introduce a pose regression module that shares the deformation features and template codes from the fields to estimate the accurate 6D pose of each object in the scene. We integrate a multi-modal representation extraction module to extract object features and semantic masks, enabling end-to-end inference. Moreover, during training, we implement a shape-invariant training strategy and a viewpoint sampling method to further enhance the model's capability to extract object pose features. Extensive experiments on the REAL275 and CAMERA25 datasets demonstrate the superiority of DTF-Net in both synthetic and real scenes. Furthermore, we show that DTF-Net effectively supports grasping tasks with a real robot arm.

Text-Only Training for Visual Storytelling

  • Yuechen Wang
  • Wengang Zhou
  • Zhenbo Lu
  • Houqiang Li

Visual storytelling aims to generate a narrative based on a sequence of images, necessitating both vision-language alignment and coherent story generation. Most existing solutions predominantly depend on paired image-text training data, which can be costly to collect and challenging to scale. To address this, we formulate visual storytelling as a visual-conditioned story generation problem and propose a text-only training method that separates the learning of cross-modality alignment and story generation. Our approach specifically leverages the cross-modality pre-trained CLIP model to integrate visual control into a story generator, trained exclusively on text data. Moreover, we devise a training-free visual condition planner that accounts for the temporal structure of the input image sequence while balancing global and local visual content. The distinctive advantage of requiring only text data for training enables our method to learn from external text story data, enhancing the generalization capability of visual storytelling. We conduct extensive experiments on the VIST benchmark, showcasing the effectiveness of our approach in both in-domain and cross-domain settings. Further evaluations on expression diversity and human assessment underscore the superiority of our method in terms of informativeness and robustness.

Saliency Prototype for RGB-D and RGB-T Salient Object Detection

  • Zihao Zhang
  • Jie Wang
  • Yahong Han

Most of the existing bi-modal (RGB-D or RGB-T) salient object detection methods attempt to integrate multimodality information through various fusion strategies. However, existing methods lack a clear definition of salient regions before feature fusion, which results in poor model robustness. To tackle this problem, we propose a novel prototype, the saliency prototype, which captures common characteristic information among salient objects. A prototype contains inherent characteristics information of multiple salient objects, which can be used for feature enhancement of various salient objects. By utilizing the saliency prototype, we provide a clearer definition of salient regions and enable the model to focus on these regions before feature fusion, avoiding the influence of complex backgrounds during the feature fusion stage. Additionally, we utilize the saliency prototypes to address the quality issue of auxiliary modality. Firstly, we apply the saliency prototypes obtained by the primary modality to perform semantic enhancement of the auxiliary modality. Secondly, we dynamically allocate weights for the auxiliary modality during the feature fusion stage in proportion to its quality. Thus, we develop a new bi-modal salient detection architecture Saliency Prototype Network (SPNet), which can be used for both RGB-D and RGB-T SOD. Extensive experimental results on RGB-D and RGB-T SOD datasets demonstrate the effectiveness of the proposed approach against the state-of-the-art. Our code is available at

PAIF: Perception-Aware Infrared-Visible Image Fusion for Attack-Tolerant Semantic Segmentation

  • Zhu Liu
  • Jinyuan Liu
  • Benzhuang Zhang
  • Long Ma
  • Xin Fan
  • Risheng Liu

Infrared and visible image fusion is a powerful technique that combines complementary information from different modalities for downstream semantic perception tasks. Existing learning-based methods show remarkable performance, but are suffering from the inherent vulnerability of adversarial attacks, causing a significant decrease in accuracy. In this work, a perception-aware fusion framework is proposed to promote segmentation robustness in adversarial scenes. We first conduct systematic analyses about the components of image fusion, investigating the correlation with segmentation robustness under adversarial perturbations. Based on these analyses, we propose a harmonized architecture search with a decomposition-based structure to balance standard accuracy and robustness. We also propose an adaptive learning strategy to improve the parameter robustness of image fusion, which can learn effective feature extraction under diverse adversarial perturbations. Thus, the goals of image fusion (i.e., extracting complementary features from source modalities and defending attack) can be realized from the perspectives of architectural and learning strategies. Extensive experimental results demonstrate that our scheme substantially enhances the robustness, with gains of 15.3% mIOU of segmentation in the adversarial scene, compared with advanced competitors. The source codes are available at

Cross-Modal Graph Attention Network for Entity Alignment

  • Baogui Xu
  • Chengjin Xu
  • Bing Su

The increasing popularity of multi-modal knowledge graphs (MMKGs) has led to a need for efficient entity alignment techniques that can exploit multi-modal information to integrate knowledge from different sources. GNN-based multi-modal entity alignment (MMEA) methods have achieved significant progress in entity alignment(EA) areas. However, these methods only rely on Graph Neural Networks (GNNs) to encode structural information, while ignoring visual and semantic modalities, which may lead to incomplete representation, thus how to integrate the visual and semantic information into GNN-based EA methods remains unexplored. In light of our insight that incorporating the message-passing mechanism of Graph Neural Networks to integrate multi-modal information is essential for fully exploiting the graph representation capability of GNN, we propose a novel Cross-modal Graph attention network for Entity Alignment (XGEA) that enables visual knowledge to interact with other views of the entity, including structural and literal information. We leverage the information from one modality as complementary relation information to compute the attention of another modality in the graph attention layers, enabling the learning of entity embedding by integrating multiple modalities. Moreover, the quantity of labeled data plays a crucial role in model performance, yet obtaining sufficient training data is expensive. To mitigate this issue, we use visual and semantic information to generate pseudo-pairs and propose a soft pseudo-labeling method for entity alignment to assign weights to the augmented training data to balance its quantity and quality. Extensive experiments show that our XGEA achieves superior performance consistently over the state-of-the-art MMEA baselines.

Intra- and Inter-Modal Curriculum for Multimodal Learning

  • Yuwei Zhou
  • Xin Wang
  • Hong Chen
  • Xuguang Duan
  • Wenwu Zhu

Multimodal learning has been widely studied and applied due to its improvement over previous unimodal tasks and its effectiveness on emerging multimodal challenges. However, it has been reported that modal encoders are under-optimized in multimodal learning in contrast to unimodal learning, especially when some modalities are dominant over others. Existing solutions to this problem suffer from two limitations: i) they merely focus on inter-modal balance, failing to consider the influence of intra-modal data on each modality; ii) their implementations heavily rely on unimodal performances or losses, thus being suboptimal for the tasks requiring modal interactions (e.g., visual question answering). To tackle these limitations, we propose I2MCL, a generic Intra- and Inter-Modal Curriculum Learning framework which simultaneously considers both data difficulty and modality balance for multimodal learning. In the intra-modal curriculum, we adopt a pretrained teacher model to obtain knowledge distillation loss as the difficulty measurer, which determines the data weights within the corresponding modality. In the inter-modal curriculum, we utilize a Pareto optimization strategy to measure and compare the gradients from distillation loss and task loss across modalities, capable of determining whether a modality should learn from the task or its teacher. Empirical experiments on various tasks including multimodal classification, visual question answering and visual entailment demonstrate that our proposed I2MCL is able to tackle the under-optimized modality problem and bring consistent improvement to multimodal learning.

Graph based Spatial-temporal Fusion for Multi-modal Person Re-identification

  • Yaobin Zhang
  • Jianming Lv
  • Chen Liu
  • Hongmin Cai

As a challenging task, unsupervised person re-identification (Re-ID) aims to optimize the pedestrian matching model based on the unlabeled image frames from surveillance videos. Recently, the fusion with the spatio-temporal clues of pedestrians have been proven effective to improve the performance of classification. However, most of these methods adopt some hard combination approaches by multiplying the visual scores with the spatio-temporal scores, which are sensitive to the noise caused by imprecise estimation of the spatio-temporal patterns in unlabeled datasets and limit the advantage of the fusion model. In this paper, we propose a Graph based Spatio-Temporal Fusion model for high-performance multi-modal person Re-ID, namely G-Fusion, to mitigate the impact of noise. In particular, we construct a graph of pedestrian images by selecting neighboring nodes based on the visual information and the transition time between cameras. Then we use a randomly initialized two-layer GraphSAGE model to obtain the multi-modal affinity matrix between images, and deploy the distillation learning to optimize the visual model by learning the affinity between the nodes. Finally, a graph-based multi-modal re-ranking method is deployed to make the decision in the testing phase for precise person Re-ID. Comprehensive experiments are conducted on two large-scale Re-ID datasets, and the results show that our method achieves a significant improvement of the performance while combined with SOTA unsupervised person Re-ID methods. Specifically, the mAP scores can reach 92.2%, and 80.4% on the Market-1501, and MSMT17 datasets respectively.

Transferring CLIP's Knowledge into Zero-Shot Point Cloud Semantic Segmentation

  • Yuanbin Wang
  • Shaofei Huang
  • Yulu Gao
  • Zhen Wang
  • Rui Wang
  • Kehua Sheng
  • Bo Zhang
  • Si Liu

Traditional 3D segmentation methods can only recognize a fixed range of classes that appear in the training set, which limits their application in real-world scenarios due to the lack of generalization ability. Large-scale visual-language pre-trained models, such as CLIP, have shown their generalization ability in the zero-shot 2D vision tasks, but are still unable to be applied to 3D semantic segmentation directly. In this work, we focus on zero-shot point cloud semantic segmentation and propose a simple yet effective baseline to transfer the visual-linguistic knowledge implied in CLIP to point cloud encoder at both feature and output levels. Both feature-level and output-level alignments are conducted between 2D and 3D encoders for effective knowledge transfer. Concretely, a Multi-granularity Cross-modal Feature Alignment (MCFA) module is proposed to align 2D and 3D features from global semantic and local position perspectives for feature-level alignment. For the output level, per-pixel pseudo labels of unseen classes are extracted using the pre-trained CLIP model as supervision for the 3D segmentation model to mimic the behavior of the CLIP image encoder. Extensive experiments are conducted on two popular benchmarks of point cloud segmentation. Our method outperforms significantly previous state-of-the-art methods under zero-shot setting (+29.2% mIoU on SemanticKITTI and 31.8% mIoU on nuScenes), and further achieves promising results in the annotation-free point cloud semantic segmentation setting, showing its great potential for label-efficient learning.

Bio-Inspired Audiovisual Multi-Representation Integration via Self-Supervised Learning

  • Zhaojian Li
  • Bin Zhao
  • Yuan Yuan

Audiovisual self-supervised representation learning has made significant strides in various audiovisual tasks. Existing methods mostly focus on single representation modeling between audio and visual modalities, ignoring the complex correspondence between them, resulting in the inability to execute cross-modal understanding in a more natural audiovisual scene. Several biological studies have shown that human learning is influenced by multi-layered synchronization of perception. To this end, inspired by biology, we argue to exploit the naturally existing relationships in audio and visual modalities to learn audiovisual representations under multilayer perceptual integration. Firstly, we introduce an audiovisual multi-representation pretext task that integrates semantic consistency, temporal alignment, and spatial correspondence. Secondly, we propose a self-supervised audiovisual multi-representation learning approach, which simultaneously learns the perceptual relationship between visual and audio modalities at semantic, temporal, and spatial levels. To establish fine-grained correspondence between visual objects and sounds, an audiovisual object detection module is proposed, which detects potential sounding objects by combining unsupervised knowledge at multiple levels. In addition, we propose a modality-wise loss and a task-wise loss to learn a subspace-orthogonal representation space that makes representation relations more discriminative. Finally, experimental results demonstrate that collectively understanding the semantic, temporal, and spatial correspondence between audiovisual modalities enables the model to perform better on downstream tasks such as sound separation, sound spatialization, and audiovisual segmentation.

DLFusion: Painting-Depth Augmenting-LiDAR for Multimodal Fusion 3D Object Detection

  • Junyin Wang
  • Chenghu Du
  • Hui Li
  • Shengwu Xiong

Surround-view cameras combined with image depth transformation to 3D feature space and fusion with point cloud features are highly regarded. The transformation of 2D features into 3D feature space by means of predefined sampling points and depth distribution happens throughout the scene, and this process generates a large number of redundant features. In addition, multimodal feature fusion unified in 3D space often happens in the previous step of the downstream task, ignoring the interactive fusion between different scales. To this end, we design a new framework, focusing on the design that can give 3D geometric perception information to images and unify them into voxel space to accomplish multi-scale interactive fusion, and we mitigate feature alignment between modal features by geometric relationships between voxel features. The method has two main designs. First, a Segmentation-guided Image View Transformation module is used to accurately transform the pixel region containing the object into a 3D pseudo-point voxel space with the help of a depth distribution. This allows subsequent feature fusion to be performed in a unified voxel feature. Secondly, a Voxel-centric Consistent Fusion module is used to alleviate the errors caused by depth estimation, as well as to achieve better feature fusion between unified modalities. Through extensive experiments on the KITTI and nuScenes datasets, we validate the effectiveness of our camera-LIDAR fusion method. Our proposed approach shows competitive performance on both datasets and outperforms state-of-the-art methods in certain classes of 3D object detection benchmarks. [code release]

Automatic Network Architecture Search for RGB-D Semantic Segmentation

  • Wenna Wang
  • Tao Zhuo
  • Xiuwei Zhang
  • Mingjun Sun
  • Hanlin Yin
  • Yinghui Xing
  • Yanning Zhang

Recent RGB-D semantic segmentation networks are usually manually designed. However, due to limited human efforts and time costs, their performance might be inferior for complex scenarios. To address this issue, we propose the first Neural Architecture Search (NAS) method that designs the network automatically. Specifically, the target network consists of an encoder and a decoder. The encoder is designed with two independent branches, where each branch specializes in extracting features from RGB and depth images, respectively. The decoder fuses the features and generates the final segmentation result. Besides, for automatic network design, we design a grid-like network-level search space combined with a hierarchical cell-level search space. By further developing an effective gradient-based search strategy, the network structure with hierarchical cell architectures is discovered. Extensive results on two datasets show that the proposed method outperforms the state-of-the-art approaches, which achieves a mIoU score of 55.1% on the NYU-Depth v2 dataset and 50.3% on the SUN-RGBD dataset.

Attentive Alignment Network for Multispectral Pedestrian Detection

  • Nuo Chen
  • Jin Xie
  • Jing Nie
  • Jiale Cao
  • Zhuang Shao
  • Yanwei Pang

Multispectral pedestrian detection is of great importance in various around-the-clock applications, i.e., self-driving and video surveillance. Fusing the features from RGB images and thermal infrared (TIR) images to explore the complementary information between different modalities is one of the most effective manners to improve multispectral pedestrian detection performance. However, the misalignment between different modalities in spatial dimension and modality reliability would introduce harmful information during feature fusion, limiting the performance of multispectral pedestrian detection. To address the above issues, we propose an attentive alignment network, consisting of an attentive position alignment (APA) module and an attentive modality alignment (AMA) module. Our APA module emphasizes pedestrian regions while aligning the pedestrian regions between different modalities. Our AMA module utilizes a channel-wise attention mechanism with illumination guidance to eliminate the imbalance between different modalities. The experiments are conducted on two widely used multispectral detection datasets, KASIT and CVC-14. Our approach surpasses the current state-of-the-art performance on both datasets.

FedAA: Using Non-sensitive Modalities to Improve Federated Learning while Preserving Image Privacy

  • Dong Chen
  • Siliang Tang
  • Zijin Shen
  • Guoming Wang
  • Jun Xiao
  • Yueting Zhuang
  • Carl Yang

Federated learning aims to train a better global model without sharing the sensitive training samples (usually images) of local clients. Since the sample distributions in local clients tend to be different from each other (i.e., non-IID), one of the major challenges for federated learning is to alleviate model degradation when aggregating local models. The degradation can be attributed to the weight divergence that quantifies the difference of local models from different training processes. Furthermore, non-IID also results in feature space heterogeneity during local training, making neurons of local models in the same location have different functions and further exacerbating weight divergence. In this paper, we demonstrate that the problem can be solved by sharing information from the non-sensitive modality (e.g., metadata, non-sensitive descriptions, etc.) while keeping the sensitive information of images protected. In particular, we propose Federated Learning with Adversarial Example and Adversarial Identifier (FedAA) that trains adversarial examples based on the shared non-sensitive modality to fine-tune local models before global aggregation. The training of local models is enhanced by client identifiers that discriminate the source of inputs to force different local models to get similar outputs and be more homogeneous during the local training. Experiments show that FedAA significantly outperforms recent non-IID federated learning algorithms while preserving image privac, by sharing information from non-sensitive modalities.

Unsupervised Domain Adaptation for Video Object Grounding with Cascaded Debiasing Learning

  • Mengze Li
  • Haoyu Zhang
  • Juncheng Li
  • Zhou Zhao
  • Wenqiao Zhang
  • Shengyu Zhang
  • Shiliang Pu
  • Yueting Zhuang
  • Fei Wu

This paper addresses the Unsupervised Domain Adaptation (UDA) for the dense frame prediction task - Video Object Grounding (VOG). This investigation springs from the recognition of the limited generalization capabilities of data-driven approaches when confronted with unseen test scenarios. We set the goal of enhancing the adaptability of the source-dominated model from a labeled domain to the unlabeled target domain through re-training on pseudo-labels (i.e., predicted boxes of language-described objects). Given the potential for source-domain biases in the pseudo-label generation, we decompose the labeling refinement as two cascaded debiasing subroutines: (1) we develop a discarded training strategy to correct the Biased Proposal Selection by filtering out the examples with uncertain proposals selected from the proposal (candidate box) set. The identifier of these uncertain examples is the discordance between the predictions of the source-dominated model and those of a target-domain clustered classifier, which remains free from the source-domain bias. (2) With the refined proposals as a foundation, we measure Grounding Coordinate Offset based on the semantic distance of the model's prediction across domains, based on which we alleviate source-domain bias in the target model through adversarial learning. To verify the superiority of the proposed method, we collected two UDA-VOG datasets called I2O-VOG and R2M-VOG by manually dividing and combining the well-known VOG datasets. The extensive experiments on them show our model significantly outperforms SOTA methods by a large margin.

RAHNet: Retrieval Augmented Hybrid Network for Long-tailed Graph Classification

  • Zhengyang Mao
  • Wei Ju
  • Yifang Qin
  • Xiao Luo
  • Ming Zhang

Graph classification is a crucial task in many real-world multimedia applications, where graphs can represent various multimedia data types such as images, videos, and social networks. Previous efforts have applied graph neural networks (GNNs) in balanced situations where the class distribution is balanced. However, real-world data typically exhibit long-tailed class distributions, resulting in a bias towards the head classes when using GNNs and limited generalization ability over the tail classes. Recent approaches mainly focus on re-balancing different classes during model training, which fails to explicitly introduce new knowledge and sacrifices the performance of the head classes. To address these drawbacks, we propose a novel framework called Retrieval Augmented Hybrid Network (RAHNet) to jointly learn a robust feature extractor and an unbiased classifier in a decoupled manner. In the feature extractor training stage, we develop a graph retrieval module to search for relevant graphs that directly enrich the intra-class diversity for the tail classes. Moreover, we innovatively optimize a category-centered supervised contrastive loss to obtain discriminative representations, which is more suitable for long-tailed scenarios. In the classifier fine-tuning stage, we balance the classifier weights with two weight regularization techniques, i.e., Max-norm and weight decay. Experiments on various popular benchmarks verify the superiority of the proposed method against state-of-the-art approaches.

That's What I Said: Fully-Controllable Talking Face Generation

  • Youngjoon Jang
  • Kyeongha Rho
  • Jongbin Woo
  • Hyeongkeun Lee
  • Jihwan Park
  • Youshin Lim
  • Byeong-Yeol Kim
  • Joon Son Chung

The goal of this paper is to synthesise talking faces with controllable facial motions. To achieve this goal, we propose two key ideas. The first is to establish a canonical space where every face has the same motion patterns but different identities. The second is to navigate a multimodal motion space that only represents motion-related features while eliminating identity information. To disentangle identity and motion, we introduce an orthogonality constraint between the two different latent spaces. From this, our method can generate natural-looking talking faces with fully controllable facial attributes and accurate lip synchronisation. Extensive experiments demonstrate that our method achieves state-of-the-art results in terms of both visual quality and lip-sync score. To the best of our knowledge, we are the first to develop a talking face generation framework that can accurately manifest full target facial motions including lip, head pose, and eye movements in the generated video without any additional supervision beyond RGB video with audio.

Event-Diffusion: Event-Based Image Reconstruction and Restoration with Diffusion Models

  • Quanmin Liang
  • Xiawu Zheng
  • Kai Huang
  • Yan Zhang
  • Jie Chen
  • Yonghong Tian

Event cameras offer the advantages of low latency, high temporal resolution and HDR compared to conventional cameras. Due to the asynchronous and sparse nature of events, many existing algorithms cannot be directly applied, necessitating the reconstruction of intensity frames. However, existing reconstruction methods often result in artifacts and edge blurring due to noise and event accumulation. In this paper, we argue that the key to event-based image reconstruction is to enhance the edge information of objects and restore the artifacts in the reconstructed images. To explain, edge information is one of the most important features in the event stream, providing information on the shape and contour of objects. Considering the extraordinary capabilities of Denoising Diffusion Probabilistic Models (DDPMs) in image generation, reconstruction, and restoration, we propose a new framework which incorporate it into the reconstruction pipeline to obtain high-quality results which effectively remove artifacts and blur in reconstructed images. Specifically, we first extract edge information from the event stream using the proposed event-based denoising method. It employs the contrast maximization framework to remove noise from the event stream and extract clear object edge information. And then, the edge information is further adopted to our diffusion model, which is used to enhance the edges of objects in the reconstructed images, thus improving the restoration effect. Experimental results show that our method achieves significant improvements in the mean squared error (MSE), the structural similarity (SSIM), and the perceptual similarity (LPIPS) metrics, with average improvements of 40%, 15%, and 25%, respectively, compared to previous state-of-the-art models, and has good generalization performance.

Mask to Reconstruct: Cooperative Semantics Completion for Video-text Retrieval

  • Han Fang
  • Zhifei Yang
  • Xianghao Zang
  • Chao Ban
  • Zhongjiang He
  • Hao Sun
  • Lanxiang Zhou

Recently, masked video modeling has been widely explored and improved the model's understanding ability of visual regions at a local level. However, existing methods usually adopt random masking and follow the same reconstruction paradigm to complete the masked regions, which do not leverage the correlations between cross-modal content. In this paper, we present MAsk for Semantics COmpleTion (MASCOT) based on semantic-based masked modeling. Specifically, after applying attention-based video masking to generate high-informed and low-informed masks, we propose Informed Semantics Completion to recover masked semantics information. The recovery mechanism is achieved by aligning the masked content with the unmasked visual regions and corresponding textual context, which makes the model capture more text-related details at a patch level. Additionally, we shift the emphasis of reconstruction from irrelevant backgrounds to discriminative parts to ignore regions with low-informed masks. Furthermore, we design co-learning to incorporate video cues under different masks and learn more aligned representation. Our MASCOT performs state-of-the-art performance on four text-video retrieval benchmarks, including MSR-VTT, LSMDC, ActivityNet, and DiDeMo.

Self-Contrastive Graph Diffusion Network

  • Yixuan Ma
  • Kun Zhan

Augmentation techniques and sampling strategies are crucial in contrastive learning, but in most existing works, augmentation techniques require careful design, and their sampling strategies can only capture a small amount of intrinsic supervision information. Additionally, the existing methods require complex designs to obtain two different representations of the data. To overcome these limitations, we propose a novel framework called the Self-Contrastive Graph Diffusion Network (SCGDN). Our framework consists of two main components: the Attentional Module (AttM) and the Diffusion Module (DiFM). AttM aggregates higher-order structure and feature information to get an excellent embedding, while DiFM balances the state of each node in the graph through Laplacian diffusion learning and allows the cooperative evolution of adjacency and feature information in the graph. Unlike existing methodologies, SCGDN is an augmentation-free approach that avoids "sampling bias" and semantic drift, without the need for pre-training. We conduct a high-quality sampling of samples based on structure and feature information. If two nodes are neighbors, they are considered positive samples of each other. If two disconnected nodes are also unrelated on kNN graph, they are considered negative samples for each other. The contrastive objective reasonably uses our proposed sampling strategies, and the redundancy reduction term minimizes redundant information in the embedding and can well retain more discriminative information. In this novel framework, the graph self-contrastive learning paradigm gives expression to a powerful force. The results manifest that SCGDN can consistently generate out performance over both the contrastive methods and the classical methods. The source code is available at

Cross-modal & Cross-domain Learning for Unsupervised LiDAR Semantic Segmentation

  • Yiyang Chen
  • Shanshan Zhao
  • Changxing Ding
  • Liyao Tang
  • Chaoyue Wang
  • Dacheng Tao

In recent years, cross-modal domain adaptation has been studied on the paired 2D image and 3D LiDAR data to ease the labeling costs for 3D LiDAR semantic segmentation (3DLSS) in the target domain. However, in such a setting the paired 2D and 3D data in the source domain are still collected with additional effort. Since the 2D-3D projections can enable the 3D model to learn semantic information from the 2D counterpart, we ask whether we could further remove the need of source 3D data and only rely on the source 2D images. To answer it, this paper studies a new 3DLSS setting where a 2D dataset (source) with semantic annotations and a paired but unannotated 2D image and 3D LiDAR data (target) are available1. To achieve 3DLSS in this scenario, we propose Cross-Modal and Cross-Domain Learning (CoMoDaL). Specifically, our CoMoDaL aims at modeling 1) inter-modal cross-domain distillation between the unpaired source 2D image and target 3D LiDAR data, and 2) the intra-domain cross-modal guidance between the target 2D image and 3D LiDAR data pair. In CoMoDaL, we propose to apply several constraints, such as point-to-pixel and prototype-to pixel alignments, to associate the semantics in different modalities and domains by constructing mixed samples in two modalities. The experimental results on several datasets show that in the proposed setting, the developed CoMoDaL can achieve segmentation without the supervision of labeled LiDAR data. Ablations are also conducted to provide more analysis. Code will be available publicly2.

Multi-View Representation Learning via View-Aware Modulation

  • Ren Wang
  • Haoliang Sun
  • Xiushan Nie
  • Yuxiu Lin
  • Xiaoming Xi
  • Yilong Yin

Multi-view (representation) learning derives an entity's representation from its multiple observable views to facilitate various downstream tasks. The most challenging topic is how to model unobserved entities and their relationships to specific views. To this end, this work proposes a novel multi-view learning method using a View-Aware parameter Modulation mechanism, termed VAM. The key idea is to use trainable parameters as proxies for unobserved entities and views, such that modeling entity-view relationships is converted into modeling the relationship between proxy parameters. Specifically, we first build a set of trainable parameters to learn a mapping from multi-view data to the unified representation as the entity proxy. Then we learn a prototype for each view and design a Modulation Parameter Generator (MPG) that learns a set of view-aware scale and shift parameters from prototypes to modulate the entity proxy and obtain view proxies. By constraining the representativeness, uniqueness, and simplicity of the proxies and proposing an entity-view contrastive loss, parameters are alternatively updated. We end up with a set of discriminative prototypes, view proxies, and an entity proxy that are flexible enough to yield robust representations for out-of-sample entities. Extensive experiments on five datasets show that the results of our VAM outperform existing methods in both classification and clustering tasks.

Uni-Dual: A Generic Unified Dual-Task Medical Self-Supervised Learning Framework

  • Boxiang Yun
  • Xingran Xie
  • Qingli Li
  • Yan Wang

RGB images and medical hyperspectral images (MHSIs) are two widely-used modalities in computational pathology. The former is cheap, easy and fast to obtain while lacking pathological information such as physiochemical state. The latter is an emerging modality which captures electromagnetic radiation matter interaction but suffers from problems such as high time cost and low spatial resolution. In this paper, we bring forward a unified dual-task multi-modality self-supervised learning (SSL) framework, called Uni-Dual, which takes the most use of both paired and unpaired RGB-MHSIs. Concretely, we design a unified SSL paradigm for RGB images and MHSIs. Two tasks are proposed: (1) a discrimination learning task which learns high-level semantics via mining the cross-correlation across unpaired RGB-MHSIs, (2) a reconstruction learning task which models low-level stochastic variations via furthering the interaction across RGB-MHSI pairs. Our Uni-Dual enjoys the following benefits: (1) A unified model which can be easily transferred to different downstream tasks on various modality combinations. (2) We consider multi-constituent and structured information learning from MHSIs and RGB images for low-cost high-precision clinical purposes. Experiments conducted on various downstream tasks with different modalities show the proposed Uni-Dual substantially outperforms other competitive SSL methods.

Towards Better Multi-modal Keyphrase Generation via Visual Entity Enhancement and Multi-granularity Image Noise Filtering

  • Yifan Dong
  • Suhang Wu
  • Fandong Meng
  • Jie Zhou
  • Xiaoli Wang
  • Jianxin Lin
  • Jinsong Su

Multi-modal keyphrase generation aims to produce a set of keyphrases that represent the core points of the input text-image pair. In this regard, dominant methods mainly focus on multi-modal fusion for keyphrase generation. Nevertheless, there are still two main drawbacks: 1) only a limited number of sources, such as image captions, can be utilized to provide auxiliary information. However, they may not be sufficient for the subsequent keyphrase generation. 2) the input text and image are often not perfectly matched, and thus the image may introduce noise into the model. To address these limitations, in this paper, we propose a novel multi-modal keyphrase generation model, which not only enriches the model input with external knowledge, but also effectively filters image noise. First, we introduce external visual entities of the image as the supplementary input to the model, which benefits the cross-modal semantic alignment for keyphrase generation. Second, we simultaneously calculate an image-text matching score and image region-text correlation scores to perform multi-granularity image noise filtering. Particularly, we introduce the correlation scores between image regions and ground-truth keyphrases to refine the calculation of the previously-mentioned correlation scores. To demonstrate the effectiveness of our model, we conduct several groups of experiments on the benchmark dataset. Experimental results and in-depth analyses show that our model achieves the state-of-the-art performance. Our code is available on

Multi-modal Social Bot Detection: Learning Homophilic and Heterophilic Connections Adaptively

  • Shilong Li
  • Boyu Qiao
  • Kun Li
  • Qianqian Lu
  • Meng Lin
  • Wei Zhou

The detection of social bots has become a critical task in maintaining the integrity of social media. With social bots evolving continually, they primarily evade detection by imitating human features and engaging in interactions with humans. To reduce the impact of social bots imitating human features, also known as feature camouflage, existing methods mainly utilize multi-modal user information for detection, especially GNN-based methods that utilize additional topological structure information. However, these methods ignore relation camouflage, which involves disguising through interactions with humans. We find that relation camouflage results in both homophilic connections formed by nodes of the same type and heterophilic connections formed by nodes of different types in social networks. The existing GNN-based detection methods assume all connections are homophilic while ignoring the difference among neighbors in heterophilic connections, which leads to a poor detection performance for bots with relation camouflage. To address this, we propose a multi-modal social bot detection method with learning homophilic and heterophilic connections adaptively (BothH for short). Specifically, firstly we determine whether each connection is homophilic or heterophilic with the connection classifier, and then we design a novel message propagating strategy that can learn the homophilic and heterophilic connections adaptively. We conduct experiments on the mainstream datasets and the results show that our model is superior to state-of-the-art methods.

CPU: Codebook Lookup Transformer with Knowledge Distillation for Point Cloud Upsampling

  • Weibing Zhao
  • Haiming Zhang
  • Chaoda Zheng
  • Xu Yan
  • Shuguang Cui
  • Zhen Li

Point clouds produced by 3D scanning are typically sparse, non-uniform, and noisy. Existing upsampling techniques directly learn the mapping from a sparse point set to a dense point set, which is often under-determined and ill-posed. To reduce the uncertainty and ambiguity of the upsampling mapping, this paper proposes a generic three-stage vector-quantization framework, which incorporates a Codebook lookup Transformer and knowledge distillation for Point Cloud Upsampling, named CPU. The proposed CPU reformulates the upsampling task into a relatively determinate code prediction task within a small, discrete proxy space. Since the traditional vector-quantization methods cannot be directly applied to point cloud upsampling scenarios, we introduce a knowledge distillation training scheme that facilitates efficient codebook learning and ensures full utilization of codebook entries. Specifically, we adopt a teacher-student training paradigm to avoid model collapse during codebook learning. In the first stage, we pre-train a vanilla auto-encoder of the dense point set as the teacher model, which provides rich guidance features to ensure sufficient codebook learning. In the second stage, we train a vector-quantized auto-encoder as a student model to capture high-fidelity geometric priors into a learned codebook with the aid of distillation. In the third stage, we propose a Codebook Lookup Transformer to model the global context of the sparse point set and predict the code indices. Then the coarse features of the sparse point set can be quantized and substituted by looking up the indices in the learned codebook. Benefiting from the expressive codebook priors and the distillation training scheme, the proposed CPU outperforms state-of-the-art methods quantitatively and qualitatively.

Your tone speaks louder than your face! Modality Order Infused Multi-modal Sarcasm Detection

  • Mohit Tomar
  • Abhisek Tiwari
  • Tulika Saha
  • Sriparna Saha

Figurative language is an essential component of human communication, and detecting sarcasm in text has become a challenging yet highly popular task in natural language processing. As humans, we rely on a combination of visual and auditory cues, such as facial expressions and tone of voice, to comprehend a message. Our brains are implicitly trained to integrate information from multiple senses to form a complete understanding of the message being conveyed, a process known as multi-sensory integration. The combination of different modalities not only provides additional information but also amplifies the information conveyed by each modality in relation to the others. Thus, the infusion order of different modalities also plays a significant role in multimodal processing. In this paper, we investigate the impact of different modality infusion orders for identifying sarcasm in dialogues. We propose a modality order-driven module integrated into a transformer network, MO-Sarcation that fuses modalities in an ordered manner. Our model outperforms several state-of-the-art models by 1-3% across various metrics, demonstrating the crucial role of modality order in sarcasm detection. The obtained improvements and detailed analysis show that audio tone should be infused with textual content, followed by visual information to identify sarcasm efficiently. The code and dataset are available at

Fine-Grained Multimodal Named Entity Recognition and Grounding with a Generative Framework

  • Jieming Wang
  • Ziyan Li
  • Jianfei Yu
  • Li Yang
  • Rui Xia

Multimodal Named Entity Recognition (MNER) aims to locate and classify named entities mentioned in a pair of text and image. However, most previous MNER works focus on extracting entities in the form of text but failing to ground text symbols to their corresponding visual objects. Moreover, existing MNER studies primarily classify entities into four coarse-grained entity types, which are often insufficient to map them to their real-world referents. To solve these limitations, we introduce a task named Fine-grained Multimodal Named Entity Recognition and Grounding (FMNERG) in this paper, which aims to simultaneously extract named entities in text, their fine-grained entity types, and their grounded visual objects in image. Moreover, we construct a Twitter dataset for the FMNERG task, and further propose a T5-based multImodal GEneration fRamework (TIGER), which formulates FMNERG as a generation problem by converting all the entity-type-object triples into a target sequence and adapts a pre-trained sequence-to-sequence model T5 to directly generate the target sequence from an image-text input pair. Experimental results demonstrate that TIGER performs significantly better than a number of baseline systems on the annotated Twitter dataset. Our dataset annotation and source code are publicly released at

SkipStreaming: Pinpointing User-Perceived Redundancy in Correlated Web Video Streaming through the Lens of Scenes

  • Wei Liu
  • Xinlei Yang
  • Zhenhua Li
  • Feng Qian

When streaming over the web, correlated videos (e.g., a series of TV episodes) appear to bear considerable redundant clips, mostly included in the intros, outros, recaps, and commercial breaks, leading to a waste of network traffic and playback time. Mainstream video content providers have taken various measures to identify these clips, but often result in unexpected and undesirable user experiences. In this paper, we conduct a large-scale, crowdsourced study to demystify the root causes of poor experiences. Driven by the findings, we propose to reconsider the problem from a novel perspective of scenes without going through the excessive video frames, which pays special attention to how the contents of correlated videos are organized during video production. To enable this idea, we design efficient approaches to the separation of video scenes and the identification of visual redundancy. We build an open-source system to embody our design, which achieves fast (e.g., taking ~38 seconds to process a 45-minute video using a common commodity server) and accurate (incurring only 770-ms deviation on average) redundancy recognition on representative workloads.

Synthesizing Long-Term Human Motions with Diffusion Models via Coherent Sampling

  • Zhao Yang
  • Bing Su
  • Ji-Rong Wen

Text-to-motion generation has gained increasing attention, but most existing methods are limited to generating short-term motions that correspond to a single sentence describing a single action. However, when a text stream describes a sequence of continuous motions, the generated motions corresponding to each sentence may not be coherently linked. Existing long-term motion generation methods face two main issues. Firstly, they cannot directly generate coherent motions and require additional operations such as interpolation to process the generated actions. Secondly, they generate subsequent actions in an autoregressive manner without considering the influence of future actions on previous ones. To address these issues, we propose a novel approach that utilizes a past-conditioned diffusion model with two optional coherent sampling methods: Past Inpainting Sampling and Compositional Transition Sampling. Past Inpainting Sampling completes subsequent motions by treating previous motions as conditions, while Compositional Transition Sampling models the distribution of the transition as the composition of two adjacent motions guided by different text prompts. Our experimental results demonstrate that our proposed method is capable of generating compositional and coherent long-term 3D human motions controlled by a user-instructed long text stream. The code is available at

Layout Sequence Prediction From Noisy Mobile Modality

  • Haichao Zhang
  • Yi Xu
  • Hongsheng Lu
  • Takayuki Shimizu
  • Yun Fu

Trajectory prediction plays a vital role in understanding pedestrian movement for applications such as autonomous driving and robotics. Current trajectory prediction models depend on long, complete, and accurately observed sequences from visual modalities. Nevertheless, real-world situations often involve obstructed cameras, missed objects, or objects out of sight due to environmental factors, leading to incomplete or noisy trajectories. To overcome these limitations, we propose LTrajDiff, a novel approach that treats objects obstructed or out of sight as equally important as those with fully visible trajectories. LTrajDiff utilizes sensor data from mobile phones to surmount out-of-sight constraints, albeit introducing new challenges such as modality fusion, noisy data, and the absence of spatial layout and object size information. We employ a denoising diffusion model to predict precise layout sequences from noisy mobile data using a coarse-to-fine diffusion strategy, incorporating the Random Mask Strategy, Siamese Masked Encoding Module, and Modality Fusion Module. Our model predicts layout sequences by implicitly inferring object size and projection status from a single reference timestamp or significantly obstructed sequences. Achieving state-of-the-art results in randomly obstructed experiments, our model outperforms other baselines in extremely short input experiments, illustrating the effectiveness of leveraging noisy mobile data for layout sequence prediction. In summary, our approach offers a promising solution to the challenges faced by layout sequence and trajectory prediction models in real-world settings, paving the way for utilizing sensor data from mobile phones to accurately predict pedestrian bounding box trajectories. To the best of our knowledge, this is the first work that addresses severely obstructed and extremely short layout sequences by combining vision with noisy mobile modality, making it the pioneering work in the field of layout sequence trajectory prediction.

Graph-Based Video-Language Learning with Multi-Grained Audio-Visual Alignment

  • Chenyang Lyu
  • Wenxi Li
  • Tianbo Ji
  • Longyue Wang
  • Liting Zhou
  • Cathal Gurrin
  • Linyi Yang
  • Yi Yu
  • Yvette Graham
  • Jennifer Foster

Video-language learning has attracted significant attention in the fields of multimedia, computer vision and natural language processing in recent years. One of the key challenges in this area is how to effectively integrate visual and linguistic information to enable machines to understand video content and query information. In this work, we leverage graph-based representations and multi-grained audio-visual alignment to address this challenge. First, our approach starts by transforming video and query inputs into visual-scene graphs and semantic role graphs using a visual-scene parser and semantic role labeler respectively. These graphs are then encoded using graph neural networks to obtain enriched representations and combined to obtain a video-query joint representation that enhances the semantic expressivity of the inputs. Second, to achieve accurate matching of relevant parts of audio and visual features, we propose a multi-grained alignment module that aligns the audio and visual features at multiple scales. This enables us to effectively fuse the audio and visual information in a way that is consistent with the semantic-level information captured by the graph-based representations. Experiments on five representative datasets collected for Video Retrieval and Video Question Answering tasks show that our approach outperforms the literature on several metrics. Our extensive ablation studies demonstrate the effectiveness of graph-based representation and multi-grained audio-visual alignment.

Advancing Video Question Answering with a Multi-modal and Multi-layer Question Enhancement Network

  • Meng Liu
  • Fenglei Zhang
  • Xin Luo
  • Fan Liu
  • Yinwei Wei
  • Liqiang Nie

Video question answering is an increasingly vital research field, spurred by the rapid proliferation of video content online and the urgent need for intelligent systems that can comprehend and interact with this content. Existing methodologies often lean towards video understanding and cross-modal information interaction modeling but tend to overlook the crucial aspect of comprehensive question understanding. To address this gap, we introduce the multi-modal and multi-layer question enhancement network, a groundbreaking framework emphasizing nuanced question understanding. Our approach begins by extracting object, appearance, and motion features from videos. Subsequently, we harness multi-layer outputs from a pre-trained language model, ensuring a thorough grasp of the question. Integrating object data into appearance is guided by global question and frame representation, facilitating the adaptive acquisition of appearance and motion-enhanced question representation. By amalgamating multi-modal question insights, our methodology adeptly determines answers to questions. Experimental results conducted on three benchmarks demonstrate the superiority of our tailored approach, underscoring the importance of advanced question comprehension in VideoQA.

Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning

  • Wenrui Li
  • Xi-Le Zhao
  • Zhengyu Ma
  • Xingtao Wang
  • Xiaopeng Fan
  • Yonghong Tian

Audio-visual zero-shot learning (ZSL) has attracted board attention, as it could classify video data from classes that are not observed during training. However, most of the existing methods are restricted to background scene bias and fewer motion details by employing a single-stream network to process scenes and motion information as a unified entity. In this paper, we address this challenge by proposing a novel dual-stream architecture Motion-Decoupled Spiking Transformer (MDFT) to explicitly decouple the contextual semantic information and highly sparsity dynamic motion information. Specifically, The Recurrent Joint Learning Unit (RJLU) could extract contextual semantic information effectively and understand the environment in which actions occur by capturing joint knowledge between different modalities. By converting RGB images to events, our approach effectively captures motion information while mitigating the influence of background scene biases, leading to more accurate classification results. We utilize the inherent strengths of Spiking Neural Networks (SNNs) to process highly sparsity event data efficiently. Additionally, we introduce a Discrepancy Analysis Block (DAB) to model the audio motion features. To enhance the efficiency of SNNs in extracting dynamic temporal and motion information, we dynamically adjust the threshold of Leaky Integrate-and-Fire (LIF) neurons based on the statistical cues of global motion and contextual semantic information. Our experiments demonstrate the effectiveness of MDFT, which consistently outperforms state-of-the-art methods across mainstream benchmarks. Moreover, we find that motion information serves as a powerful regularization for video networks, where using it improves the accuracy of HM and ZSL by 19.1% and 38.4%, respectively.

Multimodal Color Recommendation in Vector Graphic Documents

  • Qianru Qiu
  • Xueting Wang
  • Mayu Otani

Color selection plays a critical role in graphic document design and requires sufficient consideration of various contexts. However, recommending appropriate colors which harmonize with the other colors and textual contexts in documents is a challenging task, even for experienced designers. In this study, we propose a multimodal masked color model that integrates both color and textual contexts to provide text-aware color recommendation for graphic documents. Our proposed model comprises self-attention networks to capture the relationships between colors in multiple palettes, and cross-attention networks that incorporate both color and CLIP-based text representations. Our proposed method primarily focuses on color palette completion, which recommends colors based on the given colors and text. Additionally, it is applicable for another color recommendation task, full palette generation, which generates a complete color palette corresponding to the given text. Experimental results demonstrate that our proposed approach surpasses previous color palette completion methods on accuracy, color distribution, and user experience, as well as full palette generation methods concerning color diversity and similarity to the ground truth palettes.

Open-Vocabulary Object Detection via Scene Graph Discovery

  • Hengcan Shi
  • Munawar Hayat
  • Jianfei Cai

In recent years, open-vocabulary (OV) object detection has attracted increasing research attention. Unlike traditional detection, which only recognizes fixed-category objects, OV detection aims to detect objects in an open category set. Previous works often leverage vision-language (VL) training data (e.g., referring grounding data) to recognize OV objects. However, they only use pairs of nouns and individual objects in VL data, while these data usually contain much more information, such as scene graphs, which are also crucial for OV detection. In this paper, we propose a novel Scene-Graph-Based Discovery Network (SGDN) that exploits scene graph cues for OV detection. Firstly, a scene-graph-based decoder (SGDecoder) including sparse scene-graph-guided attention (SSGA) is presented. It captures scene graphs and leverages them to discover OV objects. Secondly, we propose scene-graph-based prediction (SGPred), where we build a scene-graph-based offset regression (SGOR) mechanism to enable mutual enhancement between scene graph extraction and object localization. Thirdly, we design a cross-modal learning mechanism in SGPred. It takes scene graphs as bridges to improve the consistency between cross-modal embeddings for OV object classification. Experiments on COCO and LVIS demonstrate the effectiveness of our approach. Moreover, we show the ability of our model for OV scene graph detection, while previous OV scene graph generation methods cannot tackle this task.

Universal Domain Adaptive Network Embedding for Node Classification

  • Jushuo Chen
  • Feifei Dai
  • Xiaoyan Gu
  • Jiang Zhou
  • Bo Li
  • Weipinng Wang

Cross-network node classification aims to leverage the abundant knowledge from a labeled source network to help classify the node in an unlabeled target network. However, existing methods assume that label sets are identical across domains, which is easily violated in practice. Hence, we attempt to integrate network embedding with universal domain adaptation, which transfers valuable knowledge across domains without assumption on the label sets, to assist in node classification. Nonetheless, the complex network relationships between nodes increase the difficulty of this universal domain adaptive node classification task. In this work, we propose a novel Universal Domain Adaptive Network Embedding (UDANE) framework, which learns transferable node representations across networks to succeed in such a task. Technically, we first adopt the cross-network node embedding component to model comprehensive node information of both networks. Then we employ the inter-domain adaptive alignment component to exploit and relate knowledge across domains, learning domain-invariant representation for knowledge transfer. In addition, the intra-domain contrastive alignment component is proposed to learn discriminative representations beneficial for classification by sufficiently utilizing unlabeled data in the target domain. Extensive experiments have been conducted on real-world datasets, demonstrating that the proposed UDANE model outperforms the state-of-the-art baselines by a large margin.

Uncertainty-Guided End-to-End Audio-Visual Speaker Diarization for Far-Field Recordings

  • Chenyu Yang
  • Mengxi Chen
  • Yanfeng Wang
  • Yu Wang

Audio-visual speaker diarization refers to the task of identifying "who spoke when" by using both audio and video data. Although previous fusion-based approaches have shown exceptional performance over audio-only methods, they have mainly focused on high-quality data and have not accounted for the impacts of acoustic noise or missing faces. To address these limitations, we propose a novel uncertainty-aware end-to-end audio-visual speaker diarization (UAV-SD) approach in this paper. Our approach leverages both framewise inter- and intra-modal confidence to achieve more effective and robust speaker diarization. By taking into account the uncertainty of the data, UAV-SD can achieve better diarization performance even in noisy or low-quality recordings. Additionally, our approach is compatible with multi-channel audio signals without the need to retrain the model, making it a more versatile solution. To evaluate the effectiveness of our approach, we conduct extensive experiments on the Multi-modal Information Based Speech Processing (MISP) 2022 Challenge datasets which consist of far-field audio and video data. The results show that UAV-SD is able to yield significant performance gains compared to baseline methods for both single and multi-channel data, demonstrating its effectiveness in real-world scenarios.

Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization

  • Tianyu Liu
  • Peng Zhang
  • Wei Huang
  • Yufei Zha
  • Tao You
  • Yanning Zhang

Self-supervised sound source localization is usually challenged by the modality inconsistency. In recent studies, contrastive learning based strategies have shown promising to establish such a consistent correspondence between audio and sound sources in visual scenarios. Unfortunately, the insufficient attention to the heterogeneity influence in the different modality features still limits this scheme to be further improved, which also becomes the motivation of our work. In this study, an Induction Network is proposed to bridge the modality gap more effectively. By decoupling the gradients of visual and audio modalities, the discriminative visual representations of sound sources can be learned with the designed Induction Vector in a bootstrap manner, which also enables the audio modality to be aligned with the visual modality consistently. In addition to a visual weighted contrastive loss, an adaptive threshold selection strategy is introduced to enhance the robustness of the Induction Network. Substantial experiments conducted on SoundNet-Flickr and VGG-Sound Source datasets have demonstrated a superior performance compared to other state-of-the-art works in different challenging scenarios. The code is available at

HELIOS: Hyper-Relational Schema Modeling from Knowledge Graphs

  • Yuhuan Lu
  • Bangchao Deng
  • Weijian Yu
  • Dingqi Yang

Knowledge graph (KG) schema, which prescribes a high-level structure and semantics of a KG, is significantly helpful for KG completion and reasoning problems. Despite its usefulness, open-domain KGs do not practically have a unified and fixed schema. Existing approaches usually extract schema information using entity types from a KG where each entity e can be associated with a set of types {Te, by either heuristically taking one type for each entity or exhaustively combining the types of all entities in a fact (to get entity-typed tuples, (h_type, r, t_type) for example). However, these two approaches either overlook the role of multiple types of a single entity across different facts or introduce non-negligible noise as not all the type combinations actually support the fact, thus failing to capture the sophisticated schema information. Against this background, we study the problem of modeling hyper-relational schema, which is formulated as mixed hyper-relational tuples ({Th}, r, {Tt}, k, {Tv1},...) with two-fold hyper-relations: each type set T may contain multiple types and each schema tuple may contain multiple key-type set pairs (k, Tv). To address this problem, we propose HELIOS, a hyper-relational schema model designed to subtly learn from such hyper-relational schema tuples by capturing not only the correlation between multiple types of a single entity, but also the correlation between types of different entities and relations in a schema tuple. We evaluate HELIOS on three real-world KG datasets in different schema prediction tasks. Results show that HELIOS consistently outperforms state-of-the-art hyper-relational link prediction techniques by 20.0-29.7%, and is also much more robust than baselines in predicting types and relations across different positions in a hyper-relational schema tuple.

Breaking the Barrier Between Pre-training and Fine-tuning: A Hybrid Prompting Model for Knowledge-Based VQA

  • Zhongfan Sun
  • Yongli Hu
  • Qingqing Gao
  • Huajie Jiang
  • Junbin Gao
  • Yanfeng Sun
  • Baocai Yin

Considerable performance gains have been achieved for knowledge-based visual question answering due to the visual-language pre-training models with pre-training-then-fine-tuning paradigm. However, because the targets of the pre-training and fine-tuning stages are different, there is an evident barrier that prevents the cross-modal comprehension ability developed in the pre-training stage from fully endowing the fine-tuning task. To break this barrier, in this paper, we propose a novel hybrid prompting model for knowledge-based VQA, which inherits and incorporates the pre-training and fine-tuning tasks with a shared objective. Specifically, based on static declaration prompt, we construct a consistent goal with the fine-tuning via masked language modeling to inherit capabilities of pre-training task, while selecting the top-t relevant knowledge in a dense retrieval manner. Additionally, a dynamic knowledge prompt is learned from retrieved knowledge, which not only alleviates the length constraint on inputs for visual-language pre-trained models but also assists in providing answer features via fine-tuning. Combining and unifying the aims of the two stages could fully exploit the abilities of pre-training and fine-tuning to predict answer. We evaluate the proposed model on the OKVQA dataset, and the result shows that our model outperforms the state-of-the-art methods based on visual-language pre-training models with a noticeable performance gap and even exceeds the large-scale language model of GPT-3, which proves the benefits of the hybrid prompts and the advantages of unifying pre-training to fine-tuning.

OccluBEV: Occlusion Aware Spatiotemporal Modeling for Multi-view 3D Object Detection

  • Ziteng Wen
  • Hai Xu
  • Chenyu Liu
  • Tao Guo
  • Jinshui Hu
  • Xuming He
  • Fengren Wang
  • Shun Lou
  • Haibo Fan

Bird's-Eye-View (BEV) based 3D visual perception, which formulates a unified space for multi-view representation, has received wide attention in autonomous driving due to its scalability for downstream tasks. However, view transform in transformer-based BEV methods is agnostic of 3D occlusion relationships, resulting in model degradation. To construct a higher-quality BEV space, this paper analyzes the mutual occlusion problems in the view transform process and proposes a new transformer-based method named OccluBEV. OccluBEV alleviates the occlusion issue via point cloud information distillation in both the image and BEV space. Specifically, in the image space, we perform depth estimation for each pixel and utilize it to guide image feature mapping. Further, since predicting depth directly from monocular image is ill-posed, ignoring stereo information such as multi-view and temporal cues, this paper introduces a voxel visibility segmentation task in 3D BEV space. The task explicitly predicts whether each voxel in the 3D BEV grid is occupied or not. In addition, to alleviate the overfitting problem in BEV feature learning under a single task, we design a multi-head learning framework which jointly models multiple strongly-correlated tasks in a unified BEV space. The effectiveness of the proposed method is fully validated on the nuScenes dataset, achieving a competetive NDS/mAP score of 57.5/47.9 on the nuScenes test leaderboard using ResNet101 backbone, which is superior to state-of-the-art camera-based solutions.

SESSION: Poster Session III: Understanding Multimedia Content -- Vision and Language

Semantics-Enriched Cross-Modal Alignment for Complex-Query Video Moment Retrieval

  • Xingyu Shen
  • Xiang Zhang
  • Xun Yang
  • Yibing Zhan
  • Long Lan
  • Jianfeng Dong
  • Hongzhou Wu

Video moment retrieval (VMR) aims to search for a video segment that matches the search intent in a query sentence, which has received increasing attention in recent years, due to its practical values in various fields. Existing efforts devoted to this interesting yet challenging task typically encode the query sentence and video segments into unstructured global representations for cross-modal interaction and fusion, which may fail to accurately capture the search intent in complex queries with multi-granularity semantics.

To fill the research gap, this paper presents a novel solution termed semantics-enriched video moment retrieval method (SVMR), which can effectively and explicitly model the hierarchical multi-granularity semantics of complex textual query. Specifically, we first explore cross-token relations to offer multiple granularity query representations with hierarchical semantic contexts of semantically associated tokens for fine-grained cross-modal interaction and fusion, which contributes to mining rich visual motion cues semantically related to different activities and entities in complex queries. Furthermore, to fully leverage fine-grained cross-modal cues for moment retrieval, we design a specific temporal boundary reasoning module by explicitly generating start and end time-aware filter kernels with visual cues to perceive the moment boundaries. Extensive experiments and analyses on three public benchmarks clearly demonstrate the advantage of our proposed SVMR over existing state-of-the-art approaches, especially in retrieving complex query-based video moments.

NightHazeFormer: Single Nighttime Haze Removal Using Prior Query Transformer

  • Yun Liu
  • Zhongsheng Yan
  • Sixiang Chen
  • Tian Ye
  • Wenqi Ren
  • Erkang Chen

Nighttime image dehazing is a challenging task due to the presence of multiple types of adverse degrading effects including glow, haze, blur, noise, color distortion, and so on. However, most previous studies mainly focus on daytime image dehazing or partial degradations presented in nighttime hazy scenes, which may lead to unsatisfactory restoration results. In this paper, we propose an end-to-end transformer-based framework for nighttime haze removal, called NightHazeFormer. Our proposed approach consists of two stages: supervised pre-training and semi-supervised fine-tuning. During the pre-training stage, we introduce two powerful priors into the transformer decoder to generate the non-learnable prior queries, which guide the model to extract specific degradations. For the fine-tuning, we combine the generated pseudo ground truths with input real-world nighttime hazy images as paired images and feed into the synthetic domain to fine-tune the pre-trained model. This semi-supervised fine-tuning paradigm helps improve the generalization to real domain. In addition, we also propose a large-scale synthetic dataset called UNREAL-NH, to simulate the real-world nighttime haze scenarios comprehensively. Extensive experiments on several synthetic and real-world datasets demonstrate the superiority of our NightHazeFormer over state-of-the-art nighttime haze removal methods in terms of both visually and quantitatively.

FSNet: Frequency Domain Guided Superpixel Segmentation Network for Complex Scenes

  • Hua Li
  • Junyan Liang
  • Wenjie Li
  • Wenhui Wu

Existing superpixel segmentation algorithms mainly focus on natural image with high-quality, while neglecting the inevitable environment constraint in complex scenes. In this paper, we propose an end-to-end frequency domain guided superpixel segmentation network (FSNet) to generate superpixels with sharp boundary adherence for complex scenes by fusing the deep features in spatial and frequency domains. To utilize the frequency domain information of the image, an improved frequency information extractor (IFIE) is proposed to extract the frequency domain information with sharp boundary features. Moreover, considering the over-sharp feature may damage the semantic information of superpixel, we further design a dense hybrid atrous convolution (DHAC) block to preserve semantic information via capturing wider and deeper semantic information in spatial domain. Finally, the extracted deep features in spatial and frequency domains will be fused to generate semantic perceptual superpixels with sharp boundary adherence. Extensive experiments on multiple challenging datasets with complex boundaries demonstrate that our method achieves the state-of-the-art performance both quantitatively and qualitatively, and we further verify the superiority of the proposed method when applied in salient object detection.

Zero-Shot Learning by Harnessing Adversarial Samples

  • Zhi Chen
  • Pengfei Zhang
  • Jingjing Li
  • Sen Wang
  • Zi Huang

Zero-Shot Learning (ZSL) aims to recognize unseen classes by generalizing the knowledge, i.e., visual and semantic relationships, obtained from seen classes, where image augmentation techniques are commonly applied to improve the generalization ability of a model. However, this approach can also cause adverse effects on ZSL since the conventional augmentation techniques that solely depend on single-label supervision is not able to maintain semantic information and result in the semantic distortion issue consequently. In other words, image argumentation may falsify the semantic (e.g., attribute) information of an image. To take the advantage of image augmentations while mitigating the semantic distortion issue, we propose a novel ZSL approach by Harnessing Adversarial Samples (HAS). HAS advances ZSL through adversarial training which takes into account three crucial aspects: (1) robust generation by enforcing augmentations to be similar to negative classes, while maintaining correct labels, (2) reliable generation by introducing a latent space constraint to avert significant deviations from the original data manifold, and (3) diverse generation by incorporating attribute-based perturbation by adjusting images according to each semantic attribute's localization. Through comprehensive experiments on three prominent zero-shot benchmark datasets, we demonstrate the effectiveness of our adversarial samples approach in both ZSL and Generalized Zero-Shot Learning (GZSL) scenarios. Our source code is available at

Sequential Affinity Learning for Video Restoration

  • Tian Ye
  • Sixiang Chen
  • Yun Liu
  • Wenhao Chai
  • Jinbin Bai
  • Wenbin Zou
  • Yunchen Zhang
  • Mingchao Jiang
  • Erkang Chen
  • Chenghao Xue

Video restoration networks aim to restore high-quality frame sequences from degraded ones. However, traditional video restoration methods heavily rely on temporal modeling operators or optical flow estimation, which limits their versatility. The aim of this work is to present a novel approach for video restoration that eliminates inefficient temporal modeling operators and pixel-level feature alignment in the network architecture. The proposed method, Sequential Affinity Learning Network (SALN), is designed based on an affinity mechanism that establishes direct correspondences between the Query frame, degraded sequence, and restored frames in latent space. This unique perspective allows for more accurate and effective restoration of video content without relying on temporal modeling operators or optical flow estimation techniques. Moreover, we enhanced the design of the channel-wise self-attention block to improve the decoder's performance for video restoration. Our method outperformed previous state-of-the-art methods by a significant margin in several classic video tasks, including video deraining, video dehazing, and video waterdrop removal, demonstrating excellent efficiency. As a novel network that differs significantly from previous video restoration methods, SALN aims to provide innovative ideas and directions for video restoration. Our contributions include proposing a novel affinity-based approach for video restoration, enhancing the design of the channel-wise self-attention block, and achieving state-of-the-art performance on several classic video tasks.

Beat: Bi-directional One-to-Many Embedding Alignment for Text-based Person Retrieval

  • Yiwei Ma
  • Xiaoshuai Sun
  • Jiayi Ji
  • Guannan Jiang
  • Weilin Zhuang
  • Rongrong Ji

Text-based person retrieval (TPR) is a challenging task that involves retrieving a specific individual based on a textual description. Despite considerable efforts to bridge the gap between vision and language, the significant differences between these modalities continue to pose a challenge. Previous methods have attempted to align text and image samples in a modal-shared space, but they face uncertainties in optimization directions due to the movable features of both modalities and the failure to account for one-to-many relationships of image-text pairs in TPR datasets. To address this issue, we propose an effective bi-directional one-to-many embedding paradigm that offers a clear optimization direction for each sample, thus mitigating the optimization problem. Additionally, this embedding scheme generates multiple features for each sample without introducing trainable parameters, making it easier to align with several positive samples. Based on this paradigm, we propose a novel Bi-directional one-to-many Embedding Alignment (Beat) model to address the TPR task. Our experimental results demonstrate that the proposed Beat model achieves state-of-the-art performance on three popular TPR datasets, including CUHK-PEDES (65.61 R@1), ICFG-PEDES (58.25 R@1), and RSTPReID (48.10 R@1). Furthermore, additional experiments on MS-COCO, CUB, and Flowers datasets further demonstrate the potential of Beat to be applied to other image-text retrieval tasks.

Transformer-based Point Cloud Generation Network

  • Rui Xu
  • Le Hui
  • Yuehui Han
  • Jianjun Qian
  • Jin Xie

Point cloud generation is an important research topic in 3D computer vision, which can provide high-quality datasets for various downstream tasks. However, efficiently capturing the geometry of point clouds remains a challenging problem due to their irregularities. In this paper, we propose a novel transformer-based 3D point cloud generation network to generate realistic point clouds. Specifically, we first develop a transformer-based interpolation module that utilizes k-nearest neighbors at different scales to learn global and local information about point clouds in the feature space. Based on geometric information, we interpolate new point features to upsample the point cloud features. Then, the upsampled features are used to generate a coarse point cloud with spatial coordinate information. We construct a transformer-based refinement module to enhance the upsampled features in feature space with geometric information in coordinate space. Finally, we use a multi-layer perceptron on the upsampled features to generate the final point cloud. Extensive experiments on ShapeNet and ModelNet demonstrate the effectiveness of our proposed method.

Isolation and Induction: Training Robust Deep Neural Networks against Model Stealing Attacks

  • Jun Guo
  • Xingyu Zheng
  • Aishan Liu
  • Siyuan Liang
  • Yisong Xiao
  • Yichao Wu
  • Xianglong Liu

Despite the broad application of Machine Learning models as a Service (MLaaS), they are vulnerable to model stealing attacks. These attacks can replicate the model functionality by using the black-box query process without any prior knowledge of the target victim model. Existing stealing defenses add deceptive perturbations to the victim's posterior probabilities to mislead the attackers. However, these defenses are now suffering problems of high inference computational overheads and unfavorable trade-offs between benign accuracy and stealing robustness, which challenges the feasibility of deployed models in practice. To address the problems, this paper proposes Isolation and Induction (InI), a novel and effective training framework for model stealing defenses. Instead of deploying auxiliary defense modules that introduce redundant inference time, InI directly trains a defensive model by isolating the adversary's training gradient from the expected gradient, which can effectively reduce the inference computational cost. In contrast to adding perturbations over model predictions that harm the benign accuracy, we train models to produce uninformative outputs against stealing queries, which can induce the adversary to extract little useful knowledge from victim models with minimal impact on the benign performance. Extensive experiments on several visual classification datasets (e.g., MNIST and CIFAR10) demonstrate the superior robustness (up to 48% reduction on stealing accuracy) and speed (up to 25.4× faster) of our InI over other state-of-the-art methods. Our codes can be found in

Filling the Information Gap between Video and Query for Language-Driven Moment Retrieval

  • Daizong Liu
  • Xiaoye Qu
  • Jianfeng Dong
  • Guoshun Nan
  • Pan Zhou
  • Zichuan Xu
  • Lixing Chen
  • He Yan
  • Yu Cheng

This paper addresses the challenging task of language-driven moment retrieval. Previous methods are typically trained to localize the target moment corresponding to a single sentence query in a complicated video. However, this specific moment generally delivers richer contents than the query, i.e., the semantics of one query may miss certain object details or actions in the complex foreground-background visual contents. Such information imbalance between two modalities makes it difficult to finely align their representations. To this end, instead of training with a single query, we propose to utilize the diversity and complementarity among different queries corresponding to the same video moment for enriching the textual semantics. Specifically, we develop a Teacher-Student Moment Retrieval (TSMR) framework to fill this cross-modal information gap. A teacher model is trained to not only encode a certain query but also capture extra complementary queries to aggregate contextual semantics for obtaining more comprehensive moment-related query representations. Since the additional queries are inaccessible during inference, we further introduce an adaptive knowledge distillation mechanism to train a student model with a single query input by selectively absorbing the knowledge from the teacher model. In this manner, the student model is more robust to the cross-modal information gap during the moment retrieval guided by a single query. Experimental results on two benchmarks demonstrate the effectiveness of our proposed method.

Improving Semi-Supervised Semantic Segmentation with Dual-Level Siamese Structure Network

  • Zhibo Tian
  • Xiaolin Zhang
  • Peng Zhang
  • Kun Zhan

Semi-supervised semantic segmentation (SSS) is an important task that utilizes both labeled and unlabeled data to reduce expenses on labeling training examples. However, the effectiveness of SSS algorithms is limited by the difficulty of fully exploiting the potential of unlabeled data. To address this, we propose a dual-level Siamese structure network (DSSN) for pixel-wise contrastive learning. By aligning positive pairs with a pixel-wise contrastive loss using strong augmented views in both low-level image space and high-level feature space, the proposed DSSN is designed to maximize the utilization of available unlabeled data. Additionally, we introduce a novel class-aware pseudo-label selection strategy for weak-to-strong supervision, which addresses the limitations of most existing methods that do not perform selection or apply a predefined threshold for all classes. Specifically, our strategy selects the top high-confidence prediction of the weak view for each class to generate pseudo labels that supervise the strong augmented views. This strategy is capable of taking into account the class imbalance and improving the performance of long-tailed classes. Our proposed method achieves state-of-the-art results on two datasets, PASCAL VOC 2012 and Cityscapes, outperforming other SSS algorithms by a significant margin. The source code is available at

Focusing on Flexible Masks: A Novel Framework for Panoptic Scene Graph Generation with Relation Constraints

  • Jiarui Yang
  • Chuan Wang
  • Zeming Liu
  • Jiahong Wu
  • Dongsheng Wang
  • Liang Yang
  • Xiaochun Cao

Panoptic Scene Graph Generation (PSG) presents pixel-wise instance detection and localization, leading to comprehensive and precise scene graphs. Current methods employ conventional Scene Graph Generation (SGG) frameworks to solve the PSG problem, neglecting the fundamental differences between bounding boxes and masks, i.e., bounding boxes are allowed overlap but masks are not. Since segmentation from the panoptic head has deviations, non-overlapping masks may not afford complete instance information. Subsequently, in the training phase, incomplete segmented instances may not be well-aligned to annotated ones, causing mismatched relations and insufficient training. During the inference phase, incomplete segmentation leads to incomplete scene graph prediction. To alleviate these problems, we construct a novel two-stage framework for the PSG problem. In the training phase, we design a proposal matching strategy, which replaces deterministic segmentation results with proposals extracted from the off-the-shelf panoptic head for label alignment, thereby ensuring the all-matching of training samples. In the inference phase, we present an innovative concept of employing relation predictions to constrain segmentation and design a relation-constrained segmentation algorithm. By reconstructing the process of generating segmentation results from proposals using predicted relation results, the algorithm recovers more valid instances and predicts more complete scene graphs. The experimental results show overall superiority, effectiveness, and robustness against adversarial attacks.

CCMB: A Large-scale Chinese Cross-modal Benchmark

  • Chunyu Xie
  • Heng Cai
  • Jincheng Li
  • Fanjing Kong
  • Xiaoyu Wu
  • Jianfei Song
  • Henrique Morimitsu
  • Lin Yao
  • Dexin Wang
  • Xiangzheng Zhang
  • Dawei Leng
  • Baochang Zhang
  • Xiangyang Ji
  • Yafeng Deng

Vision-language pre-training (VLP) on large-scale datasets has shown premier performance on various downstream tasks. In contrast to plenty of available benchmarks with English corpus, large-scale pre-training datasets and downstream datasets with Chinese corpus remain largely unexplored. In this work, we build a large-scale high-quality Chinese Cross-Modal Benchmark named CCMB for the research community, which contains the currently largest public pre-training dataset Zero and five human-annotated fine-tuning datasets for downstream tasks. Zero contains 250 million images paired with 750 million text descriptions, plus two of the five fine-tuning datasets are also currently the largest ones for Chinese cross-modal downstream tasks. Along with the CCMB, we also develop a VLP framework named R2D2, applying a pre-Ranking + Ranking strategy to learn powerful vision-language representations and a two-way distillation method (i.e., target-guided Distillation and feature-guided Distillation) to further enhance the learning capability. With the Zero and the R2D2 VLP framework, we achieve state-of-the-art performance on twelve downstream datasets from five broad categories of tasks including image-text retrieval, image-text matching, image caption, text-to-image generation, and zero-shot image classification. The datasets, models, and codes are available at

CPLFormer: Cross-scale Prototype Learning Transformer for Image Snow Removal

  • Sixiang Chen
  • Tian Ye
  • Yun Liu
  • Jinbin Bai
  • Haoyu Chen
  • Yunlong Lin
  • Jun Shi
  • Erkang Chen

Removing snow from a single image poses a significant challenge within the image restoration domain, as snowfall's effects are in various scales and forms. Existing methods have tried to tackle this issue by using multi-scale approaches, but their reliance on targeted design for handling each single-scale feature has resulted in unsatisfactory performance. This is primarily due to a lack of cross-scale knowledge, making it difficult to effectively handle degradations. To this end, we propose a novel approach, CPLFormer, which uses snow prototypes to own comprehensive clean scene understanding through learning from cross-scale features, outperforming convolutional network and vanilla transformer-based solutions. CPLFormer has several advantages: firstly, learnable snow prototypes learn global context information from multiple scales to uncover hidden clean cues; secondly, prototypes can propagate cross-scale information to each patch through cross-attention to assist with clean patch reconstruction; thirdly, CPLFormer surpasses advanced state-of-the-art desnowing networks and the prevalent universal image restoration transformers on six synthetic and real-world benchmark tests.

Video Entailment via Reaching a Structure-Aware Cross-modal Consensus

  • Xuan Yao
  • Junyu Gao
  • Mengyuan Chen
  • Changsheng Xu

This paper targets at the task of video entailment, which aims to achieve a thorough comprehension and draw inferences on whether a natural language statement entails or contradicts a given multi-modal video. Despite the recent progress, most existing methods focus on designing a vision-language encoder for multi-modal feature extraction in video entailment, which ignore the underlying consensus knowledge between two modalities, hindering the reasoning performance. As human beings, we make sense of the world by synthesizing information from different sense perceptions, which can acquire consensus among multiple modalities to form a more thorough and coherent representation of the surroundings, as well as to perform complicated understanding tasks. In this paper, we attempt to recreate this ability to infer the truthfulness of a given statement in the context of video entailment. To this end, we propose a unified structure-aware cross-modal consensus method to excavate the consensus semantics shared between video and language modalities, thereby incorporating which into video entailment as statement-related clues. Specifically, the consensus information is achieved by filtering away redundant information by utilizing the global information from one modality and the local complementary information from the other one. Moreover, a consensus-guided graph reasoning method is designed to explore inter-modality consistency and emphasize the significant features related to the judged statement, generating the inference results. Extensive experiments on two benchmarks demonstrate the accurate and robust performance of our approach compared to state-of-the-arts. Code is available at

Cerebrovascular Segmentation in TOF-MRA with Topology Regularization Adversarial Model

  • Cheng Chen
  • Yunqing Chen
  • Shuang Song
  • Jianan Wang
  • Huansheng Ning
  • Ruoxiu Xiao

Time-of-flight magnetic resonance angiography (TOF-MRA) is a common cerebrovascular imaging. Accurate and automatic cerebrovascular segmentation in TOF-MRA images is an important auxiliary method in clinical practice. Due to the complex semantics and noise interference, the existing segmentation methods often fail to pay attention to topological correlation, resulting in the neglect of branch vessels and vascular topology destruction. In this paper, we proposed a topology regularization adversarial model for cerebrovascular segmentation in TOF-MRA images. Firstly, we trained a self-supervised model to learn spatial semantic layout in TOF-MRA images by image context restoration. Subsequently, we exploited initialization based on the self-supervised model and constructed an adversarial model to accomplish parameter optimization. Considering the limitations of uneven distribution of cerebrovascular classes, we introduced skeleton structures as discriminative features to enhance vessel topological strength. We constructed some latest models to test our method over two datasets. Results show that the proposed model attains the highest score. Therefore, our method can obtain accurate connectivity information and higher graph similarity, leading more meaningful clinical utility.

Hierarchical Reasoning Network with Contrastive Learning for Few-Shot Human-Object Interaction Recognition

  • Jiale Yu
  • Baopeng Zhang
  • Qirui Li
  • Haoyang Chen
  • Zhu Teng

Few-shot learning (FSL) for human-object interaction aims at classifying samples of new unseen HOI classes with only a few labeled samples available. Although progress has been made in few-shot human-object interaction, most of the existing methods encounter two issues in handling fine-grained interactions: the inability to capture more subtle interactive clues and the inadequacy in learning from data with low inter-class variance. To tackle the first issue, we propose a hierarchical reasoning network to integrate multi-level interactive clues (from coarse to fine-grained) for strengthening HOI representations. The hierarchical relation module mainly captures and aggregates more discriminative relation information among human parts at multiple levels (including the human instance, action region, and body part levels) and objects via a unified graph and exploits a language-guided attentive fusion way to highlight informative features of each interaction level. To address the second issue, we introduce a contrastive learning mechanism to alleviate the inter-class variance. Compared with the previous ProtoNet-based methods, our model generates more discriminative representations for low inter-class variance data, since it makes full use of potential contrastive pairs in each training episode. Extensive experimental results on two standard benchmarks demonstrate that the proposed model performs favorably against state-of-the-art FS-HOI methods.

Uncertainty-Driven Dynamic Degradation Perceiving and Background Modeling for Efficient Single Image Desnowing

  • Sixiang Chen
  • Tian Ye
  • Chenghao Xue
  • Haoyu Chen
  • Yun Liu
  • Erkang Chen
  • Lei Zhu

Single-image snow removal aims to restore clean images from heterogeneous and irregular snow degradations. Recent methods utilize neural networks to remove various degradations directly. However, these approaches suffer from the limited ability to flexibly perceive complicated snow degradation patterns and insufficient representation of background structure information. To further improve the performance and generalization ability of snow removal, this paper aims to develop a novel and efficient paradigm from the perspective of degradation perceiving and background modeling.

For this purpose, we first analyze two critical properties in real snow images, namely local-region heterogeneity and axial anisotropy. Inspired by them, we propose Dynamic Perceiving for Degraded Regions and Axial-Pooling Attention for Background Structure Modeling, which together couple a new network architecture, dubbed as D2P-BMNet. Our proposed D2P-BMNet offers several key advantages: (i) It can effectively segment regions under the uncertainty map's guidance, and dynamically perceives heterogeneous degradations within various regions. (ii) By utilizing linear attention solely along a horizontal axis, it can effectively model clean scene information that is buried beneath the snow. (iii) D2P-BMNet significantly improves over prior methods across all benchmarks and maintains excellent inference speeds.

DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder

  • Chenpeng Du
  • Qi Chen
  • Tianyu He
  • Xu Tan
  • Xie Chen
  • Kai Yu
  • Sheng Zhao
  • Jiang Bian

While recent research has made significant progress in speech-driven talking face generation, the quality of the generated video still lags behind that of real recordings. One reason for this is the use of handcrafted intermediate representations like facial landmarks and 3DMM coefficients, which are designed based on human knowledge and are insufficient to precisely describe facial movements. Additionally, these methods require an external pretrained model for extracting these representations, whose performance sets an upper bound on talking face generation. To address these limitations, we propose a novel method called DAE-Talker that leverages data-driven latent representations obtained from a diffusion autoencoder (DAE). DAE contains an image encoder that encodes an image into a latent vector and a DDIM-based image decoder that reconstructs the image from it. We train our DAE on talking face video frames and then extract their latent representations as the training target for a Conformer-based speech2latent model. During inference, DAE-Talker first predicts the latents from speech and then generates the video frames with the image decoder in DAE from the predicted latents. This allows DAE-Talker to synthesize full video frames and produce natural head movements that align with the content of speech, rather than relying on a predetermined head pose from a template video. We also introduce pose modelling in speech2latent for pose controllability. Additionally, we propose a novel method for generating continuous video frames with the DDIM-based image decoder trained on individual frames, eliminating the need for modelling the joint distribution of consecutive frames directly. Our experiments show that DAE-Talker outperforms existing popular methods in lip-sync, video fidelity, and pose naturalness. We also conduct ablation studies to analyze the effectiveness of the proposed techniques and demonstrate the pose controllability of DAE-Talker.

Spatio-Temporal Branching for Motion Prediction using Motion Increments

  • Jiexin Wang
  • Yujie Zhou
  • Wenwen Qiang
  • Ying Ba
  • Bing Su
  • Ji-Rong Wen

Human motion prediction (HMP) has emerged as a popular research topic due to its diverse applications. Traditional methods rely on hand-crafted features and machine learning techniques, which often struggle to model the complex dynamics of human motion. Recent deep learning-based methods have achieved success by learning spatio-temporal representations of motion, but these models often overlook the reliability of motion data. Additionally, the temporal and spatial dependencies of skeleton nodes are distinct. The temporal relationship captures motion information over time, while the spatial relationship describes body structure and the relationships between different nodes. In this paper, we propose a novel spatio-temporal branching network using incremental information for HMP, which decouples the learning of temporal-domain and spatial-domain features, extracts more motion information, and achieves complementary cross-domain knowledge learning through knowledge distillation. Our approach effectively reduces noise interference and provides more expressive information for characterizing motion by separately extracting temporal and spatial features. We evaluate our approach on standard HMP benchmarks and outperform state-of-the-art methods in terms of prediction accuracy. Code is available at

Generative Neutral Features-Disentangled Learning for Facial Expression Recognition

  • Zhenqian Wu
  • Yazhou Ren
  • Xiaorong Pu
  • Zhifeng Hao
  • Lifang He

Facial expression recognition (FER) plays a critical role in human-computer interaction and affective computing. Traditional FER methods typically rely on comparing the difference between an examined facial expression and a neutral face of the same person to extract the motion of facial features and filter out expression-irrelevant information. With the extensive use of deep learning, the performance of FER has been further improved. However, existing deep learning-based methods rarely utilize neutral faces. To address this gap, we propose a novel deep learning-based FER method called Generative Neutral Features-Disentangled Learning (GNDL), which draws inspiration from the facial feature manifold. Our approach integrates a neutral feature generator (NFG) that generates neutral features in scenarios where the neutral face of the same subject is not available. The NFG uses fine-grained features from examined images as input and produces corresponding neutral features with the same identity. We train the NFG using a neutral feature reconstruction loss to ensure that the generative neutral features are consistent with the actual neutral features. We then disentangle the generative neutral features from the examined features to remove disturbance features and generate an expression deviation embedding for classification. Extensitive experimental results on three popular databases (CK+, Oulu-CASIA, and MMI) demonstrate that our proposed GNDL method outperforms state-of-the-art FER methods.

Deep Algorithm Unrolling with Registration Embedding for Pansharpening

  • Tingting Wang
  • Yongxu Ye
  • Faming Fang
  • Guixu Zhang
  • Ming Xu

Pansharpening aims to sharpen low resolution (LR) multispectral (MS) images with the help of corresponding high resolution (HR) panchromatic (PAN) images to obtain HRMS images. Model-based pansharpening methods manually design objective functions via observation model and hand-crafted priors. However, inevitable performance degradation may occur in the case that the prior is invalid. Although many deep learning based end-to-end pansharpening methods have been proposed recently, they still need to be improved due to the insufficient study on HRMS related domain knowledge. Besides, existing pansharpening methods rarely consider the misalignments between MS and PAN images, leading to poor performance. To tackle these issues, this paper proposes to unrolling the observation model with registration embedding for pansharpening. Inspired by the optical flow estimation, we embed the registration operation into the observation model to reconstruct the pansharpening function with the help of a deep prior of HRMS images, and then unroll the iterative solution into a novel deep convolutional network.. Apart from the single HRMS supervision, we also introduce a consistency loss to supervise the two degradation processes. The use of consistency loss enables the degradation sub-networks to learn more realistic degradation. Experimental results at reduced-resolution and full-resolution are reported to demonstrate the superiority of the proposed method to other state-of-the-art pansharpening methods. In GaoFen-2 dataset evaluation, our method achieves 1.2dB higher PSNR than SOTA techniques.

DAOT: Domain-Agnostically Aligned Optimal Transport for Domain-Adaptive Crowd Counting

  • Huilin Zhu
  • Jingling Yuan
  • Xian Zhong
  • Zhengwei Yang
  • Zheng Wang
  • Shengfeng He

Domain adaptation is commonly employed in crowd counting to bridge the domain gaps between different datasets. However, existing domain adaptation methods tend to focus on inter-dataset differences while overlooking the intra-differences within the same dataset, leading to additional learning ambiguities. These domain-agnostic factors,e.g., density, surveillance perspective, and scale, can cause significant in-domain variations, and the misalignment of these factors across domains can lead to a drop in performance in cross-domain crowd counting. To address this issue, we propose a Domain-agnostically Aligned Optimal Transport (DAOT) strategy that aligns domain-agnostic factors between domains. The DAOT consists of three steps. First, individual-level differences in domain-agnostic factors are measured using structural similarity (SSIM). Second, the optimal transfer (OT) strategy is employed to smooth out these differences and find the optimal domain-to-domain misalignment, with outlier individuals removed via a virtual "dustbin'' column. Third, knowledge is transferred based on the aligned domain-agnostic factors, and the model is retrained for domain adaptation to bridge the gap across domains. We conduct extensive experiments on five standard crowd-counting benchmarks and demonstrate that the proposed method has strong generalizability across diverse datasets. Our code will be available at:

Partial Annotation-based Video Moment Retrieval via Iterative Learning

  • Wei Ji
  • Renjie Liang
  • Lizi Liao
  • Hao Fei
  • Fuli Feng

Given a descriptive language query, Video Moment Retrieval (VMR) aims to seek the corresponding semantic-consistent moment clip in the video, which is represented as a pair of the start and end timestamps. Although current methods have achieved satisfying performance, training these models heavily relies on the fully-annotated VMR datasets. Nonetheless, precise video temporal annotations are extremely labor-intensive and ambiguous due to the diverse preferences of different annotators.

Although there are several works trying to explore weakly supervised VMR tasks with scattered annotated frames as labels, there is still much room to improve in terms of accuracy. Therefore, we design a new setting of VMR where users can easily point to small segments of non-controversy video moments and our proposed method can automatically fill in the remaining parts based on the video and query semantics. To support this, we propose a new framework named Video Moment Retrieval via Iterative Learning (VMRIL). It treats the partial temporal region as the seed, then expands the pseudo label by iterative training. In order to restrict the expansion with reasonable boundaries, we utilize a pretrained video action localization model to provide coarse guidance of potential video segments. Compared with other VMR methods, our VMRIL achieves a trade-off between satisfying performance and annotation efficiency. Experimental results show that our proposed method can achieve the SOTA performance in the weakly supervised VMR setting, and are even comparable with some fully-supervised VMR methods but with much less annotation cost.

Style Transfer Meets Super-Resolution: Advancing Unpaired Infrared-to-Visible Image Translation with Detail Enhancement

  • Yirui Shen
  • Jingxuan Kang
  • Shuang Li
  • Zhenjie Yu
  • Shuigen Wang

The problem of unpaired infrared-to-visible image translation has gained significant attention due to its ability to generate visible images with color information from low-detail grayscale infrared inputs. However, current methodologies often depend on conventional style transfer techniques, which constrain the spatial resolution of the visible output to be equivalent to that of the input infrared image. The fixed generation pattern results in blurry generated results when translating low-resolution infrared inputs, and utilizing high-resolution infrared inputs as a solution necessitates greater computational resources. This spurs us to investigate the challenging unpaired image translation from low-resolution infrared inputs to high-resolution visible outputs, with the ultimate goal of enhancing image details while reducing computational costs. Therefore, we propose a unified framework that integrates the super-resolution process into our unpaired infrared-to-visible image transfer, yielding realistic and high-resolution results. Specifically, we propose the Detail Consistency Loss to establish a connection between the two aforementioned modules, thereby enhancing the quality of visual detail in style transfer results through the super-resolution module. Furthermore, our Texture Perceptual Loss is designed to ensure that the generator generates high-quality visual details accurately and reliably. Experimental results indicate that our method outperforms other comparative approaches when utilizing low-resolution infrared inputs. Remarkably, our approach even surpasses techniques that use high-resolution infrared inputs to generate visible images. Last but equally important, we propose a new and challenging dataset, dubbed as InfraredCity-HD, which comprises 512X512 resolution images, to advance research on high-resolution infrared-related fields.

Mind the Gap: Improving Success Rate of Vision-and-Language Navigation by Revisiting Oracle Success Routes

  • Chongyang Zhao
  • Yuankai Qi
  • Qi Wu

Vision-and-Language Navigation (VLN) aims to navigate to the target location by following a given instruction. Unlike existing methods focused on predicting a more accurate action at each step in navigation, in this paper, we make the first attempt to tackle a long-ignored problem in VLN: narrowing the gap between Success Rate (SR) and Oracle Success Rate (OSR). We observe a consistently large gap (up to 9%) on four state-of-the-art VLN methods across two benchmark datasets: R2R and REVERIE. The high OSR indicates the robot agent passes the target location, while the low SR suggests the agent actually fails to stop at the target location at last. Instead of predicting actions directly, we propose to mine the target location from a trajectory given by off-the-shelf VLN models. Specially, we design a multi-module transformer-based model for learning compact discriminative trajectory viewpoint representation, which is used to predict the confidence of being a target location as described in the instruction. The proposed method is evaluated on three widely-adopted datasets: R2R, REVERIE and NDH, and shows promising results, demonstrating the potential for more future research.

Feature-Suppressed Contrast for Self-Supervised Food Pre-training

  • Xinda Liu
  • Yaohui Zhu
  • Linhu Liu
  • Jiang Tian
  • Lili Wang

Most previous approaches for analyzing food images have relied on extensively annotated datasets, resulting in significant human labeling expenses due to the varied and intricate nature of such images. Inspired by the effectiveness of contrastive self-supervised methods in utilizing unlabelled data, weiqing explore leveraging these techniques on unlabelled food images. In contrastive self-supervised methods, two views are randomly generated from an image by data augmentations. However, regarding food images, the two views tend to contain similar informative contents, causing large mutual information, which impedes the efficacy of contrastive self-supervised learning. To address this problem, we propose Feature Suppressed Contrast (FeaSC) to reduce mutual information between views. As the similar contents of the two views are salient or highly responsive in the feature map, the proposed FeaSC uses a response-aware scheme to localize salient features in an unsupervised manner. By suppressing some salient features in one view while leaving another contrast view unchanged, the mutual information between the two views is reduced, thereby enhancing the effectiveness of contrast learning for self-supervised food pre-training. As a plug-and-play module, the proposed method consistently improves BYOL and SimSiam by 1.70% ~ 6.69% classification accuracy on four publicly available food recognition datasets. Superior results have also been achieved on downstream segmentation tasks, demonstrating the effectiveness of the proposed method.

Learning from Easy to Hard Pairs: Multi-step Reasoning Network for Human-Object Interaction Detection

  • Yuchen Zhou
  • Guang Tan
  • Mengtang Li
  • Chao Gou

Human-object interaction (HOI) detection aims to interpret the interactions of human-object pairs. Existing methods adopt a one-step reasoning paradigm that simultaneously outputs multi-label results for all HOI pairs without distinguishing difficulties. However, there are significant variations among HOI pairs in the same image, making their performance degrade in challenging situations. In this paper, we argue that the model should prioritize hard samples after inferring easy ones, and hard samples can benefit from easy ones. To this end, we propose a novel Multi-step Reasoning Network that progressively learns from easy to hard samples. In particular, an Easy-to-Hard Learning Block is introduced to enhance the representation of hard HOI pairs by prior associations. Additionally, we propose a Multi-step Reasoning Probability Transfer mechanism to enhance multi-label interaction classifications, which leverages cognitive associations and semantic dependencies. Extensive experiments demonstrate that our method outperforms other state-of-the-art on two challenging benchmark datasets.

Separate and Locate: Rethink the Text in Text-based Visual Question Answering

  • Chengyang Fang
  • Jiangnan Li
  • Liang Li
  • Can Ma
  • Dayong Hu

Text-based Visual Question Answering (TextVQA) aims at answering questions about the text in images. Most works in this field focus on designing network structures or pre-training tasks. All these methods list the OCR texts in reading order (from left to right and top to bottom) to form a sequence, which is treated as a natural language ''sentence''. However, they ignore the fact that most OCR words in the TextVQA task do not have a semantical contextual relationship. In addition, these approaches use 1-D position embedding to construct the spatial relation between OCR tokens sequentially, which is not reasonable. The 1-D position embedding can only represent the left-right sequence relationship between words in a sentence, but not the complex spatial position relationship. To tackle these problems, we propose a novel method named Separate and Locate (SaL) that explores text contextual cues and designs spatial position embedding to construct spatial relations between OCR texts. Specifically, we propose a Text Semantic Separate (TSS) module that helps the model recognize whether words have semantic contextual relations. Then, we introduce a Spatial Circle Position (SCP) module that helps the model better construct and reason the spatial position relationships between OCR texts. Our SaL model outperforms the baseline model by 4.44% and 3.96% accuracy on TextVQA and ST-VQA datasets. Compared with the pre-training state-of-the-art method pre-trained on 64 million pre-training samples, our method, without any pre-training tasks, still achieves 2.68% and 2.52% accuracy improvement on TextVQA and ST-VQA. Our code and models will be released at

Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts

  • Yunshi Lan
  • Xiang Li
  • Xin Liu
  • Yang Li
  • Wei Qin
  • Weining Qian

Zero-shot Visual Question Answering (VQA) is a prominent vision-language task that examines both the visual and textual understanding capability of systems in the absence of training data. Recently, by converting the images into captions, information across multi-modalities is bridged and Large Language Models (LLMs) can apply their strong zero-shot generalization capability to unseen questions. To design ideal prompts for solving VQA via LLMs, several studies have explored different strategies to select or generate question-answer pairs as the exemplar prompts, which guide LLMs to answer the current questions effectively. However, they totally ignore the role of question prompts. The original questions in VQA tasks usually encounter ellipses and ambiguity which require intermediate reasoning. To this end, we present Reasoning Question Prompts for VQA tasks, which can further activate the potential of LLMs in zero-shot scenarios. Specifically, for each question, we first generate self-contained questions as reasoning question prompts via an unsupervised question edition module considering sentence fluency, semantic integrity and syntactic invariance. Each reasoning question prompt clearly indicates the intent of the original question. This results in a set of candidate answers. Then, the candidate answers associated with their confidence scores acting as answer heuristics are fed into LLMs and produce the final answer. We evaluate reasoning question prompts on three VQA challenges, experimental results demonstrate that they can significantly improve the results of LLMs on zero-shot setting and outperform existing state-of-the-art zero-shot methods on three out of four data sets. Our source code is publicly released at

Adaptive Decoupled Pose Knowledge Distillation

  • Jie Xu
  • Shanshan Zhang
  • Jian Yang

Existing state-of-the-art human pose estimation approaches require heavy computational resources for accurate prediction. One promising technique to obtain an accurate yet lightweight pose estimator is Knowledge Distillation (KD), which distills the pose knowledge from a powerful teacher model to a lightweight student model. However, existing human pose KD methods focus more on designing paired student and teacher network architectures, yet ignore the mechanism of pose knowledge distillation. In this work, we reformulate the human pose KD to a coarse to fine process and decouple the classical KD loss into three terms: Binary Keypoint vs. Non-Keypoint Distillation (BiKD), Keypoint Area Distillation (KAD) and Non-keypoint Area Distillation (NAD). Observing the decoupled formulation, we point out an important limitation of the classical pose KD, i.e. the bias between different loss terms limits the performance gain of the student network. To address the biased knowledge distillation problem, we present a novel KD method named Adaptive Decoupled Pose knowledge Distillation (ADPD), enabling BiKD, KAD and NAD to play their roles more effectively and flexibly. Extensive experiments on two standard human pose datasets, MPII and MS COCO, demonstrate that our proposed method outperforms previous KD methods and is generalizable to different teacher-student pairs. The code will be available at

Biased-Predicate Annotation Identification via Unbiased Visual Predicate Representation

  • Li Li
  • Chenwei Wang
  • You Qin
  • Wei Ji
  • Renjie Liang

Panoptic Scene Graph Generation (PSG) translates visual scenes to structured linguistic descriptions, i.e., mapping visual instances to subjects/objects, and their relationships to predicates. However, the annotators' preferences and semantic overlaps between predicates inevitably lead to the semantic mappings of multiple predicates to one relationship, i.e., biased-predicate annotations. As a result, with the contradictory mapping between visual and linguistics, PSG models are struggled to construct clear decision planes among predicates, so as to cause existing poor performances. Obviously, it is essential for the PSG task to tackle this multi-modal contradiction. Therefore, we propose a novel method that utilizes unbiased visual predicate representations for Biased-Annotation Identification (BAI) as a fundamental step for PSG/SGG tasks. Our BAI includes three main steps: predicate representation extraction, predicate representation debiasing, and biased-annotation identification. With flexible biased annotation processing methods, our BAI can act as a fundamental step of dataset debiasing. Experimental results demonstrate that our proposed BAI has achieved state-of-the-art performance, which promotes the performance of benchmark models to various degrees with ingenious biased annotation processing methods. Furthermore, our BAI shows great generalization and effectiveness on multiple datasets. Our codes are released at

Zero-Shot Object Detection by Semantics-Aware DETR with Adaptive Contrastive Loss

  • Huan Liu
  • Lu Zhang
  • Jihong Guan
  • Shuigeng Zhou

Zero-shot object detection (ZSD) aims to localize and recognize unseen objects in unconstrained images by leveraging semantic descriptions. Existing ZSD methods typically suffer from two drawbacks: 1) Due to the lack of data on unseen categories during the training phase, the model inevitably has a bias towards the seen categories, i.e., it prefers to subsume objects of unseen categories to seen categories; 2) It is usually very tricky for the feature extractor trained on data of seen categories to learn discriminative features that are good enough to help the model transfer the knowledge learned from data of seen categories to unseen categories. To tackle these problems, this paper proposes a novel zero-shot detection method based on a semantics-aware DETR and a class-wise adaptive contrastive loss. Concretely, to address the first problem, we develop a novel semantics-aware attention mechanism to mitigate the bias towards seen categories and integrate it into DETR, which results in a new end-to-end zero-shot object detection approach. Furthermore, to handle the second problem, a novel class-wise adaptive contrastive loss is proposed, which considers the relevance between each pair of categories according to their semantic description in order to learn separable features for better visual-semantic alignment. Extensive experiments and ablation studies on benchmark datasets demonstrate the effectiveness and superiority of the proposed method.

Rethinking Missing Modality Learning from a Decoding Perspective

  • Tao Jin
  • Xize Cheng
  • Linjun Li
  • Wang Lin
  • Ye Wang
  • Zhou Zhao

Conventional pipeline of multimodal learning consists of three stages, including encoding, fusion, and decoding. Most existing methods under missing modality condition focus on the first stage and aim to learn the modality invariant representation or reconstruct missing features. However, these methods rely on strong assumptions (i.e., all the pre-defined modalities are available for each input sample during training and the number of modalities is fixed). To solve this problem, we propose a simple yet effective method called Interaction Augmented Prototype Decomposition (IPD) for a more general setting, where the number of modalities is arbitrary and there are various incomplete modality conditions happening in both training and inference phases, even there are unseen testing conditions. Different from the previous methods, we improve the decoding stage. Concretely, IPD jointly learns the common and modality-specific task prototypes. Considering that the number of missing modality conditions scales exponentially with the number of modalities O(2n) and different conditions may have implicit interaction, the low-rank partial prototype decomposition with enough theoretical analysis is employed for modality-specific components to reduce the complexity. The decomposition also can promote unseen generalization with the modality factors of existing conditions. To simulate the low-rank setup, we further constrain the explicit interaction of specific modality conditions by employing disentangled contrastive constraints. Extensive results on the newly-created benchmarks of multiple tasks illustrate the effectiveness of our proposed model.

Improving the Transferability of Adversarial Examples with Arbitrary Style Transfer

  • Zhijin Ge
  • Fanhua Shang
  • Hongying Liu
  • Yuanyuan Liu
  • Liang Wan
  • Wei Feng
  • Xiaosen Wang

Deep neural networks are vulnerable to adversarial examples crafted by applying human-imperceptible perturbations on clean inputs. Although many attack methods can achieve high success rates in the white-box setting, they also exhibit weak transferability in the black-box setting. Recently, various methods have been proposed to improve adversarial transferability, in which the input transformation is one of the most effective methods. In this work, we notice that existing input transformation-based works mainly adopt the transformed data in the same domain for augmentation. Inspired by domain generalization, we aim to further improve the transferability using the data augmented from different domains. Specifically, a style transfer network can alter the distribution of low-level visual features in an image while preserving semantic content for humans. Hence, we propose a novel attack method named Style Transfer Method (STM) that utilizes a proposed arbitrary style transfer network to transform the images into different domains. To avoid inconsistent semantic information of stylized images for the classification network, we fine-tune the style transfer network and mix up the generated images added by random noise with the original images to maintain semantic consistency and boost input diversity. Extensive experimental results on the ImageNet-compatible dataset show that our proposed method can significantly improve the adversarial transferability on either normally trained models or adversarially trained models than state-of-the-art input transformation-based attacks. Code is available at:

Mixup-Augmented Temporally Debiased Video Grounding with Content-Location Disentanglement

  • Xin Wang
  • Zihao Wu
  • Hong Chen
  • Xiaohan Lan
  • Wenwu Zhu

Video Grounding (VG), has drawn widespread attention over the past few years, and numerous studies have been devoted to improving performance on various VG benchmarks. Nevertheless, the label annotation procedures in VG produce imbalanced query-moment-label distributions in the datasets, which severely deteriorate the learning model's capability of truly understanding the video contents. Existing works on debiased VG either focus on adjusting the learning model or conducting video-level augmentation, failing to handle the temporal bias issue caused by imbalanced query-moment-label distributions. In this paper, we propose a Disentangled Feature Mixup (DFM) framework for debiased VG, which is capable of performing unbiased grounding to tackle the temporal bias issue. Specifically, a feature-mixup augmentation strategy is designed to generate new (text, location) pairs with diverse temporal distributions via jointly augmenting the representation of text queries and the location labels. This strategy encourages making prediction based on more diverse data samples with balanced query-moment-label distributions. Furthermore, we also design a content-location disentanglement module to disentangle the representations of the temporal information and content information in videos, which is able to remove the spurious effect of temporal biases on video representation. Given that our proposed DFM framework conducts feature-level augmentation and disentanglement, it is model-agnostic and can be applied to most baselines simply yet effectively. Extensive experiments show that our proposed DFM framework is able to significantly outperform baseline models in various metrics under both independent identical distribution (i.i.d.) and out-of-distribution (o.o.d.) scenes, especially in scenarios with annotation distribution changes.

Learning Semantics-Grounded Vocabulary Representation for Video-Text Retrieval

  • Yaya Shi
  • Haowei Liu
  • Haiyang Xu
  • Zongyang Ma
  • Qinghao Ye
  • Anwen Hu
  • Ming Yan
  • Ji Zhang
  • Fei Huang
  • Chunfeng Yuan
  • Bing Li
  • Weiming Hu
  • Zheng-Jun Zha

Previous dual-encoder pre-training methods for video-text retrieval employ contrastive learning for cross-modal alignment in a latent space. However, such learned latent spaces often result in modality gap problem [26]. In this paper, we introduce a novel SemVTR framework designed to learn semantics-grounded video-text representations in a vocabulary space, in which each dimension corresponds to a semantic concept represented by a word. The representation is obtained by grounding video and text into semantically-related dimensions with high activation values. As video-text pairs share grounded dimensions, their vocabulary representations are expected to cluster together and thus alleviate modality gap problem. So, the crux of our method lies in grounding video and text into vocabulary space. Specifically, we propose a Multi-Granularity Video Semantics Grounding approach and a Textual Semantics Preserving training strategy. The visualization illustrates that SemVTR obtains semantics-gronded vocabulary representation and also alleviates the modality gap problem. SemVTR significantly outperforms existing methods on four video-text retrieval benchmarks.

Learning a Graph Neural Network with Cross Modality Interaction for Image Fusion

  • Jiawei Li
  • Jiansheng Chen
  • Jinyuan Liu
  • Huimin Ma

Infrared and visible image fusion has gradually proved to be a vital fork in the field of multi-modality imaging technologies. In recent developments, researchers not only focus on the quality of fused images but also evaluate their performance in downstream tasks. Nevertheless, the majority of methods seldom put their eyes on mutual learning from different modalities, resulting in fused images lacking significant details and textures. To overcome this issue, we propose an interactive graph neural network (GNN)-based architecture between cross modality for fusion, called IGNet. Specifically, we first apply a multi-scale extractor to achieve shallow features, which are employed as the necessary input to build graph structures. Then, the graph interaction module can construct the extracted intermediate features of the infrared/visible branch into graph structures. Meanwhile, the graph structures of two branches interact for cross-modality and semantic learning, so that fused images can maintain the important feature expressions and enhance the performance of downstream tasks. Besides, the proposed leader nodes can improve information propagation in the same modality. Finally, we merge all graph features to get the fusion result. Extensive experiments on different datasets (i.e. TNO, MFNet, and M3FD) demonstrate that our IGNet can generate visually appealing fused images while scoring averagely 2.59% mAP@.5 and 7.77% mIoU higher in detection and segmentation than the compared state-of-the-art methods. The source code of the proposed IGNet can be available at

COPA : Efficient Vision-Language Pre-training through Collaborative Object- and Patch-Text Alignment

  • Chaoya Jiang
  • Haiyang Xu
  • Wei Ye
  • Qinghao Ye
  • Chenliang Li
  • Ming Yan
  • Bin Bi
  • Shikun Zhang
  • Fei Huang
  • Ji Zhang

Vision-Language Pre-training (VLP) methods based on object detection enjoy the rich knowledge of fine-grained object-text alignment but at the cost of computationally expensive inference. Recent Visual-Transformer (ViT)-based approaches circumvent this issue while struggling with long visual sequences without detailed cross-modal alignment information. This paper introduces a ViT-based VLP technique that efficiently incorporates object information through a novel patch-text alignment mechanism. Specifically, we convert object-level signals into patch-level ones and devise a Patch-Text Alignment pre-training task (PTA) to learn a text-aware patch detector. By using off-the-shelf delicate object annotations in 5% training images, we jointly train PTA with other conventional VLP objectives in an end-to-end manner, bypassing the high computational cost of object detection and yielding an effective patch detector that accurately detects text-relevant patches, thus considerably reducing patch sequences and accelerating computation within the ViT backbone. Our experiments on a variety of widely-used benchmarks reveal that our method achieves a speedup of nearly 88% compared to prior VLP models while maintaining competitive or superior performance on downstream tasks with similar model size and data scale.

Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark

  • Shuyu Yang
  • Yinan Zhou
  • Zhedong Zheng
  • Yaxiong Wang
  • Li Zhu
  • Yujiao Wu

In this paper, we introduce a large Multi-Attribute and Language Search dataset for text-based person retrieval, called MALS, and explore the feasibility of performing pre-training on both attribute recognition and image-text matching tasks in one stone. In particular, MALS contains 1,510,330 image-text pairs, which is about 37.5 × larger than prevailing CUHK-PEDES, and all images are annotated with 27 attributes. Considering the privacy concerns and annotation costs, we leverage the off-the-shelf diffusion models to generate the dataset. To verify the feasibility of learning from the generated data, we develop a new joint Attribute Prompt Learning and Text Matching Learning (APTM) framework, considering the shared knowledge between attribute and text. As the name implies, APTM contains an attribute prompt learning stream and a text matching learning stream. (1) The attribute prompt learning leverages the attribute prompts for image-attribute alignment, which enhances the text matching learning. (2) The text matching learning facilitates the representation learning on fine-grained details, and in turn, boosts the attribute prompt learning. Extensive experiments validate the effectiveness of the pre-training on MALS, achieving state-of-the-art retrieval performance via APTM on three challenging real-world benchmarks. In particular, APTM achieves a consistent improvement of +6.96 %, +7.68%, and +16.95% Recall@1 accuracy on CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets by a clear margin, respectively. The dataset, model, and code are available at

Towards Real-Time Sign Language Recognition and Translation on Edge Devices

  • Shiwei Gan
  • Yafeng Yin
  • Zhiwei Jiang
  • Lei Xie
  • Sanglu Lu

To provide instant communication for hearing-impaired people, it is essential to achieve real-time sign language processing anytime anywhere. Therefore, in this paper, we propose a Region-aware Temporal Graph based neural Network (RTG-Net), aiming to achieve real-time Sign Language Recognition (SLR) and Translation (SLT) on edge devices. To reduce the computation overhead, we first construct a shallow graph convolution network to reduce model size by decreasing model depth. Besides, we apply structural re-parameterization to fuse the convolutional layer, batch normalization layer and all branches to simplify model complexity by reducing model width. To achieve the high performance in sign language processing as well, we extract key regions based on keypoints in skeleton from each frame, and design a region-aware temporal graph to combine key regions and full frame for feature representation. In RTG-Net, we design a multi-stage training strategy to optimize keypoint selection, SLR and SLT step by step. Experimental results demonstrate that RTG-Net achieves comparable performance with existing methods in SLR or SLT, while greatly reducing the computation overhead and achieving real-time sign language processing on edge devices. Our code is available at

Enhancing Visually-Rich Document Understanding via Layout Structure Modeling

  • Qiwei Li
  • Zuchao Li
  • Xiantao Cai
  • Bo Du
  • Hai Zhao

In recent years, the use of multi-modal pre-trained Transformers has led to significant advancements in visually-rich document understanding. However, existing models have mainly focused on features such as text and vision while neglecting the importance of layout relationship between text nodes. In this paper, we propose GraphLayoutLM, a novel document understanding model that leverages the modeling of layout structure graph to inject document layout knowledge into the model. GraphLayoutLM utilizes a graph reordering algorithm to adjust the text sequence based on the graph structure. Additionally, our model uses a layout-aware multi-head self-attention layer to learn document layout knowledge. The proposed model enables the understanding of the spatial arrangement of text elements, improving document comprehension. We evaluate our model on various benchmarks, including FUNSD, XFUND and CORD and it achieves state-of-the-art results among these datasets. Our experiment results demonstrate that our proposed method provides a significant improvement over existing approaches and showcases the importance of incorporating layout information into document understanding models. We also conduct an ablation study to investigate the contribution of each component of our model. The results show that both the graph reordering algorithm and the layout-aware multi-head self-attention layer play a crucial role in achieving the best performance.

Non-Exemplar Class-Incremental Learning via Adaptive Old Class Reconstruction

  • Shaokun Wang
  • Weiwei Shi
  • Yuhang He
  • Yifan Yu
  • Yihong Gong

In the Class-Incremental Learning (CIL) task, rehearsal-based approaches have received a lot of attention recently. However, storing old class samples is often infeasible in application scenarios where device memory is insufficient or data privacy is important. Therefore, it is necessary to rethink Non-Exemplar Class-Incremental Learning (NECIL). In this paper, we propose a novel NECIL method named POLO with an adaPtive Old cLass recOnstruction mechanism, in which a density-based prototype reinforcement method (DBR), a topology-correction prototype adaptation method (TPA), and an adaptive prototype augmentation method (APA) are designed to reconstruct pseudo features of old classes in new incremental sessions. Specifically, the DBR focuses on the low-density features to maintain the model's discriminative ability for old classes. Afterward, the TPA is designed to adapt old class prototypes to new feature spaces in the incremental learning process. Finally, the APA is developed to further adapt pseudo feature spaces of old classes to new feature spaces. Experimental evaluations on four benchmark datasets demonstrate the effectiveness of our proposed method over the state-of-the-art NECIL methods.

CLIP-Count: Towards Text-Guided Zero-Shot Object Counting

  • Ruixiang Jiang
  • Lingbo Liu
  • Changwen Chen

Recent advances in visual-language models have shown remarkable zero-shot text-image matching ability that is transferable to downstream tasks such as object detection and segmentation. Adapting these models for object counting, however, remains a formidable challenge. In this study, we first investigate transferring vision-language models (VLMs) for class-agnostic object counting. Specifically, we propose CLIP-Count, the first end-to-end pipeline that estimates density maps for open-vocabulary objects with text guidance in a zero-shot manner. To align the text embedding with dense visual features, we introduce a patch-text contrastive loss that guides the model to learn informative patch-level visual representations for dense prediction. Moreover, we design a hierarchical patch-text interaction module to propagate semantic information across different resolution levels of visual features. Benefiting from the full exploitation of the rich image-text alignment knowledge of pretrained VLMs, our method effectively generates high-quality density maps for objects-of-interest. Extensive experiments on FSC-147, CARPK, and ShanghaiTech crowd counting datasets demonstrate state-of-the-art accuracy and generalizability of the proposed method. Code is available:

Self-Supervised Cross-Language Scene Text Editing

  • Fuxiang Yang
  • Tonghua Su
  • Xiang Zhou
  • Donglin Di
  • Zhongjie Wang
  • Songze Li

We propose and formulate the task of cross-language scene text editing, modifying the text content of a scene image into new text in another language, while preserving the scene text style and background texture. The key challenges of this task lie in the difficulty in distinguishing text and background, great distribution differences among languages, and the lack of fine-labeled real-world data. To tackle these problems, we propose a novel network named Cross-LAnguage Scene Text Editing (CLASTE), which is capable of separating the foreground text and background, as well as further decomposing the content and style of the foreground text. Our model can be trained in a self-supervised training manner on the unlabeled and multi-language data in real-world scenarios, where the source images serve as both input and ground truth. Experimental results on the Chinese-English cross-language dataset show that our proposed model can generate realistic text images, specifically, modifying English to Chinese and vice versa. Furthermore, our method is universal and can be extended to other languages such as Arabic, Korean, Japanese, Hindi, Bengali, and so on.

Learning Implicit Entity-object Relations by Bidirectional Generative Alignment for Multimodal NER

  • Feng Chen
  • Jiajia Liu
  • Kaixiang Ji
  • Wang Ren
  • Jian Wang
  • Jingdong Chen

The challenge posed by multimodal named entity recognition (MNER) is mainly two-fold: (1) bridging the semantic gap between text and image and (2) matching the entity with its associated object in image. Existing methods fail to capture the implicit entity-object relations, due to the lack of corresponding annotation. In this paper, we propose a bidirectional generative alignment method named BGA-MNER to tackle these issues. Our BGA-MNER consists of image2text and text2image generation with respect to entity-salient content in two modalities. It jointly optimizes the bidirectional reconstruction objectives, leading to aligning the implicit entity-object relations under such direct and powerful constraints. Furthermore, image-text pairs usually contain unmatched components which are noisy for generation. A stage-refined context sampler is proposed to extract the matched cross-modal content for generation. Extensive experiments on two benchmarks demonstrate that our method achieves state-of-the-art performance without image input during inference.

MORE: A Multimodal Object-Entity Relation Extraction Dataset with a Benchmark Evaluation

  • Liang He
  • Hongke Wang
  • Yongchang Cao
  • Zhen Wu
  • Jianbing Zhang
  • Xinyu Dai

Extracting relational facts from multimodal data is a crucial task in the field of multimedia and knowledge graphs that feeds into widespread real-world applications. The emphasis of recent studies centers on recognizing relational facts in which both entities are present in one modality and supplementary information is used from other modalities. However, such works disregard a substantial amount of multimodal relational facts that arise across different modalities, such as one entity seen in a text and another in an image. In this paper, we propose a new task, namely Multimodal Object-Entity Relation Extraction, which aims to extract "object-entity" relational facts from image and text data. To facilitate research on this task, we introduce MORE, a new dataset comprising 21 relation types and 20,136 multimodal relational facts annotated on 3,522 pairs of textual news titles and corresponding images. To show the challenges of Multimodal Object-Entity Relation Extraction, we evaluated recent state-of-the-art methods for multimodal relation extraction and conducted a comprehensive experimentation analysis on MORE. Our results demonstrate significant challenges for existing methods, underlining the need for further research on this task. Based on our experiments, we identify several promising directions for future research. The MORE dataset and code are available at

Weakly-supervised Video Scene Graph Generation via Unbiased Cross-modal Learning

  • Ziyue Wu
  • Junyu Gao
  • Changsheng Xu

Video Scene Graph Generation (VidSGG), which aims to detect the relations between objects in a continuous spatio-temporal environment, has shown great potential in video understanding. Almost all prevailing VidSGG approaches are in a fully-supervised manner where expensive manual annotations are required. Therefore, we introduce a novel and challenging task named Weakly-supervised Video Scene Graph Generation (WS-VidSGG), in which a model is trained with only unlocalized scene graphs as supervisory information. Due to the imbalanced data distribution and the lack of fine-grained annotations, models learned in this setting is prone to be biased. Therefore, we propose an Unbiased Cross-Modal Learning (UCML) framework to address the WS-VidSGG task. Specifically, a cross-modal alignment module is firstly designed for allocating pseudo labels to unlabeled visual objects. We then extract unbiased knowledge from dataset statistics, and utilize prompt to make our model finely comprehend semantic concepts. The learned features that from the prompts and unbiased knowledge reinforced each other, resulting in discriminative textual representations. In order to better explore the relations between visual entities, we design a knowledge-guided attention graph to capture the cross-modal relations. Finally, the learned textual and visual features are integrated into a unified framework for relation prediction. Extensive ablation studies verify the effectiveness of our framework. Moreover, the comparison with state-of-the-art fully-supervised methods shows that our proposed framework also achieves comparable performance. Code is available.

Reducing Intrinsic and Extrinsic Data Biases for Moment Localization with Natural Language

  • Jiong Yin
  • Liang Li
  • Jiehua Zhang
  • Chenggang Yan
  • Lei Zhang
  • Zunjie Zhu

Moment Localization with Natural Language (MLNL) aims to locate the target moment from an untrimmed video by a linguistic query. Recent works reveal the severe data bias problem in MLNL and point out that the multi-modal content may not be understood by fitting the timestamp distribution. In this paper, we study the data biases on the intrinsic and extrinsic aspects: the former is mainly caused by the ambiguity of the moment boundary and the information imbalance between input and output; The latter results from the long-tail distribution of moments in MLNL datasets. To alleviate this, we propose a hybrid multi-modal debiasing network with temporal consistency constraint for MLNL. Specifically, we first design the multi-temporal Transformer to mitigate the ambiguity of boundary by integrating frame-wise features into segment-wise and dynamically matching with moment boundaries. Then, we introduce the temporal consistency constraint that highlights the action information in complex moment content to overcome the intrinsic bias from information imbalance.Furthermore, we design the hybrid linguistic activating module with external knowledge to relieve the extrinsic bias, which introduces a prior guidance to focus the discriminative information from the tail samples. Extensive experiments on three public datasets demonstrate that our model outperforms the existing methods.

VioLET: Vision-Language Efficient Tuning with Collaborative Multi-modal Gradients

  • Yaoming Wang
  • Yuchen Liu
  • Xiaopeng Zhang
  • Jin Li
  • Bowen Shi
  • Chenglin Li
  • Wenrui Dai
  • Hongkai Xiong
  • Qi Tian

Parameter-Efficient Tuning (PET) has emerged as a leading advancement in both Natural Language Processing and Computer Vision, enabling efficient accommodation of downstream tasks without costly fine-tuning. However, most existing PET approaches are limited to uni-modal tuning, even for vision-language models like CLIP. We investigate this limitation and demonstrate that simultaneous tuning of the two modalities in such models leads to multi-modal forgetting and catastrophic performance degradation, particularly when generalizing to new classes. To address this issue, we propose a novel PET approach called VioLET (Vision Language Efficient Tuning) that utilizes collaborative multi-modal gradients to unlock the full potential of both modalities. Specifically, we incorporate an additional visual encoder without learnable parameters and use these two visual encoders to compute the gradients of the context parameters separately. When conflicts arise, we replace the original gradient with an orthogonal gradient. Extensive experiments are conducted on few-shot recognition and unseen class generalization tasks using ResNet-50 or ViT/B-16 as the backbone. VioLET consistently outperforms several state-of-the-art methods on 11 datasets, showcasing its superiority over existing PET approaches. The code is available at

Mirror-NeRF: Learning Neural Radiance Fields for Mirrors with Whitted-Style Ray Tracing

  • Junyi Zeng
  • Chong Bao
  • Rui Chen
  • Zilong Dong
  • Guofeng Zhang
  • Hujun Bao
  • Zhaopeng Cui

Recently, Neural Radiance Fields (NeRF) has exhibited significant success in novel view synthesis, surface reconstruction, etc. However, since no physical reflection is considered in its rendering pipeline, NeRF mistakes the reflection in the mirror as a separate virtual scene, leading to the inaccurate reconstruction of the mirror and multi-view inconsistent reflections in the mirror. In this paper, we present a novel neural rendering framework, named Mirror-NeRF, which is able to learn accurate geometry and reflection of the mirror and support various scene manipulation applications with mirrors, such as adding new objects or mirrors into the scene and synthesizing the reflections of these new objects in mirrors, controlling mirror roughness, etc. To achieve this goal, we propose a unified radiance field by introducing the reflection probability and tracing rays following the light transport model of Whitted Ray Tracing, and also develop several techniques to facilitate the learning process. Experiments and comparisons on both synthetic and real datasets demonstrate the superiority of our method. The code and supplementary material are available on the project webpage:

Semi-supervised Deep Multi-view Stereo

  • Hongbin Xu
  • Weitao Chen
  • Yang Liu
  • Zhipeng Zhou
  • Haihong Xiao
  • Baigui Sun
  • Xuansong Xie
  • Wenxiong Kang

Significant progress has been witnessed in learning-based Multi-view Stereo (MVS) under supervised and unsupervised settings. To combine their respective merits in accuracy and completeness, meantime reducing the demand for expensive labeled data, this paper explores the problem of learning-based MVS in a semi-supervised setting that only a tiny part of the MVS data is attached with dense depth ground truth. However, due to huge variation of scenarios and flexible settings in views, it may break the basic assumption in classic semi-supervised learning, that unlabeled data and labeled data share the same label space and data distribution, named as semi-supervised distribution-gap ambiguity in the MVS problem. To handle these issues, we propose a novel semi-supervised distribution-augmented MVS framework, namely SDA-MVS. For the simple case that the basic assumption works in MVS data, consistency regularization encourages the model predictions to be consistent between original sample and randomly augmented sample. For further troublesome case that the basic assumption is conflicted in MVS data, we propose a novel style consistency loss to alleviate the negative effect caused by the distribution gap. The visual style of unlabeled sample is transferred to labeled sample to shrink the gap, and the model prediction of generated sample is further supervised with the label in original labeled sample. The experimental results in semi-supervised settings of multiple MVS datasets show the superior performance of the proposed method. With the same settings in backbone network, our proposed SDA-MVS outperforms its fully-supervised and unsupervised baselines.

Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning

  • Chen Jiang
  • Hong Liu
  • Xuzheng Yu
  • Qing Wang
  • Yuan Cheng
  • Jia Xu
  • Zhongyi Liu
  • Qingpei Guo
  • Wei Chu
  • Ming Yang
  • Yuan Qi

In recent years, the explosion of web videos makes text-video retrieval increasingly essential and popular for video filtering, recommendation, and search. Text-video retrieval aims to rank relevant text/video higher than irrelevant ones. The core of this task is to precisely measure the cross-modal similarity between texts and videos. Recently, contrastive learning methods have shown promising results for text-video retrieval, most of which focus on the construction of positive and negative pairs to learn text and video representations. Nevertheless, they do not pay enough attention to hard negative pairs and lack the ability to model different levels of semantic similarity. To address these two issues, this paper improves contrastive learning using two novel techniques. First, to exploit hard examples for robust discriminative power, we propose a novel Dual-Modal Attention-Enhanced Module (DMAE) to mine hard negative pairs from textual and visual clues. By further introducing a Negative-aware InfoNCE (NegNCE) loss, we are able to adaptively identify all these hard negatives and explicitly highlight their impacts in the training loss. Second, our work argues that triplet samples can better model fine-grained semantic similarity compared to pairwise samples. We thereby present a new Triplet Partial Margin Contrastive Learning (TPM-CL) module to construct partial order triplet samples by automatically generating fine-grained hard negatives for matched text-video pairs. The proposed TPM-CL designs an adaptive token masking strategy with cross-modal interaction to model subtle semantic differences. Extensive experiments demonstrate that the proposed approach outperforms existing methods on four widely-used text-video retrieval datasets, including MSR-VTT, MSVD, DiDeMo and ActivityNet.

Temporal Sentence Grounding in Streaming Videos

  • Tian Gan
  • Xiao Wang
  • Yan Sun
  • Jianlong Wu
  • Qingpei Guo
  • Liqiang Nie

This paper aims to tackle a novel task - Temporal Sentence Grounding in Streaming Videos (TSGSV). The goal of TSGSV is to evaluate the relevance between a video stream and a given sentence query. Unlike regular videos, streaming videos are acquired continuously from a particular source, and are always desired to be processed on-the-fly in many applications such as surveillance and live-stream analysis. Thus, TSGSV is challenging since it requires the model to infer without future frames and process long historical frames effectively, which is untouched in the early methods. To specifically address the above challenges, we propose two novel methods: (1) a TwinNet structure that enables the model to learn about upcoming events; and (2) a language-guided feature compressor that eliminates redundant visual frames and reinforces the frames that are relevant to the query. We conduct extensive experiments using ActivityNet Captions, TACoS, and MAD datasets. The results demonstrate the superiority of our proposed methods. A systematic ablation study also confirms their effectiveness.

Modality-agnostic Augmented Multi-Collaboration Representation for Semi-supervised Heterogenous Face Recognition

  • Decheng Liu
  • Weizhao Yang
  • Chunlei Peng
  • Nannan Wang
  • Ruimin Hu
  • Xinbo Gao

Heterogeneous face recognition (HFR) aims to match input face identity across different image modalities. Due to the existing large modality gap and the limited number of training data, HFR is still a challenging problem in biometrics and draws more and more attention. Existing researchers always extract modality invariant features or generate homogeneous images to decrease the modality gap, lacking abundant labeled data to avoid the overfitting problem. In this paper, we proposed a novel Modality-Agnostic Augmented Multi-Collaboration representation for Heterogeneous Face Recognition (MAMCO-HFR) in a semi-supervised manner. The modality-agnostic augmentation strategy is proposed to generate adversarial perturbations to map unlabeled faces into the modality-agnostic domain. The multi-collaboration feature constraint is designed to mine the inherent relationships between diverse layers for discriminative representation. Experiments on several large-scale heterogeneous face datasets (CASIA NIR-VIS 2.0, LAMP-HQ and Tufts Face dataset) prove the proposed algorithm can achieve superior performance compared with state-of-the-art methods. The source code is available at

Swin-UNIT: Transformer-based GAN for High-resolution Unpaired Image Translation

  • Yifan Li
  • Yaochen Li
  • Wenneng Tang
  • Zhifeng Zhu
  • Jinhuo Yang
  • Yuehu Liu

The transformer model has gained a lot of success in various computer vision tasks owing to its capacity of modeling long-range dependencies. However, its application has been limited in the area of high-resolution unpaired image translation using GANs due to the quadratic complexity with the spatial resolution of input features. In this paper, we propose a novel transformer-based GAN for high-resolution unpaired image translation named Swin-UNIT. A two-stage generator is designed which consists of a global style translation (GST) module and a recurrent detail supplement (RDS) module. The GST module focuses on translating low-resolution global features using the ability of self-attention. The RDS module offers quick information propagation from the global features to the detail features at a high resolution using cross-attention. Moreover, we customize a dual-branch discriminator to guide the generator. Extensive experiments demonstrate that our model achieves state-of-the-art results on the unpaired image translation tasks.

PixelFace+: Towards Controllable Face Generation and Manipulation with Text Descriptions and Segmentation Masks

  • Xiaoxiong Du
  • Jun Peng
  • Yiyi Zhou
  • Jinlu Zhang
  • Siting Chen
  • Guannan Jiang
  • Xiaoshuai Sun
  • Rongrong Ji

Synthesizing vivid human portraits is a research hot spot in image generation with a wide scope of applications. In addition to fidelity, generation controllability is another key factor that has long plagued its development. To address this issue, existing solutions usually adopt either textual or visual conditions for the target face synthesis, e.g., descriptions or segmentation masks, which still cannot fully control the generation due to the intrinsic shortages of each condition. In this paper, we propose to make use of both types of prior information to facilitate controllable face generation. In particular, we hope to produce coarse-grained information about faces based on the segmentation masks, such as face shapes and poses, and the text description is used to render detailed face attributes, e.g., face color, makeup and gender. More importantly, we hope that the generation can be easily controlled via interactively editing both types of information, making face generation more applicable to real-world applications. To accomplish this target, we propose a novel face generation model termed PixelFace+. In PixelFace+, both the text and mask are encoded as pixel-wise priors, based on which the pixel synthesis process is conducted to produce the expected portraits. Meanwhile, the loss objectives are also carefully designed to make sure that the generated faces are semantically aligned with both text and mask inputs. To validate the proposed PixelFace+, we conducted a comprehensive set of experiments on the widely recognized benchmark called MMCelebA. We not only quantitatively compare PixelFace+ with a bunch of newly proposed Text-to-Face(T2F) generation methods, but also give plenty of qualitative analyses. The experimental results demonstrate that PixelFace+ not only outperforms existing generation methods in both image quality and conditional matching but also shows a much superior controllability of face generation. More importantly, PixelFace+ presents a convenient and interactive way of face generation and manipulation via editing the text and mask inputs. Our SOURCE CODE and DEMO are given in our supplementary materials.

LiFT: Transfer Learning in Vision-Language Models for Downstream Adaptation and Generalization

  • Jingzheng Li
  • Hailong Sun

Pre-trained Vision-Language Models (VLMs) on large-scale image-text pairs, e.g., CLIP, have shown promising performance on zero-shot knowledge transfer. Recently, fine-tuning pre-trained VLMs to downstream few-shot classification with limited image annotation data yields significant gains. However, there are two limitations. First, most of the methods for fine-tuning VLMs only update newly added parameters while keeping the whole VLM frozen. Thus, it remains unclear how to directly update the VLM itself. Second, fine-tuning VLMs to a specific set of base classes would deteriorate the well-learned representation space such that the VLMs generalize poorly on novel classes. To address these issues, we first propose Layer-wise Fine-Tuning (LiFT) which achieves average gains of 3.9%, 4.3%, 4.2% and 4.5% on base classes under 2-, 4-, 8- and 16-shot respectively compared to the baseline CoOp over 11 datasets. Alternatively, we provide a parameter-efficient LiFT-Adapter exhibiting favorable performance while updating only 1.66% of total parameters. Further, we design scalable LiFT-NCD to identify both base classes and novel classes, which boosts the accuracy by an average of 5.01% over zero-shot generalization of CLIP, exploring the potential of VLMs in discovering novel classes.

VCMaster: Generating Diverse and Fluent Live Video Comments Based on Multimodal Contexts

  • Manman Zhang
  • Ge Luo
  • Yuchen Ma
  • Sheng Li
  • Zhenxing Qian
  • Xinpeng Zhang

Live video commenting, or "bullet screen," is a popular social style on video platforms. Automatic live commenting has been explored as a promising approach to enhance the appeal of videos. However, existing methods neglect the diversity of generated sentences, limiting the potential to obtain human-like comments. In this paper, we introduce a novel framework called "VCMaster" for multimodal live video comments generation, which balances the diversity and quality of generated comments to create human-like sentences. We involve images, subtitles, and contextual comments as inputs to better understand complex video contexts. Then, we propose an effective Hierarchical Cross-Fusion Decoder to integrate high-quality trimodal feature representations by cross-fusing critical information from previous layers. Additionally, we develop a Sentence-Level Contrastive Loss to enlarge the distance between generated and contextual comments by contrastive learning. It helps the model to avoid the pitfall of simply imitating provided contextual comments and losing creativity, encouraging the model to achieve more diverse comments while maintaining high quality. We also construct a large-scale multimodal live video comments dataset with 292,507 comments and three sub-datasets that cover nine general categories. Extensive experiments demonstrate that our model achieves a level of human-like language expression and remarkably fluent, diverse, and engaging generated comments compared to baselines.

Whether you can locate or not? Interactive Referring Expression Generation

  • Fulong Ye
  • Yuxing Long
  • Fangxiang Feng
  • Xiaojie Wang

Referring Expression Generation (REG) aims to generate unambiguous Referring Expressions (REs) for objects in a visual scene, with a dual task of Referring Expression Comprehension (REC) to locate the referred object. Existing methods construct REG models independently by using only the REs as ground truth for model training, without considering the potential interaction between REG and REC models. In this paper, we propose an Interactive REG (IREG) model that can interact with a real REC model, utilizing signals indicating whether the object is located and the visual region located by the REC model to gradually modify REs. Our experimental results on three RE benchmark datasets, RefCOCO, RefCOCO+, and RefCOCOg show that IREG outperforms previous state-of-the-art methods on popular evaluation metrics. Furthermore, a human evaluation shows that IREG generates better REs with the capability of interaction.

Iterative Learning with Extra and Inner Knowledge for Long-tail Dynamic Scene Graph Generation

  • Yiming Li
  • Xiaoshan Yang
  • Changsheng Xu

Dynamic scene graphs have become a powerful tool for higher-level visual understanding tasks, and the interest in dynamic scene graph generation (dynamic SGG) is grown over time. Recently, numbers of existing methods achieve significant progress in dynamic SGG by capturing temporal information with transformer or recurrent network structures. However, most existing methods only focus on predicting the head predicates, which ignore the long-tail phenomenon, thus the tail predicates are hard to be recognized. In this paper, we propose a novel method named Iterative Learning with Extra and Inner Knowledge (I2LEK) to address the long-tail problem in dynamic SGG. The extra knowledge is obtained from commonsense, while inner knowledge is defined as the temporal evolution patterns of visual relationships. Specifically, we introduce extra knowledge to enrich the representations of predicates in the spatial dimension and adopt inner knowledge to implement knowledge sharing in the temporal dimension. With enriched representations and shared knowledge, I2LEK can accurately predict both the tail and head predicates. Moreover, an iterative learning strategy is proposed to fuse the extra knowledge, inner knowledge, and spatial-temporal context contained in videos, which further enhances the model's understanding of visual relationships. Our experimental results on the public Action Genome dataset demonstrate that our model achieves state-of-the-art performance.

Improving Image Captioning through Visual and Semantic Mutual Promotion

  • Jing Zhang
  • Yingshuai Xie
  • Xiaoqiang Liu

Current image captioning methods commonly use semantic attributes extracted by an object detector to guide visual representation, leaving the mutual guidance and enhancement between vision and semantics under-explored. Neurological studies have revealed that the visual cortex of the brain plays a crucial role in recognizing visual objects, while the prefrontal cortex is involved in the integration of contextual semantics. Inspired by the above studies, we propose a novel Visual-Semantic Transformer (VST) to model the neural interaction between vision and semantics, which explores the mechanism of deep fusion and mutual promotion of multimodal information, realizing more accurate image captioning. To better facilitate the complementary strengths between visual objects and semantic contexts, we propose a global position-sensitive co-attention encoder to realize globally associative, position-aware visual and semantic co-interaction through a mutual cross-attention mechanism. In addition, a multimodal mixed attention module is proposed in the decoder, which achieves adaptive multimodal feature fusion for enhancing the decoding capability. Experimental evidence shows that our VST significantly surpasses the state-of-the-art approaches on MSCOCO dataset and reaches the excellent CIDEr score of 142% on the Karpathy test split.

Fine-Grained Spatiotemporal Motion Alignment for Contrastive Video Representation Learning

  • Minghao Zhu
  • Xiao Lin
  • Ronghao Dang
  • Chengju Liu
  • Qijun Chen

As the most essential property in a video, motion information is critical to a robust and generalized video representation. To inject motion dynamics, recent works have adopted frame difference as the source of motion information in video contrastive learning, considering the trade-off between quality and cost. However, existing works align motion features at the instance level, which suffers from spatial and temporal weak alignment across modalities. In this paper, we present a Fine-grained Motion Alignment (FIMA) framework, capable of introducing well-aligned and significant motion information. Specifically, we first develop a dense contrastive learning framework in the spatiotemporal domain to generate pixel-level motion supervision. Then, we design a motion decoder and a foreground sampling strategy to eliminate the weak alignments in terms of time and space. Moreover, a frame-level motion contrastive loss is presented to improve the temporal diversity of the motion features. Extensive experiments demonstrate that the representations learned by FIMA possess great motion-awareness capabilities and achieve state-of-the-art or competitive results on downstream tasks across UCF101, HMDB51, and Diving48 datasets. Code is available at

Better Integrating Vision and Semantics for Improving Few-shot Classification

  • Zhuoling Li
  • Yong Wang

Some recent methods address few-shot classification by integrating visual and semantic prototypes. However, they usually ignore the difference in feature structure between the visual and semantic modalities, which leads to limited performance improvements. In this paper, we propose a novel method, called bimodal integrator (BMI), to better integrate visual and semantic prototypes. In BMI, we first construct a latent space for each modality via a variational autoencoder, and then align the semantic latent space to the visual latent space. Through this semantics-to-vision alignment, the semantic modality is mapped to the visual latent space and has the same feature structure as the visual modality. As a result, the visual and semantic prototypes can be better integrated. In addition, based on the multivariate Gaussian distribution and the prompt engineering, a data augmentation scheme is designed to ensure the accuracy of modality alignment during the training process. Experimental results demonstrate that BMI significantly improves few-shot classification, making simple baselines outperform the most advanced methods on miniImageNet and tieredImageNet datasets.

Multi-Domain Lifelong Visual Question Answering via Self-Critical Distillation

  • Mingrui Lao
  • Nan Pu
  • Yu Liu
  • Zhun Zhong
  • Erwin M. Bakker
  • Nicu Sebe
  • Michael S. Lew

Visual Question Answering (VQA) has achieved significant success over the last few years, while most studies focus on training a VQA model on a stationary domain (e.g., a given dataset). In real-world application scenarios, however, these methods are often inefficient because VQA systems are always supposed to extend their knowledge and meet the ever-changing demands of users. In this paper, we introduce a new and challenging multi-domain lifelong VQA task, dubbed MDL-VQA, which encourages the VQA model to continuously learn across multiple domains while mitigating the forgetting on previously-learned domains. Furthermore, we propose a novel replay-free Self-Critical Distillation (SCD) framework tailor-made for MDL-VQA, which alleviates forgetting issue via transferring previous-domain knowledge from teacher to student models. First, we propose to introspect the teacher's understanding over original and counterfactual samples, thereby creating informative instance-relevant and domain-relevant knowledge for logits-based distillation. Second, on the side of feature-based distillation, we propose to introspect the reasoning behavior of student model to establish the harmful domain-specific knowledge acquired in current domain, and further leverage the metric learning strategy to encourage student to learn useful knowledge in new domain. Extensive experiments demonstrate that SCD framework outperforms state-of-the-art competitors with different training orders.

Relation Triplet Construction for Cross-modal Text-to-Video Retrieval

  • Xue Song
  • Jingjing Chen
  • Yu-Gang Jiang

Cross-modal text-to-video retrieval aims to find semantically related videos for a text query. Since video and text are distinct modalities, the major challenge comes from building the correspondence between two modalities, thus relevant samples could be matched. Inherently, the text contains multiple relatively complete semantic units and each one is composed of three primary components, i.e., subject, predicate and object (SVO triplet). Therefore, it requires similar modeling of video content -- objects and their relations, to correctly retrieve videos for texts. To model fine-grained visual relations, this paper proposes a Multi-Granularity Matching (MGM) framework that considers both fine-grained relation triplet matching and coarse-grained global semantic matching for text-to-video retrieval. Specifically, in the proposed framework, we represent videos as SVO triplet tracklets by extracting frame-level relation triplets followed by temporal relation association across frames. Moreover, we design a transformer-based Bi-directional Fusion Block (BFB) to express each SVO triplet with a highly unified representation. The constructed SVO triplet tracklets provide a reasonable way to model fine-grained video contents, fulfilling a better alignment between videos and texts. Extensive experiments conducted on three benchmark datasets, i.e., MSR-VTT, LSMDC and MSVD, demonstrate the effectiveness of our proposed method.

HSVLT: Hierarchical Scale-Aware Vision-Language Transformer for Multi-Label Image Classification

  • Shuyi Ouyang
  • Hongyi Wang
  • Ziwei Niu
  • Zhenjia Bai
  • Shiao Xie
  • Yingying Xu
  • Ruofeng Tong
  • Yen-Wei Chen
  • Lanfen Lin

The task of multi-label image classification involves recognizing multiple objects within a single image. Considering both valuable semantic information contained in the labels and essential visual features presented in the image, tight visual-linguistic interactions play a vital role in improving classification performance. Moreover, given the potential variance in object size and appearance within a single image, attention to features of different scales can help to discover possible objects in the image. Recently, Transformer-based methods have achieved great success in multi-label image classification by leveraging the advantage of modeling long-range dependencies, but they have several limitations. Firstly, existing methods treat visual feature extraction and cross-modal fusion as separate steps, resulting in insufficient visual-linguistic alignment in the joint semantic space. Additionally, they only extract visual features and perform cross-modal fusion at a single scale, neglecting objects with different characteristics. To address these issues, we propose a Hierarchical Scale-Aware Vision-Language Transformer (HSVLT) with two appealing designs: (1)A hierarchical multi-scale architecture that involves a Cross-Scale Aggregation module, which leverages joint multi-modal features extracted from multiple scales to recognize objects of varying sizes and appearances in images. (2)Interactive Visual-Linguistic Attention, a novel attention mechanism module that tightly integrates cross-modal interaction, enabling the joint updating of visual, linguistic and multi-modal features. We have evaluated our method on three benchmark datasets. The experimental results demonstrate that HSVLT surpasses state-of-the-art methods with lower computational cost.

Depth-Aware Sparse Transformer for Video-Language Learning

  • Haonan Zhang
  • Lianli Gao
  • Pengpeng Zeng
  • Alan Hanjalic
  • Heng Tao Shen

In Video-Language (VL) learning tasks, a massive amount of text annotations are describing geometrical relationships of instances (e.g. 19.6% to 45.0% in MSVD, MSR-VTT, MSVD-QA and MSVRTT-QA), which often become the bottleneck of the current VL tasks (e.g. 60.8% vs. 98.2% CIDEr in MSVD for geometrical and non-geometrical annotations). Considering the rich spatial information of depth map, an intuitive way is to enrich the conventional 2D visual representations with depth information through current SOTA models, e.g. transformer. However, it is cumbersome to compute the self-attention on a long-range sequence and heterogeneous video-level representations with regard to computation cost and flexibility on various frame scales. To tackle this, we propose a hierarchical transformer, termed Depth-Aware Sparse Transformer (DAST). Specifically, to guarantee computational efficiency, a depth-aware sparse attention modular with linear computational complexity is designed for each transformer layer to learn depth-aware 2D representations. Furthermore, we design a hierarchical structure to maintain multi-scale temporal coherence across long-range dependencies. These qualities of DAST make it compatible with a broad range of video-language tasks, including video captioning (achieving MSVD 107.8%, MSR-VTT 52.5% for CIDEr), video question answering (MSVD-QA 44.1%, MSRVTT-QA 39.4%), and video-text matching (MSR-VTT 215.7 for SumR). Our code is available at

Invariant Meets Specific: A Scalable Harmful Memes Detection Framework

  • Chuanpeng Yang
  • Fuqing Zhu
  • Jizhong Han
  • Songlin Hu

Harmful memes detection is a challenging task in the field of multimodal information processing due to the semantic gap between different modalities. Current research on this task mainly focuses on multimodal dual-stream models. However, the existing works ignore the misalignment of the memes caused by the modality gap. Moreover, the cross-modal interaction in the dual-stream models is insufficient to identify harmful memes. To this end, this paper proposes a scalable invariant and specific modality (ISM) representations framework via graph neural networks. The proposed ISM framework provides a comprehensive and disentangled view for memes and promotes inter-modal interaction. Specifically, ISM projects each modality to two distinct spaces. The first space is modality-invariant, learning the corresponding commonalities and reducing the modality gap. The second space is modality-specific, holding the distinctive characteristics of each modality and complementing the common latent features captured in invariant spaces. Then, we construct fully connected visual and textual graphs for each space. The unimodal graphs are fused to dynamically balance inter-modal and intra-modal relationships, which are complementary to the dual-stream models. Finally, an adaptive module is designed to weigh the proportion of each fusion graph for memes. Moreover, the mainstream multimodal dual-stream models could be employed as the backbone flexibly. Extensive experiments on five publicly available datasets show that the proposed ISM provides a stable improvement over baselines and produces a competitive performance compared with the existing harmful memes detection methods.

A Method of Micro-Geometric Details Preserving in Surface Reconstruction from Gradient

  • Wuyuan Xie
  • Miaohui Wang

Surface from gradient (SfG) is one of the fundamental methods to densely reconstruct 3D object surface in computer vision. However, the reconstruction of micro-geometric details has not been satisfactorily solved in existing SfG methods due to their non-integrability. In this paper, we present an effective discrete geometric approach to reconstruct fine-grained sharp surface feature with non-integrability. Specifically, We investigate the fine-grained structure of surfaces in the micro geometry domain. based on an adaptive projection on vertexes constrained by neighboring gradient vectors, and develop a gradient angle-guided energy optimization to generate a fine-grained surface. Experimental results on various challenging synthetic and real-world data show that the proposed method is able to effectively reconstruct challenging micro-geometric details for general SfG methods.

Progressive Positive Association Framework for Image and Text Retrieval

  • Wenhui Li
  • Yan Wang
  • Yuting Su
  • Lanjun Wang
  • Weizhi Nie
  • An-An Liu

With the increasing amount of multimedia data, the demand for fast and accurate access to information is growing. Image and text retrieval learns visual and textual semantic relationships for multimedia data management and content recognition. The main challenge of this task is how to derive image and text similarity based on local associations under huge modal gap. However, the existing methods compute semantic relevance using associations of all fragments (visual regions and textual words), which underestimate the uncertainty of associations and discriminative positive associations leading to cross-modal correspondence ambiguity. To address these issues, we propose a novel Progressive Positive Association Framework (PPAF), which models association uncertainty as a normal distribution and progressively mines direct and potential positive associations according to the characteristics of the association distribution. We design positive association matching, which adaptively fuses multi-step associations for local matching depending on the relevance difference. In addition, we apply KL loss constraint on cross-modal association distribution in order to enhance local semantic alignment. Extended experiments demonstrate the leading performance of PPAF.

Globally-Robust Instance Identification and Locally-Accurate Keypoint Alignment for Multi-Person Pose Estimation

  • Fangzheng Tian
  • Sungchan Kim

Scenes with a large number of human instances are characterized by significant overlap of the instances with similar appearance, occlusion, and scale variation. We propose GRAPE, a novel method that leverages both Globally Robust human instance identification and locally Accurate keypoint alignment for 2D Pose Estimation. GRAPE predicts instance center and keypoint heatmaps, as global identifications of instance location and scale, and keypoint offset vectors from instance centers, as representations of accurate local keypoint positions. We use Transformer to jointly learn the global and local contexts, which allows us to robustly detect instance centers even in difficult cases such as crowded scenes, and align instance offset vectors with relevant keypoint heatmaps, resulting in refined final poses. GRAPE also predicts keypoint visibility, which is crucial for estimating centers of partially visible instances in crowded scenes. We demonstrate that GRAPE achieves state-of-the-art performance on the CrowdPose, OCHuman, and COCO datasets. The benefit of GRAPE is more apparent on crowded scenes (CrowdPose and OCHuman), where our model significantly outperforms previous methods, especially on hard examples.

Unlocking the Power of Cross-Dimensional Semantic Dependency for Image-Text Matching

  • Kun Zhang
  • Lei Zhang
  • Bo Hu
  • Mengxiao Zhu
  • Zhendong Mao

Image-text matching, as a fundamental cross-modal task, bridges vision and language. The key challenge lies in accurately learning the semantic similarity of these two heterogeneous modalities. To determine the semantic similarity between visual and textual features, existing paradigm typically first maps them into a d-dimensional shared representation space, then independently aggregates all dimensional correspondences of cross-modal features to reflect it, e.g., the inner product. However, in this paper, we are motivated by an insightful finding that dimensions are not mutually independent, but there are intrinsic dependencies among dimensions to jointly represent latent semantics. Ignoring this intrinsic information probably leads to suboptimal aggregation for semantic similarity, impairing cross-modal matching learning. To solve this issue, we propose a novel cross-dimensional semantic dependency-aware model (called X-Dim), which explicitly and adaptively mines the semantic dependencies between dimensions in the shared space, enabling dimensions with joint dependencies to be enhanced and utilized. X-Dim (1) designs a generalized framework to learn dimensions' semantic dependency degrees, and (2) devises the adaptive sparse probabilistic learning to autonomously make the model capture precise dependencies. Theoretical analysis and extensive experiments demonstrate the superiority of X-Dim over state-of-the-art methods, achieving 5.9%-7.3% rSum improvements on Flickr30K and MS-COCO benchmarks.

Dark Knowledge Balance Learning for Unbiased Scene Graph Generation

  • Zhiqing Chen
  • Yawei Luo
  • Jian Shao
  • Yi Yang
  • Chunping Wang
  • Lei Chen
  • Jun Xiao

One of the major obstacles that hinders the current scene graph generation (SGG) performance lies in the severe predicate annotation bias. Conventional solutions to this problem are mainly based on reweighting/resampling heuristics. Despite achieving some improvements on tail classes, these methods are prone to cause serious performance degradation of head predicates. In this paper, we propose to tackle this problem from a brand-new perspective of dark knowledge. In consideration of the unique nature of SGG that requires a large number of negative samples to be employed for predicate learning, we design to capitalize on the dark knowledge contained in negative samples for debiasing the predicate distribution. Along such vein, we propose a novel SGG method dubbed Dark Knowledge Balance Learning (DKBL). In DKBL, we first design a dark knowledge balancing loss, which helps the model learn to balance head and tail predicates while maintaining the overall performance. We further introduce a dark knowledge semantic enhancement module to better encode the semantics of predicates. DKBL is orthogonal to existing SGG methods and can be easily plugged into their training process for further improvement. Extensive experiments on VG dataset show that the proposed DKBL can consistently achieve well trade-off performance between head and tail predicates, which is significantly better than previous state-of-the-art methods. The code is available in

Orthogonal Uncertainty Representation of Data Manifold for Robust Long-Tailed Learning

  • Yanbiao Ma
  • Licheng Jiao
  • Fang Liu
  • Shuyuan Yang
  • Xu Liu
  • Lingling Li

In scenarios with long-tailed distributions, the model's ability to identify tail classes is limited due to the under-representation of tail samples. Class rebalancing, information augmentation, and other techniques have been proposed to facilitate models to learn the potential distribution of tail classes. The disadvantage is that these methods generally pursue models with balanced class accuracy on the data manifold, while ignoring the ability of the model to resist interference. By constructing noisy data manifold, we found that the robustness of models trained on unbalanced data has a long-tail phenomenon. That is, even if the class accuracy is balanced on the data domain, it still has bias on the noisy data manifold. However, existing methods cannot effectively mitigate the above phenomenon, which makes the model vulnerable in long-tailed scenarios. In this work, we propose an Orthogonal Uncertainty Representation (hOUR) of feature embedding and an end-to-end training strategy to improve the long-tail phenomenon of model robustness. As a general enhancement tool, OUR has excellent compatibility with other methods and does not require additional data generation, ensuring fast and efficient training. Comprehensive evaluations on long-tailed datasets show that our method significantly improves the long-tail phenomenon of robustness, bringing consistent performance gains to other long-tailed learning methods.

Topological Structure Learning for Weakly-Supervised Out-of-Distribution Detection

  • Rundong He
  • Rongxue Li
  • Zhongyi Han
  • Xihong Yang
  • Yilong Yin

Out-of-distribution~(OOD) detection is the key to deploying models safely in the open world. For OOD detection, collecting sufficient in-distribution~(ID) labeled data is usually more time-consuming and costly than unlabeled data. When ID labeled data is limited, the previous OOD detection methods are no longer superior due to their high dependence on the amount of ID labeled data. Based on limited ID labeled data and sufficient unlabeled data, we define a new setting called Weakly-Supervised Out-of-Distribution Detection (WSOOD). To solve the new problem, we propose an effective method called Topological Structure Learning (TSL). Firstly, TSL uses a contrastive learning method to build the initial topological structure space for ID and OOD data. Secondly, TSL mines effective topological connections in the initial topological space. Finally, based on limited ID labeled data and mined topological connections, TSL reconstructs the topological structure in a new topological space to increase the separability of ID and OOD instances. Extensive studies on several representative datasets show that TSL remarkably outperforms the state-of-the-art, verifying the validity and robustness of our method in the new setting of WSOOD.

Efficient Spatio-Temporal Video Grounding with Semantic-Guided Feature Decomposition

  • Weikang Wang
  • Jing Liu
  • Yuting Su
  • Weizhi Nie

Spatio-temporal video grounding (STVG) aims to localize the spatio-temporal object tube in a video according to a given text query. Current approaches address the STVG task with end-to-end frameworks while suffering from heavy computational complexity and insufficient spatio-temporal interactions. To overcome these limitations, we propose a novel Semantic-Guided Feature Decomposition based Network (SGFDN). A semantic-guided mapping operation is proposed to decompose the 3D spatio-temporal feature into 2D motions and 1D object embedding without losing much object-related semantic information. Thus, the computational complexity in computationally expensive operations such as attention mechanisms can be effectively reduced by replacing the input spatio-temporal feature with the decomposed features. Furthermore, based on this decomposition strategy, a pyramid relevance filtering based attention is proposed to capture the cross-modal interactions at multiple spatio-temporal scales. In addition, a decomposition-based grounding head is proposed to locate the queried objects with less computational complexity. Extensive experiments on two widely-used STVG datasets (VidSTG and HC-STVG) demonstrate that our method enjoys state-of-the-art performance as well as less computational complexity. The code has been available at

Prior Knowledge-driven Dynamic Scene Graph Generation with Causal Inference

  • Jiale Lu
  • Lianggangxu Chen
  • Youqi Song
  • Shaohui Lin
  • Changbo Wang
  • Gaoqi He

The task of dynamic scene graph generation (DSGG) aims at constructing a set of frame-level scene graphs for the given video. It suffers from two kinds of spurious correlation problems. First, the spurious correlation between input object pair and predicate label is caused by the biased predicate sample distribution in dataset. Second, the spurious correlation between contextual information and predicate label arises from interference caused by background content in both the current frame and adjacent frames of the video sequence. To alleviate spurious correlations, our work is formulated into two sub-tasks: video-specific commonsense graph generation (VsCG) and causal inference (CI). VsCG module aims to alleviate the first correlation by integrating prior knowledge into prediction. Information of all the frames in current video is used to enhance the commonsense graph constructed from co-occurrence patterns of all training samples. Thus, the commonsense graph has been augmented with video-specific temporal dependencies. Then, a CI strategy with both intervention and counterfactual is used. The intervention component further eliminates the first correlation by forcing the model to consider all possible predicate categories fairly, while the counterfactual component resolves the second correlation by removing the bad effect from context. Comprehensive experiments on the Action Genome dataset show that the proposed method achieves state-of-the-art performance.

ATM: Action Temporality Modeling for Video Question Answering

  • Junwen Chen
  • Jie Zhu
  • Yu Kong

Despite significant progress in video question answering (VideoQA), existing methods fall short of questions that require causal/temporal reasoning across frames. This can be attributed to imprecise motion representations. We introduce Action Temporality Modeling (ATM) for temporality reasoning via three-fold uniqueness: (1) rethinking the optical flow and realizing that optical flow is effective in capturing the long horizon temporality reasoning; (2) training the visual-text embedding by contrastive learning in an action-centric manner, leading to better action representations in both vision and text modalities; and (3) preventing the model from answering the question given the shuffled video in the fine-tuning stage, to avoid spurious correlation between appearance and motion and hence ensure faithful temporality reasoning. In the experiments, we show that ATM outperforms existing approaches in terms of the accuracy on multiple VideoQAs and exhibits better true temporality reasoning ability.

CLIP-Hand3D: Exploiting 3D Hand Pose Estimation via Context-Aware Prompting

  • Shaoxiang Guo
  • Qing Cai
  • Lin Qi
  • Junyu Dong

Contrastive Language-Image Pre-training (CLIP) starts to emerge in many computer vision tasks and has achieved promising performance. However, it remains underexplored whether CLIP can be generalized to 3D hand pose estimation, as bridging text prompts with pose-aware features presents significant challenges due to the discrete nature of joint positions in 3D space. In this paper, we make one of the first attempts to propose a novel 3D hand pose estimator from monocular images, dubbed as CLIP-Hand3D, which successfully bridges the gap between text prompts and irregular detailed pose distribution. In particular, the distribution order of hand joints in various 3D space directions is derived from pose labels, forming corresponding text prompts that are subsequently encoded into text representations. Simultaneously, 21 hand joints in the 3D space are retrieved, and their spatial distribution (in x, y, and z axes) is encoded to form pose-aware features. Subsequently, we maximize semantic consistency for a pair of pose-text features following a CLIP-based contrastive learning paradigm. Furthermore, a coarse-to-fine mesh regressor is designed, which is capable of effectively querying joint-aware cues from the feature pyramid. Extensive experiments on several public hand benchmarks show that the proposed model attains a significantly faster inference speed while achieving state-of-the-art performance compared to methods utilizing the similar scale backbone. Code is available at:

A Multitask Framework for Graffiti-to-Image Translation

  • Ying Yang
  • Mulin Chen
  • Xuelong Li

Recently, image-to-image translation models have achieved great success in terms of content consistency and visual fidelity. However, in most of these tasks, the inaccuracy of sketches and the high cost of fine semantic masks acquisition limit the large-scale use of image translation models. Therefore, we propose to use graffiti that combines the advantages of sketches and semantic masks as model input. Graffiti reflects the general content of an image using lines and color distinctions, with some unlabeled regions. However, due to the large number of unknown areas in the graffiti, the generated results may be blurred, resulting in poor visual effects. To address these challenges, this paper proposes a multi-task framework that can predict unknown regions by learning semantic mask from graffiti, thereby improving the quality of generated real scene images. Furthermore, by introducing an edge activation module, which utilizes semantic and edge information to optimize the object boundaries of the generated images, the details of the generated images can be improved. Experiments on the Cityscapes dataset demonstrate that our multi-task framework achieves competitive performance on graffiti-based image generation task.

Adaptive Contrastive Learning for Learning Robust Representations under Label Noise

  • Zihao Wang
  • Weichen Zhang
  • Weihong Bao
  • Fei Long
  • Chun Yuan

Deep Neural Networks suffer significant performance degeneration when noisy labels corrupt latent data representations. Previous work has attempted to alleviate this problem by exploiting contrastive learning, the pair building of which is critical. However, existing methods either conduct sample-level processes and then use the resultant subset to construct pairs or directly perform pair-level selecting using a fixed threshold, both leading to sub-optimal pairing and subsequent representation learning. To address this issue, we propose a novel adaptive contrastive learning method (ACL) working at the pair level to select contrastive pairs adaptively. Specifically, we consider the model's learning status to adjust the confidence threshold in a self-adaptive manner instead of fixing it. Then, towards the ineffectiveness of the thresholding method on unconfident pairs, we automatically apply instance-specific temperature to boost the confidence of accurately-predicted samples and their pairs. We further introduce temporal cross-ensembling to handle the impact of noisy labels on model predictions. As a result, diverse pairs are correctly selected for contrastive learning to induce discriminative representations robust to various types of label noise. Extensive experimental results on several standard benchmarks and real-world datasets indicate the superiority of ACL, especially in extremely noisy scenarios.

Distilling Vision-Language Foundation Models: A Data-Free Approach via Prompt Diversification

  • Yunyi Xuan
  • Weijie Chen
  • Shicai Yang
  • Di Xie
  • Luojun Lin
  • Yueting Zhuang

Data-Free Knowledge Distillation (DFKD) has shown great potential in creating a compact student model while alleviating the dependency on real training data by synthesizing surrogate data. However, prior arts are seldom discussed under distribution shifts, which may be vulnerable in real-world applications. Recent Vision-Language Foundation Models, e.g., CLIP, have demonstrated remarkable performance in zero-shot out-of-distribution generalization, yet consuming heavy computation resources. In this paper, we discuss the extension of DFKD to Vision-Language Foundation Models without access to the billion-level image-text datasets. The objective is to customize a student model for distribution-agnostic downstream tasks with given category concepts, inheriting the out-of-distribution generalization capability from the pre-trained foundation models. In order to avoid generalization degradation, the primary challenge of this task lies in synthesizing diverse surrogate images driven by text prompts. Since not only category concepts but also style information are encoded in text prompts, we propose three novel Prompt Diversification methods to encourage image synthesis with diverse styles, namely Mix-Prompt, Random-Prompt, and Contrastive-Prompt. Experiments on out-of-distribution generalization datasets demonstrate the effectiveness of the proposed methods, with Contrastive-Prompt performing the best.

Real20M: A Large-scale E-commerce Dataset for Cross-domain Retrieval

  • Yanzhe Chen
  • Huasong Zhong
  • Xiangteng He
  • Yuxin Peng
  • Lele Cheng

In e-commerce, products and micro-videos serve as two primary carriers. Introducing cross-domain retrieval between these carriers can establish associations, thereby leading to the advancement of specific scenarios, such as retrieving products based on micro-videos or recommending relevant videos based on products. However, existing datasets only focus on retrieval within the product domain while neglecting the micro-video domain and often ignore the multi-modal characteristics of the product domain. Additionally, these datasets strictly limit their data scale through content alignment and use a content-based data organization format that hinders the inclusion of user retrieval intentions. To address these limitations, we propose the PKU Real20M dataset, a large-scale e-commerce dataset designed for cross-domain retrieval. We adopt a query-driven approach to efficiently gather over 20 million e-commerce products and micro-videos, including multimodal information. Additionally, we design a three-level entity prompt learning framework to align inter-modality information from coarse to fine. Moreover, we introduce the Query-driven Cross-Domain retrieval framework (QCD), which leverages user queries to facilitate efficient alignment between the product and micro-video domains. Extensive experiments on two downstream tasks validate the effectiveness of our proposed approaches. The dataset and source code are available at

Zero-TextCap: Zero-shot Framework for Text-based Image Captioning

  • Dongsheng Xu
  • Wenye Zhao
  • Yi Cai
  • Qingbao Huang

Text-based image captioning is a vital but under-explored task, which aims to describe images by captions containing scene text automatically. Recent studies have made encouraging progress, but they are still suffering from two issues. Firstly, current models cannot capture and generate scene text in non-Latin script languages, which severely limits the objectivity and the information completeness of generated captions. Secondly, current models tend to describe images with monotonous and templated style, which greatly limits the diversity of the generated captions. Although the above-mentioned issues can be alleviated through carefully designed annotations, this process is undoubtedly laborious and time-consuming. To address the above issues, we propose a Zero-shot Framework for Text-based Image Captioning (Zero-TextCap). Concretely, to generate candidate sentences starting from the prompt 'Image of' and iteratively refine them to improve the quality and diversity of captions, we introduce a Hybrid-sampling masked language model (H-MLM). To read multi-lingual scene text and model the relationships between them, we introduce a robust OCR system. To ensure that the captions generated by H-MLM contain scene text and are highly relevant to the image, we propose a CLIP-based generation guidance module to insert OCR tokens and filter candidate sentences. Our Zero-TextCap is capable of generalizing captions containing multi-lingual scene text and boosting the diversity of captions. Sufficient experiments demonstrate the effectiveness of our proposed Zero-TextCap. Our codes are available at

Adversarial Training of Deep Neural Networks Guided by Texture and Structural Information

  • Zhaoxin Wang
  • Handing Wang
  • Cong Tian
  • Yaochu Jin

Adversarial training (AT) is one of the most effective ways for deep neural network models to resist adversarial examples. However, there is still a significant gap between robust training accuracy and testing accuracy. Although recent studies have shown that data augmentation can effectively reduce this gap, most methods heavily rely on generating large amounts of training data without considering which features are beneficial for model robustness, making them inefficient. To address the above issue, we propose a two-stage AT algorithm for image data that adopts different data augmentation strategies during the training process to improve model robustness. In the first stage, we focus on the convergence of the algorithm, which uses structure and texture information to guide AT. In the second stage, we introduce a strategy that randomly fuses the data features to generate diverse adversarial examples for AT. We compare our proposed algorithm with five state-of-the-art algorithms on three models, and the experimental results achieve the best robust accuracy under all evaluation metrics on the CIFAR10 dataset, demonstrating the superiority of our method.

TeViS: Translating Text Synopses to Video Storyboards

  • Xu Gu
  • Yuchong Sun
  • Feiyue Ni
  • Shizhe Chen
  • Xihua Wang
  • Ruihua Song
  • Boyuan Li
  • Xiang Cao

A video storyboard is a roadmap for video creation which consists of shot-by-shot images to visualize key plots in a text synopsis. Creating video storyboards, however, remains challenging which not only requires cross-modal association between high-level texts and images but also demands long-term reasoning to make transitions smooth across shots. In this paper, we propose a new task called Text synopsis to Video Storyboard (TeViS) which aims to retrieve an ordered sequence of images as the video storyboard to visualize the text synopsis. We construct a MovieNet-TeViS dataset based on the public MovieNet dataset [17]. It contains 10K text synopses each paired with keyframes manually selected from corresponding movies by considering both relevance and cinematic coherence. To benchmark the task, we present strong CLIP-based baselines and a novel VQ-Trans model. VQ-Trans first encodes text synopsis and images into a joint embedding space and uses vector quantization (VQ) to improve the visual representation. Then, it auto-regressively generates a sequence of visual features for retrieval and ordering. Experimental results demonstrate that VQ-Trans significantly outperforms prior methods and the CLIP-based baselines. Nevertheless, there is still a large gap compared to human performance suggesting room for promising future work. The code and data are available at:

Chain-of-Look Prompting for Verb-centric Surgical Triplet Recognition in Endoscopic Videos

  • Nan Xi
  • Jingjing Meng
  • Junsong Yuan