MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Digital Library logo Full Citation in the ACM Digital Library

SESSION: Keynote Talks

Internet of Video Things: Technical Challenges and Emerging Applications

Chang-Wen Chen Chen

The worldwide flourishing of the Internet of Things (IoT) in the past decade has enabled numerous new applications through the internetworking of a wide variety of devices and sensors. In recent years, visual sensors have seen a considerable boom in IoT systems because they are capable of providing richer and more versatile information. Internetworking of large-scale visual sensors has been named the Internet of Video Things (IoVT). IoVT has a new array of unique characteristics in terms of sensing, transmission, storage, and analysis, all are fundamentally different from the conventional IoT. These new characteristics of IoVT are expected to impose significant challenges on existing technical infrastructures. In this keynote talk, an overview of recent advances in various fronts of IoVT will be introduced and a broad range of technological and systematic challenges will be addressed. Several emerging IoVT applications will be discussed to illustrate the great potential of IoVT in a broad range of practical scenarios.

Multimodal AI & LLMs for Peacekeeping and Emergency Response

Alejandro Jaimes

When an emergency event, or an incident relevant for peacekeeping first occurs, getting the right information as quickly as possible is critical in saving lives. When an event is ongoing, information on what is happening can be critical in making decisions to keep people safe and take control of the particular situation unfolding. In both cases, first responders and peacekeepers have to quickly make decisions that include what resources to deploy and where. Fortunately, in most emergencies, people use social media to publicly share information. At the same time, sensor data is increasingly becoming available. But a platform to detect emergency situations and deliver the right information has to deal with ingesting thousands of noisy data points per second: sifting through and identifying relevant information, from different sources, in different formats, with varying levels of detail, in real time, so that relevant individuals and teams can be alerted at the right level and at the right time. In this talk I will describe the technical challenges in processing vast amounts of heterogeneous, noisy data in real time, highlighting the importance of interdisciplinary research and a human-centered approach to address problems in peacekeeping and emergency response. I will give specific examples specifically discussing how LLMs can be deployed at scale, including relevant future research directions in Multimedia.

Transition and Adaptability: The Cornerstone of Resilience in Future Networked Multimedia Systems and Beyond

Ralf Steinmetz

Let us define transition as the "exchange" between two mechanisms with comparable functionality, but with different algorithms and implementation concepts, which are optimal depending on the respective conditions of the respective context. It is much more that adaptability; it does not cover just the smooth automatic control of e.g., a MAPE loop or a control loop which is in charge to maximize the quality of service of streamed media data while errors occur.

Resilience describes the ability of a system to either absorb large changes (and crises) and recover from them in the short term or to overcome them by acquiring comparable or new basic functionality through overall system adjustments. In doing so, the system's readiness increases continuously and sustainably by learning from past changes of the context (and crises).

Just one extreme example: In the situation of a severe danger due to a nature disaster, a person -located in the affected area of the disaster- transmits to the rescue team an on-the-fly generated 360-degree panoramic point cloud of the situation. He still has sufficient energy supply and for whatever reason the communication facilities are still available. Due to energy shortage, a lot of other traffic and some damages of the infrastructure, multimedia communication must be adjusted continuously to the environment and requirements. In an extreme situation data is send over high latency low bandwidth satellite channels. Media might become a short textual description of the actual surrounding. Assume it happens without any manual interaction of the person sending this data. Multimedia and Communications Mechanisms must be exchanged; media must be "customized". Transitions happen to support the user in e.g., such an extreme stress situation.

In the collaborative research project MAKI as well as our center researching resilient infrastructures of digital cities that can withstand crises and disasters emergenCITY, we address some of these issues. However, in the next years beyond multimedia networks, many multimedia systems, interfaces, applications, etc. will be affected.

SESSION: Oral Session I: Understanding Multimedia Content -- Media Interpretation

Mutual Information-driven Triple Interaction Network for Efficient Image Dehazing

Hao Shen
Zhong-Qiu Zhao
Yulun Zhang
Zhao Zhang

Multi-stage architectures have exhibited efficacy in image dehazing, which usually decomposes a challenging task into multiple more tractable sub-tasks and progressively estimates latent hazy-free images. Despite the remarkable progress, existing methods still suffer from the following shortcomings: (1) limited exploration of frequency domain information; (2) insufficient information interaction; (3) severe feature redundancy. To remedy these issues, we propose a novel Mutual Information-driven Triple interaction Network (MITNet) based on spatial-frequency dual domain information and two-stage architecture. To be specific, the first stage, named amplitude-guided haze removal, aims to recover the amplitude spectrum of the hazy images for haze removal. And the second stage, named phase-guided structure refined, devotes to learning the transformation and refinement of the phase spectrum. To facilitate the information exchange between two stages, an Adaptive Triple Interaction Module (ATIM) is developed to simultaneously aggregate cross-domain, cross-scale, and cross-stage features, where the fused features are further used to generate content-adaptive dynamic filters so that applying them to enhance global context representation. In addition, we impose the mutual information minimization constraint on paired scale encoder and decoder features from both stages. Such an operation can effectively reduce information redundancy and enhance cross-stage feature complementarity. Extensive experiments on multiple public datasets exhibit that our MITNet performs superior performance with lower model complexity. The code and models are available at https://github.com/it-hao/MITNet.

Suspected Objects Matter: Rethinking Model's Prediction for One-stage Visual Grounding

Yang Jiao
Zequn Jie
Jingjing Chen
Lin Ma
Yu-Gang Jiang

Recently, one-stage visual grounders attract high attention due to their comparable accuracy but significantly higher efficiency than two-stage grounders. However, inter-object relation modeling has not been well studied for one-stage grounders. Inter-object relationship modeling, though important, is not necessarily performed among all objects, as only part of them are related to the text query and may confuse the model. We call these objects "suspected objects". However, exploring their relationships in the one-stage paradigm is non-trivial because: (1) no object proposals are available as the basis on which to select suspected objects and perform relationship modeling; (2) suspected objects are more confusing than others, as they may share similar semantics, be entangled with certain relationships, etc, and thereby more easily mislead the model's prediction. Toward this end, we propose a Suspected Object Transformation mechanism (SOT), which can be seamlessly integrated into existing CNN and Transformer-based one-stage visual grounders to encourage the target object selection among the suspected ones. Suspected objects are dynamically discovered from a learned activation map adapted to the model's current discrimination ability during training. Afterward, on top of suspected objects, a Keyword-Aware Discrimination module (KAD) and an Exploration by Random Connection strategy (ERC) are concurrently proposed to help the model rethink its initial prediction. On the one hand, KAD leverages keywords contributing high to suspected object discrimination. On the other hand, ERC allows the model to seek the correct object instead of being trapped in a situation that always exploits the current false prediction. Extensive experiments demonstrate the effectiveness of our proposed method.

Self-Relational Graph Convolution Network for Skeleton-Based Action Recognition

Sophyani Banaamwini Yussif
Ning Xie
Yang Yang
Heng Tao Shen

Using a Graph convolution network (GCN) for constructing and aggregating node features has been helpful for skeleton-based action recognition. The strength of the nodes' relation of an action sequence distinguishes it from other actions. This work proposes a novel spatial module called Multi-scale self-relational graph convolution (MS-SRGC) for dynamically modeling joint relations of action instances. Modeling the joints' relations is crucial in determining the spatial distinctiveness between skeleton sequences; hence MS-SRGC shows effectiveness for activity recognition. We also propose a Hybrid multi-scale temporal convolution network (HMS-TCN) that captures different ranges of time steps along the temporal dimension of the skeleton sequence. In addition, we propose a Spatio-temporal blackout (STB) module that randomly zeroes some continue frames for selected strategic joint groups. We sequentially stack our spatial (MS-SRGC) and temporal (HMS-TCN) modules to form a Self-relational graph convolution network (SR-GCN) block, which we use to construct our SR-GCN model. We append our STB on the SR-GCN model top for the randomized operation. With the effectiveness of ensemble networks, we perform extensive experiments on single and multiple ensembles. Our results beat the state-of-the-art methods on the NTU RGB-D, NTU RGB-D 120, and Northwestern-UCLA datasets.

Exploring Correlations in Degraded Spatial Identity Features for Blind Face Restoration

Qian Ning
Fangfang Wu
Weisheng Dong
Xin Li
Guangming Shi

Blind face restoration aims to recover high-quality face images from low-quality ones with complex and unknown degradation. Existing approaches have achieved promising performance by leveraging pre-trained dictionaries or generative priors. However, these methods may fail to exploit the full potential of degraded inputs and facial identity features due to complex degradation. To address this issue, we propose a novel method that explores the correlation of degraded spatial identity features by learning a general representation using memory network. Specifically, our approach enhances degraded features with more identity by leveraging similar facial features retrieved from memory network. We also propose a fusion approach that fuses memorized spatial features with GAN prior features via affine transformation and blending fusion to improve fidelity and realism. Additionally, the memory network is updated online in an unsupervised manner along with other modules, which obviates the requirement for pre-training. Experimental results on synthetic and popular real-world datasets demonstrate the effectiveness of our proposed method, which achieves at least comparable and often better performance than other state-of-the-art approaches.

Video-based Visible-Infrared Person Re-Identification via Style Disturbance Defense and Dual Interaction

Chuhao Zhou
Jinxing Li
Huafeng Li
Guangming Lu
Yong Xu
Min Zhang

Video-based visible-infrared person re-identification (VVI-ReID) aims to retrieve video sequences of the same pedestrian from different modalities. The key of VVI-ReID is to learn discriminative sequence-level representations that are invariant to both intra- and inter-modal discrepancies. However, most works only focus on the elimination of modality-gap while ignore the distractors within the modality. Moreover, existing sequence-level representation learning approaches are limited to a single video, failing to mine the correlations among multiple videos of the same pedestrian. In this paper, we propose a Style Augmentation, Attack and Defense network with Graph-based dual interaction (SAADG) to guarantee the semantic consistency against both intra-modal discrepancies and inter-modal gap. Specifically, we first generate diverse styles for video frames by random style variation in image spaces. Followed by the style attack and defense, the intra- and inter-modal discrepancies are modeled as different types of style disturbance (attack), and our model achieves to keep the id-related content invariant under such attack. Besides, a graph-based dual interaction module is further introduced to fully explore the cross-view and cross-modal correlations among various videos of the same identity, which are then transferred to the sequence-level representations. Extensive experiments on the public SYSU-MM01 and HITSZ-VCM datasets show that our approach achieves the remarkable performance compared with state-of-the-arts. The code is available at https://github.com/ChuhaoZhou99/SAADG_VVIReID.

PetalView: Fine-grained Location and Orientation Extraction of Street-view Images via Cross-view Local Search

Wenmiao Hu
Yichen Zhang
Yuxuan Liang
Xianjing Han
Yifang Yin
Hannes Kruppa
See-Kiong Ng
Roger Zimmermann

Satellite-based street-view information extraction by cross-view matching refers to a task that extracts the location and orientation information of a given street-view image query by using one or multiple geo-referenced satellite images. Recent work has initiated a new research direction to find accurate information within a local area covered by one satellite image centered at a location prior (e.g., from GPS). It can be used as a standalone solution or complementary step following a large-scale search with multiple satellite candidates. However, these existing works require an accurate initial orientation (angle) prior (e.g., from IMU) and/or do not efficiently search through all possible poses. To allow efficient search and to give accurate prediction regardless of the existence or the accuracy of the angle prior, we present PetalView extractors with multi-scale search. The PetalView extractors give semantically meaningful features that are equivalent across two drastically different views, and the multi-scale search strategy efficiently inspects the satellite image from coarse to fine granularity to provide sub-meter and sub-degree precision extraction. Moreover, when an angle prior is given, we propose a learnable prior angle mixer to utilize this information. Our method obtains the best performance on the VIGOR dataset and successfully improves the performance on KITTI dataset test~1 set with the recall within 1 meter (r@1m) for location estimation to 68.88% and recall within 1 degree (r@1d) 21.10% when no angle prior is available, and with angle prior achieves stable estimations at r@1m and r@1d above 70% and 21%, up to a 40-degree noise level.

Shifted GCN-GAT and Cumulative-Transformer based Social Relation Recognition for Long Videos

Haorui Wang
Yibo Hu
Yangfu Zhu
Jinsheng Qi
Bin Wu

Social Relation Recognition is an important part of Video Understanding, providing insights into the information that videos convey. Most previous works mainly focused on graph generation for characters, instead of edges which are more suitable for relation modelling. Furthermore, previous methods tend to recognize social relations for single frames or short video clips within their receptive fields, neglecting the importance of continuous reasoning throughout the entire video. To tackle these challenges, we propose a novel Shifted GCN-GAT and Cumulative-Transformer framework, named SGCAT-CT. The overall architecture consists of an SGCAT module for shifted graph operations on novel relation graphs and a CT module for temporal processing with memory. SGCAT-CT conducts continuous recognition of social relations and memorizes information from as early as the beginning of a long video. Experiments conducted on several video datasets demonstrate encouraging performance on long videos. Our code will be released at https://github.com/HarryWgCN/SGCAT-CT.

Causal Intervention for Sparse-View Gait Recognition

Jilong Wang
Saihui Hou
Yan Huang
Chunshui Cao
Xu Liu
Yongzhen Huang
Liang Wang

Gait recognition aims at identifying individuals by unique walking patterns at a long distance. However, prevailing methods suffer from a large degradation when applied to large-scale surveillance systems. We find a significant cause of this issue is that previous methods heavily rely on full-view person annotations to reduce view differences by pulling closer the anchor to positive samples from different viewpoints. But, subjects under in-the-wild scenarios usually have only a limited number of sequences from different viewpoints. As a result, the available viewpoints of each subject are sparse compared to the whole dataset, and simply minimizing intra-identity differences cannot well reducing the view differences in the whole dataset. In this work, we formulate this overlooked problem as Sparse-View Gait Recognition and provide a comprehensive analysis of it by a Structural Causal Model for causalities among latent features, view distribution, and labels. Based on our analysis, we propose a simple yet effective method that enables networks to learn a more robust representation among different views. Specifically, our method consists of two parts: 1) an effective metric learning algorithmic implementation based on the backdoor adjustment, which improves the consistency of representations among different views; 2) an unsupervised view cluster algorithm to discover and identify the most influential view contexts. We evaluate the effectiveness of our method on popular GREW, Gait3D, CASIA-B, and OU-MVLP, showing that our method consistently outperforms baselines and achieves state-of-the-art performance. The code will be available at https://github.com/wj1tr0y/GaitCSV.

MM-AU:Towards Multimodal Understanding of Advertisement Videos

Digbalay Bose
Rajat Hebbar
Tiantian Feng
Krishna Somandepalli
Anfeng Xu
Shrikanth Narayanan

Advertisement videos (ads) play an integral part in the domain of Internet e-commerce, as they amplify the reach of particular products to a broad audience or can serve as a medium to raise awareness about specific issues through concise narrative structures. The narrative structures of advertisements involve several elements like reasoning about the broad content (topic and the underlying message) and examining fine-grained details involving the transition of perceived tone due to the sequence of events and interaction among characters. In this work, to facilitate the understanding of advertisements along the three dimensions of topic categorization, perceived tone transition, and social message detection, we introduce a multimodal multilingual benchmark called MM-AU comprised of 8.4 K videos (147hrs) curated from multiple web-based sources. We explore multiple zero-shot reasoning baselines through the application of large language models on the ads transcripts. Further, we demonstrate that leveraging signals from multiple modalities, including audio, video, and text, in multimodal transformer-based supervised models leads to improved performance compared to unimodal approaches.

UER: A Heuristic Bias Addressing Approach for Online Continual Learning

Huiwei Lin
Shanshan Feng
Baoquan Zhang
Hongliang Qiao
Xutao Li
Yunming Ye

Online continual learning aims to continuously train neural networks from a continuous data stream with a single pass-through data. As the most effective approach, the rehearsal-based methods replay part of previous data. Commonly used predictors in existing methods tend to generate biased dot-product logits that prefer to the classes of current data, which is known as a bias issue and a phenomenon of forgetting. Many approaches have been proposed to overcome the forgetting problem by correcting the bias; however, they still need to be improved in online fashion. In this paper, we try to address the bias issue by a more straightforward and more efficient method. By decomposing the dot-product logits into an angle factor and a norm factor, we empirically find that the bias problem mainly occurs in the angle factor, which can be used to learn novel knowledge as cosine logits. On the contrary, the norm factor abandoned by existing methods helps remember historical knowledge. Based on this observation, we intuitively propose to leverage the norm factor to balance the new and old knowledge for addressing the bias. To this end, we develop a heuristic approach called unbias experience replay (UER). UER learns current samples only by the angle factor and further replays previous samples by both the norm and angle factors. Extensive experiments on three datasets show that UER achieves superior performance over various state-of-the-art methods. The code is in https://github.com/FelixHuiweiLin/UER.

Clip Fusion with Bi-level Optimization for Human Mesh Reconstruction from Monocular Videos

Peng Wu
Xiankai Lu
Jianbing Shen
Yilong Yin

Human mesh reconstruction (HMR) from monocular video is the key step to many mixed reality and robotic applications. Although existing methods show promising results by capturing frames' temporal information, these methods predict human mesh with the design of implicit temporal learning modules in a sequence to frame manner. To mine more temporal information from the video, we present a bi-level clip inference network for HMR, which leverages both local motion and global context explicitly for dense 3D reconstruction. Specifically, we propose a novel bi-level temporal fusion strategy that takes both neighboring and long-range relations into consideration. In addition, different from traditional frame-wise operation, we investigate an alternative perspective by treating video-based HMR as clip-wise inference. We evaluate the proposed method on multiple datasets (3DPW, Human3.6M, and MPI-INF-3DHP) quantitatively and qualitatively, demonstrating a significant improvement over existing methods (in terms of PA-MPJPE, ACC-Error etc). Furthermore, we extend the proposed method on more challenging Multiple Shots HMR task to demonstrate its generalizability. Some visual demos can be seen https://github.com/bicf0/bicf_demo.

Parsing is All You Need for Accurate Gait Recognition in the Wild

Jinkai Zheng
Xinchen Liu
Shuai Wang
Lihao Wang
Chenggang Yan
Wu Liu

Binary silhouettes and keypoint-based skeletons have dominated human gait recognition studies for decades since they are easy to extract from video frames. Despite their success in gait recognition for in-the-lab environments, they usually fail in real-world scenarios due to their low information entropy for gait representations. To achieve accurate gait recognition in the wild, this paper presents a novel gait representation, named Gait Parsing Sequence (GPS). GPSs are sequences of fine-grained human segmentation, i.e., human parsing, extracted from video frames, so they have much higher information entropy to encode the shapes and dynamics of fine-grained human parts during walking. Moreover, to effectively explore the capability of the GPS representation, we propose a novel human parsing-based gait recognition framework, named ParsingGait. ParsingGait contains a Convolutional Neural Network (CNN)-based backbone and two light-weighted heads. The first head extracts global semantic features from GPSs, while the other one learns mutual information of part-level features through Graph Convolutional Networks to model the detailed dynamics of human walking. Furthermore, due to the lack of suitable datasets, we build the first parsing-based dataset for gait recognition in the wild, named Gait3D-Parsing, by extending the large-scale and challenging Gait3D dataset. Based on Gait3D-Parsing, we comprehensively evaluate our method and existing gait recognition methods. Specifically, ParsingGait achieves a 17.5% Rank-1 increase compared with the state-of-the-art silhouette-based method. In addition, by replacing silhouettes with GPSs, current gait recognition methods achieve about 12.5% ~ 19.2% improvements in Rank-1 accuracy. The experimental results show a significant improvement in accuracy brought by the GPS representation and the superiority of ParsingGait.

Multi-Scale Similarity Aggregation for Dynamic Metric Learning

Dingyi Zhang
Yingming Li
Zhongfei Zhang

In this paper, we propose a new multi-scale similarity aggregation method (MSA) for dynamic metric learning (DyML), which adopts a pretraining-finetuning scheme and efficiently learns the similarity relationship for each semantic level. In particular, building upon the framework of self-supervised pretraining, the output embedding layer is divided into three learners to learn the similarity relations in each level individually. Then for training these learners, the hierarchical prior information is fully considered. Specifically, in light of the class hierarchy that each class in a coarse level corresponds to a set of subclasses in a finer level, multi-proxy learning is employed to facilitate the single-level similarity learning of each learner. On the other hand, following the hierarchical consistency property, a cross-level similarity constraint is further presented to encourage the estimated similarities of the three learners to be hierarchically consistent. Extensive experiments on three DyML datasets show that MSA significantly outperforms the existing state-of-the-art methods and allows for a better generalization for different semantic scales.

RefineTAD: Learning Proposal-free Refinement for Temporal Action Detection

Yue Feng
Zhengye Zhang
Rong Quan
Limin Wang
Jie Qin

Temporal action detection (TAD) aims to localize the start and end frames of actions in untrimmed videos, which is a challenging task due to the similarity of adjacent frames and the ambiguity of action boundaries. Previous methods often generate coarse proposals first and then perform proposal-based refinement, which is coupled with prior action detectors and leads to proposal-oriented offsets. However, this paradigm increases the training difficulty of the TAD model and is heavily influenced by the quantity and quality of the proposals. To address the above issues, we decouple the refinement process from conventional TAD methods and propose a learnable, proposal-free refinement method for fine boundary localization, named RefineTAD. We first propose a multi-level refinement module to generate multi-scale boundary offsets, score offsets and boundary-aware probability at each time point based on the feature pyramid. Then, we propose an offset focusing strategy to progressively refine the predicted results of TAD models in a coarse-to-fine manner with our multi-scale offsets. We perform extensive experiments on three challenging datasets and demonstrate that our RefineTAD significantly improves the state-of-the-art TAD methods with minimal computational overhead.

Video Infringement Detection via Feature Disentanglement and Mutual Information Maximization

Zhenguang Liu
Xinyang Yu
Ruili Wang
Shuai Ye
Zhe Ma
Jianfeng Dong
Sifeng He
Feng Qian
Xiaobo Zhang
Roger Zimmermann
Lei Yang

The self-media era provides us tremendous high quality videos. Unfortunately, frequent video copyright infringements are now seriously damaging the interests and enthusiasm of video creators. Identifying infringing videos is therefore a compelling task. Current state-of-the-art methods tend to simply feed high-dimensional mixed video features into deep neural networks and count on the networks to extract useful representations. Despite its simplicity, this paradigm heavily relies on the original entangled features and lacks constraints guaranteeing that useful task-relevant semantics are extracted from the features.

In this paper, we seek to tackle the above challenges from two aspects: (1) We propose to disentangle an original high-dimensional feature into multiple sub-features, explicitly disentangling the feature into exclusive lower-dimensional components. We expect the sub-features to encode non-overlapping semantics of the original feature and remove redundant information. (2) On top of the disentangled sub-features, we further learn an auxiliary feature to enhance the sub-features. We theoretically analyzed the mutual information between the label and the disentangled features, arriving at a loss that maximizes the extraction of task-relevant information from the original feature.

Extensive experiments on two large-scale benchmark datasets (i.e., SVD and VCSL) demonstrate that our method achieves 90.1% TOP-100 mAP on the large-scale SVD dataset and also sets the new state-of-the-art on the VCSL benchmark dataset. Our code and model have been released at https://github.com/yyyooooo/DMI/, hoping to contribute to the community.

Pseudo Object Replay and Mining for Incremental Object Detection

Dongbao Yang
Yu Zhou
Xiaopeng Hong
Aoting Zhang
Xin Wei
Linchengxi Zeng
Zhi Qiao
Weipinng Wang

Incremental object detection (IOD) aims to mitigate catastrophic forgetting for object detectors when incrementally learning to detect new emerging object classes without using original training data. Most existing IOD methods benefit from the assumption that unlabeled old-class objects may co-occur with labeled new-class objects in the new training data. However, in practical scenarios, old-class objects may be absent, which is called non co-occurrence IOD. In this paper, we propose a pseudo object replay and mining method (PseudoRM) to handle the co-occurrence dependent problem, reducing the performance degradation caused by the absence of old-class objects. The new training data can be augmented by co-occurring fake (old-class) and real (new-class) objects with a patch-level data-free generation method in the pseudo object replay stage. To fully use existing training data, we propose pseudo object mining to explore false positives for transferring useful instance-level knowledge. In the incremental learning procedure, a generative distillation is introduced to distill image-level knowledge for balancing stability and plasticity. Experimental results on PASCAL VOC and COCO demonstrate that PseudoRM can effectively boost the performance on both co-occurrence and non co-occurrence scenarios without using old samples or extra wild data.

Informative Classes Matter: Towards Unsupervised Domain Adaptive Nighttime Semantic Segmentation

Shiqin Wang
Xin Xu
Xianzheng Ma
Kui Jiang
Zheng Wang

Unsupervised Domain Adaptive Nighttime Semantic Segmentation (UDA-NSS) aims to adapt a robust model from a labeled daytime domain to an unlabeled nighttime domain. However, current advanced segmentation methods ignore the illumination effect and class discrepancies of different semantic classes during domain adaptation, showing an uneven prediction phenomenon. It is the completely ignored and underexplored issues of ''hard-to-adapt'' classes that some classes have a large performance gap between existing UDA-NSS methods and supervised learning counterparts while others have a very low performance gap. To realize ''hard-to-adapt'' classes' more sufficient learning and facilitate the UDA-NSS task, we present an Online Informative Class Sampling (OICS) strategy to adaptively mine informative classes from the target nighttime domain according to the corresponding spectrogram mean and the class frequency via our Informative Mixture of Experts. Furthermore, an Informativeness-based cross-domain Mixed Sampling (InforMS) framework is designed to focus on informative classes from the target nighttime domain by vesting their higher sampling probabilities when cross-domain mixing sampling and achieves better performance in UDA-NSS tasks. Consequently, our method outperforms state-of-the-art UDA-NSS methods by large margins on three widely-used benchmarks (e.g., ACDC, Dark Zurich, and Nighttime Driving). Notably, our method achieves state-of-the-art performance with 65.1% mIoU on ACDC-night-test and 55.4% mIoU on ACDC-night-val.

View while Moving: Efficient Video Recognition in Long-untrimmed Videos

Ye Tian
Mengyu Yang
Lanshan Zhang
Zhizhen Zhang
Yang Liu
Xiaohui Xie
Xirong Que
Wendong Wang

Recent adaptive methods for efficient video recognition mostly follow the two-stage paradigm of "preview-then-recognition" and have achieved great success on multiple video benchmarks. However, this two-stage paradigm involves two visits of raw frames from coarse-grained to fine-grained during inference (cannot be parallelized), and the captured spatiotemporal features cannot be reused in the second stage (due to varying granularity), being not friendly to efficiency and computation optimization.To this end, inspired by human cognition, we propose a novel recognition paradigm of "View while Moving" for efficient long-untrimmed video recognition.In contrast to the two-stage paradigm, our paradigm only needs to access the raw frame once.The two phases of coarse-grained sampling and fine-grained recognition are combined into unified spatiotemporal modeling, showing great performance.Moreover, we investigate the properties of semantic units in video and propose a hierarchical mechanism to efficiently capture and reason about the unit-level and video-level temporal semantics in long-untrimmed videos respectively.Extensive experiments on both long-untrimmed and short-trimmed videos demonstrate that our approach outperforms state-of-the-art methods in terms of accuracy as well as efficiency, yielding new efficiency and accuracy trade-offs for video spatiotemporal modeling.

PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion

Yimin Deng
Huaizhen Tang
Xulong Zhang
Jianzong Wang
Ning Cheng
Jing Xiao

Voice conversion as the style transfer task applied to speech, refers to converting one person's speech into a new speech that sounds like another person's. Up to now, there has been a lot of research devoted to better implementation of VC tasks. However, a good voice conversion model should not only match the timbre information of the target speaker, but also expressive information such as prosody, pace, pause, etc. In this context, prosody modeling is crucial for achieving expressive voice conversion that sounds natural and convincing. Unfortunately, prosody modeling is important but challenging, especially without text transcriptions. In this paper, we firstly propose a novel voice conversion framework named 'PMVC', which effectively separates and models the content, timbre, and prosodic information from the speech without text transcriptions. Specially, we introduce a new speech augmentation algorithm for robust prosody extraction. And building upon this, mask and predict mechanism is applied in the disentanglement of prosody and content information. The experimental results on the AIShell-3 corpus supports our improvement of naturalness and similarity of converted speech.

Alleviating Spatial Misalignment and Motion Interference for UAV-based Video Recognition

Gege Shi
Xueyang Fu
Chengzhi Cao
Zheng-Jun Zha

Recognizing activities with Unmanned Aerial Vehicles (UAVs) is essential for many applications, while existing video recognition methods are mainly designed for ground cameras and do not account for UAV changing attitudes and fast motion. This creates spatial misalignment of small objects between frames, leading to inaccurate visual movement in drone videos. Additionally, camera motion relative to objects in the video causes relative movements that visually affect object motion and can result in misunderstandings of video content. To address these issues, we present a novel framework named Attentional Spatial and Adaptive Temporal Relations Modeling. First, to mitigate the spatial misalignment of small objects between frames, we design an Attentional Patch-level Spatial Enrichment (APSE) module that models dependencies among patches and enhances patch-level features. Then, we propose a Multi-scale Temporal and Spatial Mixer (MTSM) module that is capable of adapting to disturbances caused by the UAV flight and modeling various temporal clues. By integrating APSE and MTSM into a single model, our network can effectively and accurately capture spatiotemporal relations for UAV videos. Extensive experiments on several benchmarks demonstrate the superiority of our method over state-of-the-art approaches. For instance, our network achieves a classification accuracy of 68.1% with an absolute gain of 1.3% compared to FuTH-Net on the ERA dataset.

Learning Causality-inspired Representation Consistency for Video Anomaly Detection

Yang Liu
Zhaoyang Xia
Mengyang Zhao
Donglai Wei
Yuzheng Wang
Siao Liu
Bobo Ju
Gaoyun Fang
Jing Liu
Liang Song

Video anomaly detection is an essential yet challenging task in the multimedia community, with promising applications in smart cities and secure communities. Existing methods attempt to learn abstract representations of regular events with statistical dependence to model the endogenous normality, which discriminates anomalies by measuring the deviations to the learned distribution. However, conventional representation learning is only a crude description of video normality and lacks an exploration of its underlying causality. The learned statistical dependence is unreliable for diverse regular events in the real world and may cause high false alarms due to over generalization. Inspired by causal representation learning, we think that there exists a causal variable capable of adequately representing the general patterns of regular events in which anomalies will present significant variations. Therefore, we design a causality-inspired representation consistency (CRC) framework to implicitly learn the unobservable causal variables of normality directly from available normal videos and detect abnormal events with the learned representation consistency. Extensive experiments show that the causality-inspired normality is robust to regular events with label-independent shifts, and the proposed CRC framework can quickly and accurately detect various complicated anomalies from real-world surveillance videos.

M2ATS: A Real-world Multimodal Air Traffic Situation Benchmark Dataset and Beyond

Dongyue Guo
Yi Lin
Xuehang You
Zhongping Yang
Jizhe Zhou
Bo Yang
Jianwei Zhang
Han Shi
Shasha Hu
Zheng Zhang

Air Traffic Control (ATC) is a complicated, time-evolving, and real-time procedure to direct flight operations in a safer and ordered manner. Although enormous data storages are available during air traffic operations for over 40 years, data-driven intelligent application in aviation is still an emerging task due to the safety-critical issue. With the prevalence of the Next Generation ATC system, artificial intelligence (AI) -empowered research topics are attracting increasing attention from both industrial and academic domains and a high-quality dataset naturally becomes the prerequisite for such practices. However, almost all ATC-related datasets are only unimodal for certain tasks, which fails to comprehensively illustrate the traffic situation to further support real-world studies. To address this gap, a multimodal air traffic situation (M2ATS) dataset is constructed to advance AI-related research in the ATC domain, including airspace information, flight plan, trajectory, and speech. M2ATS covers 10362 flights ATC situation data, involving 110000+ utterances (104 hours) with diversity golden text annotations, 16 intents, and 51 slots. Considering the real-world ATC requirements, a total of 10 multimedia-related tasks (24 baselines) are designed to validate the proposed dataset, covering automatic speech recognition, natural language processing, and spatial-temporal data processing. New ATC-related metrics corresponding to ATC applications are proposed in addition to the common metrics to evaluate task performance. Extensive experiment results demonstrate that the selective baselines can achieve designed tasks on this new dataset, and further investigations are also required to address task and data specificities. It is believed that the proposed new dataset is a new practice to advance AI applications to an industrial scene, which not only promotes ATC-related applications but also provides diverse research topics in the common multimedia community.

Federated Learning with Label-Masking Distillation

Jianghu Lu
Shikun Li
Kexin Bao
Pengju Wang
Zhenxing Qian
Shiming Ge

Federated learning provides a privacy-preserving manner to collaboratively train models on data distributed over multiple local clients via the coordination of a global server. In this paper, we focus on label distribution skew in federated learning, where due to the different user behavior of the client, label distributions between different clients are significantly different. When faced with such cases, most existing methods will lead to a suboptimal optimization due to the inadequate utilization of label distribution information in clients. Inspired by this, we propose a label-masking distillation approach termed FedLMD to facilitate federated learning via perceiving the various label distributions of each client. We classify the labels into majority and minority labels based on the number of examples per class during training. The client model learns the knowledge of majority labels from local data. The process of distillation masks out the predictions of majority labels from the global model, so that it can focus more on preserving the minority label knowledge of the client. A series of experiments show that the proposed approach can achieve state-of-the-art performance in various cases. Moreover, considering the limited resources of the clients, we propose a variant FedLMD-Tf that does not require an additional teacher, which outperforms previous lightweight approaches without increasing computational costs. Our code is available at https://github.com/wnma3mz/FedLMD.

Painterly Image Harmonization using Diffusion Model

Lingxiao Lu
Jiangtong Li
Junyan Cao
Li Niu
Liqing Zhang

Painterly image harmonization aims to insert photographic objects into paintings and obtain artistically coherent composite images. Previous methods for this task mainly rely on inference optimization or generative adversarial network, but they are either very time-consuming or struggling at fine control of the foreground objects (e.g., texture and content details). To address these issues, we propose a novel Painterly Harmonization stable Diffusion model (PHDiffusion), which includes a lightweight adaptive encoder and a Dual Encoder Fusion (DEF) module. Specifically, the adaptive encoder and the DEF module first stylize foreground features within each encoder. Then, the stylized foreground features from both encoders are combined to guide the harmonization process. During training, besides the noise loss in diffusion model, we additionally employ content loss and two style losses, i.e., AdaIN style loss and contrastive style loss, aiming to balance the trade-off between style migration and content preservation. Compared with the state-of-the-art models from related fields, our PHDiffusion can stylize the foreground more sufficiently and simultaneously retain finer content. Our code and model are available at https://github.com/bcmi/PHDiffusion-Painterly-Image-Harmonization

Exploring Hyperspectral Histopathology Image Segmentation from a Deformable Perspective

Xingran Xie
Ting Jin
Boxiang Yun
Qingli Li
Yan Wang

Hyperspectral images (HSIs) offer great potential for computational pathology. However, limited by the spectral redundancy and the lack of spectral prior in popular 2D networks, previous HSI based techniques do not perform well. To address these problems, we propose to segment HSIs from a deformable perspective, which processes different spectral bands independently and fuses spatiospectral features of interest via deformable attention mechanisms. In addition, we propose Deformable Self-Supervised Spectral Regression (DF-S3R), which introduces two self-supervised pre-text tasks based on the low rank prior of HSIs enabling the network learning with spectrum-related features. During pre-training, DF-S3R learns both spectral structures and spatial morphology, and the jointly pre-trained architectures help alleviate the transfer risk to downstream fine-tuning. Compared to previous works, experiments show that our deformable architecture and pre-training method perform much better than other competitive methods on pathological semantic segmentation tasks, and the visualizations indicate that our method can trace the critical spectral characteristics from subtle spectral disparities. Code will be released at https://github.com/Ayakax/DFS3R.

Uncertainty-Aware Variate Decomposition for Self-supervised Blind Image Deblurring

Runhua Jiang
Yahong Han

Blind image deblurring remains challenging due to the ill-posed nature of the traditional blurring function. Although previous supervised methods have achieved great breakthrough with synthetic blurry-sharp image pairs, their generalization ability to real-world blurs is limited by the discrepancy between synthetic and real blurs. To overcome this limitation, unsupervised deblurring methods have been proposed by using natural priors or generative adversarial networks. However, natural priors are vulnerable to random blur artifacts, while generators of generative adversarial networks always produce inaccurate details and unrealistic colors. Consequently, previous methods easily suffer from slow convergence and poor performance. In this work, we propose to formulate the traditional blurring function as the composition of multiple variates, thus allowing us explicitly define characteristics of residual images between blurry and sharp images. We also propose a multi-step self-supervised deblurring framework to address the slow convergence issue. Our framework continuously decomposes and composes input images, thus utilizing the uncertainty of blur artifacts to obtain diverse pseudo blurry-sharp image pairs for self-supervised learning. This framework is more efficient than previous methods, as it does not rely on natural priors or GANs. Extensive comparisons demonstrate that the proposed framework outperforms state-of-the-art unsupervised methods on both dynamic scene, human-aware centric motion, real-world and out-of-focus deblurring datasets. The codes are available at https://github.com/ddghjikle/MM-2023-USDF.

SESSION: Oral Session II: Understanding Multimedia Content -- Multimodal Fusion and Embedding

SCLAV: Supervised Cross-modal Contrastive Learning for Audio-Visual Coding

Chao Sun
Min Chen
Jialiang Cheng
Han Liang
Chuanbo Zhu
Jincai Chen

Audio and vision are important senses for high-level cognition, and their special strong correlation makes audio-visual coding a crucial factor in many multimodal tasks. However, there are two challenges in audio-visual coding. First, the heterogeneity of multimodal data often leads to misalignment of cross-modal features under the same sample, which reduces their representation quality. Second, most self-supervised learning frameworks are constructed based on instance semantics, and the generated pseudo labels introduce additional classification noise. To address these challenges, we propose a Supervised Cross-modal Contrastive Learning Framework for Audio-Visual Coding (SCLAV). Our framework includes an audio-visual coding network composed of an inter-modal attention interaction module and an intra-modal self-integration module, which leverage multimodal complementary and hidden information for better representation. Additionally, we introduce a supervised cross-modal contrastive loss to minimize the distance between audio and vision features of the same instance, and use weak labels of multimodal data to eliminate the feature-oriented classification noise. Extensive experiments on the AVE and XD-Violence datasets demonstrate that SCLAV outperforms the state-of-the-art results, even with limited computational resources.

Cross-Modal and Multi-Attribute Face Recognition: A Benchmark

Feng Lin
Kaiqiang fu
Hao Luo
Ziyue Zhan
Zhibo Wang
Zhenguang Liu
Lorenzo Cavallaro
Kui Ren

Face recognition has made significant advances with the development of deep learning and has begun to be deployed in some unrestricted scenarios. Many smartphones, for example, have infrared sensors that allow them to capture clear images even in low-light conditions. Face authentication under complex environmental conditions can thus be accomplished by matching NIR-VIS face images across modalities. However, existing NIR-VIS datasets lack enough variation in face attributes and are insufficient for real-world scenarios. To address the aforementioned issues, we first propose a 300-person NIR-VIS cross-modality face dataset with a variety of attributes. Based on modal information removal, we proposed a NIR-VIS cross-modal face recognition model. We can effectively extract modal information by constraining the similarity distribution of modalities and then using the orthogonal loss to remove modal information from identity features. The method achieves excellent results on our dataset and CASIA NIR-VIS 2.0 dataset.

A Closer Look at Classifier in Adversarial Domain Generalization

Ye Wang
Junyang Chen
Mengzhu Wang
Hao Li
Wei Wang
Houcheng Su
Zhihui Lai
Wei Wang
Zhenghan Chen

The task of domain generalization is to learn a classification model from multiple source domains and generalize it to unknown target domains. The key to domain generalization is learning discriminative domain-invariant features. Invariant representations are achieved using adversarial domain generalization as one of the primary techniques. For example, generative adversarial networks have been widely used, but suffer from the problem of low intra-class diversity, which can lead to poor generalization ability. To address this issue, we propose a new method called auxiliary classifier in adversarial domain generalization (CloCls). CloCls improve the diversity of the source domain by introducing auxiliary classifier. Combining typical task-related losses, e.g., cross-entropy loss for classification and adversarial loss for domain discrimination, our overall goal is to guarantee the learning of condition-invariant features for all source domains while increasing the diversity of source domains. Further, inspired by smoothing optima have improved generalization for supervised learning tasks like classification. We leverage that converging to a smooth minima with respect task loss stabilizes the adversarial training leading to better performance on unseen target domain which can effectively enhances the performance of domain adversarial methods. We have conducted extensive image classification experiments on benchmark datasets in domain generalization, and our model exhibits sufficient generalization ability and outperforms state-of-the-art DG methods.

Mixture-of-Experts Learner for Single Long-Tailed Domain Generalization

Mengzhu Wang
Jianlong Yuan
Zhibin Wang

Domain generalization (DG) refers to the task of training a model on multiple source domains and test it on a different target domain with different distribution. In this paper, we address a more challenging and realistic scenario known as Single Long-Tailed Domain Generalization, where only one source domain is available and the minority class in this domain has an abundance of instances in other domains. To tackle this task, we propose a novel approach called Mixture-of-Experts Learner for Single Long-Tailed Domain Generalization (MoEL), which comprises two key strategies. The first strategy is a simple yet effective data augmentation technique that leverages saliency maps to identify important regions on the original images and preserves these regions during augmentation. The second strategy is a new skill-diverse expert learning approach that trains multiple experts from a single long-tailed source domain and leverages mutual learning to aggregate their learned knowledge for the unknown target domain. We evaluate our method on various benchmark datasets, including Digits-DG, CIFAR-10-C, PACS, and DomainNet, and demonstrate its superior performance compared to previous single domain generalization methods. Additionally, the ablation study is also conducted to illustrate the inner workings of our approach.

Robust Spectral Embedding Completion Based Incomplete Multi-view Clustering

Chao Zhang
Jingwen Wei
Bo Wang
Zechao Li
Chunlin Chen
Huaxiong Li

Graph based methods have been widely used in incomplete multi-view clustering (IMVC). Most recent methods try to fill the original missing samples or incomplete affinity matrices to obtain a complete similarity graph for the subsequent spectral clustering. However, recovering the original high-dimensional data or complete n X n similarity matrix is usually time-consuming and noise-sensitive. Besides, they generally separate the cluster indicator learning into an individual step, which may result in sub-optimal graphs or spectral embeddings for clustering. To address these problems, this paper proposes a robust Spectral Embedding Completion based IMVC (SEC-IMVC) method, which incorporates spectral embedding completion and discrete cluster indicator learning into a unified framework. SEC-IMVC performs completion on spectral embeddings, and the embedding noise is eliminated to reduce the negative influence of original data noise. The discrete cluster indicator matrix is seamlessly learned by using spectral rotation, and it can explore the first-order feature consistency among different views. To further improve the completion robustness, the second-order correlation consistency is also captured by pairwise relations alignment. We compare our method with some state-of-the-art approaches on several datasets, and the experimental results show the effectiveness and advantages of our method.

SA-GDA: Spectral Augmentation for Graph Domain Adaptation

Jinhui Pang
Zixuan Wang
Jiliang Tang
Mingyan Xiao
Nan Yin

Graph neural networks (GNNs) have achieved impressive impressions for graph-related tasks. However, most GNNs are primarily studied under the cases of signal domain with supervised training, which requires abundant task-specific labels and is difficult to transfer to other domains. There are few works focused on domain adaptation for graph node classification. They mainly focused on aligning the feature space of the source and target domains, without considering the feature alignment between different categories, which may lead to confusion of classification in the target domain. However, due to the scarcity of labels of the target domain, we cannot directly perform effective alignment of categories from different domains, which makes the problem more challenging. In this paper, we present the Spectral Augmentation for Graph Domain Adaptation (SA-GDA) for graph node classification. First, we observe that nodes with the same category in different domains exhibit similar characteristics in the spectral domain, while different classes are quite different. Following the observation, we align the category feature space of different domains in the spectral domain instead of aligning the whole features space, and we theoretical proof the stability of proposed SA-GDA. Then, we develop a dual graph convolutional network to jointly exploits local and global consistency for feature aggregation. Last, we utilize a domain classifier with an adversarial learning submodule to facilitate knowledge transfer between different domain graphs. Experimental results on a variety of publicly available datasets reveal the effectiveness of our SA-GDA.

CONVERT: Contrastive Graph Clustering with Reliable Augmentation

Xihong Yang
Cheng Tan
Yue Liu
Ke Liang
Siwei Wang
Sihang Zhou
Jun Xia
Stan Z. Li
Xinwang Liu
En Zhu

Contrastive graph node clustering via learnable data augmentation is a hot research spot in the field of unsupervised graph learning. The existing methods learn the sampling distribution of a pre-defined augmentation to generate data-driven augmentations automatically. Although promising clustering performance has been achieved, we observe that these strategies still rely on pre-defined augmentations, the semantics of the augmented graph can easily drift. The reliability of the augmented view semantics for contrastive learning can not be guaranteed, thus limiting the model performance. To address these problems, we propose a novel CONtrastiVe Graph ClustEring network with Reliable AugmenTation (COVERT). Specifically, in our method, the data augmentations are processed by the proposed reversible perturb-recover network. It distills reliable semantic information by recovering the perturbed latent embeddings. Moreover, to further guarantee the reliability of semantics, a novel semantic loss is presented to constrain the network via quantifying the perturbation and recovery. Lastly, a label-matching mechanism is designed to guide the model by clustering information through aligning the semantic labels and the selected high-confidence clustering pseudo labels. Extensive experimental results on seven datasets demonstrate the effectiveness of the proposed method. We release the code and appendix of CONVERT at https://github.com/xihongyang1999/CONVERT on GitHub.

High-order Complementarity Induced Fast Multi-View Clustering with Enhanced Tensor Rank Minimization

Jintian Ji
Songhe Feng

Recently, tensor-based multi-view clustering methods have achieved promising results, primarily benefited from their superior ability in exploring high-order consistent information among views. Despite significant progress, these methods inevitably suffer from several drawbacks: 1) Extremely high computational complexity restricts their feasibility for large-scale data sets. 2) Prevalently adopted tensor rank approximations (e.g., Tensor Nuclear Norm (TNN)) tend to under-penalize small singular values, resulting in noise residuals. 3) Tensor structure is rarely utilized for high-order complementarity investigation. In light of this, we propose High-order Complementarity Induced Fast Multi-View Clustering with Enhanced Tensor Rank Minimization (CFMVC-ETR). Specifically, two sets of representation matrices are learned from original multi-view data via the matrix factorization mechanism with a group of base matrices, which are further reconstructed into the consistent tensor and the complementary tensor, respectively. Subsequently, a novel Enhanced Tensor Rank is imposed on the consistent tensor, which is a tighter approximation of the tensor rank and is more noisy-robust to explore the high-order consistency. Meanwhile, a tensor-level constraint termed Tensorial Exclusive Regularization is proposed on the complementary tensor to enhance the view-specific feature and well capture the high-order complementarity. Moreover, we adopt a concatenation-fusion approach to integrate these two parts, deriving a discriminative unified embedding for the clustering task. We solve CFMVC-ETR by an efficient algorithm with good convergence. Extensive experiments on nine challenging data sets demonstrate the superiority of the proposed method.

DealMVC: Dual Contrastive Calibration for Multi-view Clustering

Xihong Yang
Jin Jiaqi
Siwei Wang
Ke Liang
Yue Liu
Yi Wen
Suyuan Liu
Sihang Zhou
Xinwang Liu
En Zhu

Benefiting from the strong view-consistent information mining capacity, multi-view contrastive clustering has attracted plenty of attention in recent years. However, we observe the following drawback, which limits the clustering performance from further improvement. The existing multi-view models mainly focus on the consistency of the same samples in different views while ignoring the circumstance of similar but different samples in cross-view scenarios. To solve this problem, we propose a novel Dual contrastive calibration network for Multi-View Clustering (DealMVC). Specifically, we first design a fusion mechanism to obtain a global cross-view feature. Then, a global contrastive calibration loss is proposed by aligning the view feature similarity graph and the high-confidence pseudo-label graph. Moreover, to utilize the diversity of multi-view information, we propose a local contrastive calibration loss to constrain the consistency of pair-wise view features. The feature structure is regularized by reliable class information, thus guaranteeing similar samples have similar features in different views. During the training procedure, the interacted cross-view feature is jointly optimized at both local and global levels. In comparison with other state-of-the-art approaches, the comprehensive experimental results obtained from eight benchmark datasets provide substantial validation of the effectiveness and superiority of our algorithm. We release the code of DealMVC at https://github.com/xihongyang1999/DealMVC on GitHub.

Bidomain Modeling Paradigm for Pansharpening

Junming Hou
Qi Cao
Ran Ran
Che Liu
Junling Li
Liang-jian Deng

Pansharpening is a challenging low-level vision task whose aim is to learn the complementary representation between spectral information and spatial detail. Despite the remarkable progress, existing deep neural network (DNN) based pansharpening algorithms are still confronted with common limitations. 1) These methods rarely consider the local specificity of different spectral bands; 2) They often extract the global detail in the spatial domain, which ignore the task-related degradation, e.g., the down-sampling process of MS image, and also suffer from limited receptive field. In this work, we propose a novel bidomain modeling paradigm for pansharpening problem (dubbed as BiMPan), which takes into both local spectral specificity and global spatial detail. More specifically, we first customize the specialized source-discriminative adaptive convolution (SDAConv) for every spectral band instead of sharing the identical kernels across all bands like prior works. Then, we devise a novel Fourier global modeling module (FGMM), which is capable of embracing global information while benefiting the disentanglement of image degradation. By integrating the band-aware local feature and Fourier global detail from these two functional designs, we can fuse a texture-rich while visually pleasing high-resolution MS image. Extensive experiments demonstrate that the proposed framework achieves favorable performance against current state-of-the-art pansharpening methods. The code is available at https://github.com/coder-qicao/BiMPan.

Learning High-frequency Feature Enhancement and Alignment for Pan-sharpening

Yingying Wang
Yunlong Lin
Ge Meng
Zhenqi Fu
Yuhang Dong
Linyu Fan
Hedeng Yu
Xinghao Ding
Yue Huang

Pan-sharpening aims to utilize the high-resolution panchromatic (PAN) image as a guidance to super-resolve the spatial resolution of the low-resolution multispectral (MS) image. The key challenge in pan-sharpening is how to effectively and precisely inject high-frequency edges and textures from the PAN image into the low-resolution MS image. To address this issue, we propose a High-frequency Feature Enhancement and Alignment Network (HFEAN) for effectively encouraging the high-frequency learning. To implement it, three core designs are customized: a Fourier convolution based efficient feature enhancement module (FEM), an implicit neural alignment module (INA), and a preliminary alignment module (Pre-align). To be specific, FEM employs the fast Fourier convolution with attention mechanism to achieve the mixed global-local receptive field on each scale of the high-frequency domain, thus yielding the informative latent codes. INA leverages implicit neural function to precisely align the latent codes from different scales in the continuous domain. In this way, the high frequency signals at different scales are represented as functions of continuous coordinates, enabling a precise feature alignment in a resolution-free manner. Pre-align is developed to further address the inherent misalignment between PAN and MS pairs. Extensive experiments over multiple satellite datasets validate the effectiveness of the proposed network and demonstrate its favorable performance against the existing state-of-the-art methods both visually and quantitatively. Code is available at: https://github.com/Gracewangyy/HFEAN.

Distribution Consistency based Fast Anchor Imputation for Incomplete Multi-view Clustering

Xingfeng Li
Yinghui Sun
Quansen Sun
Jia Dai
Zhenwen Ren

In practical scenarios, partial missing of multi-view data is very common, such as register information missing from social network analysis, which results in incomplete multi-view clustering (IMVC). How to fill missing data fast and efficiently plays a vital role in improving IMVC, carrying a significant challenge. Existing IMVC methods always use all observed data to fill in missing data, resulting in high complexity and poor imputation quality due to a lack of guidance from consistent distribution. To break the existing limitations, we propose a novel Distribution Consistency based Fast Anchor Imputation for Incomplete Multi-view Clustering (DCFAI-IMVC) method. Specifically, to eliminate the interference of redundant and fraudulent features in the original space, incomplete data are first projected into a consensus latent space, where we dynamically learn a small number of anchors to achieve fast and good imputation. Then, we employ global distribution information of the observed embedding representations to further ensure the consistent distribution between the learned anchors and the observed embedding representations. Ultimately, a tensor low-rank constraint is imposed on bipartite graphs to investigate the high-order correlations hidden in data. DCFAI-IMVC enjoys linear complexity in terms of sample number, which gives it great potential to handle large-scale IMVC tasks. By performing extensive experiments, our effectiveness, superiority, and efficiency are all validated on multiple public datasets with recent advances.

Visual Causal Scene Refinement for Video Question Answering

Yushen Wei
Yang Liu
Hong Yan
Guanbin Li
Liang Lin

Existing methods for video question answering (VideoQA) often suffer from spurious correlations between different modalities, leading to a failure in identifying the dominant visual evidence and the intended question. Moreover, these methods function as black boxes, making it difficult to interpret the visual scene during the QA process. In this paper, to discover critical video segments and frames that serve as the visual causal scene for generating reliable answers, we present a causal analysis of VideoQA and propose a framework for cross-modal causal relational reasoning, named Visual Causal Scene Refinement (VCSR). Particularly, a set of causal front-door intervention operations is introduced to explicitly find the visual causal scenes at both segment and frame levels. Our VCSR involves two essential modules: i) the Question-Guided Refiner (QGR) module, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention; ii) the Causal Scene Separator (CSS) module, which discovers a collection of visual causal and non-causal scenes based on the visual-linguistic causal relevance and estimates the causal effect of the scene-separating intervention in a contrastive learning manner. Extensive experiments on the NExT-QA, Causal-VidQA, and MSRVTT-QA datasets demonstrate the superiority of our VCSR in discovering visual causal scene and achieving robust video question answering.

Parameter-Efficient Transfer Learning for Audio-Visual-Language Tasks

Hongye Liu
Xianhai Xie
Yang Gao
Zhou Yu

The pretrain-then-finetune paradigm has been widely used in various unimodal and multimodal tasks. However, finetuning all the parameters of a pre-trained model becomes prohibitive as the model size grows exponentially. To address this issue, the adapter mechanism that freezes the pre-trained model and only finetunes a few extra parameters is introduced and delivers promising results. Most studies on adapter architectures are dedicated to unimodal or bimodal tasks, while the adapter architectures for trimodal tasks have not been investigated yet. This paper introduces a novel Long Short-Term Trimodal Adapter (LSTTA) approach for video understanding tasks involving audio, visual, and language modalities. Based on the pre-trained from the three modalities, the designed adapter module is inserted between the sequential blocks to model the dense interactions across the three modalities. Specifically, LSTTA consists of two types of complementary adapter modules, namely the long-term semantic filtering module and the short-term semantic interaction module. The long-term semantic filtering aims to characterize the temporal importance of the video frames and the short-term semantic interaction module models local interactions within short periods. Compared to previous state-of-the-art trimodal learning methods pre-trained on a large-scale trimodal corpus, LSTTA is more flexible and can inherit any powerful unimodal or bimodal models. Experimental results on four typical trimodal learning tasks show the effectiveness of LSTTA over existing state-of-the-art methods.

ReCo: A Dataset for Residential Community Layout Planning

Xi Chen
Yun Xiong
Siqi Wang
Haofen Wang
Tao Sheng
Yao Zhang
Yu Ye

Layout planning is centrally important in the field of architecture and urban design. Among the various basic units carrying urban functions, residential community plays a vital part for supporting human life. Therefore, the layout planning of residential community has always been of concern, and has attracted particular attention since the advent of deep learning that facilitates the automated layout generation and spatial pattern recognition. However, the research circles generally suffer from the insufficiency of residential community layout benchmark or high-quality datasets, which hampers the future exploration of data-driven methods for residential community layout planning. The lack of datasets is largely due to the difficulties of large-scale real-world residential data acquisition and long-term expert screening. In order to address the issues and advance a benchmark dataset for various intelligent spatial design and analysis applications in the development of smart city, we introduce Residential Community Layout Planning (ReCo) Dataset, which is the first and largest open-source vector dataset related to real-world community to date. ReCo Dataset is presented in multiple data formats with 37,646 residential community layout plans, covering 598,728 residential buildings with height information. ReCo can be conveniently adapted for residential community layout related urban design tasks, e.g., generative layout design, morphological pattern recognition and spatial evaluation. To validate the utility of ReCo in automated residential community layout planning, two Generative Adversarial Network (GAN) based generative models are further applied to the dataset. We expect ReCo Dataset to inspire more creative and practical work in intelligent design and beyond. The ReCo Dataset is published at: https://www.kaggle.com/fdudsde/reco-dataset and related code can be found at: \urlhttps://github.com/FDUDSDE/ReCo-Dataset.

Point-aware Interaction and CNN-induced Refinement Network for RGB-D Salient Object Detection

Runmin Cong
Hongyu Liu
Chen Zhang
Wei Zhang
Feng Zheng
Ran Song
Sam Kwong

By integrating complementary information from RGB image and depth map, the ability of salient object detection (SOD) for complex and challenging scenes can be improved. In recent years, the important role of Convolutional Neural Networks (CNNs) in feature extraction and cross-modality interaction has been fully explored, but it is still insufficient in modeling global long-range dependencies of self-modality and cross-modality. To this end, we introduce CNNs-assisted Transformer architecture and propose a novel RGB-D SOD network with Point-aware Interaction and CNN-induced Refinement (PICR-Net). On the one hand, considering the prior correlation between RGB modality and depth modality, an attention-triggered cross-modality point-aware interaction (CmPI) module is designed to explore the feature interaction of different modalities with positional constraints. On the other hand, in order to alleviate the block effect and detail destruction problems brought by the Transformer naturally, we design a CNN-induced refinement (CNNR) unit for content refinement and supplementation. Extensive experiments on five RGB-D SOD datasets show that the proposed network achieves competitive results in both quantitative and qualitative comparisons. Our code is publicly available at: https://github.com/rmcong/PICR-Net_ACMMM23.

Multi-view Self-Expressive Subspace Clustering Network

Jinrong Cui
Yuting Li
Yulu Fu
Jie Wen

Advanced deep multi-view subspace clustering methods are based on the self-expressive model, which has achieved impressive performance. However, most existing works have several limitations: 1) They endure high computational complexity when learning a consistent affinity matrix, impeding their capacity to handle large-scale multi-view data; 2) The global and local structure information of multi-view data remains under-explored. To tackle these challenges, we propose a simplistic but comprehensive framework called Multi-view Self-Expressive Subspace Clustering (MSESC) network. Specifically, we design a deep metric network to replace the conventional self-expressive model, which can directly and efficiently produce the intrinsic similarity values of any instance-pairs of all views. Moreover, our method explores global and local structure information from the connectivity of instance-pairs across views and the nearest neighbors of instance-pairs within the view, respectively. By integrating global and local structure information within a unified framework, MSESC can learn a high-quality shared affinity matrix for better clustering performance. Extensive experimental results indicate the superiority of MSESC compared to several state-of-the-art methods.

Cross-modality Representation Interactive Learning for Multimodal Sentiment Analysis

Jian Huang
Yanli Ji
Yang Yang
Heng Tao Shen

Effective alignment and fusion of multimodal features remain a significant challenge for multimodal sentiment analysis. In various multimodal applications, the text modal exhibits a significant advantage of compact yet expressive representation ability. In this paper, we propose a Cross-modality Representation Interactive Learning (CRIL) approach, which adopts the text modality to guide other modalities for learning representative feature tokens, contributing to effective multimodal fusion in multimodal sentiment analysis. We propose a semantic representation interactive learning module to learn concise semantic representation tokens for audio and video modalities under the guidance of the text modality, ensuring semantic alignment of representations among multiple modalities. Furthermore, we design a semantic relationship interactive learning module, which calculates a self-attention matrix for each modality and controls their consistency to enable the semantic relationship alignment for multiple modalities. Finally, we present a two-stage interactive fusion solution to bridge the modality gap for multimodal fusion and sentiment analysis. Extensive experiments are performed on the CMU-MOSEI, CMU-MOSI, and UR-FUNNY datasets, and experiment results demonstrate the effectiveness of our proposed approach.

Entropy Neural Estimation for Graph Contrastive Learning

Yixuan Ma
Xiaolin Zhang
Peng Zhang
Kun Zhan

Contrastive learning on graphs aims at extracting distinguishable high-level representations of nodes. We theoretically illustrate that the entropy of a dataset is approximated by maximizing the lower bound of the mutual information across different views of a graph, i.e., entropy is estimated by a neural network. Based on this finding, we propose a simple yet effective subset sampling strategy to contrast pairwise representations between views of a dataset. In particular, we randomly sample nodes and edges from a given graph to build the input subset for a view. Two views are fed into a parameter-shared Siamese network to extract the high-dimensional embeddings and estimate the information entropy of the entire graph. For the learning process, we propose to optimize the network using two objectives, simultaneously. Concretely, the input of the contrastive loss consists of positive and negative pairs. Our selection strategy of pairs is different from previous works and we present a novel strategy to enhance the representation ability by selecting nodes based on cross-view similarities. We enrich the diversity of the positive and negative pairs by selecting highly similar samples and totally different data with the guidance of cross-view similarity scores, respectively. We also introduce a cross-view consistency constraint on the representations generated from the different views. We conduct experiments on seven graph benchmarks, and the proposed approach achieves competitive performance compared to the current state-of-the-art methods. The source code is available at https://github.com/kunzhan/M-ILBO.

Cross-modal and Cross-medium Adversarial Attack for Audio

Liguo Zhang
Zilin Tian
Yunfei Long
Sizhao Li
Guisheng Yin

Acoustic waves are forms of energy that propagate through various mediums. They can be represented by different modalities, such as auditory signals and visual patterns. The two modalities are often described as one-dimensional waveform in the time domain and two-dimensional spectrogram in the frequency domain. Most acoustic signal processing methods use single modal data for input and training models. This poses a challenge for black-box adversarial attacks on audio signals because the input modality is also unknown to the attacker. In fact, there currently exist no methods that explore the cross-modal transferability of adversarial perturbation. This paper investigates the cross-modal transferability from waveform to spectrogram. We argue that the data distributions in the sample space with the different modalities have mapping relations and propose a novel decision-based cross-modal and cross-medium adversarial attack method. Specifically, it generates an initial example with cross-modal attack capability by combining random natural noise, then iteratively reduces the perturbation to enhance its invisibility. It incorporates the constraints of the spectrogram sample space while iteratively optimizing adversarial perturbations for black-box audio classification models. The perturbation is imperceptible to humans, both visually and aurally. Extensive experiments demonstrate that our approach can launch attacks on classification models for sound waves and spectrograms that share the same audio signal. Furthermore, we explore the cross-medium capability of our proposed adversarial attack strategy that can target processing models for acoustic signals propagating in air and seawater. The proposed method has preeminent invisibility and generalization compared to other methods.

Unsupervised Multiplex Graph learning with Complementary and Consistent Information

Liang Peng
Xin Wang
Xiaofeng Zhu

Unsupervised multiplex graph learning (UMGL) has been shown to achieve significant effectiveness for different downstream tasks by exploring both complementary information and consistent information among multiple graphs. However, previous methods usually overlook the issues in practical applications, i.e., the out-of-sample issue and the noise issue. To address the above issues, in this paper, we propose an effective and efficient UMGL method to explore both complementary and consistent information. To do this, our method employs multiple MLP encoders rather than graph convolutional network (GCN) to conduct representation learning with two constraints, i.e., preserving the local graph structure among nodes to handle the out-of-sample issue, and maximizing the correlation of multiple node representations to handle the noise issue. Comprehensive experiments demonstrate that our proposed method achieves superior effectiveness and efficiency over the comparison methods and effectively tackles those two issues. Code is available at https://github.com/LarryUESTC/CoCoMG.

GCL: Gradient-Guided Contrastive Learning for Medical Image Segmentation with Multi-Perspective Meta Labels

Yixuan Wu
Jintai Chen
Jiahuan Yan
Yiheng Zhu
Danny Z. Chen
Jian Wu

Since annotating medical images for segmentation tasks commonly incurs expensive costs, it is highly desirable to design an annotation-efficient method to alleviate the annotation burden. Recently, contrastive learning has exhibited a great potential in learning robust representations to boost downstream tasks with limited labels. In medical imaging scenarios, ready-made meta labels (i.e., specific attribute information of medical images) inherently reveal semantic relationships among images, which have been used to define positive pairs in previous work. However, the multi-perspective semantics revealed by various meta labels are usually incompatible and can incur intractable "semantic contradiction" when combining different meta labels. In this paper, we tackle the issue of "semantic contradiction" in a gradient-guided manner using our proposed Gradient Mitigator method, which systematically unifies multi-perspective meta labels to enable a pre-trained model to attain a better high-level semantic recognition ability. Moreover, we emphasize that the fine-grained discrimination ability is vital for segmentation-oriented pre-training, and develop a novel method called Gradient Filter to dynamically screen pixel pairs with the most discriminating power based on the magnitude of gradients. Comprehensive experiments on four medical image segmentation datasets verify that our new method GCL: (1) learns informative image representations and considerably boosts segmentation performance with limited labels, and (2) shows promising generalizability on out-of-distribution datasets.

Multi-Spectral Image Stitching via Spatial Graph Reasoning

Zhiying Jiang
Zengxi Zhang
Jinyuan Liu
Xin Fan
Risheng Liu

Multi-spectral image stitching leverages the complementarity between infrared and visible images to generate a robust and reliable wide field-of-view~(FOV) scene. The primary challenge of this task is to explore the relations between multi-spectral images for aligning and integrating multi-view scenes. Capitalizing on the strengths of Graph Convolutional Networks (GCNs) in modeling feature relationships, we propose a spatial graph reasoning based multi-spectral image stitching method that effectively distills the deformation and integration of multi-spectral images across different viewpoints. To accomplish this, we embed multi-scale complementary features from the same view position into a set of nodes. The correspondence across different views is learned through powerful dense feature embeddings, where both inter- and intra-correlations are developed to exploit cross-view matching and enhance inner feature disparity. By introducing long-range coherence along spatial and channel dimensions, the complementarity of pixel relations and channel interdependencies aids in the reconstruction of aligned multi-view features, generating informative and reliable wide FOV scenes. Moreover, we release a challenging dataset named ChaMS, comprising both real-world and synthetic sets with significant parallax, providing a new option for comprehensive evaluation. Extensive experiments demonstrate that our method surpasses the state-of-the-arts.

Propagation is All You Need: A New Framework for Representation Learning and Classifier Training on Graphs

Jiaming Zhuo
Can Cui
Kun Fu
Bingxin Niu
Dongxiao He
Yuanfang Guo
Zhen Wang
Chuan Wang
Xiaochun Cao
Liang Yang

Graph Neural Networks (GNNs) have been the standard toolkit for processing non-euclidean spatial data since their powerful capability in graph representation learning. Unfortunately, their training strategy for network parameters is inefficient since it is directly inherited from classic Neural Networks (NNs), ignoring the characteristic of GNNs. To alleviate this issue, experimental analyses are performed to investigate the knowledge captured in classifier parameters during network training. We conclude that the parameter features, i.e., the column vectors of the classifier parameter matrix, are cluster representations with high discriminability. And after a theoretical analysis, we conclude that the discriminability of these features is obtained from the feature propagation from nodes to parameters. Furthermore, an experiment verifies that compared with cluster centroids, the parameter features are more potential for augmenting the feature propagation between nodes. Accordingly, a novel GNN-specific training framework is proposed by simultaneously updating node representations and classifier parameters via a unified feature propagation scheme. Moreover, two augmentation schemes are implemented for the framework, named Full Propagation Augmentation (FPA) and Simplified Full Propagation Augmentation (SFPA). Specifically, FPA augmentates the feature propagation of each node with the updated classifier parameters. SFPA only augments nodes with the classifier parameters corresponding to their clusters. Theoretically, FPA is equivalent to optimizing a novel graph learning objective, which demonstrates the universality of the proposed framework to existing GNNs. Extensive experiments demonstrate the superior performance and the universality of the proposed framework.

Cross-modal Unsupervised Domain Adaptation for 3D Semantic Segmentation via Bidirectional Fusion-then-Distillation

Yao Wu
Mingwei Xing
Yachao Zhang
Yuan Xie
Jianping Fan
Zhongchao Shi
Yanyun Qu

Cross-modal Unsupervised Domain Adaptation (UDA) becomes a research hotspot because it reduces the laborious annotation of target domain samples. Existing methods only mutually mimic the outputs of cross-modality in each domain, which enforces the class probability distribution agreeable in different domains. However, these methods ignore the complementarity brought by the modality fusion representation in cross-modal learning. In this paper, we propose a cross-modal UDA method for 3D semantic segmentation via Bidirectional Fusion-then-Distillation, named BFtD-xMUDA, which explores cross-modal fusion in UDA and realizes distribution consistency between outputs of two domains not only for 2D image and 3D point cloud but also for 2D/3D and fusion. Our method contains three significant components: Model-agnostic Feature Fusion Module (MFFM), Bidirectional Distillation (B-Distill), and Cross-modal Debiased Pseudo-Labeling (xDPL). MFFM is employed to generate cross-modal fusion features for establishing a latent space, which enforces maximum correlation and complementarity between two heterogeneous modalities. B-Distill is introduced to exploit bidirectional knowledge distillation which includes cross-modality and cross-domain fusion distillation, and well-achieving domain-modality alignment. xDPL is designed to model the uncertainty of pseudo-labels by self-training scheme. Extensive experimental results demonstrate that our method outperforms state-of-the-art competitors in several adaptation scenarios.

SESSION: Oral Session III: Understanding Multimedia Content -- Vision and Language

Distortion-aware Transformer in 360° Salient Object Detection

Yinjie Zhao
Lichen Zhao
Qian Yu
Lu Sheng
Jing Zhang
Dong Xu

With the emergence of VR and AR, 360° data attracts increasing attention from the computer vision and multimedia communities. Typically, 360° data is projected into 2D ERP (equirectangular projection) images for feature extraction. However, existing methods cannot handle the distortions that result from the projection, hindering the development of 360-data-based tasks. Therefore, in this paper, we propose a Transformer-based model called DATFormer to address the distortion problem. We tackle this issue from two perspectives. Firstly, we introduce two distortion-adaptive modules. The first is a Distortion Mapping Module, which guides the model to pre-adapt to distorted features globally. The second module is a Distortion-Adaptive Attention Block that reduces local distortions on multi-scale features. Secondly, to exploit the unique characteristics of 360° data, we present a learnable relation matrix and use it as part of the positional embedding to further improve performance. Extensive experiments are conducted on three public datasets, and the results show that our model outperforms existing 2D SOD (salient object detection) and 360 SOD methods. The source code is available at https://github.com/yjzhao19981027/DATFormer/.

Symmetrical Linguistic Feature Distillation with CLIP for Scene Text Recognition

Zixiao Wang
Hongtao Xie
Yuxin Wang
Jianjun Xu
Boqiang Zhang
Yongdong Zhang

In this paper, we explore the potential of the Contrastive Language-Image Pretraining (CLIP) model in scene text recognition (STR), and establish a novel Symmetrical Linguistic Feature Distillation framework (named CLIP-OCR) to leverage both visual and linguistic knowledge in CLIP. Different from previous CLIP-based methods mainly considering feature generalization on visual encoding, we propose a symmetrical distillation strategy (SDS) that further captures the linguistic knowledge in the CLIP text encoder. By cascading the CLIP image encoder with the reversed CLIP text encoder, a symmetrical structure is built with an image-to-text feature flow that covers not only visual but also linguistic information for distillation. Benefiting from the natural alignment in CLIP, such guidance flow provides a progressive optimization objective from vision to language, which can supervise the STR feature forwarding process layer-by-layer. Besides, a new Linguistic Consistency Loss (LCL) is proposed to enhance the linguistic capability by considering second-order statistics during the optimization. Overall, CLIP-OCR is the first to design a smooth transition between image and text for the STR task. Extensive experiments demonstrate the effectiveness of CLIP-OCR with 93.8% average accuracy on six popular STR benchmarks. Code will be available at https://github.com/wzx99/CLIPOCR.

SpaceCLIP: A Vision-Language Pretraining Framework With Spatial Reconstruction On Text

Bo Zou
Chao Yang
Chengbin Quan
Youjian Zhao

The tremendous progress of vision-to-language retrieval over these years is fueled by contrastive vision-language pretraining (VLP), such as CLIP. Although, contrastive methods do not exhibit the same level of performance on other downstream tasks (e.g., video question answering and natural language grounding). One possible reason is they ignore the misalignment between vision and language, especially the absence of spatial information in language. To mitigate this issue, We start from a new perspective and propose a contrastive VLP framework with spatial reconstruction on text (SpaceCLIP). Specifically, we introduce a unique reconstruction method to assign text representations into the same spatial structure with images or videos and a pretraining objective, SpatialNCE, to reduce the computational overhead and ensure performance on downstream tasks. Empirically, we show SpaceCLIP outperforms other methods with performance gains ranging from 2.1% up to 9.0% on MSRVTT and EgoCLIP multiple-choice questions answering, 2.5% up to 11.0% on EPIC-KITCHENS-100 and MSRVTT multi-instance retrieval, and 0.31% up to 7.2% on Ego4D natural language query benchmark.

Improving Cross-Modal Recipe Retrieval with Component-Aware Prompted CLIP Embedding

Xu Huang
Jin Liu
Zhizhong Zhang
Yuan Xie

Cross-modal recipe retrieval is an emerging visual-textual retrieval task, which aims at matching food images with the corresponding recipes. Although large-scale Vision-Language Pre-training (VLP) models have achieved impressive performance on a wide range of downstream tasks, they still perform unsatisfactorily on this cross-modal retrieval task due to the following two problems: (1) Features from food images and recipes need to be aligned, simply fine-tuning the pre-trained VLP model's image encoder does not explicitly help with this goal. (2) The text content in the recipe is more structured than the text caption in the VLP model's pre-training corpus, which prevents the VLP model from adapting to the recipe retrieval task. In this paper, we propose a Component-aware Instance-specific Prompt learning (CIP) model that fully exploits the ability of large-scale VLP models. CIP enables us to learn the structured recipe information and therefore allows for aligning visual-textual representations without fine-tuning. Furthermore, we construct a recipe encoder termed Adaptive Recipe Merger (ARM) based on hierarchical Transformers, encouraging the model to learn more effective recipe representations. Extensive experiments on the public Recipe1M dataset demonstrate the superiority of our proposed method by outperforming the state-of-the-art methods on cross-modal recipe retrieval task.

Dynamic Contrastive Learning with Pseudo-samples Intervention for Weakly Supervised Joint Video MR and HD

Shuhan Kong
Liang Li
Beichen Zhang
Wenyu Wang
Bin Jiang
Chenggang Yan
Changhao Xu

Joint video moment retrieval (MR) and highlight detection (HD) aims to find relevant video moments according to the query text. Existing methods are fully supervised based on manual annotation, and their coarse multi-modal information interactions easily lose details about video and text. In addition, some tasks introduce weakly supervised learning with random masks, while the single masking forces the model to focus on masked words and ignore multi-modal contextual information. In view of this, we attempt weakly supervised joint tasks (MR+HD) and propose Dynamic Contrastive Learning with Pseudo-Sample Intervention (CPI) for better multi-modal video comprehension. First, we design pseudo-samples over random masks for a more efficient contrastive learning manner. We introduce a proportional sampling strategy for pseudo-samples to ensure the semantic difference between the pseudo-samples and the query text. This balances the over-reliance from single random mask to global text semantics and makes the model learn multimodal context from each word fairly. Second, we design dynamic intervention contrastive loss to enhance the core feature-matching ability of the model dynamically. We add pseudo-sample intervention when negative proposals are close to positive proposals. This can help the model overcome the vision confusion phenomenon and achieve semantic similarity instead of word similarity. Extensive experiments demonstrate the effectiveness of CPI and the potential of weakly supervised joint tasks.

RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training

Zheng Yuan
Qiao Jin
Chuanqi Tan
Zhengyun Zhao
Hongyi Yuan
Fei Huang
Songfang Huang

Vision-and-language multi-modal pretraining and fine-tuning have shown great success in visual question answering (VQA). Compared to general domain VQA, the performance of biomedical VQA suffers from limited data. In this paper, we propose a retrieval-augmented pretrain-and-finetune paradigm named RAMM for biomedical VQA to overcome the data limitation issue. Specifically, we collect a new biomedical dataset named PMCPM which offers patient-based image-text pairs containing diverse patient situations from PubMed. Then, we pretrain the biomedical multi-modal model to learn visual and textual representation for image-text pairs and align these representations with image-text contrastive objective (ITC). Finally, we propose a retrieval-augmented method to better use the limited data. We propose to retrieve similar image-text pairs based on ITC from pretraining datasets and introduce a novel retrieval-attention module to fuse the representation of the image and the question with the retrieved images and texts. Experiments demonstrate that our retrieval-augmented pretrain-and-finetune paradigm obtains state-of-the-art performance on Med-VQA2019, Med-VQA2021, VQARAD, and SLAKE datasets. Further analysis shows that the proposed RAMM and PMCPM can enhance biomedical VQA performance compared with previous resources and methods. The pre-trained models and codes are published at https://github.com/GanjinZero/RAMM.

RTQ: Rethinking Video-language Understanding Based on Image-text Model

Xiao Wang
Yaoyu Li
Tian Gan
Zheng Zhang
Jingjing Lv
Liqiang Nie

Recent advancements in video-language understanding have been established on the foundation of image-text models, resulting in promising outcomes due to the shared knowledge between images and videos. However, video-language understanding presents unique challenges due to the inclusion of highly complex semantic details, which result in information redundancy, temporal dependency, and scene complexity. Current techniques have only partially tackled these issues, and our quantitative analysis indicates that some of these methods are complementary. In light of this, we propose a novel framework called RTQ (Refine, Temporal model, and Query), which addresses these challenges simultaneously. The approach involves refining redundant information within frames, modeling temporal relations among frames, and querying task-specific information from the videos. Remarkably, our model demonstrates outstanding performance even in the absence of video-language pre-training, and the results are comparable with or superior to those achieved by state-of-the-art pre-training methods.

SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models

Shanshan Zhong
Zhongzhan Huang
Weushao Wen
Jinghui Qin
Liang Lin

Diffusion models, which have emerged to become popular text-to-image generation models, can produce high-quality and content-rich images guided by textual prompts. However, there are limitations to semantic understanding and commonsense reasoning in existing models when the input prompts are concise narrative, resulting in low-quality image generation. To improve the capacities for narrative prompts, we propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models. To reach this goal, we first collect and annotate a new dataset SURD which consists of more than 57,000 semantically corrected multi-modal samples. Each sample contains a simple narrative prompt, a complex keyword-based prompt, and a high-quality image. Then, we align the semantic representation of narrative prompts to the complex prompts and transfer knowledge of large language models (LLMs) to our SUR-adapter via knowledge distillation so that it can acquire the powerful semantic understanding and reasoning capabilities to build a high-quality textual semantic representation for text-to-image generation. We conduct experiments by integrating multiple LLMs and popular pre-trained diffusion models to show the effectiveness of our approach in enabling diffusion models to understand and reason concise natural language without image quality degradation. Our approach can make text-to-image diffusion models easier to use with better user experience, which demonstrates our approach has the potential for further advancing the development of user-friendly text-to-image generation models by bridging the semantic gap between simple narrative prompts and complex keyword-based prompts. The code is released at https://github.com/Qrange-group/SUR-adapter.

Face Encryption via Frequency-Restricted Identity-Agnostic Attacks

Xin Dong
Rui Wang
Siyuan Liang
Aishan Liu
Lihua Jing

Billions of people are sharing their daily live images on social media everyday. However, malicious collectors use deep face recognition systems to easily steal their biometric information (e.g., faces) from these images. Some studies are being conducted to generate encrypted face photos using adversarial attacks by introducing imperceptible perturbations to reduce face information leakage. However, existing studies need stronger black-box scenario feasibility and more natural visual appearances, which challenge the feasibility of privacy protection. To address these problems, we propose a frequency-restricted identity-agnostic (FRIA) framework to encrypt face images from unauthorized face recognition without access to personal information. As for the weak black-box scenario feasibility, we obverse that representations of the average feature in multiple face recognition models are similar, thus we propose to utilize the average feature via the crawled dataset from the Internet as the target to guide the generation, which is also agnostic to identities of unknown face recognition systems; in nature, the low-frequency perturbations are more visually perceptible by the human vision system. Inspired by this, we restrict the perturbation in the low-frequency facial regions by discrete cosine transform to achieve the visual naturalness guarantee. Extensive experiments on several face recognition models demonstrate that our FRIA outperforms other state-of-the-art methods in generating more natural encrypted faces while attaining high black-box attack success rates of 96%. In addition, we validate the efficacy of FRIA using real-world black-box commercial API, which reveals the potential of FRIA in practice. Our codes can be found in https://github.com/XinDong10/FRIA.

Emotion-Prior Awareness Network for Emotional Video Captioning

Peipei Song
Dan Guo
Xun Yang
Shengeng Tang
Erkun Yang
Meng Wang

Emotional video captioning (EVC) is an emerging task to describe the factual content with the inherent emotion expressed in a video. It is crucial for the EVC task to effectively perceive subtle and ambiguous visual emotion cues in the stage of caption generation. However, existing captioning methods usually overlooked the learning of emotions in user-generated videos, thus making the generated sentence a bit boring and soulless.

To address this issue, this paper proposes a new emotional captioning perspective in a human-like perception-priority manner, i.e., first perceiving the inherent emotion and then leveraging the perceived emotion cue to support caption generation. Specifically, we devise an Emotion-Prior Awareness Network (EPAN). It mainly benefits from a novel tree-structured emotion learning module involving both catalog-level psychological categories and lexical-level usual words to achieve the goal of explicit and fine-grained emotion perception. Besides, we develop a novel subordinate emotion masking mechanism between the catalog level and lexical level that facilitates coarse-to-fine emotion learning. Afterward, with the emotion prior, we can effectively decode the emotional caption by exploiting the complementation of visual, textual, and emotional semantics. In addition, we also introduce three simple yet effective optimization objectives, which can significantly boost the emotion learning from the perspectives of emotional captioning, hierarchical emotion classification, and emotional contrastive learning. Sufficient experimental results on three benchmark datasets clearly demonstrate the advantages of our proposed EPAN over existing SOTA methods in both semantic and emotional metrics. The extensive ablation study and visualization analysis further reveal the good interpretability of our emotional video captioning method. Code will be made available at https://github.com/songpipi/EPAN.

TE-KWS: Text-Informed Speech Enhancement for Noise-Robust Keyword Spotting

Dong Liu
Qirong Mao
Lijian Gao
Qinghua Ren
Zhenghan Chen
Ming Dong

Keyword spotting (KWS) presents a formidable challenge, particularly in high-noise environments. Traditional denoising algorithms that rely solely on speech have difficulty recovering speech that has been severely corrupted by noise. In this investigation, we develop an adaptive text-informed denoising model to bolster reliable keyword identification in the presence of considerable noise degradation. The whole proposed TE-KWS incorporates a tripartite branch structure, where the speech branch (SB) takes noisy speech as input which provides the raw speech information, the alignment branch (AB) accommodates aligned text input which facilitates accurate restoration of the corresponding speech when text with alignment is preserved, and the text branch (TB) handles unaligned text which prompts the model to autonomously learn the alignment between speech and text. To make the proposed denoising model more beneficial for KWS, following the training of the whole model,the alignment branch (AB) is frozen, and the model is fine-tuned by leveraging its speech restoration and forced alignment capabilities. Subsequently, the input for the text branch (TB) is supplanted with designated keywords, and a heavier denoising penalty is applied on the keywords period, thereby explicitly intensifying the speech restoration ability of the model for keywords. Finally, the Combined Adversarial Domain Adaptation (CADA) is implemented to enhance the robustness of KWS with regard to data pre-and post-speech enhancement (SE). Experimental results indicate that our approach not only markedly ameliorates highly corrupted speech, achieving SOTA performance for marginally corrupted speech, but also bolsters the efficacy and generalizability of prevailing mainstream KWS models.

A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval

Jiancheng Pan
Qing Ma
Cong Bai

This paper presents a prior instruction representation framework (PIR) for remote sensing image-text retrieval, aimed at remote sensing vision-language understanding tasks to solve the semantic noise problem. Our highlight is the proposal of a paradigm that draws on prior knowledge to instruct adaptive learning of vision and text representations. Concretely, two progressive attention encoder (PAE) structures, Spatial-PAE and Temporal-PAE, are proposed to perform long-range dependency modeling to enhance key feature representation. In vision representation, Vision Instruction Representation (VIR) based on Spatial-PAE exploits the prior-guided knowledge of the remote sensing scene recognition by building a belief matrix to select key features for reducing the impact of semantic noise. In text representation, Language Cycle Attention (LCA) based on Temporal-PAE uses the previous time step to cyclically activate the current time step to enhance text representation capability. A cluster-wise affiliation loss is proposed to constrain the inter-classes and to reduce the semantic confusion zones in the common subspace. Comprehensive experiments demonstrate that using prior knowledge instruction could enhance vision and text representations and could outperform the state-of-the-art methods on two benchmark datasets, RSICD and RSITMD. Codes are available at https://github.com/Zjut-MultimediaPlus/PIR-pytorch.

PromptMTopic: Unsupervised Multimodal Topic Modeling of Memes using Large Language Models

Nirmalendu Prakash
Han Wang
Nguyen Khoi Hoang
Ming Shan Hee
Roy Ka-Wei Lee

The proliferation of social media has given rise to a new form of communication: memes. Memes are multimodal and often contain a combination of text and visual elements that convey meaning, humor, and cultural significance. While meme analysis has been an active area of research, little work has been done on unsupervised multimodal topic modeling of memes, which is important for content moderation, social media analysis, and cultural studies. We propose PromptMTopic, a novel multimodal prompt-based model designed to learn topics from both text and visual modalities by leveraging the language modeling capabilities of large language models. Our model effectively extracts and clusters topics learned from memes, considering the semantic interaction between the text and visual modalities. We evaluate our proposed model through extensive experiments on three real-world meme datasets, which demonstrate its superiority over state-of-the-art topic modeling baselines in learning descriptive topics in memes. Additionally, our qualitative analysis shows that PromptMTopic can identify meaningful and culturally relevant topics from memes. Our work contributes to the understanding of the topics and themes of memes, a crucial form of communication in today's society. Disclaimer: This paper contains sensitive content that may be disturbing to some readers.

Dynamic Low-Rank Instance Adaptation for Universal Neural Image Compression

Yue Lv
Jinxi Xiang
Jun Zhang
Wenming Yang
Xiao Han
Wei Yang

The latest advancements in neural image compression show great potential in surpassing the rate-distortion performance of conventional standard codecs. Nevertheless, there exists an indelible domain gap between the datasets utilized for training (i.e., natural images) and those utilized for inference (e.g., artistic images). Our proposal involves a low-rank adaptation approach aimed at addressing the rate-distortion drop observed in out-of-domain datasets. Specifically, we perform low-rank matrix decomposition to update certain adaptation parameters of the client's decoder. These updated parameters, along with image latents, are encoded into a bitstream and transmitted to the decoder in practical scenarios. Due to the low-rank constraint imposed on the adaptation parameters, the resulting bit rate overhead is small. Furthermore, the bit rate allocation of low-rank adaptation is non-trivial, considering the diverse inputs require varying adaptation bitstreams. We thus introduce a dynamic gating network on top of the low-rank adaptation method, in order to decide which decoder layer should employ adaptation. The dynamic adaptation network is optimized end-to-end using rate-distortion loss. Our proposed method exhibits universality across diverse image datasets. Extensive results demonstrate that this paradigm significantly mitigates the domain gap, surpassing non-adaptive methods with an average BD-rate improvement of approximately 19% across out-of-domain images. Furthermore, it outperforms the most advanced instance adaptive methods by roughly 5% BD-rate. Ablation studies confirm our method's ability to universally enhance various image compression architectures. Our project is available at https://github.com/llvy21/DUIC.

LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation

Leigang Qu
Shengqiong Wu
Hao Fei
Liqiang Nie
Tat-Seng Chua

In the text-to-image generation field, recent remarkable progress in Stable Diffusion makes it possible to generate rich kinds of novel photorealistic images. However, current models still face misalignment issues (e.g., problematic spatial relation understanding and numeration failure) in complex natural scenes, which impedes the high-faithfulness text-to-image generation. Although recent efforts have been made to improve controllability by giving fine-grained guidance (e.g., sketch and scribbles), this issue has not been fundamentally tackled since users have to provide such guidance information manually. In this work, we strive to synthesize high-fidelity images that are semantically aligned with a given textual prompt without any guidance. Toward this end, we propose a coarse-to-fine paradigm to achieve layout planning and image generation. Concretely, we first generate the coarse-grained layout conditioned on a given textual prompt via in-context learning based on Large Language Models. Afterward, we propose a fine-grained object-interaction diffusion method to synthesize high-faithfulness images conditioned on the prompt and the automatically generated layout. Extensive experiments demonstrate that our proposed method outperforms the state-of-the-art models in terms of layout and image generation. Our code and settings are available at https://layoutllm-t2i.github.io/.

POAR: Towards Open Vocabulary Pedestrian Attribute Recognition

Yue Zhang
Suchen Wang
Shichao Kan
Zhenyu Weng
Yigang Cen
Yap-peng Tan

Pedestrian attribute recognition (PAR) aims to predict the attributes of a target pedestrian. Recent methods often address the PAR problem by training a multi-label classifier with predefined attribute classes, but they can hardly exhaust all possible pedestrian attributes in the real world. To tackle this problem, we propose a novel Pedestrian Open-Attribute Recognition (POAR) approach by formulating the problem as a task of image-text search. Our approach employs a Transformer-based Encoder with a Masking Strategy (TEMS) to focus on the attributes of specific pedestrian parts (e.g., head, upper body, lower body, feet, etc.), and introduces a set of attribute tokens to encode the corresponding attributes into visual embeddings. Each attribute category is described as a natural language sentence and encoded by the text encoder. Then, we compute the similarity between the visual and text embeddings to find the best attribute descriptions for the input images. To handle multiple attributes of a single pedestrian, we propose a Many-To-Many Contrastive (MTMC) loss with masked tokens. In addition, we propose a Grouped Knowledge Distillation (GKD) method to minimize the disparity between visual embeddings and unseen attribute text embeddings. We evaluate our proposed method on three PAR datasets with an open-attribute setting. The results demonstrate the effectiveness of our method as a strong baseline for the POAR task. Our code is available at https://github.com/IvyYZ/POAR.

PointCRT: Detecting Backdoor in 3D Point Cloud via Corruption Robustness

Shengshan Hu
Wei Liu
Minghui Li
Yechao Zhang
Xiaogeng Liu
Xianlong Wang
Leo Yu Zhang
Junhui Hou

Backdoor attacks for point clouds have elicited mounting interest with the proliferation of deep learning. The point cloud classifiers can be vulnerable to malicious actors who seek to manipulate or fool the model with specific backdoor triggers. Detecting and rejecting backdoor samples during the inference stage can effectively alleviate backdoor attacks. Recently, some black-box test-time backdoor sample detection methods have been proposed in the 2D image domain, without any underlying assumptions about the backdoor triggers. However, upon examination, we have found that these detection techniques are not effective for 3D point clouds. As a result, there is a pressing need to bridge the gap for the development of a universal approach that is specifically designed for 3D point clouds.

In this paper, we propose the first test-time backdoor sample detection method in 3D point cloud without assumption to the backdoor triggers, called Point Clouds Corruption Robustness Test (PointCRT). Based on the fact that the corruption robustness of clean samples remains relatively stable across various backdoor models, we propose the corruption robustness score to map the features into high-dimensional space. The corruption robustness score is a vector evaluated by label consistency, whose element is the minimum severity level of corruption that changes the label prediction of the victim model. Then, the trigger is identified by detecting the abnormal corruption robustness score through a nonlinear classification. The comprehensive experiments demonstrate PointCRT deals with all cases with the average AUC over 0.934 and F1 score over 0.864, with the enhancement of 18%-28% on ModelNet40. Our codes are available at: https://github.com/CGCL-codes/PointCRT.

Blind Image Super-resolution with Rich Texture-Aware Codebook

Rui Qin
Ming Sun
Fangyuan Zhang
Xing Wen
Bin Wang

Blind super-resolution (BSR) methods based on high-resolution (HR) reconstruction codebooks have achieved promising results in recent years. However, we find that a codebook based on HR reconstruction may not effectively capture the complex correlations between low-resolution (LR) and HR images. In detail, multiple HR images may produce similar LR versions due to complex blind degradations, causing the HR-dependent only codebooks having limited texture diversity when faced with confusing LR inputs. To alleviate this problem, we propose the Rich Texture-aware Codebook-based Network (RTCNet), which consists of the Degradation-robust Texture Prior Module (DTPM) and the Patch-aware Texture Prior Module (PTPM). DTPM effectively mines the cross-resolution correlation of textures between LR and HR images by exploiting the cross-resolution correspondence of textures. PTPM uses patch-wise semantic pre-training to correct the misperception of texture similarity in the high-level semantic regularization. By taking advantage of this, RTCNet effectively gets rid of the misalignment of confusing textures between HR and LR in the BSR scenarios. Experiments show that RTCNet outperforms state-of-the-art methods on various benchmarks by up to 0.16 ~ 0.46dB.

V2Depth: Monocular Depth Estimation via Feature-Level Virtual-View Simulation and Refinement

Zizhang Wu
Zhuozheng Li
Zhi-Gang Fan
Yunzhe Wu
Jian Pu
Xianzhi Li

Due to the lack of spatial cues giving merely a single image, many monocular depth estimation methods have been developed to leverage stereo or multi-view images to learn the spatial information of a scene in a self-supervised manner. However, these methods have limited performance gain since they are not able to exploit sufficient 3D geometry cues during inference, where only monocular images are available. In this work, we present V2Depth, a novel coarse-to-fine framework with Virtual View feature simulation for supervised monocular Depth estimation. Specifically, we first design a virtual-view feature simulator by leveraging the technique of novel view synthesis and contrastive learning to generate virtual view feature maps. In this way, we explicitly provide representative spatial geometry for subsequent depth estimation in both the training and inference stages. Then we introduce a 3DVA-Refiner to iteratively optimize the predicted depth map. During the optimization process, 3D-aware virtual attention is developed to capture the global spatial-context correlations to maintain the feature consistency of different views and estimation integrity of the 3D scene such as objects with occlusion relationships. Decisive improvements over state-of-the-art approaches on three benchmark datasets across all metrics demonstrate the superiority of our method.

GCMA: Generative Cross-Modal Transferable Adversarial Attacks from Images to Videos

Kai Chen
Zhipeng Wei
Jingjing Chen
Zuxuan Wu
Yu-Gang Jiang

Existing cross-domain transferable attacks mostly focus on exploring the adversarial transferability across homomodal domains, while the adversarial transferability across heteromodal domains, e.g., image domains to video domains, has received less attention. This paper investigates cross-modal transferable attacks from image domains to video domains with the generator-oriented approach, i.e., crafting adversarial perturbations for each frame of video clips with the perturbation generator trained in the ImageNet domain to attack target video models. To this end, we propose an effective Generative Cross-Modal Attacks (GCMA) framework to enhance adversarial transferability from image domains to video domains. To narrow the domain gap between image and video data, we first propose a random motion module that warps images with synthetic random optical flows. We then integrate the random motion module into the feature disruption loss to incorporate additional temporal cues in the training phase. Specifically, feature disruption loss minimizes the cosine similarity between intermediate features of warped benign and adversarial images. Furthermore, motivated by the positive correlation between transferability and temporal consistency of adversarial video clips, we also introduce a temporal consistency loss that maximizes the cosine similarity between intermediate features of warped adversarial images and adversarial counterparts of warped benign images. Finally, GCMA trains the perturbation generator by simultaneously optimizing feature disruption loss and temporal consistency loss. Extensive experiments demonstrate the effectiveness of our proposed method, achieving state-of-the-art performance on Kinetics-400 and UCF-101. Our code is available at https://github.com/kay-ck/GCMA.

AdaBrowse: Adaptive Video Browser for Efficient Continuous Sign Language Recognition

Lianyu Hu
Liqing Gao
Zekang Liu
Chi-Man Pun
Wei Feng

Raw videos have been proven to own considerable feature redundancy where in many cases only a portion of frames can already meet the requirements for accurate recognition. In this paper, we are interested in whether such redundancy can be effectively leveraged to facilitate efficient inference in continuous sign language recognition (CSLR). We propose a novel adaptive model (AdaBrowse) to dynamically select a most informative subsequence from input video sequences by modelling this problem as a sequential decision task. In specific, we first utilize a lightweight network to quickly scan input videos to extract coarse features. Then these features are fed into a policy network to intelligently select a subsequence to process. The corresponding subsequence is finally inferred by a normal CSLR model for sentence prediction. As only a portion of frames are processed in this procedure, the total computations can be considerably saved. Besides temporal redundancy, we are also interested in whether the inherent spatial redundancy can be seamlessly integrated together to achieve further efficiency, i.e., dynamically selecting a lowest input resolution for each sample, whose model is referred to as AdaBrowse+. Extensive experimental results on four large-scale CSLR datasets, i.e., PHOENIX14, PHOENIX14-T, CSL-Daily and CSL, demonstrate the effectiveness of AdaBrowse and AdaBrowse+ by achieving comparable accuracy with state-of-the-art methods with 1.44X throughput and 2.12X fewer FLOPs. Comparisons with other commonly-used 2D CNNs and adaptive efficient methods verify the effectiveness of AdaBrowse. Code is available at https://github.com/hulianyuyy/AdaBrowse.

Dynamic Triple Reweighting Network for Automatic Femoral Head Necrosis Diagnosis from Computed Tomography

Lingfeng Li
Gangming Zhao
Yizhou Yu
Jinpeng Li

Avascular necrosis of the femoral head (AVNFH) is a common orthopedic disease that seriously affects the life quality of middle-aged and elderly people. Early AVNFH is difficult to diagnose due to its complex symptoms. In recent years, some works have applied deep learning algorithms to find traces of early AVNFH in X-rays or magnetic resonance imaging (MRI). However, X-rays are difficult to reflect hidden features due to the tissue overlap; MRI is sensitive but requires more time for imaging and is expensive. This study aims to develop a computer-aided diagnosis system for early AVNFH based on computed tomography (CT), which provides layer-wise features and is less costly. To achieve this, a large-scale dataset for AVNFH was collected and annotated by experienced doctors. We propose the Dynamic Triple Reweighting Network (DTRNet) that integrates the AVNFH classification and weakly-supervised localization. DTRNet incorporates nested multi-instance learning as the first and second reweighting, and structure regularization as the third reweighting to identify diseases and localize the lesion region. Since nested multi-instance learning is inapplicable in situations with few positive samples in the patch set, we propose a dynamic pseudo-package module to compensate for this limitation. Experimental results show that DTRNet is superior to the baselines in AVNFH classification. In addition, it can locate lesions to provide more information for assisting clinical decisions. The desensitized data and codes has been made available at: https://github.com/tomas-lilingfeng/DTRNet.

Category-Level Articulated Object 9D Pose Estimation via Reinforcement Learning

Liu Liu
Jianming Du
Hao Wu
Xun Yang
Zhenguang Liu
Richang Hong
Meng Wang

Human life is populated with articulated objects. Current category-level articulated object 9D pose estimation (Articulated Object 9D Pose Estimation, ArtOPE) methods usually meet the challenges of shared object representation requirement, kinematics-agnostic pose modeling and self-occlusions. In this paper, we propose a novel framework called Articulated object 9D Pose Estimation via Reinforcement Learning (ArtPERL), which formulates the category-level ArtOPE as a reinforcement learning problem. Given a point cloud or RGB-D image input, ArtPERL firstly retrieves the part-sensitive articulated object as reference point cloud, and then introduces a joint-centric pose modeling strategy that estimates 9D pose by fitting joint states via reinforced agent training. Finally, we further propose a pose optimization that refine the predicted 9D pose considering kinematic constraints. We evaluate our ArtPERL on various datasets ranging from synthetic point cloud to real-world multi-hinged object. Experiments demonstrate the superior performance and robustness of our ArtPERL. Our work provides a new perspective on category-level articulated object 9D pose estimation and has the potential to be applied in many fields, including robotics, augmented reality, and autonomous driving.

RetouchingFFHQ: A Large-scale Dataset for Fine-grained Face Retouching Detection

Qichao Ying
Jiaxin Liu
Sheng Li
Haisheng Xu
Zhenxing Qian
Xinpeng Zhang

The widespread use of face retouching filters on short-video platforms has raised concerns about the authenticity of digital appearances and the impact of deceptive advertising. To address these issues, there is a pressing need to develop advanced face retouching techniques. However, the lack of large-scale and fine-grained face retouching datasets has been a major obstacle to progress in this field. In this paper, we introduce RetouchingFFHQ, a large-scale and fine-grained face retouching dataset that contains over half a million conditionally-retouched images. RetouchingFFHQ stands out from previous datasets due to its large scale, high quality, fine-grainedness, and customization. By including four typical types of face retouching operations and different retouching levels, we extend the binary face retouching detection into a fine-grained, multi-retouching type, and multi-retouching level estimation problem. Additionally, we propose a Multi-granularity Attention Module (MAM) as a plugin for CNN backbones for enhanced cross-scale representation learning. Extensive experiments using different baselines as well as our proposed method on RetouchingFFHQ show decent performance on face retouching detection.

Slow-Fast Time Parameter Aggregation Network for Class-Incremental Lip Reading

Xueyi Zhang
Chengwei Zhang
Tao Wang
Jun Tang
Songyang Lao
Haizhou Li

Class incremental learning has yet to be explored in the field of lip-reading, which can circumvent data privacy issues and avoid the high training costs associated with joint training. In this paper, we introduce a benchmark for Class-Incremental Lip-Reading (CILR). To simultaneously improve the plasticity for new classes and stability for old classes in incremental learning, we propose a Slow-Fast Time Parameter Aggregation Network (TPAN) that decouples representation learning of new and old knowledge, taking into account the task characteristics of lip-reading. The TPAN comprises two dynamically evolving branches: one that uses fast gradient descent and the other employs slow momentum updates to retain old knowledge while adapting to new knowledge. Additionally, to achieve efficient knowledge transfer of the incremental model, we design a Hybrid Sequence-Distribution Distillation (HSDD) strategy to transfer knowledge in temporal feature view and classification probability view. We present a comprehensive comparison of the proposed method and previous state-of-the-art class incremental learning methods on the most commonly used lip-reading datasets LRW and LRW1000. The experimental result show that the proposed method can reduce the effect of catastrophic forgetting and improve the incremental accuracy.

Text-based Person Search without Parallel Image-Text Data

Yang Bai
Jingyao Wang
Min Cao
Chen Chen
Ziqiang Cao
Liqiang Nie
Min Zhang

Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description. Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect. In this paper, we make the first attempt to explore TBPS without parallel image-text data (μ-TBPS), in which only non-parallel images and texts, or even image-only data, can be adopted. Towards this end, we propose a two-stage framework, generation-then-retrieval (GTR), to first generate the corresponding pseudo text for each image and then perform the retrieval in a supervised manner. In the generation stage, we propose a fine-grained image captioning strategy to obtain an enriched description of the person image, which firstly utilizes a set of instruction prompts to activate the off-the-shelf pretrained vision-language model to capture and generate fine-grained person attributes, and then converts the extracted attributes into a textual description via the finetuned large language model or the hand-crafted template. In the retrieval stage, considering the noise interference of the generated texts for training model, we develop a confidence score-based training scheme by enabling more reliable texts to contribute more during the training. Experimental results on multiple TBPS benchmarks (i.e., CUHK-PEDES, ICFG-PEDES and RSTPReid) show that the proposed GTR can achieve a promising performance without relying on parallel image-text data.

Exploring Inconsistent Knowledge Distillation for Object Detection with Data Augmentation

Jiawei Liang
Siyuan Liang
Aishan Liu
Ke Ma
Jingzhi Li
Xiaochun Cao

Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model. Since the teacher model perceives data in a way different from humans, existing KD methods only distill knowledge that is consistent with labels annotated by human expert while neglecting knowledge that is not consistent with human perception, which results in insufficient distillation and sub-optimal performance. In this paper, we propose inconsistent knowledge distillation (IKD), which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions. We start by considering the teacher model's counter-intuitive perceptions of frequency and non-robust features. Unlike previous works that exploit fine-grained features or introduce additional regularizations, we extract inconsistent knowledge by providing diverse input using data augmentation. Specifically, we propose a sample-specific data augmentation to transfer the teacher model's ability in capturing distinct frequency components and suggest an adversarial feature augmentation to extract the teacher model's perceptions of non-robust features in the data. Extensive experiments demonstrate the effectiveness of our method which outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors (at most +1.0 mAP). Our codes will be made available at https://github.com/JWLiang007/IKD.git.

CARIS: Context-Aware Referring Image Segmentation

Sun-Ao Liu
Yiheng Zhang
Zhaofan Qiu
Hongtao Xie
Yongdong Zhang
Ting Yao

Referring image segmentation aims to segment the target object described by a natural-language utterance. Recent approaches typically distinguish pixels by aligning pixel-wise visual features with linguistic features extracted from the referring description. Nevertheless, such a free-form description only specifies certain discriminative attributes of the target object or its relations to a limited number of objects, which fails to represent the rich visual context adequately. The stand-alone linguistic features are therefore unable to align with all visual concepts, resulting in inaccurate segmentation. In this paper, we propose to address this issue by incorporating rich visual context into linguistic features for sufficient vision-language alignment. Specifically, we present Context-Aware Referring Image Segmentation (CARIS), a novel architecture that enhances the contextual awareness of linguistic features via sequential vision-language attention and learnable prompts. Technically, CARIS develops a context-aware mask decoder with sequential bidirectional cross-modal attention to integrate the linguistic features with visual context, which are then aligned with pixel-wise visual features. Furthermore, two groups of learnable prompts are employed to delve into additional contextual information from the input image and facilitate the alignment with non-target pixels, respectively. Extensive experiments demonstrate that CARIS achieves new state-of-the-art performances on three public benchmarks. Code is available at https://github.com/lsa1997/CARIS.

Ground-to-Aerial Person Search: Benchmark Dataset and Approach

Shizhou Zhang
Qingchun Yang
De Cheng
Yinghui Xing
Guoqiang Liang
Peng Wang
Yanning Zhang

In this work, we construct a large-scale dataset for Ground-to-Aerial Person Search, named G2APS, which contains 31,770 images of 260,559 annotated bounding boxes for 2,644 identities appearing in both of the UAVs and ground surveillance cameras. To our knowledge, this is the first dataset for cross-platform intelligent surveillance applications, where the UAVs could work as a powerful complement for the ground surveillance cameras. To more realistically simulate the actual cross-platform Ground-to-Aerial surveillance scenarios, the surveillance cameras are fixed about 2 meters above the ground, while the UAVs capture videos of persons at different location, with a variety of view-angles, flight attitudes and flight modes. Therefore, the dataset has the following unique characteristics: 1) drastic view-angle changes between query and gallery person images from cross-platform cameras; 2) diverse resolutions, poses and views of the person images under 9 rich real-world scenarios. On basis of the G2APS benchmark dataset, we demonstrate detailed analysis about current two-step and end-to-end person search methods, and further propose a simple yet effective knowledge distillation scheme on the head of the ReID network, which achieves state-of-the-art performances on both of the G2APS and the previous two public person search datasets, i.e., PRW and CUHK-SYSU. The dataset and source code available on https://github.com/yqc123456/HKD_for_person_search.

Sparse Sharing Relation Network for Panoptic Driving Perception

Fan Jiang
Zilei Wang

Efficient and accurate perception system is critical for autonomous driving, including traffic object detection, drivable area segmentation, and lane detection. Most previous works do not consider the spatial and semantic cues in traffic scenes. In this paper, we propose a novel multi-task learning network to exploit these priors. Specifically, to model the co-occurrence and spatial relationships of traffic objects, we propose to use a Graph Convolutional Network (GCN) block operating on the patches of feature maps. It enables adaptive discovery and incorporation of semantic and spatial relationships in the feature space. Furthermore, we propose a sub-feature sharing method to mitigate negative transfer in multi-task learning. On the basis of a fully shared base network, we split the feature space of different tasks along the channel dimension, resulting in the shared and private features for each task. It allows the network parameters to be selectively updated by different tasks during training. Experimental results on the challenging BDD100K dataset demonstrate that our proposed approach gets consistent improvement with fewer parameters, and achieves new state-of-the-art performance in terms of accuracy and speed.

SESSION: Oral Session IV: Engaging Users with Multimedia -- Emotional and Social Signals

AcFormer: An Aligned and Compact Transformer for Multimodal Sentiment Analysis

Daoming Zong
Chaoyue Ding
Baoxiang Li
Jiakui Li
Ken Zheng
Qunyan Zhou

Multimodal Sentiment Analysis (MSA) is a popular research topic aimed at utilizing multimodal signals for understanding human emotions. The primary approach to solving this task is to develop complex fusion techniques. However, the heterogeneity and unaligned nature between modalities pose significant challenges to fusion. Additionally, existing methods lack consideration for the efficiency of modal fusion. To tackle these issues, we propose AcFormer, which contains two core ingredients: i) contrastive learning within and across modalities to explicitly align different modality streams before fusion; and ii) pivot attention for multimodal interaction/fusion. The former encourages positive triplets of image-audio-text to have similar representations in contrast to negative ones. The latter introduces attention pivots that can serve as cross-modal information bridges and limit cross-modal attention to a certain number of fusion pivot tokens. We evaluate AcFormer on multiple MSA tasks, including multimodal emotion recognition, humor detection, and sarcasm detection. Empirical evidence shows that AcFormer achieves the optimal performance with minimal computation cost compared to previous state-of-the-art methods. Our code is publicly available at https://github.com/dingchaoyue/AcFormer.

Freq-HD: An Interpretable Frequency-based High-Dynamics Affective Clip Selection Method for in-the-Wild Facial Expression Recognition in Videos

Zeng Tao
Yan Wang
Zhaoyu Chen
Boyang Wang
Shaoqi Yan
Kaixun Jiang
Shuyong Gao
Wenqiang Zhang

The in-the-wild dynamic facial expression recognition (DFER) has been challenging due to several high-dynamics factors such as limited dynamic expression-related frames and variable non-expression noise in facial expression sequences. To provide more expression-related clips for DFER models, we propose a novel and interpretable frequency-based method (Freq-HD) for high-dynamics affective clip selection. It can select clips containing pure expression changes from sequences and aid different DFER network structures in recognizing in-the-wild dynamic facial expressions more accurately and efficiently. We first design a novel spatial-temporal frequency analysis (STFA) module to compute the dynamics values of each clip by using sliding windows and spatial-temporal frequency analysis. Moreover, we propose a multi-band complementary selection (MBC) module to amend the inappropriate reaction of the dynamics values of different spatial frequency bands in STFA when expression-irrelevant noise occurs. Specifically, the MBC uses an ingenious mapping method to generate the inhibitory factors to complement and separate the dynamics of expressions and non-expressions in different frequency bands. The Freq-HD can select the most expression-correlated clips and the consisting frames, which could be incorporated into any existing DFER models. We extensively evaluate the Freq-HD on two in-the-wild datasets and four DFER baselines, showing that our method significantly improves the subsequent network performance while using fewer input frames and reducing computation cost. More ablation studies and visualization analysis provide further empirical evidence of the effectiveness of our method.

StyleEDL: Style-Guided High-order Attention Network for Image Emotion Distribution Learning

Peiguang Jing
Xianyi Liu
Ji Wang
Yinwei Wei
Liqiang Nie
Yuting Su

Emotion distribution learning has gained increasing attention with the tendency to express emotions through images. As for emotion ambiguity arising from humans' subjectivity, substantial previous methods generally focused on learning appropriate representations from the holistic or significant part of images. However, they rarely consider establishing connections with the stylistic information although it can lead to a better understanding of images. In this paper, we propose a style-guided high-order attention network for image emotion distribution learning termed StyleEDL, which interactively learns stylistic-aware representations of images by exploring the hierarchical stylistic information of visual contents. Specifically, we consider exploring the intra- and inter-layer correlations among GRAM-based stylistic representations, and meanwhile exploit an adversary-constrained high-order attention mechanism to capture potential interactions between subtle visual parts. In addition, we introduce a stylistic graph convolutional network to dynamically generate the content-dependent emotion representations to benefit the final emotion distribution learning. Extensive experiments conducted on several benchmark datasets demonstrate the effectiveness of our proposed StyleEDL compared to state-of-the-art methods. The implementation is released at: https://github.com/liuxianyi/StyleEDL.

Variance-Aware Bi-Attention Expression Transformer for Open-Set Facial Expression Recognition in the Wild

Junjie Zhu
Bingjun Luo
Ao Sun
Jinghang Tan
Xibin Zhao
Yue Gao

Despite the great accomplishments of facial expression recognition (FER) models in closed-set scenarios, they still lack open-world robustness when it comes to handling unknown samples. To address the demands of operating in an open environment, open-set FER models should improve their performance in rejecting unknown samples while maintaining their efficiency in recognizing known expressions. With this goal in mind, we propose an open-set FER framework named Variance-Aware Bi-Attention Expression Transformer (VBExT), which enhances conventional closed-set FER models with open-world robustness for unknown samples. Specifically, to make full use of the expression representation capabilities of learned features, we introduce a bi-attention feature augmentation mechanism that learns the important regions and integrates the hierarchical features extracted by the emotional CNN backbone. We also propose a variance-aware distribution modeling method that adapts to the diverse distribution of different expression classes in the open environment, thereby enhancing the detection ability of unknown expressions. Additionally, we have constructed a Fine-Grained Light Facial Expression dataset that includes 30 different light brightnesses to better validate the efficiency of VBExT. Extensive experiments and ablation studies show that VBExT significantly improves the performance of open-set FER and achieves state-of-the-art results on CFEE (lab, basic), RAF-DB (wild, basic+compound), and FGL-FE (multiple light brightnesses, basic).

AffectFAL: Federated Active Affective Computing with Non-IID Data

Zixin Zhang
Fan Qi
Shuai Li
Changsheng Xu

Federated affective computing, which deploys traditional affective computing in a distributed framework, achieves a trade-off between privacy and utility, and offers a wide variety of applications in business and society. However, the expensive annotation cost of obtaining reliable emotion labels at the local client remains a barrier to the effective use of local emotional data. Therefore, we propose a federated active affective paradigm to improve the performance of federated affective computing with a limited annotation budget on the client. A major challenge in federated active learning is the inconsistency between the active sampling goals of global and local models, particularly in scenarios with Non-IID data across clients, which exacerbates the problem. To address the above challenge, we propose AffectFAL, a federated active affective computing framework. It incorporates a Preference-aware Group Aggregation module, which obtains global models representing the different emotional preferences among clients. We also devise a tailored De-biased Federated Active Sampling strategy with an improved vote entropy, facilitating class balancing of labeled samples and alleviating the problem of sampling goals inconsistency between the global and local models. We evaluate AffectFAL on diverse benchmarks (image, video and physiological signal) and experimental settings for affective computing. Thorough comparisons with other active sampling strategies demonstrate our method's advantages in affective computing for Non-IID federated learning.

ASTDF-Net: Attention-Based Spatial-Temporal Dual-Stream Fusion Network for EEG-Based Emotion Recognition

Peiliang Gong
Ziyu Jia
Pengpai Wang
Yueying Zhou
Daoqiang Zhang

Emotion recognition based on electroencephalography (EEG) has attracted significant attention and achieved considerable advances in the fields of affective computing and human-computer interaction. However, most existing studies ignore the coupling and complementarity of complex spatiotemporal patterns in EEG signals. Moreover, how to exploit and fuse crucial discriminative aspects in high redundancy and low signal-to-noise ratio EEG signals remains a great challenge for emotion recognition. In this paper, we propose a novel attention-based spatial-temporal dual-stream fusion network, named ASTDF-Net, for EEG-based emotion recognition. Specifically, ASTDF-Net comprises three main stages: first, the collaborative embedding module is designed to learn a joint latent subspace to capture the coupling of complicated spatiotemporal information in EEG signals. Second, stacked parallel spatial and temporal attention streams are employed to extract the most essential discriminative features and filter out redundant task-irrelevant factors. Finally, the hybrid attention-based feature fusion module is proposed to integrate significant features discovered from the dual-stream structure to take full advantage of the complementarity of the diverse characteristics. Extensive experiments on two publicly available emotion recognition datasets indicate that our proposed approach consistently outperforms state-of-the-art methods.

SESSION: Oral Session V: Engaging Users with Multimedia -- Multimedia Search and Recommendation

Multi-Granularity Interactive Transformer Hashing for Cross-modal Retrieval

Yishu Liu
Qingpeng Wu
Zheng Zhang
Jingyi Zhang
Guangming Lu

With the powerful representation ability and privileged efficiency, deep cross-modal hashing (DCMH) has become an emerging fast similarity search technique. Prior studies primarily focus on exploring pairwise similarities across modalities, but fail to comprehensively capture the multi-grained semantic correlations during intra- and inter-modal negotiation. To tackle this issue, this paper proposes a novel Multi-granularity Interactive Transformer Hashing (MITH) network, which hierarchically considers both coarse- and fine-grained similarity measurements across different modalities in one unified transformer-based framework. To the best of our knowledge, this is the first attempt for multi-granularity transformer-based cross-modal hashing. Specifically, a well-designed distilled intra-modal interaction module is deployed to excavate modality-specific concept knowledge with global-local knowledge distillation under the guidance of implicit conceptual category-level representations. Moreover, we construct a contrastive inter-modal alignment module to mine modality-independent semantic concept correspondences with instance- and token-wise contrastive learning, respectively. Such a collaborative learning paradigm can jointly alleviate the heterogeneity and semantic gaps among different modalities from a multi-granularity perspective, yielding discriminative modality-invariant hash codes. Extensive experiments on multiple representative cross-modal datasets demonstrate the consistent superiority of MITH over the existing state-of-the-art baselines. The codes are available at https://github.com/DarrenZZhang/MITH.

Equivariant Learning for Out-of-Distribution Cold-start Recommendation

Wenjie Wang
Xinyu Lin
Liuhui Wang
Fuli Feng
Yinwei Wei
Tat-Seng Chua

Recommender systems rely on user-item interactions to learn Collaborative Filtering (CF) signals and easily under-recommend the cold-start items without historical interactions. To boost cold-start item recommendation, previous studies usually incorporate item features (e.g., micro-video content features) into CF models. They essentially align the feature representations of warm-start items with CF representations during training, and then adopt the feature representations of cold-start items to make recommendations. However, cold-start items might have feature distribution shifts from warm-start ones due to different upload times. As such, these cold-start item features fall into the underrepresented feature space, where their feature representations cannot align well with CF signals, causing poor cold-start recommendation.

To combat item feature shifts, the key lies in pushing feature representation learning to well represent the shifted item features and align with the CF representations in the underrepresented feature space. To this end, we propose an equivariant learning framework, which aims to achieve equivariant alignment between item features, feature representations, and CF representations in the underrepresented feature space. Specifically, since cold-start items are unavailable for training, we interpolate the features and CF representations of two underrepresented warm items to simulate the feature shifts. The interpolated feature representations are then regulated to achieve equivariant alignment with the interpolated features and CF representations via three alignment losses. We instantiate the proposed framework on two competitive cold-start models, and empirical results on three datasets validate that the framework significantly improves cold-start recommendation.

Target-Guided Composed Image Retrieval

Haokun Wen
Xian Zhang
Xuemeng Song
Yinwei Wei
Liqiang Nie

Composed image retrieval (CIR) is a new and flexible image retrieval paradigm, which can retrieve the target image for a multimodal query, including a reference image and its corresponding modification text. Although existing efforts have achieved compelling success, they overlook the conflict relationship modeling between the reference image and the modification text for improving the multimodal query composition and the adaptive matching degree modeling for promoting the ranking of the candidate images that could present different levels of matching degrees with the given query. To address these two limitations, in this work, we propose a Target-Guided Composed Image Retrieval network (TG-CIR). In particular, TG-CIR first extracts the unified global and local attribute features for the reference/target image and the modification text with the contrastive language-image pre-training model (CLIP) as the backbone, where an orthogonal regularization is introduced to promote the independence among the attribute features. Then TG-CIR designs a target-query relationship-guided multimodal query composition module, comprising a target-free student composition branch and a target-based teacher composition branch, where the target-query relationship is injected into the teacher branch for guiding the conflict relationship modeling of the student branch. Last, apart from the conventional batch-based classification loss, TG-CIR additionally introduces a batch-based target similarity-guided matching degree regularization to promote the metric learning process. Extensive experiments on three benchmark datasets demonstrate the superiority of our proposed method.

Your Negative May not Be True Negative: Boosting Image-Text Matching with False Negative Elimination

HaoXuan Li
Yi Bin
Junrong Liao
Yang Yang
Heng Tao Shen

Most existing image-text matching methods adopt triplet loss as the optimization objective, and choosing a proper negative sample for the triplet of <anchor, positive, negative> is important for effectively training the model, e.g., hard negatives make the model learn efficiently and effectively. However, we observe that existing methods mainly employ the most similar samples as hard negatives, which may not be true negatives. In other words, the samples with high similarity but not paired with the anchor may reserve positive semantic associations, and we call them false negatives. Repelling these false negatives in triplet loss would mislead the semantic representation learning and result in inferior retrieval performance. In this paper, we propose a novel False Negative Elimination (FNE) strategy to select negatives via sampling, which could alleviate the problem introduced by false negatives. Specifically, we first construct the distributions of positive and negative samples separately via their similarities with the anchor, based on the features extracted from image and text encoders. Then we calculate the false negative probability of a given sample based on its similarity with the anchor and the above distributions via the Bayes' rule, which is employed as the sampling weight during negative sampling process. Since there may not exist any false negative in a small batch size, we design a memory module with momentum to retain a large negative buffer and implement our negative sampling strategy spanning over the buffer. In addition, to make the model focus on hard negatives, we reassign the sampling weights for the simple negatives with a cut-down strategy. The extensive experiments are conducted on Flickr30K and MS-COCO, and the results demonstrate the superiority of our proposed false negative elimination strategy. The code is available at https://github.com/LuminosityX/FNE.

A Tale of Two Graphs: Freezing and Denoising Graph Structures for Multimodal Recommendation

Xin Zhou
Zhiqi Shen

Multimodal recommender systems utilizing multimodal features (e.g., images and textual descriptions) typically show better recommendation accuracy than general recommendation models based solely on user-item interactions. Generally, prior work fuses multimodal features into item ID embeddings to enrich item representations, thus failing to capture the latent semantic item-item structures. In this context, LATTICE proposes to learn the latent structure between items explicitly and achieves state-of-the-art performance for multimodal recommendations. However, we argue the latent graph structure learning of LATTICE is both inefficient and unnecessary. Experimentally, we demonstrate that freezing its item-item structure before training can also achieve competitive performance. Based on this finding, we propose a simple yet effective model, dubbed as FREEDOM, that FREEzes the item-item graph and DenOises the user-item interaction graph simultaneously for Multimodal recommendation. Theoretically, we examine the design of FREEDOM through a graph spectral perspective and demonstrate that it possesses a tighter upper bound on the graph spectrum. In denoising the user-item interaction graph, we devise a degree-sensitive edge pruning method, which rejects possibly noisy edges with a high probability when sampling the graph. We evaluate the proposed model on three real-world datasets and show that FREEDOM can significantly outperform the strongest baselines. Compared with LATTICE, FREEDOM achieves an average improvement of 19.07% in recommendation accuracy while reducing its memory cost up to 6x on large graphs. The source code is available at: https://github.com/enoche/FREEDOM.

ProtoHPE: Prototype-guided High-frequency Patch Enhancement for Visible-Infrared Person Re-identification

Guiwei Zhang
Yongfei Zhang
Zichang Tan

Visible-Infrared person re-identification is challenging due to the large modality gap. To bridge the gap, most studies heavily rely on the correlation of visible-infrared holistic person images, which may perform poorly under severe distribution shifts. In contrast, we find that some cross-modal correlated high-frequency components contain discriminative visual patterns and are less affected by variations such as wavelength, pose, and background clutter than holistic images. Therefore, we are motivated to bridge the modality gap based on such high-frequency components, and propose Prototype-guided High-frequency Patch Enhancement (ProtoHPE) with two core designs. First, to enhance the representation ability of cross-modal correlated high-frequency components, we split patches with such components by Wavelet Transform and exponential moving average Vision Transformer (ViT), then empower ViT to take the split patches as auxiliary input. Second, to obtain semantically compact and discriminative high-frequency representations of the same identity, we propose Multimodal Prototypical Contrast. To be specific, it hierarchically captures comprehensive semantics of different modal instances, facilitating the aggregation of high-frequency representations belonging to the same identity. With it, ViT can capture key high-frequency components during inference without relying on ProtoHPE, thus bringing no extra complexity. Extensive experiments validate the effectiveness of ProtoHPE.

Online Distillation-enhanced Multi-modal Transformer for Sequential Recommendation

Wei Ji
Xiangyan Liu
An Zhang
Yinwei Wei
Yongxin Ni
Xiang Wang

Multi-modal recommendation systems, which integrate diverse types of information, have gained widespread attention in recent years. However, compared to traditional collaborative filtering-based multi-modal recommendation systems, research on multi-modal sequential recommendation is still in its nascent stages. Unlike traditional sequential recommendation models that solely rely on item identifier (ID) information and focus on network structure design, multi-modal recommendation models need to emphasize item representation learning and the fusion of heterogeneous data sources. This paper investigates the impact of item representation learning on downstream recommendation tasks and examines the disparities in information fusion at different stages. Empirical experiments are conducted to demonstrate the need to design a framework suitable for collaborative learning and fusion of diverse information. Based on this, we propose a new model-agnostic framework for multi-modal sequential recommendation tasks, called Online Distillation-enhanced Multi-modal Transformer (ODMT), to enhance feature interaction and mutual learning among multi-source input (ID, text, and image), while avoiding conflicts among different features during training, thereby improving recommendation accuracy. To be specific, we first introduce an ID-aware Multi-modal Transformer module in the item representation learning stage to facilitate information interaction among different features. Secondly, we employ an online distillation training strategy in the prediction optimization stage to make multi-source data learn from each other and improve prediction robustness. Experimental results on a stream media recommendation dataset and three e-commerce recommendation datasets demonstrate the effectiveness of the proposed two modules, which is approximately 10% improvement in performance compared to baseline models. Our code will be released at: https://github.com/xyliugo/ODMT.

Zero-shot Micro-video Classification with Neural Variational Inference in Graph Prototype Network

Junyang Chen
Jialong Wang
Zhijiang Dai
Huisi Wu
Mengzhu Wang
Qin Zhang
Huan Wang

Micro-video classification plays a central role in online content recommendation platforms, such as Kwai and Tik-Tok. Existing works on video classification largely exploit the interactions between users and items as well as the item labels to provide quality recommendation services. However, scarce or even no labeled data of emerging videos is a great challenge for existing classification methods. In this paper, we propose a zero-shot micro-video classification model (NVIGPN) by exploiting the hidden topics behind items to guide the representation learning in user-item interactions. Specifically, we study this zero-shot classification in two stages: (1) exploiting a generalized semantic hidden topic descriptions for transferable knowledge learning, and (2) designing a graph-based learning model for guiding the minor seen class information to the unseen ones. Through mining the transferable knowledge between the hidden topics and the small number of the seen classes, NVIGPN can achieves state-of-the-art performances in predicting the unseen classes of micro-videos. We conduct extensive experiments to demonstrate the effectiveness of our method.

Joint Searching and Grounding: Multi-Granularity Video Content Retrieval

Zhiguo Chen
Xun Jiang
Xing Xu
Zuo Cao
Yijun Mo
Heng Tao Shen

Text-based video retrieval is a well-studied task aimed at retrieving relevant videos from a large collection in response to a given text query. Most existing TVR works assume that videos are already trimmed and fully relevant to the query thus ignoring that most videos in real-world scenarios are untrimmed and contain massive irrelevant video content. Moreover, as users' queries are only relevant to video events rather than complete videos, it is also more practical to provide specific video events rather than an untrimmed video list. In this paper, we introduce a challenging but more realistic task called Multi-Granularity Video Content Retrieval (MGVCR), which involves retrieving both video files and specific video content with their temporal locations. This task presents significant challenges since it requires identifying and ranking the partial relevance between long videos and text queries under the lack of temporal alignment supervision between the query and relevant moments. To this end, we propose a novel unified framework, termed, Joint Searching and Grounding (JSG). It consists of two branches: (1) a glance branch that coarsely aligns the query and moment proposals using inter-video contrastive learning, and (2) a gaze branch that finely aligns two modalities using both inter- and intra-video contrastive learning. Based on the glance-to-gaze design, our JSG method learns two separate joint embedding spaces for moments and text queries using a hybrid synergistic contrastive learning strategy. Extensive experiments on three public benchmarks, i.e., Charades-STA, DiDeMo, and ActivityNet-Captions demonstrate the superior performance of our JSG method on both video-level retrieval and event-level retrieval subtasks. Our open-source implementation code is available at https://github.com/CFM-MSG/Code_JSG.

Making Users Indistinguishable: Attribute-wise Unlearning in Recommender Systems

Yuyuan Li
Chaochao Chen
Xiaolin Zheng
Yizhao Zhang
Zhongxuan Han
Dan Meng
Jun Wang

With the growing privacy concerns in recommender systems, recommendation unlearning, i.e., forgetting the impact of specific learned targets, is getting increasing attention. Existing studies predominantly use training data, i.e., model inputs, as the unlearning target. However, we find that attackers can extract private information, i.e., gender, race, and age, from a trained model even if it has not been explicitly encountered during training. We name this unseen information as attribute and treat it as the unlearning target. To protect the sensitive attribute of users, Attribute Unlearning (AU) aims to degrade attacking performance and make target attributes indistinguishable. In this paper, we focus on a strict but practical setting of AU, namely Post-Training Attribute Unlearning (PoT-AU), where unlearning can only be performed after the training of the recommendation model is completed. To address the PoT-AU problem in recommender systems, we design a two-component loss function that consists of i) distinguishability loss: making attribute labels indistinguishable from attackers, and ii) regularization loss: preventing drastic changes in the model that result in a negative impact on recommendation performance. Specifically, we investigate two types of distinguishability measurements, i.e., user-to-user and distribution-to-distribution. We use the stochastic gradient descent algorithm to optimize our proposed loss. Extensive experiments on three real-world datasets demonstrate the effectiveness of our proposed methods.

Prior-Guided Accuracy-Bias Tradeoff Learning for CTR Prediction in Multimedia Recommendation

Dugang Liu
Yang Qiao
Xing Tang
Liang Chen
Xiuqiang He
Zhong Ming

Although debiasing in multimedia recommendation has shown promising results, most existing work relies on the ability of the model itself to fully disentangle the biased and unbiased information and considers arbitrarily removing all the biases. However, in many business scenarios, it is usually possible to extract a subset of features associated with the biases by means of expert knowledge, i.e., the confounding proxy features. Therefore, in this paper, we propose a novel debiasing framework with confounding proxy priors for the accuracy-bias tradeoff learning in the multimedia recommendation, or CP2Rec for short, in which these confounding proxy features driven by the expert experience are integrated into the model as prior knowledge corresponding to the biases. Specifically, guided by these priors, we use a bias disentangling module with some orthogonal constraints to force the model to avoid encoding biased information in the feature embeddings. We then introduce an auxiliary unbiased loss to synergize with the original biased loss in an accuracy-bias tradeoff module, aiming at recovering the beneficial bias information from the above-purified feature embeddings to achieve a more reasonable accuracy-bias tradeoff recommendation. Finally, we conduct extensive experiments on a public dataset and a product dataset to verify the effectiveness of CR2Rec. In addition, CR2Rec is also deployed on a large-scale financial multimedia recommendation platform in China and achieves a sustained performance gain.

GoRec: A Generative Cold-start Recommendation Framework

Haoyue Bai
Min Hou
Le Wu
Yonghui Yang
Kun Zhang
Richang Hong
Meng Wang

Multimedia-based recommendation models learn user and item preference representation by fusing both the user-item collaborative signals and the multimedia content signals. In real scenarios, cold items appear in the test stage without any user interaction record. How to perform cold item recommendation is challenging as the training items and test items have different data distributions. These hybrid preference representations contained auxiliary collaborative signals, so current solutions designed alignment functions to transfer learned hybrid preference representations to cold items. Despite the effectiveness, we argue that they are still limited as these models relied heavily on the manually carefully designed alignment functions, which are easily influenced by the limited item records and noises in the training data.

To tackle the above limitations, we propose a Generative cold-start Recommendation (GoRec) framework for multimedia-based new item recommendation. Specifically, we design a Conditional Variational AutoEncoder~(CVAE) based method that first estimates the underlying distribution of each warm item conditioned on the multimedia content representation. Then, we propose a uniformity-enhanced optimization objective to ensure the latent space of CVAE is more distinguishable and informative. In the inference stage, a generative approach is designed to obtain warm-up new item representations from the latent distribution. Please note that GoRec is applicable to arbitrary recommendation backbones. Extensive experiments on three real datasets and various recommendation backbones verify the superiority of our proposed framework. The code is available at https://github.com/HaoyueBai98/GoRec.

Prototype-guided Knowledge Transfer for Federated Unsupervised Cross-modal Hashing

Jingzhi Li
Fengling Li
Lei Zhu
Hui Cui
Jingjing Li

Although deep cross-modal hashing methods have shown superiorities for cross-modal retrieval recently, there is a concern about potential data privacy leakage when training the models. Federated learning adopts a distributed machine learning strategy, which can collaboratively train models without leaking local private data. It is a promising technique to support privacy-preserving cross-modal hashing. However, existing federated learning-based cross-modal retrieval methods usually rely on a large number of semantic annotations, which limits the scalability of the retrieval models. Furthermore, they mostly update the global models by aggregating local model parameters, ignoring the differences in the quantity and category of multi-modal data from multiple clients. To address these issues, we propose a Prototype Transfer-based Federated Unsupervised Cross-modal Hashing(PT-FUCH) method for solving the privacy leakage problem in cross-modal retrieval model learning. PT-FUCH protects local private data by exploring unified global prototypes for different clients, without relying on any semantic annotations. Global prototypes are used to guide the local cross-modal hash learning and promote the alignment of the feature space, thereby alleviating the model bias caused by the difference in the distribution of local multi-modal data and improving the retrieval accuracy. Additionally, we design an adaptive cross-modal knowledge distillation to transfer valuable semantic knowledge from modal-specific global models to local prototype learning processes, reducing the risk of overfitting. Experimental results on three benchmark cross-modal retrieval datasets validate that our PT-FUCH method can achieve outstanding retrieval performance when trained under distributed privacy-preserving mode. The source codes of our method are available at https://github.com/exquisite1210/PT-FUCH_P.

SESSION: Oral Session VI: Engaging Users with Multimedia -- Interactions and Quality of Experience

EAT: An Enhancer for Aesthetics-Oriented Transformers

Shuai He
Anlong Ming
Shuntian Zheng
Haobin Zhong
Huadong Ma

Transformers have shown great potential in various vision tasks, but none of them have surpassed the best CNN model on image aesthetics assessment (IAA) tasks. IAA is a challenging task in multimedia systems that requires attention to both foreground and background, as well as robustness to noisy and redundant labels. The global and dense attention mechanism of Transformers, designed for saliency-oriented tasks, may miss important aesthetic information in the background, increase the computational cost and slow down the convergence on IAA tasks. To address these issues, we propose an Enhancer for Aesthetics-Oriented Transformers (EAT). EAT uses a deformable, sparse and data-dependent attention mechanism that learns where to focus and how to refine attention by offsets. EAT also guides the offsets to balance the attention between foreground and background according to dedicated rules. Our EAT-enhanced Transformers outperform the previous methods on four representative datasets with fewer training epochs. Code is available in https://github.com/woshidandan/Image-Aesthetics-Assessment

UnifiedGesture: A Unified Gesture Synthesis Model for Multiple Skeletons

Sicheng Yang
Zilin Wang
Zhiyong Wu
Minglei Li
Zhensong Zhang
Qiaochu Huang
Lei Hao
Songcen Xu
Xiaofei Wu
Changpeng Yang
Zonghong Dai

The automatic co-speech gesture generation draws much attention in computer animation. Previous works designed network structures on individual datasets, which resulted in a lack of data volume and generalizability across different motion capture standards. In addition, it is a challenging task due to the weak correlation between speech and gestures. To address these problems, we present UnifiedGesture, a novel diffusion model-based speech-driven gesture synthesis approach, trained on multiple gesture datasets with different skeletons. Specifically, we first present a retargeting network to learn latent homeomorphic graphs for different motion capture standards, unifying the representations of various gestures while extending the dataset. We then capture the correlation between speech and gestures based on a diffusion model architecture using cross-local attention and self-attention to generate better speech-matched and realistic gestures. To further align speech and gesture and increase diversity, we incorporate reinforcement learning on the discrete gesture units with a learned reward function. Extensive experiments show that UnifiedGesture outperforms recent approaches on speech-driven gesture generation in terms of CCA, FGD, and human-likeness.

Towards Explainable In-the-Wild Video Quality Assessment: A Database and a Language-Prompted Approach

Haoning Wu
Erli Zhang
Liang Liao
Chaofeng Chen
Jingwen Hou
Annan Wang
Wenxiu Sun
Qiong Yan
Weisi Lin

The proliferation of in-the-wild videos has greatly expanded the Video Quality Assessment (VQA) problem. Unlike early definitions that usually focus on limited distortion types, VQA on in-the-wild videos is especially challenging as it could be affected by complicated factors, including various distortions and diverse contents. Though subjective studies have collected overall quality scores for these videos, how the abstract quality scores relate with specific factors is still obscure, hindering VQA methods from more concrete quality evaluations (e.g. sharpness of a video). To solve this problem, we collect over two million opinions on 4,543 in-the-wild videos on 13 dimensions of quality-related factors, including in-capture authentic distortions (e.g. motion blur, noise, flicker), errors introduced by compression and transmission, and higher-level experiences on semantic contents and aesthetic issues (e.g. composition, camera trajectory), to establish the multi-dimensional Maxwell database. Specifically, we ask the subjects to label among a positive, a negative, and a neutral choice for each dimension. These explanation-level opinions allow us to measure the relationships between specific quality factors and abstract subjective quality ratings, and to benchmark different categories of VQA algorithms on each dimension, so as to more comprehensively analyze their strengths and weaknesses. Furthermore, we propose the MaxVQA, a language-prompted VQA approach that modifies vision-language foundation model CLIP to better capture important quality issues as observed in our analyses. The MaxVQA can jointly evaluate various specific quality factors and final quality scores with state-of-the-art accuracy on all dimensions, and superb generalization ability on existing datasets. Code and data available at https://github.com/VQAssessment/MaxVQA.

Sketch Input Method Editor: A Comprehensive Dataset and Methodology for Systematic Input Recognition

Guangming Zhu
Siyuan Wang
Qing Cheng
Kelong Wu
Hao Li
Liang Zhang

With the recent surge in the use of touchscreen devices, free-hand sketching has emerged as a promising modality for human-computer interaction. While previous research has focused on tasks such as recognition, retrieval, and generation of familiar everyday objects, this study aims to create a Sketch Input Method Editor (SketchIME) specifically designed for a professional Command, Control, Communications, Computer, and Intelligence (C4I) system. Within this system, sketches are utilized as low-fidelity prototypes for recommending standardized symbols in the creation of comprehensive situation maps. This paper also presents a systematic dataset comprising 374 specialized sketch types, and proposes a simultaneous recognition and segmentation architecture with multilevel supervision between recognition and segmentation to improve performance and enhance interpretability. By incorporating few-shot domain adaptation and class-incremental learning, the network's ability to adapt to new users and extend to new task-specific classes is significantly enhanced. Results from experiments conducted on both the proposed dataset and the SPG dataset illustrate the superior performance of the proposed architecture. Our dataset and code are publicly available at https://github.com/GuangmingZhu/SketchIME.

StableVQA: A Deep No-Reference Quality Assessment Model for Video Stability

Tengchuan Kou
Xiaohong Liu
Wei Sun
Jun Jia
Xiongkuo Min
Guangtao Zhai
Ning Liu

Video shakiness is an unpleasant distortion of User Generated Content (UGC) videos, which is usually caused by the unstable hold of cameras. In recent years, many video stabilization algorithms have been proposed, yet no specific and accurate metric enables comprehensively evaluating the stability of videos. Indeed, most existing quality assessment models evaluate video quality as a whole without specifically taking the subjective experience of video stability into consideration. Therefore, these models cannot measure the video stability explicitly and precisely when severe shakes are present. In addition, there is no large-scale video database in public that includes various degrees of shaky videos with the corresponding subjective scores available, which hinders the development of Video Quality Assessment for Stability (VQA-S). To this end, we build a new database named StableDB that contains 1,952 diversely-shaky UGC videos, where each video has a Mean Opinion Score (MOS) on the degree of video stability rated by 34 subjects. Moreover, we elaborately design a novel VQA-S model named StableVQA, which consists of three feature extractors to acquire the optical flow, semantic, and blur features respectively, and a regression layer to predict the final stability score. Extensive experiments demonstrate that the StableVQA achieves a higher correlation with subjective opinions than the existing VQA-S models and generic VQA models. The database and codes are available at https://github.com/QMME/StableVQA.

Spatial-angular Quality-aware Representation Learning for Blind Light Field Image Quality Assessment

Jianjun Xiang
Yuanjie Dang
Peng Chen
Ronghua Liang
Ruohong Huan
Zhengyu Zhang

Blind light field image quality assessment (BLFIQA) remains a challenging task in deep learning due to the unique spatial-angular structure of light field images (LFIs) and the lack of large-scale labeled data for training. In this work, we propose a novel BLFIQA method using spatial-angular quality-aware representation learning in a self-supervised learning manner. Visual content and distortion type are important factors affecting the perceived quality of LFIs. In our observation, the band-pass transform maps of LFIs with the same distortion type exhibit similar Gaussian distributions. Thus, we learn spatial-angular quality-aware representations by minimizing the distance in the embedding space between the luminance map and the band-pass transform map of the same LFI. To implement spatial-angular quality-aware representations of LFI, we also build a large-scale unlabeled dataset containing 40k distorted LFIs with different distortion types and visual content. Further, we propose a fusion-separation-fusion network (FSFNet) to extract features for representing the intrinsic spatial-angular structure of the LFI. After pre-training on the unlabeled dataset using the proposed self-supervised learning, the FSFNet is employed for downstream BLFIQA tasks and achieves good performance. Experimental results show that our proposed method outperforms seventeen state-of-the-art models on the Win5-LID, NBU-LF1.0 and LFDD datasets, and achieves 3.78%, 6.61% and 4.06% SRCC improvements, respectively. The code and dataset will be publicly available in https://github.com/JianjunXiang/SSL_and_FSFNet.

Light-VQA: A Multi-Dimensional Quality Assessment Model for Low-Light Video Enhancement

Yunlong Dong
Xiaohong Liu
Yixuan Gao
Xunchu Zhou
Tao Tan
Guangtao Zhai

Recently, Users Generated Content (UGC) videos becomes ubiquitous in our daily lives. However, due to the limitations of photographic equipments and techniques, UGC videos often contain various degradations, in which one of the most visually unfavorable effects is the underexposure. Therefore, corresponding video enhancement algorithms such as Low-Light Video Enhancement (LLVE) have been proposed to deal with the specific degradation. However, different from video enhancement algorithms, almost all existing Video Quality Assessment (VQA) models are built generally rather than specifically, which measure the quality of a video from a comprehensive perspective. To the best of our knowledge, there is no VQA model specially designed for videos enhanced by LLVE algorithms. To this end, we first construct a Low-Light Video Enhancement Quality Assessment (LLVE-QA) dataset in which 254 original low-light videos are collected and then enhanced by leveraging 8 LLVE algorithms to obtain 2,060 videos in total. Moreover, we propose a quality assessment model specialized in LLVE, named Light-VQA. More concretely, since the brightness and noise have the most impact on low-light enhanced VQA, we handcraft corresponding features and integrate them with deep-learning-based semantic features as the overall spatial information. As for temporal information, in addition to deep-learning-based motion features, we also investigate the handcrafted brightness consistency among video frames, and the overall temporal information is their concatenation. Subsequently, spatial and temporal information is fused to obtain the quality-aware representation of a video. Extensive experimental results show that our Light-VQA achieves the best performance against the current State-Of-The-Art (SOTA) on LLVE-QA and public dataset. Dataset and Codes can be found at https://github.com/wenzhouyidu/Light-VQA.

Capturing Co-existing Distortions in User-Generated Content for No-reference Video Quality Assessment

Kun Yuan
Zishang Kong
Chuanchuan Zheng
Ming Sun
Xing Wen

Video Quality Assessment (VQA), which aims to predict the perceptual quality of a video, has attracted raising attention with the rapid development of streaming media technology, such as Facebook, TikTok, Kwai, and so on. Compared with other sequence-based visual tasks (e.g., action recognition), VQA faces two under-estimated challenges unresolved in User Generated Content (UGC) videos. First, it is not rare that several frames containing serious distortions (e.g., blocking, blurriness), can determine the perceptual quality of the whole video, while other sequence-based tasks require more frames of equal importance for representations.Second, the perceptual quality of a video exhibits a multi-distortion distribution, due to the differences in the duration and probability of occurrence for various distortions. In order to solve the above challenges, we propose Visual Quality Transformer (VQT) to extract quality-related sparse features more efficiently. Methodologically, a Sparse Temporal Attention (STA) is proposed to sample keyframes by analyzing the temporal correlation between frames, which reduces the computational complexity from O(T2) to O(T log T). Structurally, a Multi-Pathway Temporal Network (MPTN) utilizes multiple STA modules with different degrees of sparsity in parallel, capturing co-existing distortions in a video. Experimentally, VQT demonstrates superior performance than many state-of-the-art methods in three public no-reference VQA datasets. Furthermore, VQT shows better performance in four full-reference VQA datasets against widely-adopted industrial algorithms (e.g., VMAF and AVQT).

Understanding User Behavior in Volumetric Video Watching: Dataset, Analysis and Prediction

Kaiyuan Hu
Haowen Yang
Yili Jin
Junhua Liu
Yongting Chen
Miao Zhang
Fangxin Wang

Volumetric video emerges as a new attractive video paradigm in recent years since it provides an immersive and interactive 3D viewing experience with six degree-of-freedom (DoF). Unlike traditional 2D or panoramic videos, volumetric videos require dense point clouds, voxels, meshes, or huge neural models to depict volumetric scenes, which results in a prohibitively high bandwidth burden for video delivery. Users' behavior analysis, especially the viewport and gaze analysis, then plays a significant role in prioritizing the content streaming within users' viewport and degrading the remaining content to maximize user QoE with limited bandwidth. Although understanding user behavior is crucial, to the best of our best knowledge, there are no available 3D volumetric video viewing datasets containing fine-grained user interactivity features, not to mention further analysis and behavior prediction.

In this paper, we for the first time release a volumetric video viewing behavior dataset, with a large scale, multiple dimensions, and diverse conditions. We conduct an in-depth analysis to understand user behaviors when viewing volumetric videos. Interesting findings on user viewport, gaze, and motion preference related to different videos and users are revealed. We finally design a transformer-based viewport prediction model that fuses the features of both gaze and motion, which is able to achieve high accuracy at various conditions. Our prediction model is expected to further benefit volumetric video streaming optimization.

Our dataset, along with its corresponding visualization tools and prediction models, is accessible at https://cuhksz-inml.github.io/user-behavior-in-vv-watching/

AesCLIP: Multi-Attribute Contrastive Learning for Image Aesthetics Assessment

Xiangfei Sheng
Leida Li
Pengfei Chen
Jinjian Wu
Weisheng Dong
Yuzhe Yang
Liwu Xu
Yaqian Li
Guangming Shi

Image aesthetics assessment (IAA) aims at predicting the aesthetic quality of images. Recently, large pre-trained vision-language models, like CLIP, have shown impressive performances on various visual tasks. When it comes to IAA, a straightforward way is to finetune the CLIP image encoder using aesthetic images. However, this can only achieve limited success without considering the uniqueness of multimodal data in the aesthetics domain. People usually assess image aesthetics according to fine-grained visual attributes, e.g., color, light and composition. However, how to learn aesthetics-aware attributes from CLIP-based semantic space has not been addressed before. With this motivation, this paper presents a CLIP-based multi-attribute contrastive learning framework for IAA, dubbed AesCLIP. Specifically, AesCLIP consists of two major components, i.e., aesthetic attribute-based comment classification and attribute-aware learning. The former classifies the aesthetic comments into different attribute categories. Then the latter learns an aesthetic attribute-aware representation by contrastive learning, aiming to mitigate the domain shift from the general visual domain to the aesthetics domain. Extensive experiments have been done by using the pre-trained AesCLIP on four popular IAA databases, and the results demonstrate the advantage of AesCLIP over the state-of-the-arts. The source code will be public at https://github.com/OPPOMKLab/AesCLIP.

SESSION: Oral Session VII: Engaging Users with Multimedia -- Metaverse, Art and Culture

Feeling Present! From Physical to Virtual Cinematography Lighting Education with Metashadow

Zheng Wei
Xian Xu
Lik-Hang Lee
Wai Tong
Huamin Qu
Pan Hui

The high cost and limited availability of soundstages for cinematography lighting education pose significant challenges for art institutions. Traditional teaching methods, combining basic lighting equipment operation with slide lectures, often yield unsatisfactory results, hindering students' mastery of cinematography lighting techniques. Therefore, we propose Metashadow, a virtual reality (VR) cinematography lighting education system demonstrating the feasibility of learning in a virtual soundstage. Based on the presence theory, Metashadow features high-fidelity lighting devices that enable users to adjust multiple parameters, providing a quantifiable learning approach. We evaluated Metashadow with 24 participants and found that it provides better learning outcomes than traditional teaching methods regarding presence, collaboration, usability, realism, creativity, and flexibility. Six experts also praised the Metashadow's expressiveness and its learning outcomes. Our study demonstrates the potential of VR technology to enhance cinematography lighting education while imposing a smaller cost burden and space requirement.

Automatic Generation of Commercial Scenes

Shao-Kui Zhang
Jia-Hong Liu
Yike Li
Tianyi Xiong
Ke-Xin Ren
Hongbo Fu
Song-Hai Zhang

Commercial scenes such as markets and shops are everyday scenes for both virtual scenes and real-world interior designs. However, existing literature on interior scene synthesis mainly focuses on formulating and optimizing residential scenes such as bedrooms, living rooms, etc. Existing literature typically presents a set of relations among objects. It recognizes each furniture object as the smallest unit while optimizing a residential room. However, object relations become less critical in commercial scenes since shelves are often placed next to each other so pre-calculated relations of objects are less needed. Instead, interior designers resort to evaluating how groups of objects perform in commercial scenes, i.e., the smallest unit to be evaluated is a group of objects. This paper presents a system automatically synthesizes market-like commercial scenes in virtual environments. Following the rules of commercial layout design, we parameterize groups of objects as "patterns" contributing to a scene. Each pattern directly yields a human-centric routine locally, provides potential connectivity with other routines, and derives the arrangements of objects concerning itself according to the assigned parameters. In order to optimize a scene, the patterns are iteratively multiplexed to insert new routines or modify existing ones under a set of constraints derived from commercial layout designs. Through extensive experiments, we demonstrate the ability of our framework to generate plausible and practical commercial scenes.

Control3D: Towards Controllable Text-to-3D Generation

Yang Chen
Yingwei Pan
Yehao Li
Ting Yao
Tao Mei

Recent remarkable advances in large-scale text-to-image diffusion models have inspired a significant breakthrough in text-to-3D generation, pursuing 3D content creation solely from a given text prompt. However, existing text-to-3D techniques lack a crucial ability in the creative process: interactively control and shape the synthetic 3D contents according to users' desired specifications (e.g., sketch). To alleviate this issue, we present the first attempt for text-to-3D generation conditioning on the additional hand-drawn sketch, namely Control3D, which enhances controllability for users. In particular, a 2D conditioned diffusion model (ControlNet) is remoulded to guide the learning of 3D scene parameterized as NeRF, encouraging each view of 3D scene aligned with the given text prompt and hand-drawn sketch. Moreover, we exploit a pre-trained differentiable photo-to-sketch model to directly estimate the sketch of the rendered image over synthetic 3D scene. Such estimated sketch along with each sampled view is further enforced to be geometrically consistent with the given sketch, pursuing better controllable text-to-3D generation. Through extensive experiments, we demonstrate that our proposal can generate accurate and faithful 3D scenes that align closely with the input text prompts and sketches.

Reconnecting the Broken Civilization: Patchwork Integration of Fragments from Ancient Manuscripts

Yuqing Zhang
Zhou Fang
Xinyu Yang
Shengyu Zhang
Baoyi He
Huaiyong Dou
Junchi Yan
Yongquan Zhang
Fei Wu

The rich tapestry of human history is often painstakingly pieced together from ancient manuscripts, serving as resilient time capsules of cultural heritage, societal norms, religious tenets, and quotidian life. Unfortunately, the ravages of time, careless preservation, and varied forms of degradation frequently leave us with fragmented relics. The traditional process of reconstructing these fragments is an arduous task, demanding exhaustive manual intervention and a global collaboration among archaeologists. This paper presents a transformative approach to this challenge, harnessing multi-media techniques to restore the connectable fragments of the invaluable Dunhuang scrolls. We curate a unique multimodal dataset of the fragmented Dunhuang manuscripts and architect an innovative three-tiered pipeline to reconstruct these historical scrolls. Our initial stage uses a text-based localization strategy, filtering fragment pairs through text comparison. We then employ a novel self-supervised contour-based pairwise matching framework to overcome the hurdle of limited labeled pairing samples. This process is powered by data augmentation techniques and a Siamese network which determines the most compatible matches. The final stage in our pipeline globally reconstructs the selected fragment pairs with hierarchical clustering, bringing us closer to the original grandeur of the Dunhuang scrolls. Our empirical evaluations reveal that this pipeline exhibits a remarkable success rate in fragment assembly. By addressing this cross-disciplinary challenge, our dataset and pipeline not only contribute to the field of multi-media artificial intelligence but also hold profound implications for sociocultural studies and future explorations into the understanding of ancient cultural history.

SESSION: Oral Session VIII: Engaging Users with Multimedia -- Multimedia Applications

Cal-SFDA: Source-Free Domain-adaptive Semantic Segmentation with Differentiable Expected Calibration Error

Zixin Wang
Yadan Luo
Zhi Chen
Sen Wang
Zi Huang

The prevalence of domain adaptive semantic segmentation has prompted concerns regarding source domain data leakage, where private information from the source domain could inadvertently be exposed in the target domain. To circumvent the requirement for source data, source-free domain adaptation has emerged as a viable solution that leverages self-training methods to pseudo-label high-confidence regions and adapt the model to the target data. However, the confidence scores obtained are often highly biased due to overconfidence and class-imbalance issues, which render both model selection and optimization problematic. In this paper, we propose a novel calibration-guided source-free domain adaptive semantic segmentation (Cal-SFDA) framework. The core idea is to estimate the expected calibration error (ECE) from the segmentation predictions, serving as a strong indicator of the model's generalization capability to the unlabeled target domain. The estimated ECE scores, in turn, assist the model training and fair selection in both source training and target adaptation stages. During model pre-training on the source domain, we ensure the differentiability of the ECE objective by leveraging the LogSumExp trick and using ECE scores to select the best source checkpoints for adaptation. To enable ECE estimation on the target domain without requiring labels, we train a value net for ECE estimation and apply statistic warm-up on its BatchNorm layers for stability. The estimated ECE scores assist in determining the reliability of prediction and enable class-balanced pseudo-labeling by positively guiding the adaptation progress and inhibiting potential error accumulation. Extensive experiments on two widely-used synthetic-to-real transfer tasks show that the proposed approach surpasses previous state-of-the-art by up to 5.25% of mIoU with fair model selection criteria.

Frequency Perception Network for Camouflaged Object Detection

Runmin Cong
Mengyao Sun
Sanyi Zhang
Xiaofei Zhou
Wei Zhang
Yao Zhao

Camouflaged object detection (COD) aims to accurately detect objects hidden in the surrounding environment. However,the existing COD methods mainly locate camouflaged objects in the RGB domain, their performance has not been fully exploited in many challenging scenarios. Considering that the features of the camouflaged object and the background are more discriminative in the frequency domain, we propose a novel learnable and separable frequency perception mechanism driven by the semantic hierarchy in the frequency domain. Our entire network adopts a two-stage model, including a frequency-guided coarse localization stage and a detail-preserving fine localization stage.With the multi-level features extracted by the backbone, we design a flexible frequency perception module based on octave convolution for coarse positioning. Then, we design the correction fusion module to step-by-step integrate the high-level features through the prior-guided correction and cross-layer feature channel association, and finally combine them with the shallow features to achieve the detailed correction of the camouflaged objects. Compared with the currently existing models, our proposed method achieves competitive performance in three popular benchmark datasets both qualitatively and quantitatively. The code will be released at https://github.com/rmcong/FPNet_ACMMM23.

SepMark: Deep Separable Watermarking for Unified Source Tracing and Deepfake Detection

Xiaoshuai Wu
Xin Liao
Bo Ou

Malicious Deepfakes have led to a sharp conflict over distinguishing between genuine and forged faces. Although many countermeasures have been developed to detect Deepfakes ex-post, undoubtedly, passive forensics has not considered any preventive measures for the pristine face before foreseeable manipulations. To complete this forensics ecosystem, we thus put forward the proactive solution dubbed SepMark, which provides a unified framework for source tracing and Deepfake detection. SepMark originates from encoder-decoder-based deep watermarking but with two separable decoders. For the first time the deep separable watermarking, SepMark brings a new paradigm to the established study of deep watermarking, where a single encoder embeds one watermark elegantly, while two decoders can extract the watermark separately at different levels of robustness. The robust decoder termed Tracer that resists various distortions may have an overly high level of robustness, allowing the watermark to survive both before and after Deepfake. The semi-robust one termed Detector is selectively sensitive to malicious distortions, making the watermark disappear after Deepfake. Only SepMark comprising of Tracer and Detector can reliably trace the trusted source of the marked face and detect whether it has been altered since being marked; neither of the two alone can achieve this. Extensive experiments demonstrate the effectiveness of the proposed SepMark on typical Deepfakes, including face swapping, expression reenactment, and attribute editing. Code will be available at https://github.com/sh1newu/SepMark.

SDDNet: Style-guided Dual-layer Disentanglement Network for Shadow Detection

Runmin Cong
Yuchen Guan
Jinpeng Chen
Wei Zhang
Yao Zhao
Sam Kwong

Despite significant progress in shadow detection, current methods still struggle with the adverse impact of background color, which may lead to errors when shadows are present on complex backgrounds. Drawing inspiration from the human visual system, we treat the input shadow image as a composition of a background layer and a shadow layer, and design a Style-guided Dual-layer Disentanglement Network (SDDNet) to model these layers independently. To achieve this, we devise a Feature Separation and Recombination (FSR) module that decomposes multi-level features into shadow-related and background-related components by offering specialized supervision for each component, while preserving information integrity and avoiding redundancy through the reconstruction constraint. Moreover, we propose a Shadow Style Filter (SSF) module to guide the feature disentanglement by focusing on style differentiation and uniformization. With these two modules and our overall pipeline, our model effectively minimizes the detrimental effects of background color, yielding superior performance on three public datasets with a real-time inference speed of 32 FPS. Our code is publicly available at:https://github.com/rmcong/SDDNet_ACMMM23.

High-Order Tensor Recovery Coupling Multilayer Subspace Priori with Application in Video Restoration

Hao Tan
Weichao Kong
Feng Zhang
Wenjin Qin
Jianjun Wang

In the real world, a large amount of high-order tensor data (order>3) exists, such as color videos, multispectral videos, and light-field images. However, these data often face challenges in transportation, storage, and susceptibility to damage. Meanwhile, most existing tensor-based information processing methods only concentrate on third-order tensors, which may not meet the complex requirements of high-dimensional data processing. In this paper, to better address the high-order tensor recovery issue, we propose a novel method that couples multilayer subspace priors with high-order tensor recovery techniques for tensor completion and robust tensor principal component analysis. Moreover, we provide theoretical guarantees for our approach's recovery and demonstrate that it achieves comparable performance under weaker incoherent conditions. Additionally, we develop two efficient and interpretable algorithms based on the alternating direction method of multipliers (ADMM) to solve our model. Owing to the adaptability of subspace prior information, our method demonstrates superior performance in recovering various types of data, including color videos and multispectral videos, compared with various advanced algorithms currently available.

Digging into Depth Priors for Outdoor Neural Radiance Fields

Chen Wang
Jiadai Sun
Lina Liu
Chenming Wu
Zhelun Shen
Dayan Wu
Yuchao Dai
Liangjun Zhang

Neural Radiance Fields (NeRFs) have demonstrated impressive performance in vision and graphics tasks, such as novel view synthesis and immersive reality. However, the shape-radiance ambiguity of radiance fields remains a challenge, especially in the sparse viewpoints setting. Recent work resorts to integrating depth priors into outdoor NeRF training to alleviate the issue. However, the criteria for selecting depth priors and the relative merits of different priors have not been thoroughly investigated. Moreover, the relative merits of selecting different approaches to use the depth priors is also an unexplored problem. In this paper, we provide a comprehensive study and evaluation of employing depth priors to outdoor neural radiance fields, covering common depth sensing technologies and most application ways. Specifically, we conduct extensive experiments with two representative NeRF methods equipped with four commonly-used depth priors and different depth usages on two widely used outdoor datasets. Our experimental results reveal several interesting findings that can potentially benefit practitioners and researchers in training their NeRF models with depth priors. Project page: https://cwchenwang.github.io/outdoor-nerf-depth

ECENet: Explainable and Context-Enhanced Network for Muti-modal Fact verification

Fanrui Zhang
Jiawei Liu
Qiang Zhang
Esther Sun
Jingyi Xie
Zheng-Jun Zha

Recently, falsified claims incorporating both text and images have been disseminated more effectively than those containing text alone, raising significant concerns for multi-modal fact verification. Existing research makes contributions to multi-modal feature extraction and interaction, but fails to fully utilize and enhance the valuable and intricate semantic relationships between distinct features. Moreover, most detectors merely provide a single outcome judgment and lack an inference process or explanation. Taking these factors into account, we propose a novel Explainable and Context-Enhanced Network (ECENet) for multi-modal fact verification, making the first attempt to integrate multi-clue feature extraction, multi-level feature reasoning, and justification (explanation) generation within a unified framework. Specifically, we propose an Improved Coarse- and Fine-grained Attention Network, equipped with two types of level-grained attention mechanisms, to facilitate a comprehensive understanding of contextual information. Furthermore, we propose a novel justification generation module via deep reinforcement learning that does not require additional labels. In this module, a sentence extractor agent measures the importance between the query claim and all document sentences at each time step, selecting a suitable amount of high-scoring sentences to be rewritten as the explanation of the model. Extensive experiments demonstrate the effectiveness of the proposed method.

Client-Adaptive Cross-Model Reconstruction Network for Modality-Incomplete Multimodal Federated Learning

Baochen Xiong
Xiaoshan Yang
Yaguang Song
Yaowei Wang
Changsheng Xu

Multimodal federated learning (MFL) is an emerging field that allows many distributed clients, each with multimodal data, to work together to train models targeting multimodal tasks without sharing local data. Whereas, existing methods assume that all modalities for each sample are complete, which limits their practicality. In this paper, we propose a Client-Adaptive Cross-Modal Reconstruction Network (CACMRN) to solve the modality-incomplete multimodal federated learning (MI-MFL). Compared to existing centralized methods for reconstructing missing modality, the local client data in federated learning is typically much less, which makes it challenging to train a reliable reconstruction model that can accurately predict missing data. We propose a cross-modal reconstruction transformer, which can prevent the model overfitting on the local client by exploring instance-instance relationships within the local client and utilizing normalized self-attention to conduct data-depended partial updating. Using federated optimization with alternative local updating and global aggregation, our method can not only collaboratively utilize the distributed data on different local clients to learn the cross-modal reconstruction transformer, but also prevent the reconstruction model from overfitting the data on the local client. Extensive experimental results on three datasets demonstrate the effectiveness of our method.

AutoPoster: A Highly Automatic and Content-aware Design System for Advertising Poster Generation

Jinpeng Lin
Min Zhou
Ye Ma
Yifan Gao
Chenxi Fei
Yangjian Chen
Zhang Yu
Tiezheng Ge

Advertising posters, a form of information presentation, combine visual and linguistic modalities. Creating a poster involves multiple steps and necessitates design experience and creativity. This paper introduces AutoPoster, a highly automatic and content-aware system for generating advertising posters. With only product images and titles as inputs, AutoPoster can automatically produce posters of varying sizes through four key stages: image cleaning and retargeting, layout generation, tagline generation, and style attribute prediction. To ensure visual harmony of posters, two content-aware models are incorporated for layout and tagline generation. Moreover, we propose a novel multi-task Style Attribute Predictor (SAP) to jointly predict visual style attributes. Meanwhile, to our knowledge, we propose the first poster generation dataset that includes visual attribute annotations for over 76k posters. Qualitative and quantitative outcomes from user studies and experiments substantiate the efficacy of our system and the aesthetic superiority of the generated posters compared to other poster generation methods.

Filling in the Blank: Rationale-Augmented Prompt Tuning for TextVQA

Gangyan Zeng
Yuan Zhang
Yu Zhou
Bo Fang
Guoqing Zhao
Xin Wei
Weiping Wang

Recently, generative Text-based visual question answering (TextVQA) methods, which are often based on language models, have exhibited impressive results and drawn increasing attention. However, due to the inconsistencies in both input forms and optimization objectives, the power of pretrained language models is not fully explored, resulting in the need for large amounts of training data. In this work, we rethink the characteristics of the TextVQA task and find that scene text is indeed a special kind of language embedded in images. To this end, we propose a text-centered generative framework FITB (stands for Filling In The Blank), in which multimodal information is mainly represented in textual form and rationale-augmented prompting is involved. Specifically, an infilling-based prompt strategy is utilized to formulate TextVQA as a novel problem of filling in the blank with proper scene text according to the language context. Furthermore, aiming to prevent the model from language bias overfitting, we design a rough answer grounding module to provide visual rationales for promoting multimodal reasoning. Extensive experiments verify the superiority of FITB in both fully-supervised and zero-shot/few-shot settings. Notably, even with a saving of about 64M data, FITB surpasses the state-of-the-art method by 3.00% and 1.99% on TextVQA and ST-VQA datasets, respectively.

End-to-end XY Separation for Single Image Blind Deblurring

Liuhan Chen
Yirou Wang
Yongyong Chen

Single image blind deblurring, only exploiting a blurry observation to reconstruct the sharp image, is a popular yet challenging low-level vision task. Current state-of-the-art deblurring networks mainly follow the coarse-to-fine strategy for architecture design and utilize U-net or its variant, XYDeblur, as the basic units. However, the one-encoder-one-decoder and the recently proposed one-encoder-two-decoder structures of basic units both fail to comprehensively take advantage of the directional separability of 2D deblurring, which increases the learning content of networks, thus leading to performance degradation. To thoroughly decouple the deblurring into two spatially orthogonal parts, we propose a novel substitution for U-net and its variant, called XYU-net. Specifically, it consists of two structurally identical U-nets, named XU-net and YU-net. They share orthogonal parameters by rotating kernels and focus on restoring a 2D blurry image in two spatially orthogonal directions respectively, which not only brings efficiency enhancement but also maintains parameter number. To further reduce the graphics memory demand of XYU-net, we transfer some non-linear transform modules (NLTM) from the outside of the network to its inside and propose the modified version, called MXYU-net. Experimental results on three large blurry image datasets demonstrate the efficiency of XYU-net and MXYU-net compared with U-net and XYDeblur, both as standalone models and as basic units of advanced U-net-based deblurring networks.

SD-Net: Spatially-Disentangled Point Cloud Completion Network

Junxian Chen
Ying Liu
Yiqi Liang
Dandan Long
Xiaolin He
Ruihui Li

Point clouds obtained from 3D scanning are typically incomplete, noisy, and sparse. Previous completion methods aim to generate complete point clouds, while taking into account the densification of point clouds, filling small holes, and proximity-to-surface, all through a single network. After revisiting the task, we propose SDNet, which disentangles the task based on the spatial characteristics of point clouds and formulates two sub-networks, a Dense Refiner and a Missing Generator. Given a partial input, the Dense Refiner produces a dense and clean point cloud, as a more reliable partial surface, which assists the Missing Generator to better infer the remaining point cloud structure. To promote the alignment and interaction across these two modules, we propose a Cross Fusion Unit with designed Non-Symmetrical Cross Transformers to capture geometric relationships between partial and missing regions, contributing to a complete, dense and well-aligned output. Extensive quantitative and qualitative results demonstrate that our method outperforms the state-of-the-art methods.

Latent-space Unfolding for MRI Reconstruction

Jiawei Jiang
Yuchao Feng
Jiacheng Chen
Dongyan Guo
Jianwei Zheng

To circumvent the problems caused by prolonged acquisition periods, compressed sensing MRI enjoys a high usage profile to accelerate the recovery of high-quality images from under-sampled k-space data. Most current solutions dedicate to solving this issue with the pursuit of certain prior properties, yet the treatments are all enforced in the original space, resulting in limited feature information. To achieve a performance promotion yet with the guarantee of running efficiency, in this work, we propose a latent-space unfolding network (LsUNet). Specifically, by an elaborately designed reversible network, the inputs are first mapped to a channel-lifted latent space, which taps the potential of capturing spatial-invariant features sufficiently. Within the latent space, we then unfold an accelerated optimization algorithm to iterate an efficient and feasible solution, in which a parallelly dual-domain update is equipped for better feature fusion. Finally, an inverse embedding transformation of the recovered high-dimensional representation is applied to achieve the expected estimation. LsUNet enjoys high interpretability due to the physically induced modules, which not only facilitates an intuitive understanding of the internal operating mechanism but also endows it with high generalization ability. Comprehensive experiments on different datasets and various sampling rates/patterns demonstrate the advantages of our proposal over the latest methods both visually and numerically.

TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real World

Hongpeng Lin
Ludan Ruan
Wenke Xia
Peiyu Liu
Jingyuan Wen
Yixin Xu
Di Hu
Ruihua Song
Wayne Xin Zhao
Qin Jin
Zhiwu Lu

To facilitate the research on intelligent and human-like chatbots with multi-modal context, we introduce a new video-based multi-modal dialogue dataset, called TikTalk. We collect 38K videos from a popular video-sharing platform, along with 367K conversations posted by users beneath them. Users engage in spontaneous conversations based on their multi-modal experiences from watching videos, which helps recreate real-world chitchat context. Compared to previous multi-modal dialogue datasets, the richer context types in TikTalk lead to more diverse conversations, but also increase the difficulty in capturing human interests from intricate multi-modal information to generate personalized responses. Moreover, external knowledge is more frequently evoked in our dataset. These facts reveal new challenges for multi-modal dialogue models. We quantitatively demonstrate the characteristics of TikTalk, propose a video-based multi-modal chitchat task, and evaluate several dialogue baselines. Experimental results indicate that the models incorporating large language models (LLM) can generate more diverse responses, while the model utilizing knowledge graphs to introduce external knowledge performs the best overall. Furthermore, no existing model can solve all the above challenges well. There is still a large room for future improvements, even for LLM with visual extensions. Our dataset is available at https://ruc-aimind.github.io/projects/TikTalk/.

IGG: Improved Graph Generation for Domain Adaptive Object Detection

Pengteng Li
Ying He
F. Richard Yu
Pinhao Song
Dongfu Yin
Guang Zhou

Domain Adaptive Object Detection (DAOD) transfers an object detector from a labeled source domain to a novel unlabeled target domain. Recent works bridge the domain gap by aligning cross-domain pixel-pairs in the non-euclidean graphical space and minimizing the domain discrepancy for adapting semantic distribution. Though great successes, these methods model graphs roughly with coarse semantic sampling due to ignoring the non-informative noises and failing to concentrate on precise semantics alignment. Besides, the coarse graph generation inevitably contains abnormal nodes. These challenges result in biased domain adaptation. Therefore, we propose an Improved Graph Generation (IGG) framework which conducts high-quality graph generation for DAOD. Specifically, we design an Intensive Node Refinement (INR) module that reconstructs the noisy sampled nodes with a memory bank, and contrastively regularizes the noisy features. For better semantics alignment, we decouple the domain-specific style and category-invariant content encoded in graph covariance and selectively eliminate only the domain-specific style. Then, a Precision Graph Optimization (PGO) adaptor is proposed which utilizes the variational inference to down-weight abnormal nodes. Comprehensive experiments on three adaptation benchmarks demonstrate that IGG achieves state-of-the-art results in unsupervised domain adaptation.

Efficient Bilateral Cross-Modality Cluster Matching for Unsupervised Visible-Infrared Person ReID

De Cheng
Lingfeng He
Nannan Wang
Shizhou Zhang
Zhen Wang
Xinbo Gao

Unsupervised visible-infrared person re-identification (USL-VI-ReID) aims to match pedestrian images of the same identity from different modalities without annotations. Existing works mainly focus on alleviating the modality gap by aligning instance-level features of the unlabeled samples. However, the relationships between cross-modality clusters are not well explored. To this end, we propose a novel bilateral cluster matching-based learning framework to reduce the modality gap by matching cross-modality clusters. Specifically, we design a Many-to-many Bilateral Cross-Modality Cluster Matching (MBCCM) algorithm through optimizing the maximum matching problem in a bipartite graph. Then, the matched pairwise clusters utilize shared visible and infrared pseudo-labels during the model training. Under such a supervisory signal, a Modality-Specific and Modality-Agnostic (MSMA) contrastive learning framework is proposed to align features jointly at a cluster-level. Meanwhile, the cross-modality Consistency Constraint (CC) is proposed to explicitly reduce the large modality discrepancy. Extensive experiments on the public SYSU-MM01 and RegDB datasets demonstrate the effectiveness of the proposed method, surpassing state-of-the-art approaches by a large margin of 8.76% mAP on average.

Faster Video Moment Retrieval with Point-Level Supervision

Xun Jiang
Zailei Zhou
Xing Xu
Yang Yang
Guoqing Wang
Heng Tao Shen

Video Moment Retrieval (VMR) aims at retrieving the most relevant events from an untrimmed video with natural language queries. Existing VMR methods suffer from two defects: (1) massive expensive temporal annotations are required to obtain satisfying performance; (2) complicated cross-modal interaction modules are deployed, which lead to high computational cost and low efficiency for the retrieval process. To address these issues, we propose a novel method termed Cheaper and Faster Moment Retrieval (CFMR), which balances the retrieval accuracy, efficiency, and annotation cost for VMR. Specifically, our proposed CFMR method learns from point-level supervision where each annotation is a single frame randomly located within the target moment. Such a labeling strategy achieves 6 times cheaper than the conventional annotations of event boundaries. Furthermore, we also design a concept-based multimodal alignment mechanism to bypass the usage of cross-modal interaction modules during the inference process, remarkably improving retrieval efficiency. The experimental results on three widely used VMR benchmarks demonstrate our proposed CFMR method achieves superior comprehensive performance to current state-of-the-art methods. Moreover, it significantly accelerates the retrieval speed with more than 100 times FLOPs compared to existing approaches with point-level supervision. Our open-source implementation is available at https://github.com/CFM-MSG/Code_CFMR.

IDDR-NGP:Incorporating Detectors for Distractors Removal with Instant Neural Radiance Field

Xianliang Huang
Jiajie Gou
Shuhang Chen
Zhizhou Zhong
Jihong Guan
Shuigeng Zhou

This paper presents the first unified distractor removal method, named IDDR-NGP, which directly operates on Instant-NPG. The method is able to remove a wide range of distractors in 3D scenes, such as snowflakes, confetti, defoliation and petals, whereas existing methods usually focus on a specific type of distractors. By incorporating implicit 3D representations with 2D detectors, we demonstrate that it is possible to efficiently restore 3D scenes from multiple corrupted images. We design the learned perceptual image patch similarity~( LPIPS) loss and the multi-view compensation loss (MVCL) to jointly optimize the rendering results of IDDR-NGP, which could aggregate information from multi-view corrupted images. All of them can be trained in an end-to-end manner to synthesize high-quality 3D scenes. To support the research on distractors removal in implicit 3D representations, we build a new benchmark dataset that consists of both synthetic and real-world distractors. To validate the effectiveness and robustness of IDDR-NGP, we provide a wide range of distractors with corresponding annotated labels added to both realistic and synthetic scenesExtensive experimental results demonstrate the effectiveness and robustness of IDDR-NGP in removing multiple types of distractors. In addition, our approach achieves results comparable with the existing SOTA desnow methods and is capable of accurately removing both realistic and synthetic distractors.

G-PCC++: Enhanced Geometry-based Point Cloud Compression

Junzhe Zhang
Tong Chen
Dandan Ding
Zhan Ma

MPEG Geometry-based Point Cloud Compression (G-PCC) standard is developed for lossy encoding of point clouds to enable immersive services over the Internet. However, lossy G-PCC introduces superimposed distortions from both geometry and attribute information, seriously deteriorating the Quality of Experience (QoE). This paper thus proposes the Enhanced G-PCC (GPCC++), to effectively address the compression distortion and restore the quality. G-PCC++ separates the enhancement into two stages: it first enhances the geometry and then maps the decoded attribute to the enhanced geometry for refinement. As for geometry restoration, a k Nearest Neighbors (kNN)-based Linear Interpolation is first used to generate a denser geometry representation, on top of which GeoNet further generates sufficient candidates to restore geometry through probability-sorted selection. For attribute enhancement, a kNN-based Gaussian Distance Weighted Mapping is devised to re-colorize all points in enhanced geometry tensor, which are then refined by AttNet for the final reconstruction. G-PCC++ is the first solution addressing the geometry and attribute artifacts together. Extensive experiments on several public datasets demonstrate the superiority of G-PCC++, e.g., on the solid point cloud dataset 8iVFB, G-PCC++ outperforms G-PCC by 88.24% (80.54%) BD-BR in D1 (D2) measurement of geometry and by 14.64% (13.09%) BD-BR in Y (YUV) attribute. Moreover, when considering both geometry and attribute, G-PCC++ also largely surpasses G-PCC by 25.58% BD-BR using PCQM assessment.

Gradient-Free Textual Inversion

Zhengcong Fei
Mingyuan Fan
Junshi Huang

Recent works on personalized text-to-image generation usually learn to bind a special token with specific subjects or styles of a few given images by tuning its embedding through gradient descent. It is natural to question whether we can optimize the textual inversions by only accessing the process of model inference. As only requiring the forward computation to determine the textual inversion retains the benefits of less GPU memory, simple deployment, and secure access for scalable models. In this paper, we introduce a gradient-free framework to optimize the continuous textual inversion in an iterative evolutionary strategy. Specifically, we first initialize an appropriate token embedding for textual inversion with the consideration of visual and text vocabulary information. Then, we decompose the optimization of evolutionary strategy into dimension reduction of searching space and non-convex gradient-free optimization in subspace, which significantly accelerates the optimization process with negligible performance loss. Experiments in several creative applications demonstrate that the performance of text-to-image model equipped with our proposed gradient-free method is comparable to that of gradient-based counterparts with variant GPU/CPU platforms, flexible employment, as well as computational efficiency.

DiffDance: Cascaded Human Motion Diffusion Model for Dance Generation

Qiaosong Qi
Le Zhuo
Aixi Zhang
Yue Liao
Fei Fang
Si Liu
Shuicheng Yan

When hearing music, it is natural for people to dance to its rhythm. Automatic dance generation, however, is a challenging task due to the physical constraints of human motion and rhythmic alignment with target music. Conventional autoregressive methods introduce compounding errors during sampling and struggle to capture the long-term structure of dance sequences. To address these limitations, we present a novel cascaded motion diffusion model, DiffDance, designed for high-resolution, long-form dance generation. This model comprises a music-to-dance diffusion model and a sequence super-resolution diffusion model. To bridge the gap between music and motion for conditional generation, DiffDance employs a pretrained audio representation learning model to extract music embeddings and further align its embedding space to motion via contrastive loss. During training our cascaded diffusion model, we also incorporate multiple geometric losses to constrain the model outputs to be physically plausible and add a dynamic loss weight that adaptively changes over diffusion timesteps to facilitate sample diversity. Through comprehensive experiments performed on the benchmark dataset AIST++, we demonstrate that DiffDance is capable of generating realistic dance sequences that align effectively with the input music. These results are comparable to those achieved by state-of-the-art autoregressive methods.

Video Inverse Tone Mapping Network with Luma and Chroma Mapping

Peihuan Huang
Gaofeng Cao
Fei Zhou
Guoping Qiu

\beginabstract With the popularity of consumer high dynamic range (HDR) display devices, video inverse tone mapping (iTM) has become a research hotspot. However, existing methods are designed based on a perceptual non-uniformity color space (e.g., RGB and YC_BC_R), resulting in limited quality of HDR video rendered by these methods. Considering the two key factors involved in the video iTM task: luma and chroma, in this paper, we design an IC_TC_P color space based video iTM model, which reproduces high quality HDR video by processing luma and chroma information. Benefitting from the decorrelated perception of luma and chroma in the IC_TC_P color space, two global mapping networks (INet and TPNet) are developed to enhance the luma and chroma pixels, respectively. However, luma and chroma mapping in the iTM task may be affected by color appearance phenomena. Thus, a luma-chroma adaptation transform network (LCATNet) is proposed to process the luma and chroma pixels affected by color appearance phenomena, which can complement the local details to the globally enhanced luma and chroma pixels. In the LCATNet, either the luma mapping or the chroma mapping is adaptively adjusted according to both the luma and the chroma information. Besides, benefitting from the perceptually consistent property of the IC_T C_P color space, the same pixel errors can draw equal model attentions during the training. Thus, the proposed model can correctly render luma and chroma information without highlighting special regions or designing special training losses. Extensive experimental results demonstrate the effectiveness of the proposed model. \endabstract

Learning Pixel-wise Alignment for Unsupervised Image Stitching

Qi Jia
Xiaomei Feng
Yu Liu
Xin Fan
Longin Jan Latecki

Image stitching aims to align a pair of images in the same view. Generating precise alignment with natural structures is challenging for image stitching, as there is no wider field-of-view image as a reference, especially in non-coplanar practical scenarios. In this paper, we propose an unsupervised image stitching framework, breaking through the coplanar constraints in homography estimation, yielding accurate pixel-wise alignment under limited overlapping regions. First, we generate a global transformation by an iterative dense feature matching combined with an error control strategy to alleviate the difference introduced by large parallax. Second, we propose a pixel-wise warping network embedded within a large-scale feature extractor and a correlative feature enhancement module to explicitly learn correspondences between the inputs, and generate accurate pixel-level offsets upon novel constraints on both overlapping and non-overlapping regions. Notably, we leverage the pixel-level offsets in the overlapping area to guide the adjustment in the non-overlapping area upon content and structure consistency constraints, rendering a natural transition between two regions and distortions suppression over the entire stitched image. The proposed method achieves state-of-the-art performance that surpasses both traditional and deep learning approaches by a large margin. It also achieves the shortest execution time and has the best generalization ability on the traditional dataset.

FashionDiff: A Controllable Diffusion Model Using Pairwise Fashion Elements for Intelligent Design

Han Yan
Haijun Zhang
Xiangyu Mu
Jicong Fan
Zhao Zhang

The process of fashion design involves creative expression through various methods, including sketch drawing, brush painting, and choices of textures and colors, all of which are employed to characterize the originality and uniqueness of the designed fashion items. Despite recent advances in intelligence-driven fashion design, the complexity of the diverse elements of a fashion item, such as its texture, color and shape, which are associated with the semantic information conveyed, continues to present challenges in terms of generating high-quality fashion images as well as achieving a controllable editing process. To address this issue, we propose a unified framework, FashionDiff, that leverages the diverse elements in fashion items to generate new items. Initially, we collected a large number of fashion images with multiple categories and created pairwise data in terms of sketch and additional data, such as brush areas, textures, or colors. To eliminate semantic discrepancies between these pairwise datasets, we introduce a feature modulation fusion (FMFusion) process, which enables interactive communication among different images, allowing them to be fused into latent spaces characterized by different resolutions. In order to produce high-quality editable fashion images, we develop a generator based on a state-of-the-art diffusion model called FD-ControlNet, which integrates latent spaces into different layers of the generator to generate ready-to-wear fashion items. Qualitative and quantitative experimental results demonstrate the effectiveness of our proposed method, and suggest that our model can offer flexible control over the generated images in terms of sketches, brush areas, textures, and colors.

Learning Non-Uniform-Sampling for Ultra-High-Definition Image Enhancement

Wei Yu
Qi Zhu
Naishan Zheng
Jie Huang
Man Zhou
Feng Zhao

Ultra-high-definition (UHD) image enhancement is a challenging problem that aims to effectively and efficiently recover clean UHD images. To maintain efficiency, the straightforward approach is to downsample and perform most computations on low-resolution images. However, previous studies typically rely on the uniform and content-agnostic downsampling method that equally treats various regions regardless of their complexities, thus limiting the detail reconstruction in UHD image enhancement. To alleviate this issue, we propose a novel spatial-variant and invertible non-uniform downsampler that adaptively adjusts the sampling rate according to the richness of details. It magnifies important regions to preserve more information (e.g., sparse sampling points for sky, dense sampling points for buildings). Therefore, we propose a novel Non-uniform-Sampling Enhancement Network (NSEN) consisting of two core designs: 1) content-guided downsampling that extracts texture representation to guide the sampler to perform content-aware downsampling for producing detail-preserved low-resolution images; 2) invertible pixel-alignment which remaps the forward sampling process in an iterative manner to eliminate the deformations caused by the non-uniform downsampling, thus producing detail-rich clean UHD images. To demonstrate the superiority of our proposed model, we conduct extensive experiments on various UHD enhancement tasks. The results show that the proposed NSEN yields better performance against other state-of-the-art methods both visually and quantitatively.

Hierarchical Dynamic Image Harmonization

Haoxing Chen
Zhangxuan Gu
Yaohui Li
Jun Lan
Changhua Meng
Weiqiang Wang
Huaxiong Li

Image harmonization is a critical task in computer vision, which aims to adjust the foreground to make it compatible with the background. Recent works mainly focus on using global transformations (i.e., normalization and color curve rendering) to achieve visual consistency. However, these models ignore local visual consistency and their huge model sizes limit their harmonization ability on edge devices. In this paper, we propose a hierarchical dynamic network (HDNet) to adapt features from local to global view for better feature transformation in efficient image harmonization. Inspired by the success of various dynamic models, local dynamic (LD) module and mask-aware global dynamic (MGD) module are proposed in this paper. Specifically, LD matches local representations between the foreground and background regions based on semantic similarities, then adaptively adjust every foreground local representation according to the appearance of its K-nearest neighbor background regions. In this way, LD can produce more realistic images at a more fine-grained level, and simultaneously enjoy the characteristic of semantic alignment. The MGD effectively applies distinct convolution to the foreground and background, learning the representations of foreground and background regions as well as their correlations to the global harmonization, facilitating local visual consistency for the images much more efficiently. Experimental results demonstrate that the proposed HDNet significantly reduces the total model parameters by more than 80% compared to previous methods, while still attaining state-of-the-art performance on the popular iHarmony4 dataset. Additionally, we introduced a lightweight version of HDNet, i.e., HDNet-lite, which has only 0.65MB parameters, yet it still achieve competitive performance. Our code is avaliable at https://github.com/chenhaoxing/HDNet.

Toward Scalable Image Feature Compression: A Content-Adaptive and Diffusion-Based Approach

Sha Guo
Zhuo Chen
Yang Zhao
Ning Zhang
Xiaotong Li
Lingyu Duan

Traditional image codecs prioritize signal fidelity and human perception, often neglecting machine vision tasks. Deep learning approaches have shown promising coding performance by leveraging rich semantic embeddings that can be optimized for both human and machine vision. However, these compact embeddings struggle to represent low-level details like contours and textures, leading to imperfect reconstructions. Additionally, existing learning-based coding tools lack scalability. To address these challenges, this paper presents a content-adaptive diffusion model for scalable image compression. The method encodes accurate texture through a diffusion process, enhancing human perception while preserving important features for machine vision tasks. It employs a Markov palette diffusion model with commonly-used feature extractors and image generators, enabling efficient data compression. By utilizing collaborative texture-semantic feature extraction and pseudo-label generation, the approach accurately learns texture information. A content-adaptive Markov palette diffusion model is then applied to capture both low-level texture and high-level semantic knowledge in a scalable manner. This framework enables elegant compression ratio control by flexibly selecting intermediate diffusion states, eliminating the need for deep learning model re-training at different operating points. Extensive experiments demonstrate the effectiveness of the proposed framework in image reconstruction and downstream machine vision tasks such as object detection, segmentation, and facial landmark detection. It achieves superior perceptual quality scores compared to state-of-the-art methods.

Towards Decision-based Sparse Attacks on Video Recognition

Kaixun Jiang
Zhaoyu Chen
Xinyu Zhou
Jingyu Zhang
Lingyi Hong
JiaFeng Wang
Bo Li
Yan Wang
Wenqiang Zhang

Recent studies indicate that sparse attacks threaten the security of deep learning models, which modify only a small set of pixels in the input based on the l0 norm constraint. While existing research has primarily focused on sparse attacks against image models, there is a notable gap in evaluating the robustness of video recognition models. To bridge this gap, we are the first to study sparse video attacks and propose an attack framework named V-DSA in the most challenging decision-based setting, in which threat models only return the predicted hard label. Specifically, V-DSA comprises two modules: a Cross-Modal Generator (CMG) for query-free transfer attacks on each frame and an Optical flow Grouping Evolution algorithm (OGE) for query-efficient spatial-temporal attacks. CMG passes each frame to generate the transfer video as the starting point of the attack based on the feature similarity between image classification and video recognition models. OGE first initializes populations based on transfer video and then leverages optical flow to establish the temporal connection of the perturbed pixels in each frame, which can reduce the parameter space and break the temporal relationship between frames specifically. Finally, OGE complements the above optical flow modeling by grouping evolution which can realize the coarse-to-fine attack to avoid falling into the local optimum. In addition, OGE makes the perturbation with temporal coherence while balancing the number of perturbed pixels per frame, further increasing the imperceptibility of the attack. Extensive experiments demonstrate that V-DSA achieves state-of-the-art performance in terms of both threat effectiveness and imperceptibility. We hope V-DSA can provide valuable insights into the security of video recognition systems.

RAIRNet: Region-Aware Identity Rectification for Face Forgery Detection

Mingqi Fang
Lingyun Yu
Hongtao Xie
Junqiang Wu
Zezheng Wang
Jiahong Li
Yongdong Zhang

The malicious usage of facial manipulation techniques boosts the desire of face forgery detection research. Recently, identity-based approaches have attracted much attention due to the effective observation of identity inconsistency. However, there are still several nonnegligible problems: (1) generic identity extractor is totally trained on real images, leading to enormous identity representation bias during processing forged content; (2) the identity information of forged image is hybrid and presents regional distribution, while the single global identity feature is hard to reflect this local identity inconsistency. To solve the above problems, in this paper a novel Region-Aware Identity Rectification Network (RAIRNet) is proposed to effectively rectify the identity bias and adaptively exploit the inconsistency local region. Firstly, for the identity bias problem, our RAIRNet is devised in a two-branch architecture, which consists of a Generic Identity Extractor (GIE) branch and a Bias Diminishing Module (BDM) branch. The BDM branch is designed to rectify the bias introduced by GIE branch through a prototype-based training schema. This two-branch architecture effectively promotes model to adapt to forged content while maintaining the focus on identity space. Secondly, for local identity inconsistency exploiting, a novel Meta Identity Filter Generator (MIFG) is devised in a meta-learning way to generate the region-aware filter based on identity prior. This region-aware filter can adaptively exploit the local inconsistency clues and activate the discriminative local region. Moreover, to balance the local-global information and highlight the forensic clues, an Adaptive Weight Assignment Mechanism (AWAM) is proposed to assign adaptive importance weight to two branches. Extensive experiments on various datasets show the superiority of our RAIRNet. In particular, on the challenging DFDCp dataset, our approach outperforms previous binary-based and identity-based methods by 10.3% and 5.5% respectively.

Multispectral Object Detection via Cross-Modal Conflict-Aware Learning

Xiao He
Chang Tang
Xin Zou
Wei Zhang

Multispectral object detection has gained significant attention due to its potential in all-weather applications, particularly those involving visible (RGB) and infrared (IR) images. Despite substantial advancements in this domain, current methodologies primarily rely on rudimentary accumulation operations to combine complementary information from disparate modalities, overlooking the semantic conflicts that arise from the intrinsic heterogeneity among modalities. To address this issue, we propose a novel learning network, the Cross-modal Conflict-Aware Learning Network (CALNet), that takes into account semantic conflicts and complementary information within multi-modal input. Our network comprises two pivotal modules: the Cross-Modal Conflict Rectification Module (CCR) and the Selected Cross-modal Fusion (SCF) Module. The CCR module mitigates modal heterogeneity by examining contextual information of analogous pixels, thus alleviating multi-modal information with semantic conflicts. Subsequently, semantically coherent information is supplied to the SCF module, which fuses multi-modal features by assessing intra-modal importance to select semantically rich features and mining inter-modal complementary information. To assess the effectiveness of our proposed method, we develop a two-stream one-stage detector based on CALNet for multispectral object detection. Comprehensive experimental outcomes demonstrate that our approach considerably outperforms existing methods in resolving the cross-modal semantic conflict issue and achieving state-of-the-art accuracy in detection results.

Decoupled Cross-Scale Cross-View Interaction for Stereo Image Enhancement in the Dark

Huan Zheng
Zhao Zhang
Jicong Fan
Richang Hong
Yi Yang
Shuicheng Yan

Low-light stereo image enhancement (LLSIE) aims at improving the visual quality of stereo images captured in dark conditions. However, existing methods have shown limited success in detail recovery and illumination adjustment. This can be attributed to two main factors: 1) insufficient single-scale inter-view interaction hinders the exploitation of valuable cross-view cues; 2) lacking long-range dependency leads to the inability to deal with the spatial long-range effects caused by illumination degradation. To address these limitations, we propose a novel LLSIE model named Decoupled Cross-Scale Cross-View Interaction Network (DCI-Net). Our model introduces a key component called the Decoupled Interaction Module (DIM) designed to promote sufficient dual-view information exchange. DIM decouples the dual-view information interaction by discovering multi-scale cross-view correlations and further exploring cross-scale information flow. Furthermore, we present Spatial-channel Information Mining Block (SIMB) for intra-view feature extraction, and the benefits are twofold. One is long-range dependency capture to build spatial long-range relationship, and the other is expanded channel information refinement that enhances information flow in the channel dimension. Extensive experiments conducted on Flickr1024, KITTI 2012, KITTI 2015, and Middlebury datasets show that our method obtains better illumination adjustment and detail recovery, and achieves SOTA performance compared to other related methods.

CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation

Kexin Li
Zongxin Yang
Lei Chen
Yi Yang
Jun Xiao

Audio-visual video segmentation (AVVS) aims to generate pixel-level maps of sound-producing objects within image frames and ensure the maps faithfully adheres to the given audio, such as identifying and segmenting a singing person in a video. However, existing methods exhibit two limitations: 1) they address video temporal features and audio-visual interactive features separately, disregarding the inherent spatial-temporal dependence of combined audio and video, and 2) they inadequately introduce audio constraints and object-level information during the decoding stage, resulting in segmentation outcomes that fail to comply with audio directives. To tackle these issues, we propose a decoupled audio-video transformer that combines audio and video features from their respective temporal and spatial dimensions, capturing their combined dependence. To optimize memory consumption, we design a block, which, when stacked, enables capturing audio-visual fine-grained combinatorial-dependence in a memory-efficient manner. Additionally, we introduce audio-constrained queries during the decoding phase. These queries contain rich object-level information, ensuring the decoded mask adheres to the sounds. Experimental results confirm our approach's effectiveness, with our framework achieving a new SOTA performance on all three datasets using two backbones. The code is available at https://github.com/aspirinone/CATR.github.io.

S-OmniMVS: Incorporating Sphere Geometry into Omnidirectional Stereo Matching

Zisong Chen
Chunyu Lin
Lang Nie
Zhijie Shen
Kang Liao
Yuanzhouhan Cao
Yao Zhao

Multi-fisheye stereo matching is a promising task that employs the traditional multi-view stereo (MVS) pipeline with spherical sweeping to acquire omnidirectional depth. However, the existing omnidirectional MVS technologies neglect fisheye and omnidirectional distortions, yielding inferior performance. In this paper, we revisit omnidirectional MVS by incorporating three sphere geometry priors: spherical projection, spherical continuity, and spherical position. To deal with fisheye distortion, we propose a new distortion-adaptive fusion module to convert fisheye inputs into distortion-free spherical tangent representations by constructing a spherical projection space. Then these multi-scale features are adaptively aggregated with additional learnable offsets to enhance content perception. To handle omnidirectional distortion, we present a new spherical cost aggregation module with a comprehensive consideration of the spherical continuity and position. Concretely, we first design a rotation continuity compensation mechanism to ensure omnidirectional depth consistency of left-right boundaries without introducing extra computation. On the other hand, we encode the geometry-aware spherical position and push them into the cost aggregation to relieve panoramic distortion and perceive the 3D structure. Furthermore, to avoid the excessive concentration of depth hypothesis caused by inverse depth linear sampling, we develop a segmented sampling strategy that combines linear and exponential spaces to create S-OmniMVS, along with three sphere priors. Extensive experiments demonstrate the proposed method outperforms the state-of-the-art (SoTA) solutions by a large margin on various datasets both quantitatively and qualitatively.

Prototypical Cross-domain Knowledge Transfer for Cervical Dysplasia Visual Inspection

Yichen Zhang
Yifang Yin
Ying Zhang
Zhenguang Liu
Zheng Wang
Roger Zimmermann

Early detection of dysplasia of the cervix is critical for cervical cancer treatment. However, automatic cervical dysplasia diagnosis via visual inspection, which is more appropriate in low-resource settings, remains a challenging problem. Though promising results have been obtained by recent deep learning models, their performance is significantly hindered by the limited scale of the available cervix datasets. Distinct from previous methods that learn from a single dataset, we propose to leverage cross-domain cervical images that were collected in different but related clinical studies to improve the model's performance on the targeted cervix dataset. To robustly learn the transferable information across datasets, we propose a novel prototype-based knowledge filtering method to estimate the transferability of cross-domain samples. We further optimize the shared feature space by aligning the cross-domain image representations simultaneously on domain level with early alignment and class level with supervised contrastive learning, which endows model training and knowledge transfer with stronger robustness. The empirical results on three real-world benchmark cervical image datasets show that our proposed method outperforms the state-of-the-art cervical dysplasia visual inspection by an absolute improvement of 4.7% in top-1 accuracy, 7.0% in precision, 1.4% in recall, 4.6% in F1 score, and 0.05 in ROC-AUC.

When Measures are Unreliable: Imperceptible Adversarial Perturbations toward Top-k Multi-Label Learning

Yuchen Sun
Qianqian Xu
Zitai Wang
Qingming Huang

With the great success of deep neural networks, adversarial learning has received widespread attention in various studies, ranging from multi-class learning to multi-label learning. However, existing adversarial attacks toward multi-label learning only pursue the traditional visual imperceptibility but ignore the new perceptible problem coming from measures such as Precision@k and mAP@k. Specifically, when a well-trained multi-label classifier performs far below the expectation on some samples, the victim can easily realize that this performance degeneration stems from attack, rather than the model itself. Therefore, an ideal multi-labeling adversarial attack should manage to not only deceive visual perception but also evade monitoring of measures. To this end, this paper first proposes the concept of measure imperceptibility. Then, a novel loss function is devised to generate such adversarial perturbations that could achieve both visual and measure imperceptibility. Furthermore, an efficient algorithm, which enjoys a convex objective, is established to optimize this objective. Finally, extensive experiments on large-scale benchmark datasets, such as PASCAL VOC 2012, MS COCO, and NUS WIDE, demonstrate the superiority of our proposed method in attacking the top-k multi-label systems.

Karma: Adaptive Video Streaming via Causal Sequence Modeling

Bowei Xu
Hao Chen
Zhan Ma

Optimal adaptive bitrate (ABR) decision depends on a comprehensive characterization of state transitions that involve interrelated modalities over time including environmental observations, returns, and actions. However, state-of-the-art learning-based ABR algorithms solely rely on past observations to decide the next action. This paradigm tends to cause a chain of deviations from optimal action when encountering unfamiliar observations, which consequently undermines the model generalization.

This paper presents Karma, an ABR algorithm that utilizes causal sequence modeling to improve generalization by comprehending the interrelated causality among past observations, returns, and actions and timely refining action when deviation occurs. Unlike direct observation-to-action mapping, Karma recurrently maintains a multi-dimensional time series of observations, returns, and actions as input and employs causal sequence modeling via a decision transformer to determine the next action. In the input sequence, Karma uses the maximum cumulative future quality of experience (QoE) (a.k.a, QoE-to-go) as an extended return signal, which is periodically estimated based on current network conditions and playback status. We evaluate Karma through trace-driven simulations and real-world field tests, demonstrating superior performance compared to existing state-of-the-art ABR algorithms, with an average QoE improvement ranging from 10.8% to 18.7% across diverse network conditions. Furthermore, Karma exhibits strong generalization capabilities, showing leading performance under unseen networks in both simulations and real-world tests.

Joint Local Relational Augmentation and Global Nash Equilibrium for Federated Learning with Non-IID Data

Xinting Liao
Chaochao Chen
Weiming Liu
Pengyang Zhou
Huabin Zhu
Shuheng Shen
Weiqiang Wang
Mengling Hu
Yanchao Tan
Xiaolin Zheng

Federated learning (FL) is a distributed machine learning paradigm that needs collaboration between a server and a series of clients with decentralized data. To make FL effective in real-world applications, existing work devotes to improving the modeling of decentralized non-IID data. In non-IID settings, there are intra-client inconsistency that comes from the imbalanced data modeling, and inter-client inconsistency among heterogeneous client distributions, which not only hinders sufficient representation of the minority data, but also brings discrepant model deviations. However, previous work overlooks to tackle the above two coupling inconsistencies together. In this work, we propose FedRANE, which consists of two main modules, i.e., local relational augmentation (LRA) and global Nash equilibrium (GNE), to resolve intra-and inter-client inconsistency simultaneously. Specifically, in each client, LRA mines the similarity relations among different data samples and enhances the minority sample representations with their neighbors using attentive message passing. In server, GNE reaches an agreement among inconsistent and discrepant model deviations from clients to server, which encourages the global model to update in the direction of global optimum without breaking down the clients' optimization toward their local optimums. We conduct extensive experiments on four benchmark datasets to show the superiority of FedRANE in enhancing the performance of FL with non-IID data.

SSPU-Net: A Structure Sensitive Point Cloud Upsampling Network with Multi-Scale Spatial Refinement

Jin Wang
Jiade Chen
Yunhui Shi
Nam Ling
Baocai Yin

Point cloud upsampling aims to generate a dense and uniform point set from a sparse and irregular point set. The core challenge is to accurately restore the geometric structure and local details. To overcome the challenge, this paper presents a novel frequency-aware attention based point cloud upsampling approach, which combines graph filtering and channel attention based on the detection of high spatial-frequency components like edges and contours in the human visual system. To aggregate the features more efficiently, an intra-feature and inter-feature (I2-feature) aggregation block and a structure sensitive transformer block are introduced. On one hand, the I2-feature aggregation block serves to create a complete local representation of each point by aggregating intra and inter features. On the other hand, the structure sensitive transformer block aims to enhance the quality of the expanded point features by capturing the global geometric structures and the fine local details. Furthermore, to improve the quality of the coarse output, a multi-scale spatial refinement unit is applied, which leverages attentional feature fusion and multi-scale attention. Extensive qualitative and quantitative results on both synthetic and real-scanned datasets validate our proposed scheme outperforms state-of-the-art point cloud upsampling methods.

On Physically Occluded Fake Identity Document Detection

Haoyue Wang
Sheng Li
Silu Cao
Rui Yang
Jishen Zeng
Zhenxing Qian
Xinpeng Zhang

Many online applications require the users to upload their identity documents for authentication. The fake identity document is one of the main threats which compromises the security and reliability of such online applications. Existing techniques focus on the detection of digitally forged identity documents, which neglect the impact of physical forgeries. In this paper, we look into the problem of detecting physically occluded fake identity documents, which can be easily generated without any image processing knowledge. We observe that the physical occlusions inevitably produce occluded boundaries on the document. To take the advantage, we propose an Occluded Boundary Representation Learning (OBRL) module to progressively learn the occluded boundary features. These are then fed into an Occluded Boundary Message Passing (OBMP) module to effectively diffuse the physical occlusion traces to enhance the backbone features for robust detection. We newly construct a Physically Occluded Fake ID Card image dataset (POID) for evaluation. Various experiments are conducted on the POID, where our scheme is able to achieve 99.6% of accuracy in detecting physically occluded fake ID card images with a mAP of over 85% to localize the occlusion regions.

Dynamic View Synthesis with Spatio-Temporal Feature Warping from Sparse Views

Deqi Li
Shi-Sheng Huang
Tianyu Shen
Hua Huang

Significant progress has been made in realizing novel view synthesis of dynamic scenes from sparse input views. However, achieving spatio-temporal consistency in dynamic view synthesis remains to be challenging for previous approaches, since the spatio-temporal correlation for view synthesis has not been fully explored. In this paper, we propose a spatio-temporal feature warping (STFW) mechanism, which can be embedded into a deep model to produce high-quality and spatio-temporally consistent view synthesis results. The two core components of STFW are: (1) a spatial feature warping (SFW) module, which enables adaptive perception of multi-view context-consistent geometric information with a compact point cloud representation, and (2) a temporal feature warping (TFW) module that implicitly models the dynamic geometry by approaching the pixel shift in image coordinate. In the optimization process of view synthesis, the SFW and TFW are integrated to exploit the spatio-temporal correlation cues across sparse input views and novel views. Leveraging the STFW, we further build an end-to-end dynamic view synthesis model with sparse input views. Qualitative and quantitative evaluation on public multi-view datasets demonstrate that our view synthesis pipeline achieves better performance compared to previous methods in terms of visual quality.

SESSION: Oral Session IX: Engaging Users with Multimedia -- Social-good, Fairness and Transparency

Text-to-Image Diffusion Models can be Easily Backdoored through Multimodal Data Poisoning

Shengfang Zhai
Yinpeng Dong
Qingni Shen
Shi Pu
Yuejian Fang
Hang Su

With the help of conditioning mechanisms, the state-of-the-art diffusion models have achieved tremendous success in guided image generation, particularly in text-to-image synthesis. To gain a better understanding of the training process and potential risks of text-to-image synthesis, we perform a systematic investigation of backdoor attack on text-to-image diffusion models and propose BadT2I, a general multimodal backdoor attack framework that tampers with image synthesis in diverse semantic levels. Specifically, we perform backdoor attacks on three levels of the vision semantics: Pixel-Backdoor, Object-Backdoor and Style-Backdoor. By utilizing a regularization loss, our methods efficiently inject backdoors into a large-scale text-to-image diffusion model while preserving its utility with benign inputs. We conduct empirical experiments on Stable Diffusion, the widely-used text-to-image diffusion model, demonstrating that the large-scale diffusion model can be easily backdoored within a few fine-tuning steps. We conduct additional experiments to explore the impact of different types of textual triggers, as well as the backdoor persistence during further training, providing insights for the development of backdoor defense methods. Besides, our investigation may contribute to the copyright protection of text-to-image models in the future. Our Code: https://github.com/sf-zhai/BadT2I.

Deep Neural Network Watermarking against Model Extraction Attack

Jingxuan Tan
Nan Zhong
Zhenxing Qian
Xinpeng Zhang
Sheng Li

Deep neural network (DNN) watermarking is an emerging technique to protect the intellectual property of deep learning models. At present, many DNN watermarking algorithms have been proposed to achieve provenance verification by embedding identify information into the internals or prediction behaviors of the host model. However, most methods are vulnerable to model extraction attacks, where attackers collect output labels from the model to train a surrogate or a replica. To address this issue, we present a novel DNN watermarking approach, named SSW, which constructs an adaptive trigger set progressively by optimizing over a pair of symmetric shadow models to enhance the robustness to model extraction. Precisely, we train a positive shadow model supervised by the prediction of the host model to mimic the behaviors of potential surrogate models. Additionally, a negative shadow model is normally trained to imitate irrelevant independent models. Using this pair of shadow models as a reference, we design a strategy to update the trigger samples appropriately such that they tend to persist in the host model and its stolen copies. Moreover, our method could well support two specific embedding schemes: embedding the watermark via fine-tuning or from scratch. Our extensive experimental results on popular datasets demonstrate that our SSW approach outperforms state-of-the-art methods against various model extraction attacks in whether trigger set classification accuracy based or hypothesis test based verification. The results also show that our method is robust to common model modification schemes including fine-tuning and model compression.

CoCa: A Connectivity-Aware Cascade Framework for Histology Gland Segmentation

Yu Bai
Bo Zhang
Zheng Zhang
Wu Liu
Jinwen Li
Xiangyang Gong
Wendong Wang

Gland segmentation is crucial for computer-aided diagnosis of adenocarcinoma. However, Topologically Critical Areas (TCAs), such as background tissues between two adjacent glands, can easily cause under- or over-connection of gland topological structures that may lead to the opposite diagnostic of the malignancy degree. Therefore, we provide a novel perspective for gland segmentation by incorporating gland connectivity information to locate critical errors within TCAs. We propose a Connectivity-Aware Cascade framework (CoCa) that explicitly encodes gland connectivity information into the network to locate all connectivity errors during training and then leverage attention operations to focus on these errors. Since under- or over-connected glands can change the Betti number (e.g., number of connected components) of glands, we design a Connectivity Refinement Module (CRM) to compare the Betti number of each gland to locate connectivity errors. We propose CoCa-Net to mine the topological relations among different biomedical entities to guide gland prediction. We also use contrastive learning to separate pixel embeddings of different classes within TCAs through our connectivity-aware hard example sampling strategy. Extensive experiments on the GlaS and CRAG datasets demonstrate the effectiveness of CoCa over state-of-the-art methods.

Factorized Omnidirectional Representation based Vision GNN for Anisotropic 3D Multimodal MR Image Segmentation

Bo Zhang
YunPeng Tan
Zheng Zhang
Wu Liu
Hui Gao
Zhijun Xi
Wendong Wang

Anisotropy arises due to the influence of scanning equipment and parameters, resulting in a distance between slices that is often much greater than the actual distance represented by a single pixel within each slice. This can lead to inefficiency or ineffectiveness in 3D convolution. To address the anisotropy issue, we propose FOrViG, an asymmetric vision graph neural network (GNN) framework that captures the correlation between different slices by constructing a graph for multi-slice images and aggregating information from adjacent nodes. This allows FOrViG to efficiently extract 3D spatial scale information, and effectively identify feature nodes associated with small lesions efficiently, thereby improving the accuracy of lesion segmentation on anisotropic 3D multimodal MR images. As far as we know, this is the first study that adopts GNN to address anisotropy issues. Additionally, we also design a factorized omnidirectional representation method and a supervised multi-perspective contrastive learning strategy to enhance the capability of FOrViG in learning multi-scale omnidirectional presentation information, graphics construction, and distinguishing foreground from background. Extensive experiments on the PI-CAI dataset demonstrate that FOrViG significantly outperforms several state-of-the-art 3D segmentation algorithms.

Echoes: Unsupervised Debiasing via Pseudo-bias Labeling in an Echo Chamber

Rui Hu
Yahan Tu
Jitao Sang

Neural networks often learn spurious correlations when exposed to biased training data, leading to poor performance on out-of-distribution data. A biased dataset can be divided, according to biased features, into bias-aligned samples (i.e., with biased features) and bias-conflicting samples (i.e., without biased features). Recent debiasing works typically assume that no bias label is available during the training phase, as obtaining such information is challenging and labor-intensive. Following this unsupervised assumption, existing methods usually train two models: a biased model specialized to learn biased features and a target model that uses information from the biased model for debiasing. This paper first presents experimental analyses revealing that the existing biased models overfit to bias-conflicting samples in the training data, which negatively impacts the debiasing performance of the target models. To address this issue, we propose a straightforward and effective method called Echoes, which trains a biased model and a target model with a different strategy. We construct an "echo chamber" environment by reducing the weights of samples which are misclassified by the biased model, to ensure the biased model fully learns the biased features without overfitting to the bias-conflicting samples. The biased model then assigns lower weights on the bias-conflicting samples. Subsequently, we use the inverse of the sample weights of the biased model for training the target model. Experiments show that our approach achieves superior debiasing results compared to the existing baselines on both synthetic and real-world datasets. Our code is available at https://github.com/isruihu/Echoes.

FedCE: Personalized Federated Learning Method based on Clustering Ensembles

Luxin Cai
Naiyue Chen
Yuanzhouhan Cao
Jiahuan He
Yidong Li

Federated learning (FL) is a privacy-aware computing framework that enables multiple clients to collaborate in solving machine learning problems. In real scenarios, non-IID data held by different edge devices will degrade the performance of global FL models. To address this issue, most FL methods utilize cluster algorithms to group clients with similar distributions. However, these methods do not fully utilize the distribution features of client data, resulting in a lack of generalization in the cluster model. In order to make the cluster more suitable for the distribution features of user data, we propose a clustering-ensemble based federated learning method (FedCE) that sets each client associated with multiple clusters. We extract the features of client distributions to quantify the relationship between clients and clusters, and optimize the local model of clients through the historical performance of the cluster model. Furthermore, we dynamically estimate the number of clusters each client belongs to through the diversity of client performance. We conduct experiments on scenarios with mixture two-distributions, three-distributions and Dirichlet-distributions. The results show that the FedCE algorithm has better performance than the state-of-the-art clustered FL methods in both cluster and client models under different data distributions.

SESSION: Oral Session X: Multimedia systems -- Data Systems Management and Indexing

Relative NN-Descent: A Fast Index Construction for Graph-Based Approximate Nearest Neighbor Search

Naoki Ono
Yusuke Matsui

Approximate Nearest Neighbor Search (ANNS) is the task of finding the database vector that is closest to a given query vector. Graph-based ANNS is the family of methods with the best balance of accuracy and speed for million-scale datasets. However, graph-based methods have the disadvantage of long index construction time. Recently, many researchers have improved the tradeoff between accuracy and speed during a search. However, there is little research on accelerating index construction. We propose a fast graph construction algorithm, Relative NN-Descent (RNN-Descent). RNN-Descent combines NN-Descent, an algorithm for constructing approximate K-nearest neighbor graphs (K-NN graphs), and RNG Strategy, an algorithm for selecting edges effective for search. This algorithm allows the direct construction of graph-based indexes without ANNS. Experimental results demonstrated that the proposed method had the fastest index construction speed, while its search performance is comparable to existing state-of-the-art methods such as NSG. For example, in experiments on the GIST1M dataset, the construction of the proposed method is 2x faster than NSG. Additionally, it was even faster than the construction speed of NN-Descent.

Flexible and Secure Watermarking for Latent Diffusion Model

Cheng Xiong
Chuan Qin
Guorui Feng
Xinpeng Zhang

Since the significant advancements and open-source support of latent diffusion models (LDMs) in the field of image generation, numerous researchers and enterprises start fine-tuning the pre-trained models to generate specialized images for different objectives. However, the criminals may turn their attention to generate images by LDMs and then carry out illegal activities. The watermarking technique is a typical solution to deal with this problem. But, the post-hoc watermarking methods can be easily escaped to obtain the non-watermarked images, and the existing watermarking methods designed for LDMs can only embed a fixed message, i.e., the to-be-embedded message cannot be changed unless retraining the model. Therefore, in this work, we propose an end-to-end watermarking method based on the encoder-decoder (ENDE) and message-matrix. The message can be embedded into generated images through fusing the message-matrix and intermediate outputs in the forward propagation of image generation based on LDM. Thus, the message can be flexibly changed by utilizing the message-encoder to generate message-matrix, without training the LDM again. On the other hand, the security mechanism in our watermarking method can defeat the attack that the users may escape the message-matrix usage during image generation. A series of experiments demonstrate the effectiveness and the superiority of our watermarking method compared with SOTA methods.

CHAIN: Exploring Global-Local Spatio-Temporal Information for Improved Self-Supervised Video Hashing

Rukai Wei
Yu Liu
Jingkuan Song
Heng Cui
Yanzhao Xie
Ke Zhou

Compressing videos into binary codes can improve retrieval speed and reduce storage overhead. However, learning accurate hash codes for video retrieval can be challenging due to high local redundancy and complex global dependencies between video frames, especially in the absence of labels. Existing self-supervised video hashing methods have been effective in designing expressive temporal encoders, but have not fully utilized the temporal dynamics and spatial appearance of videos due to less challenging and unreliable learning tasks. To address these challenges, we begin by utilizing the contrastive learning task to capture global spatio-temporal information of videos for hashing. With the aid of our designed augmentation strategies, which focus on spatial and temporal variations to create positive pairs, the learning framework can generate hash codes that are invariant to motion, scale, and viewpoint. Furthermore, we incorporate two collaborative learning tasks, i.e., frame order verification and scene change regularization, to capture local spatio-temporal details within video frames, thereby enhancing the perception of temporal structure and the modeling of spatio-temporal relationships. Our proposed Contrastive Hashing with Global-Local Spatio-temporal Ibnformation (CHAIN) outperforms state-of-the-art self-supervised video hashing methods on four video benchmark datasets. Our codes will be released.

SESSION: Oral Session XI: Multimedia systems -- Systems and Middleware, Transport and Delivery

Pagoda: Privacy Protection for Volumetric Video Streaming through Poisson Diffusion Model

Rui Lu
Lai Wei
Shuntao Zhu
Chuang Hu
Dan Wang

With the increasing popularity of 3D volumetric video applications, e.g., metaverse, AR/VR, etc., there is a growing need to protect users' privacy while sharing their experiences during streaming. In this paper, we show that the existing privacy-preserving approaches for dense point clouds suffer a massive computation cost and degrade the quality of the streaming experience. We design Pagoda, a new PrivAcy-preservinG VOlumetric ViDeo StreAming incorporating the MPEG V-PCC standard, which protects different domain privacy information of dense point cloud, and maintains high throughput. The core idea is to content-aware transform the privacy attribute information to the geometry domain and content-agnostic protect the geometry information by adding Poisson noise perturbations. These perturbations can be denoised through a Poisson diffusion probabilistic model we design to deploy on the cloud. Users only need to encrypt a small amount of high-sensitive information and achieve secure streaming. Our designs ensure the dense point clouds can be transmitted in high quality and the attackers cannot reconstruct the original one. We evaluate Pagoda using three volumetric video datasets. The results show that Pagoda outperforms existing privacy-preserving baselines for 75.6% protection capability improvement, 4.27 times streaming quality, and 26 times latency reduction.

ScaleFlow: Efficient Deep Vision Pipeline with Closed-Loop Scale-Adaptive Inference

Yuyang Leng
Renyuan Liu
Hongpeng Guo
Songqing Chen
Shuochao Yao

Deep visual data processing is underpinning many life-changing applications, such as auto-driving and smart cities. Improving the accuracy while minimizing their inference time under constrained resources has been the primary pursuit for their practical adoptions. Existing research thus has been devoted to either narrowing down the area of interest for the detection or miniaturizing the deep learning model for faster inference time. However, the former may risk missing/delaying small but important object detection, potentially leading to disastrous consequences (e.g., car accidents), while the latter often compromises the accuracy without fully utilizing intrinsic semantic information. To overcome these limitations, in this work, we propose ScaleFlow, a closed-loop scale-adaptive inference that can reduce model inference time by progressively processing vision data with increasing resolution but decreasing spatial size, achieving speedup without compromising accuracy. For this purpose, ScaleFlow refactors existing neural networks to be scale-equivariant on multiresolution data with the assistance of wavelet theory, providing predictable feature patterns on different data resolutions. Comprehensive experiments have been conducted to evaluate ScaleFlow. The results show that ScaleFlow can support anytime inference, consistently provide 1.5× to 2.2× speed up, and save around 25% ~ 45% energy consumption with < 1% accuracy loss on four embedded and edge platforms

Optimizing Adaptive Video Streaming with Human Feedback

Tianchi Huang
Rui-Xiao Zhang
Chenglei Wu
Lifeng Sun

Quality of Experience (QoE)-driven adaptive bitrate (ABR) algorithms are typically optimized using QoE models that are based on the mean opinion score (MOS), while such principles may not account for user heterogeneity on rating scales, resulting in unexpected behaviors. In this paper, we propose Jade, which leverages reinforcement learning with human feedback(RLHF) technologies to better align the users' opinion scores. Jade's rank-based QoE model considers relative values of user ratings to interpret the subjective perception of video sessions. We implement linear-based and Deep Neural Network (DNN)-based architectures for satisfying both accuracy and generalization ability. We further propose entropy-aware reinforced mechanisms for training policies with the integration of the proposed QoE models. Experimental results demonstrate that Jade performs favorably on conventional metrics, such as quality and stall ratio, and improves QoE by 8.09%-38.13% in different network conditions, emphasizing the importance of user heterogeneity in QoE modeling and the potential of combining linear-based and DNN-based models for performance improvement.

SESSION: Poster Session I: Understanding Multimedia Content -- Media Interpretation

M3Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition

Hao Tang
Jun Liu
Shuanglin Yan
Rui Yan
Zechao Li
Jinhui Tang

Due to the scarcity of manually annotated data required for fine-grained video understanding, few-shot fine-grained (FS-FG) action recognition has gained significant attention, with the aim of classifying novel fine-grained action categories with only a few labeled instances. Despite the progress made in FS coarse-grained action recognition, current approaches encounter two challenges when dealing with the fine-grained action categories: the inability to capture subtle action details and the insufficiency of learning from limited data that exhibit high intra-class variance and inter-class similarity. To address these limitations, we propose M3Net, a matching-based framework for FS-FG action recognition, which incorporates multi-view encoding, multi-view matching, and multi-view fusion to facilitate embedding encoding, similarity matching, and decision making across multiple viewpoints.Multi-view encoding captures rich contextual details from the intra-frame, intra-video, and intra-episode perspectives, generating customized higher-order embeddings for fine-grained data.Multi-view matching integrates various matching functions enabling flexible relation modeling within limited samples to handle multi-scale spatio-temporal variations by leveraging the instance-specific, category-specific, and task-specific perspectives. Multi-view fusion consists of matching-predictions fusion and matching-losses fusion over the above views, where the former promotes mutual complementarity and the latter enhances embedding generalizability by employing multi-task collaborative learning. Explainable visualizations and experimental results on three challenging benchmarks demonstrate the superiority of M3Net in capturing fine-grained action details and achieving state-of-the-art performance for FS-FG action recognition.

CUCL: Codebook for Unsupervised Continual Learning

Chen Cheng
Jingkuan Song
Xiaosu Zhu
Junchen Zhu
Lianli Gao
Hengtao Shen

The focus of this study is on Unsupervised Continual Learning (UCL), as it presents an alternative to Supervised Continual Learning which needs high-quality manual labeled data. The experiments under UCL paradigm indicate a phenomenon where the results on the first few tasks are suboptimal. This phenomenon can render the model inappropriate for practical applications. To address this issue, after analyzing the phenomenon and identifying the lack of diversity as a vital factor, we propose a method named Codebook for Unsupervised Continual Learning (CUCL) which promotes the model to learn discriminative features to complete the class boundary. Specifically, we first introduce a Product Quantization to inject diversity into the representation and apply a cross quantized contrastive loss between the original representation and the quantized one to capture discriminative information. Then, based on the quantizer, we propose a effective Codebook Rehearsal to address catastrophic forgetting. This study involves conducting extensive experiments on CIFAR100, TinyImageNet, and MiniImageNet benchmark datasets. Our method significantly boosts the performances of supervised and unsupervised methods. For instance, on TinyImageNet, our method led to a relative improvement of 12.76% and 7% when compared with Simsiam and BYOL, respectively. Codes are publicly available at https://github.com/zackschen/CUCL

Regress Before Construct: Regress Autoencoder for Point Cloud Self-supervised Learning

Yang Liu
Chen Chen
Can Wang
Xulin King
Mengyuan Liu

Masked Autoencoders (MAE) have demonstrated promising performance in self-supervised learning for both 2D and 3D computer vision. Nevertheless, existing MAE-based methods still have certain drawbacks. Firstly, the functional decoupling between the encoder and decoder is incomplete, which limits the encoder's representation learning ability. Secondly, downstream tasks solely utilize the encoder, failing to fully leverage the knowledge acquired through the encoder-decoder architecture in the pre-text task. In this paper, we propose Point Regress AutoEncoder (Point-RAE), a new scheme for regressive autoencoders for point cloud self-supervised learning. The proposed method decouples functions between the decoder and the encoder by introducing a mask regressor, which predicts the masked patch representation from the visible patch representation encoded by the encoder and the decoder reconstructs the target from the predicted masked patch representation. By doing so, we minimize the impact of decoder updates on the representation space of the encoder. Moreover, we introduce an alignment constraint to ensure that the representations for masked patches, predicted from the encoded representations of visible patches, are aligned with the masked patch presentations computed from the encoder. To make full use of the knowledge learned in the pre-training stage, we design a new finetune mode for the proposed Point-RAE. Extensive experiments demonstrate that our approach is efficient during pre-training and generalizes well on various downstream tasks. Specifically, our pre-trained models achieve a high accuracy of 90.28% on the ScanObjectNN hardest split and 94.1% accuracy on ModelNet40, surpassing all the other self-supervised learning methods. Our code and pretrained model are public available at: https://github.com/liuyyy111/Point-RAE.

CropCap: Embedding Visual Cross-Partition Dependency for Image Captioning

Bo Wang
Zhao Zhang
Suiyi Zhao
Haijun Zhang
Richang Hong
Meng Wang

Transformer-based approaches to image captioning have shown great success by utilizing long-term dependency for visual embedding. However, their coarse long-term dependency, using the multi-head self-attention mechanism to capture the contextual interactions between the visual tokens on the time step and (or) embedded dimension, fail to distinguish fine-grained features of local partition. In this case, some similar features are captured, which leads to feature redundancy that decreases the performance. To respond to this issue, this paper proposes a novel image captioner embedding visual cross-partition dependency, dubbed CropCap. Specifically, the visual sequence generated from the Swin Transformer-based pre-embedding network is fed into the proposed cross-partition dependency module to refinedly model the interaction between partial representations on both the time step and embedded dimension. Furthermore, we formulaically reason the proposed cross-partition dependency, and theoretically prove its correctness. Extensive comparisons on the benchmark MS-COCO dataset demonstrated the effectiveness addressing the information redundancy issue, and verified the superior performance of our method.

Generalizing Face Forgery Detection via Uncertainty Learning

Yanqi Wu
Xue Song
Jingjing Chen
Yu-Gang Jiang

Current face forgery detection methods have made significant progress in achieving high intra-dataset accuracy by building a deterministic binary detector. However, deterministic networks cannot effectively capture noise and distribution shifts in the input, which makes them less robust and prone to poor generalization in real-world scenarios. To address this problem, in this paper, we propose an Uncertainty-Aware Learning (UAL) method for face forgery detection. Specifically, we extend the Transformer model in a probabilistic manner by modeling dependencies between patches as Gaussian random variables. Additionally, we introduce a Patch Selection Module that can efficiently and accurately identify discriminative regions with high-uncertainty information, which are further utilized for final classification. Furthermore, with the quantified uncertainty of the entire image, we design a novel Uncertainty-Aware One-Center Loss that enhances intra-class compactness for genuine faces only, thereby improving the inter-class separability in the embedding space. We conduct extensive experiments to demonstrate the effectiveness of the proposed method, and the results verify that, our Uncertainty-Aware Learning method enjoys better robustness and generalization ability comparing against other state-of-the-art methods.

Object Detection Difficulty: Suppressing Over-aggregation for Faster and Better Video Object Detection

Bingqing Zhang
Sen Wang
Yifan Liu
Brano Kusy
Xue Li
Jiajun Liu

Current video object detection (VOD) models often encounter issues with over-aggregation due to redundant aggregation strategies, which perform feature aggregation on every frame. This results in suboptimal performance and increased computational complexity. In this work, we propose an image-level Object Detection Difficulty (ODD) metric to quantify the difficulty of detecting objects in a given image. The derived ODD scores can be used in the VOD process to mitigate over-aggregation. Specifically, we train an ODD predictor as an auxiliary head of a still-image object detector to compute the ODD score for each image based on the discrepancies between detection results and ground-truth bounding boxes. The ODD score enhances the VOD system in two ways: 1) it enables the VOD system to select superior global reference frames, thereby improving overall accuracy; and 2) it serves as an indicator in the newly designed ODD Scheduler to eliminate the aggregation of frames that are easy to detect, thus accelerating the VOD process. Comprehensive experiments demonstrate that, when utilized for selecting global reference frames, ODD-VOD consistently enhances the accuracy of Global-frame-based VOD models. When employed for acceleration, ODD-VOD consistently improves the frames per second (FPS) by an average of 73.3% across 8 different VOD models without sacrificing accuracy. When combined, ODD-VOD attains state-of-the-art performance when competing with many VOD methods in both accuracy and speed. Our work represents a significant advancement towards making VOD more practical for real-world applications. The code will be released at https://github.com/bingqingzhang/odd-vod.

Mutual-Guided Dynamic Network for Image Fusion

Yuanshen Guan
Ruikang Xu
Mingde Yao
Lizhi Wang
Zhiwei Xiong

Image fusion aims to generate a high-quality image from multiple images captured under varying conditions. The key problem of this task is to preserve complementary information while filtering out irrelevant information for the fused result. However, existing methods address this problem by leveraging static convolutional neural networks (CNNs), suffering two inherent limitations during feature extraction,i.e., being unable to handle spatial-variant contents and lacking guidance from multiple inputs. In this paper, we propose a novel mutual-guided dynamic network (MGDN) for image fusion, which allows for effective information utilization across different locations and inputs. Specifically, we design a mutual-guided dynamic filter (MGDF) for adaptive feature extraction, composed of a mutual-guided cross-attention (MGCA) module and a dynamic filter predictor, where the former incorporates additional guidance from different inputs and the latter generates spatial-variant kernels for different locations. In addition, we introduce a parallel feature fusion (PFF) module to effectively fuse local and global information of the extracted features. To further reduce the redundancy among the extracted features while simultaneously preserving their shared structural information, we devise a novel loss function that combines the minimization of normalized mutual information (NMI) with an estimated gradient mask. Experimental results on five benchmark datasets demonstrate that our proposed method outperforms existing methods on four image fusion tasks. The code and model are publicly available at: https://github.com/Guanys-dar/MGDN.

Frequency Representation Integration for Camouflaged Object Detection

Chenxi Xie
Changqun Xia
Tianshu Yu
Jia Li

Recent camouflaged object detection (COD) approaches have been proposed to accurately segment objects blended into surroundings. The most challenging and critical issue in COD is to find out the lines of demarcation between objects and background in the camouflage environment. Because of the similarity between the target object and the background, these lines are difficult to be found accurately. However, these are easy to be observed in different frequency components of the image. To this end, in this paper we rethink COD from the perspective of frequency components and propose a Frequency Representation Integration Network to mine informative cues from them. Specifically, we obtain high-frequency components from the original image by Laplacian pyramid-like decomposition, and then respectively send the image to a transformer-based encoder and frequency components to a tailored CNN-based Residual Frequency Array Encoder. Besides, we utilize the multi-head self-attention in transformer encoder to capture low-frequency signals, which can effectively parse the overall contextual information of camouflage scenes. We also design a Frequency Representation Reasoning Module, which progressively eliminates discrepancies between differentiated frequency representations and integrates them by modeling their point-wise relations. Moreover, to further bridge different frequency representations, we introduce the image reconstruction task to implicitly guide their integration. Sufficient experiments on three widely-used COD benchmark datasets demonstrate that our method surpasses existing state-of-the-art methods by a large margin.

DecenterNet: Bottom-Up Human Pose Estimation Via Decentralized Pose Representation

Tao Wang
Lei Jin
Zhang Wang
Xiaojin Fan
Yu Cheng
Yinglei Teng
Junliang Xing
Jian Zhao

Multi-person pose estimation in crowded scenes remains a very challenging task. This paper finds that most previous methods fail to estimate or group visible keypoints in crowded scenes rather than reasoning invisible keypoints. We thus categorize the crowded scenes into entanglement and occlusion based on the visibility of human parts and observe that entanglement is a significant problem in crowded scenes. With this observation, we propose DecenterNet, an end-to-end deep architecture to perform robust and efficient pose estimation in crowded scenes. Within DecenterNet, we introduce a decentralized pose representation that uses all visible keypoints as the root points to represent human poses, which is more robust in the entanglement area. We also propose a decoupled pose assessment mechanism, which introduces a location map to adaptively select optimal poses in the offset map. In addition, we have constructed a new dataset named SkatingPose, containing more entangled scenes. The proposed DecenterNet surpasses the best method on SkatingPose by 1.8 AP. Furthermore, DecenterNet obtains 71.2 AP and 71.4 AP on the COCO and CrowdPose datasets, respectively, demonstrating the superiority of our method. We will release our source code, trained models, and dataset to facilitate further studies in this research direction. Our code and dataset are available in https://github.com/InvertedForest/DecenterNet.

Improving Scene Graph Generation with Superpixel-Based Interaction Learning

Jingyi Wang
Can Zhang
Jinfa Huang
Botao Ren
Zhidong Deng

Recent advances in Scene Graph Generation (SGG) typically model the relationships among entities utilizing box-level features from pre-defined detectors. We argue that an overlooked problem in SGG is the coarse-grained interactions between boxes, which inadequately capture contextual semantics for relationship modeling, practically limiting the development of the field. In this paper, we take the initiative to explore and propose a generic paradigm termed Superpixel-based Interaction Learning (SIL) to remedy coarse-grained interactions at the box level. It allows us to model fine-grained interactions at the superpixel level in SGG. Specifically, (i) we treat a scene as a set of points and cluster them into superpixels representing sub-regions of the scene. (ii) We explore intra-entity and cross-entity interactions among the superpixels to enrich fine-grained interactions between entities at an earlier stage. Extensive experiments on two challenging benchmarks (Visual Genome and Open Image V6) prove that our SIL enables fine-grained interaction at the superpixel level above previous box-level methods, and significantly outperforms previous state-of-the-art methods across all metrics. More encouragingly, the proposed method can be applied to boost the performance of existing box-level approaches in a plug-and-play fashion. In particular, SIL brings an average improvement of 2.0% mR (even up to 3.4%) of baselines for the PredCls task on Visual Genome, which facilitates its integration into any existing box-level method.

Lifelong Scene Text Recognizer via Expert Modules

Shifeng Xia
Lin Geng
Ningzhong Liu
Han Sun
Jie Qin

Scene text recognition (STR) has been actively studied in recent years, with a wide range of applications in autonomous driving, image retrieval and much more. However, when a pre-trained deep STR model learns a new task, its performance on previous tasks may drop dramatically, due to catastrophic forgetting in deep neural networks. A potential solution to combat the forgetting of prior knowledge is incremental learning (IL), which has shown its effectiveness and significant progress in image classification. Yet, exploiting IL in the context of STR has been barely visited, probably because the forgetting problem is even worse in STR. To address this issue, we propose the lifelong scene text recognizer (LSTR) that learns STR tasks incrementally while alleviating forgetting. Specifically, LSTR assigns each task a set of task-specific expert modules at different stages of an STR model, while other parameters are shared among tasks. These shared parameters are only learned in the first task and remain unchanged during subsequent learning to ensure that no learned knowledge is overlooked. Moreover, in real applications, there is no prior knowledge about which task an input image belongs to, making it impossible to precisely select the corresponding expert modules. To this end, we propose the incremental task prediction network (ITPN) to identify the most related task category by pulling the features of the same task closer and pushing those of different tasks farther apart. To validate the proposed method in our newly-introduced IL setting, we collected a large-scale dataset consisting of both real and synthetic multilingual STR data. Extensive experiments on this dataset clearly show the superiority of our LSTR over state-of-the-art IL methods.

CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model

Zhen Ye
Wei Xue
Xu Tan
Jie Chen
Qifeng Liu
Yike Guo

Denoising diffusion probabilistic models (DDPMs) have shown promising performance for speech synthesis. However, a large number of iterative steps are required to achieve high sample quality, which restricts the inference speed. Maintaining sample quality while increasing sampling speed has become a challenging task. In this paper, we propose a Consistency Model-based Speech synthesis method, CoMoSpeech, which achieve speech synthesis through a single diffusion sampling step while achieving high audio quality. The consistency constraint is applied to distill a consistency model from a well-designed diffusion-based teacher model, which ultimately yields superior performances in the distilled CoMoSpeech. Our experiments show that by generating audio recordings by a single sampling step, the CoMoSpeech achieves an inference speed more than 150 times faster than real-time on a single NVIDIA A100 GPU, which is comparable to FastSpeech2, making diffusion-sampling based speech synthesis truly practical. Meanwhile, objective and subjective evaluations on text-to-speech and singing voice synthesis show that the proposed teacher models yield the best audio quality, and the one-step sampling based CoMoSpeech achieves the best inference speed with better or comparable audio quality to other conventional multi-step diffusion model baselines. Audio samples and codes are available at https://comospeech.github. https://comospeech.github.io/.

Exploring Motion Cues for Video Test-Time Adaptation

Runhao Zeng
Qi Deng
Huixuan Xu
Shuaicheng Niu
Jian Chen

Test-time adaptation (TTA) aims at boosting the generalization capability of a trained model by conducting self-/un-supervised learning during testing in real-world applications. Though TTA on image-based tasks has seen significant progress, TTA techniques for video remain scarce. Naively introducing image-based TTA methods into video tasks may achieve limited performance, since these methods do not consider the special nature of video tasks, e.g., the motion information. In this paper, we propose leveraging motion cues in videos to design a new test-time learning scheme for video classification. We extract spatial appearance and dynamic motion clip features using two sampling rates (i.e., slow and fast) and propose a fast-to-slow unidirectional alignment scheme to align fast motion and slow appearance features, thereby enhancing the motion encoding ability. Additionally, we propose a slow-fast dual contrastive learning strategy to learn a joint feature space for fastly and slowly sampled clips, guiding the model to extract discriminative video features. Lastly, we introduce a stochastic pseudo-negative sampling scheme to provide better adaptation supervision by selecting a more reliable pseudo-negative label compared to the pseudo-positive label used in prior TTA methods. This technique reduces the adaptation difficulty often caused by poor performance on out-of-distribution test data before adaptation. Our approach significantly improves performance on various video classification backbones, as demonstrated through extensive experiments on two benchmark datasets.

Perceiving Ambiguity and Semantics without Recognition: An Efficient and Effective Ambiguous Scene Text Detector

Yan Shu
Wei Wang
Yu Zhou
Shaohui Liu
Aoting Zhang
Dongbao Yang
Weipinng Wang

Ambiguous scene text detection is an extremely challenging task. Existing text detectors that rely solely on visual cues often suffer from confusion due to being evenly distributed in rows/columns or incomplete detection owing to large character spacing. To overcome these challenges, the previous method recognizes a large number of proposals and utilizes semantic information predicted from recognition results to eliminate ambiguity. However, this method is inefficient, which limits their practical applications. In this paper, we propose a novel efficient and effective ambiguous text detector, which can Perceive Ambiguity and SEmantics without Recognition, termed PASER. On the one hand, PASER can perceive semantics without recognition with a light Perceiving Semantics (PerSem) module. In this way, proposals without reasonable semantics are filtered out, which largely speeds up the overall detection process. On the other hand, to detect both ambiguous and regular texts with a unified framework, PASER employs a Perceiving Ambiguity (PerAmb) module to distinguish ambiguous texts and regular texts, so that only the ambiguous proposals will be processed by PerSem while the regular texts are not, which further ensures the high efficiency. Extensive experiments show that our detector achieves state-of-the-art results on both ambiguous and regular scene text detection benchmarks. Notably, over 6 times faster speed and superior accuracy are achieved on TDA-ReCTS simultaneously.

Single-Stage Multi-human Parsing via Point Sets and Center-based Offsets

Jiaming Chu
Lei Jin
Xiaojin Fan
Yinglei Teng
Yunchao Wei
Yuqiang Fang
Junliang Xing
Jian Zhao

This work studies the multi-human parsing problem. Existing methods, either following top-down or bottom-up two-stage paradigms, usually involve expensive computational costs. We instead present a high-performance Single-stage Multi-human Parsing (SMP) deep architecture that decouples the multi-human parsing problem into two fine-grained sub-problems,i.e., locating the human body and parts. SMP leverages the point features in the barycenter positions to obtain their segmentation and then generates a series of offsets from the barycenter of the human body to the barycenters of parts, thus performing human body and parts matching without the grouping process. Within the SMP architecture, we propose a Refined Feature Retain module to extract the global feature of instances through generated mask attention and a Mask of Interest Reclassify module as a trainable plug-in module to refine the classification results with the predicted segmentation. Extensive experiments on the MHPv2.0 dataset demonstrate the best effectiveness and efficiency of the proposed method, surpassing the state-of-the-art method by 2.1% in AP50p, 1.0% in APvolpsup>, and 1.2% in PCP50. Moreover, SMP also achieves superior performance in DensePose-COCO, verifying generalization of the model. In particular, the proposed method requires fewer training epochs and a less complex model architecture. Our codes are released in https://github.com/cjm-sfw/SMP.

Partitioned Saliency Ranking with Dense Pyramid Transformers

Chengxiao Sun
Yan Xu
Jialun Pei
Haopeng Fang
He Tang

In recent years, saliency ranking has emerged as a challenging task focusing on assessing the degree of saliency at instance-level. Being subjective, even humans struggle to identify the precise order of all salient instances. Previous approaches undertake the saliency ranking by directly sorting the rank scores of salient instances, which have not explicitly resolved the inherent ambiguities. To overcome this limitation, we propose the ranking by partition paradigm, which segments unordered salient instances into partitions and then ranks them based on the correlations among these partitions. The ranking by partition paradigm alleviates ranking ambiguities in a general sense, as it consistently improves the performance of other saliency ranking models. Additionally, we introduce the Dense Pyramid Transformer (DPT) to enable global cross-scale interactions, which significantly enhances feature interactions with reduced computational burden. Extensive experiments demonstrate that our approach outperforms all existing methods. The code for our method is available at https://github.com/ssecv/PSR.

CenterLPS: Segment Instances by Centers for LiDAR Panoptic Segmentation

Jianbiao Mei
Yu Yang
Mengmeng Wang
Zizhang Li
Xiaojun Hou
Jongwon Ra
Laijian Li
Yong Liu

This paper focuses on LiDAR Panoptic Segmentation (LPS), which has attracted more attention recently due to its broad application prospect for autonomous driving and robotics. The mainstream LPS approaches either adopt a top-down strategy relying on 3D object detectors to discover instances or utilize time-consuming heuristic clustering algorithms to group instances in a bottom-up manner. Inspired by the center representation and kernel-based segmentation, we propose a new detection-free and clustering-free framework called CenterLPS, with the center-based instance encoding and decoding paradigm. Specifically, we propose a sparse center proposal network to generate the sparse 3D instance centers, as well as center feature embedding, which can well encode characteristics of instances. Then a center-aware transformer is applied to collect the context between different center feature embedding and around centers. Moreover, we generate the kernel weights based on the enhanced center feature embedding and initialize dynamic convolutions to decode the final instance masks. Finally, a mask fusion module is devised to unify the semantic and instance predictions and improve the panoptic quality. Extensive experiments on SemanticKITTI and nuScenes demonstrate the effectiveness of our proposed center-based framework CenterLPS.

Boosting Few-shot 3D Point Cloud Segmentation via Query-Guided Enhancement

Zhenhua Ning
Zhuotao Tian
Guangming Lu
Wenjie Pei

Although extensive research has been conducted on 3D point cloud segmentation, effectively adapting generic models to novel categories remains a formidable challenge. This paper proposes a novel approach to improve point cloud few-shot segmentation (PC-FSS) models. Unlike existing PC-FSS methods that directly utilize categorical information from support prototypes to recognize novel classes in query samples, our method identifies two critical aspects that substantially enhance model performance by reducing contextual gaps between support prototypes and query features. Specifically, we (1) adapt support background prototypes to match query context while removing extraneous cues that may obscure foreground and background in query samples, and (2) holistically rectify support prototypes under the guidance of query features to emulate the latter having no semantic gap to the query targets. Our proposed designs are agnostic to the feature extractor, rendering them readily applicable to any prototype-based methods. The experimental results on S3DIS and ScanNet demonstrate notable practical benefits, as our approach achieves significant improvements while still maintaining high efficiency. The code for our approach is available at https://github.com/AaronNZH/Boosting-Few-shot-3D-Point-Cloud-Segmentatio...

PiPa: Pixel- and Patch-wise Self-supervised Learning for Domain Adaptative Semantic Segmentation

Mu Chen
Zhedong Zheng
Yi Yang
Tat-Seng Chua

Unsupervised Domain Adaptation (UDA) aims to enhance the generalization of the learned model to other domains. The domain-invariant knowledge is transferred from the model trained on labeled source domain, e.g., video game, to unlabeled target domains, e.g., real-world scenarios, saving annotation expenses. Existing UDA methods for semantic segmentation usually focus on minimizing the inter-domain discrepancy of various levels, e.g., pixels, features, and predictions, for extracting domain-invariant knowledge. However, the primary intra-domain knowledge, such as context correlation inside an image, remains under-explored. In an attempt to fill this gap, we revisit the current pixel contrast in semantic segmentation and propose a unified pixel- and patch-wise self-supervised learning framework, called PiPa, for domain adaptive semantic segmentation that facilitates intra-image pixel-wise correlations and patch-wise semantic consistency against different contexts. The proposed framework exploits the inherent structures of intra-domain images, which: (1) explicitly encourages learning the discriminative pixel-wise features with intra-class compactness and inter-class separability, and (2) motivates the robust feature learning of the identical patch against different contexts or fluctuations. Extensive experiments verify the effectiveness of the proposed method, which obtains competitive accuracy on the two widely-used UDA benchmarks, e.g., 75.6 mIoU on GTA→Cityscapes and 68.2 mIoU on Synthia→Cityscapes. Moreover, our method is compatible with other UDA approaches to further improve the performance without introducing extra parameters.

Weakly-Supervised Text Instance Segmentation

Xinyan Zu
Haiyang Yu
Bin Li
Xiangyang Xue

Text segmentation is a challenging computer vision task with many downstream applications. Current text segmentation models need to be trained with pixel-level annotations, which requires a lot of labor cost. In this paper, we take the first attempt to perform weakly-supervised text instance segmentation through bridging text recognition and text segmentation. We observe that text recognition models are able to produce the attention localization of each text instance. Based on this observation, we propose a two-stage Text Adaptive Refinement (TAR) module to generate the pseudo labels based on the attention map of a text recognizer. Meanwhile, we develop a text segmentation module to take the rough attention location as input to predict segmentation masks, which are supervised by the aforementioned pseudo labels. In addition, we introduce a mask-augmented contrastive learning by treating the segmentation result as an augmented version of the input text image, thus improving the visual representation and further enhancing the performance of both recognition and segmentation. The experimental results demonstrate that the proposed method outperforms the state-of-the-art (SOTA) weakly-supervised generic segmentation methods by 18.95% and 17.80% in fgIoU on ICDAR13-FST and TextSeg. On MLT-S, COCO-TS and Total-Text, the proposed method achieves about 82% of the fully-supervised methods' performance. When evaluated on instance segmentation, the proposed method exceeds existing SOTA methods by 23.32% and 21.34% on ICDAR13-FST and TextSeg, respectively. Code and Supplementary Materials are available at https://github.com/FudanVI/FudanOCR/tree/main/weakly-text-segmentation.

PNT-Edge: Towards Robust Edge Detection with Noisy Labels by Learning Pixel-level Noise Transitions

Wenjie Xuan
Shanshan Zhao
Yu Yao
Juhua Liu
Tongliang Liu
Yixin Chen
Bo Du
Dacheng Tao

Relying on large-scale training data with pixel-level labels, previous edge detection methods have achieved high performance. However, it is hard to manually label edges accurately, especially for large datasets, and thus the datasets inevitably contain noisy labels. This label-noise issue has been studied extensively for classification, while still remaining under-explored for edge detection. To address the label-noise issue for edge detection, this paper proposes to learn Pixel-level Noise Transitions to model the label-corruption process. To achieve it, we develop a novel Pixel-wise Shift Learning (PSL) module to estimate the transition from clean to noisy labels as a displacement field. Exploiting the estimated noise transitions, our model, named PNT-Edge, is able to fit the prediction to clean labels. In addition, a local edge density regularization term is devised to exploit local structure information for better transition learning. This term encourages learning large shifts for the edges with complex local structures. Experiments on SBD and Cityscapes demonstrate the effectiveness of our method in relieving the impact of label noise. Codes will be available at github.com/DREAMXFAR/PNT-Edge.

Video Frame Interpolation with Flow Transformer

Pan Gao
Haoyue Tian
Jie Qin

Video frame interpolation has been actively studied with the development of convolutional neural networks. However, due to the intrinsic limitations of kernel weight sharing in convolution, the interpolated frame generated by it may lose details. In contrast, the attention mechanism in Transformer can better distinguish the contribution of each pixel, and it can also capture long-range pixel dependencies, which provides great potential for video interpolation. Nevertheless, the original Transformer is commonly used for 2D images; how to develop a Transformer-based framework with consideration of temporal self-attention for video frame interpolation remains an open issue. In this paper, we propose Video Frame Interpolation Flow Transformer to incorporate motion dynamics from optical flows into the self-attention mechanism. Specifically, we design a Flow Transformer Block that calculates the temporal self-attention in a matched local area with the guidance of flow, making our framework suitable for interpolating frames with large motion while maintaining reasonably low complexity. In addition, we construct a multi-scale architecture to account for multi-scale motion, further improving the overall performance. Extensive experiments on three benchmarks demonstrate that the proposed method can generate interpolated frames with better visual quality than state-of-the-art methods.

DUSA: Decoupled Unsupervised Sim2Real Adaptation for Vehicle-to-Everything Collaborative Perception

Xianghao Kong
Wentao Jiang
Jinrang Jia
Yifeng Shi
Runsheng Xu
Si Liu

Vehicle-to-Everything (V2X) collaborative perception is crucial for the advancement of autonomous driving. However, achieving high-precision V2X perception requires a significant amount of annotated real-world data, which can always be expensive and hard to acquire. Simulated data have raised much attention since they can be massively produced at an extremely low cost. Nevertheless, the significant domain gap between simulated and real-world data, including differences in sensor type, reflectance patterns, and road surroundings, often leads to poor performance of models trained on simulated data when evaluated on real-world data. In addition, there remains a domain gap between real-world collaborative agents, e.g. different types of sensors may be installed on autonomous vehicles and roadside infrastructures with different extrinsics, further increasing the difficulty of sim2real generalization. To take full advantage of simulated data, we present a new unsupervised sim2real domain adaptation method for V2X collaborative detection named Decoupled Unsupervised Sim2Real Adaptation (DUSA). Our new method decouples the V2X collaborative sim2real domain adaptation problem into two sub-problems: sim2real adaptation and inter-agent adaptation. For sim2real adaptation, we design a Location-adaptive Sim2Real Adapter (LSA) module to adaptively aggregate features from critical locations of the feature map and align the features between simulated data and real-world data via a sim/real discriminator on the aggregated global feature. For inter-agent adaptation, we further devise a Confidence-aware Inter-agent Adapter (CIA) module to align the fine-grained features from heterogeneous agents under the guidance of agent-wise confidence maps. Experiments demonstrate the effectiveness of the proposed DUSA approach on unsupervised sim2real adaptation from the simulated V2XSet dataset to the real-world DAIR-V2X-C dataset.

Explicifying Neural Implicit Fields for Efficient Dynamic Human Avatar Modeling via a Neural Explicit Surface

Ruiqi Zhang
Jie Chen
Qiang Wang

This paper proposes a technique for efficiently modeling dynamic humans by explicifying the implicit neural fields via a Neural Explicit Surface (NES). Implicit neural fields have advantages over traditional explicit representations in modeling dynamic 3D content from sparse observations and effectively representing complex geometries and appearances. Implicit neural fields defined in 3D space, however, are expensive to render due to the need for dense sampling during volumetric rendering. Moreover, their memory efficiency can be further optimized when modeling sparse 3D space. To overcome these issues, the paper proposes utilizing Neural Explicit Surface (NES) to explicitly represent implicit neural fields, facilitating memory and computational efficiency. To achieve this, the paper creates a fully differentiable conversion between the implicit neural fields and the explicit rendering interface of NES, leveraging the strengths of both implicit and explicit approaches. This conversion enables effective training of the hybrid representation using implicit methods and efficient rendering by integrating the explicit rendering interface with a newly proposed rasterization-based neural renderer that only incurs a texture color query once for the initial ray interaction with the explicit surface, resulting in improved inference efficiency. NES describes dynamic human geometries with pose-dependent neural implicit surface deformation fields and their dynamic neural textures both in 2D space, which is a more memory-efficient alternative to traditional 3D methods, reducing redundancy and computational load. The comprehensive experiments show that NES performs similarly to previous 3D approaches, with greatly improved rendering speed and reduced memory cost.

MVFlow: Deep Optical Flow Estimation of Compressed Videos with Motion Vector Prior

Shili Zhou
Xuhao Jiang
Weimin Tan
Ruian He
Bo Yan

In recent years, many deep learning-based methods have been proposed to tackle the problem of optical flow estimation and achieved promising results. However, they hardly consider that most videos are compressed and thus ignore the pre-computed information in compressed video streams. Motion vectors, one of the compression information, record the motion of the video frames. They can be directly extracted from the compression code stream without computational cost and serve as a solid prior for optical flow estimation. Therefore, we propose an optical flow model, MVFlow, which uses motion vectors to improve the speed and accuracy of optical flow estimation for compressed videos. In detail, MVFlow includes a key Motion-Vector Converting Module, which ensures that the motion vectors can be transformed into the same domain of optical flow and then be utilized fully by the flow estimation module. Meanwhile, we construct four optical flow datasets for compressed videos containing frames and motion vectors in pairs. The experimental results demonstrate the superiority of our proposed MVFlow, which can reduce the AEPE by 1.09 compared to existing models or save 52% time to achieve similar accuracy to existing models.

Uncertainty-Guided Spatial Pruning Architecture for Efficient Frame Interpolation

Ri Cheng
Xuhao Jiang
Ruian He
Shili Zhou
Weimin Tan
Bo Yan

The video frame interpolation (VFI) model applies the convolution operation to all locations, leading to redundant computations in regions with easy motion. We can use dynamic spatial pruning method to skip redundant computation, but this method cannot properly identify easy regions in VFI tasks without supervision. In this paper, we develop an Uncertainty-Guided Spatial Pruning (UGSP) architecture to skip redundant computation for efficient frame interpolation dynamically. Specifically, pixels with low uncertainty indicate easy regions, where the calculation can be reduced without bringing undesirable visual results. Therefore, we utilize uncertainty-generated mask labels to guide our UGSP in properly locating the easy region. Furthermore, we propose a self-contrast training strategy that leverages an auxiliary non-pruning branch to improve the performance of our UGSP. Extensive experiments show that UGSP maintains performance but reduces FLOPs by 34%/52%/30% compared to baseline without pruning on Vimeo90K/UCF101/MiddleBury datasets. In addition, our method achieves state-of-the-art performance with lower FLOPs on multiple benchmarks.

Learning Generalized Representations for Open-Set Temporal Action Localization

Junshan Hu
Liansheng Zhuang
Weisong Dong
Shiming Ge
Shafei Wang

Open-set Temporal Action Localization (OSTAL) is a critical and challenging task that aims to recognize and temporally localize human actions in untrimmed videos in open word scenarios. The main challenge in this task is the knowledge transfer from known actions to unknown actions. However, existing methods utilize limited training data and overparameterized deep neural network, which have poor generalization. This paper proposes a novel Generalized OSTAL model (namely GOTAL) to learn generalized representations of actions. GOTAL utilizes a Transformer network to model actions and a open-set detection head to perform action localization and recognition. Benefitting from Transformer's temporal modeling capabilities, GOTAL facilitates the extraction of human motion information from videos to mitigate the effects of irrelevant background data. Furthermore, a sharpness minimization algorithm is used to learn the network parameters of GOTAL, which facilitates the convergence of network parameters towards flatter minima by simultaneously minimizing the training loss value and sharpness of the loss plane. The collaboration of the above components significantly enhances the generalization of the representation. Experimental results demonstrate that GOTAL achieves the state-of-the-art performance on THUMOS14 and ActivityNet1.3 benchmarks, confirming the effectiveness of our proposed method.

Unambiguous Object Tracking by Exploiting Target Cues

Jie Gao
Bineng Zhong
Yan Chen

Siamese tracking exploits the template and the search region features to adaptively locate arbitrary objects in the tracking. A noteworthy issue is that both foreground and background mix in the template, and thus a tracker needs to learn what the target is and which pixels belong to it. However, existing trackers cannot effectively exploit the template information, resulting in a deficiency of target information and causing confusion for the tracker regarding which pixels belong to the target. To alleviate this issue, we propose UTrack, a simple and effective algorithm for unambiguous object tracking. UTrack utilizes long-term contextual information to propagate the appearance state of the target so as to explicitly model the apparent information of the target. Additionally, UTrack can resist the appearance change of the target by leveraging the target cues. Moreover, the proposed method uses the refined template to obtain more detailed information about the target and better understand which pixels belong to the target. Extensive experiments and comparisons with competitive trackers on challenging large-scale benchmarks show that our tracker can achieve state-of-the-art performances with real-time running. In particular, UTrack achieves 77.7% AO on GOT-10k.

Masked Text Modeling: A Self-Supervised Pre-training Method for Scene Text Detection

Keran Wang
Hongtao Xie
Yuxin Wang
Dongming Zhang
Yadong Qu
Zuan Gao
Yongdong Zhang

Scene text detection has made great progress recently with the wide use of pre-training. Nonetheless, existing scene text detection methods still suffer from two problems: 1) Limited annotated real data reduces the feature robustness. 2) Detectors perform poorly on text lacking of visual information. In this paper, we explore the potential of the CLIP model, and propose a novel self-supervised Masked Text Modeling (MTM) pre-training method for scene text detection, which can be trained with unlabeled data and improve the linguistic reasoning ability for text occlusion. Different from previous randomly pixel-level masking methods, MTM performs a targeted text-aware masking process under an unsupervised manner. Specifically, MTM consists of text perception and masked text modeling. In the text perception step, benefiting from the text-friendliness of CLIP, a Text Perception Module is proposed to attend to text area by computing the similarity between the text and image tokens from CLIP model. In the masked text modeling step, a Text-aware Masking Strategy is designed to mask the text area, and the Masked Text Modeling Module is used to reconstruct the masked texts. MTM obtains the ability to reason the linguistic information of masked texts with the reconstruction. This robust feature extraction learned by MTM ensures a more discriminative representation for the text lacking of visual information. Moreover, a new text dataset named OcclusionText is proposed to evaluate the robustness for text occlusion of detection methods. Extensive experiments on public benchmarks demonstrate that our MTM can boost the performance of existing text detectors.

Object Part Parsing with Hierarchical Dual Transformer

Jiamin Chen
Jianlou Si
Naihao Liu
Yao Wu
Li Niu
Chen Qian

Object part parsing involves segmenting objects into semantic parts, which has drawn great attention recently. The current methods ignore the specific hierarchical structure of the object, which can be used as strong prior knowledge. To address this, we propose the Hierarchical Dual Transformer (HDTR) to explore the contribution of the typical structural priors of the object parts. HDTR first generates the pyramid multi-granularity pixel representations under the supervision of the object part parsing maps at different semantic levels and then assigns each region an initial part embedding. Moreover, HDTR generates an edge pixel representation to extend the capability of the network to capture detailed information. Afterward, we design a Hierarchical Part Transformer to upgrade the part embeddings to their hierarchical counterparts with the assistance of the multi-granularity pixel representations. Next, we propose a Hierarchical Pixel Transformer to infer the hierarchical information from the part embeddings to enrich the pixel representations. Note that both transformer decoders rely on the structural relations between object parts, i.e., dependency, composition, and decomposition relations. The experiments on five large-scale datasets, i.e., LaPa, CelebAMask-HQ, CIHP, LIP and Pascal Animal, demonstrate that our method sets a new state-of-the-art performance for object part parsing.

Towards Robust Real-Time Scene Text Detection: From Semantic to Instance Representation Learning

Xugong Qin
Pengyuan Lyu
Chengquan Zhang
Yu Zhou
Kun Yao
Peng Zhang
Hailun Lin
Weiping Wang

Due to the flexible representation of arbitrary-shaped scene text and simple pipeline, bottom-up segmentation-based methods begin to be mainstream in real-time scene text detection. Despite great progress, these methods show deficiencies in robustness and still suffer from false positives and instance adhesion. Different from existing methods which integrate multiple-granularity features or multiple outputs, we resort to the perspective of representation learning in which auxiliary tasks are utilized to enable the encoder to jointly learn robust features with the main task of per-pixel classification during optimization. For semantic representation learning, we propose global-dense semantic contrast (GDSC), in which a vector is extracted for global semantic representation, then used to perform element-wise contrast with the dense grid features. To learn instance-aware representation, we propose to combine top-down modeling (TDM) with the bottom-up framework to provide implicit instance-level clues for the encoder. With the proposed GDSC and TDM, the encoder network learns stronger representation without introducing any parameters and computations during inference. Equipped with a very light decoder, the detector can achieve more robust real-time scene text detection. Experimental results on four public datasets show that the proposed method can outperform or be comparable to the state-of-the-art on both accuracy and speed. Specifically, the proposed method achieves 87.2% F-measure with 48.2 FPS on Total-Text and 89.6% F-measure with 36.9 FPS on MSRA-TD500 on a single GeForce RTX 2080 Ti GPU.

Towards Flexible and Universal: A Novel Endpoint-based Framework for Vessel Structural Information Extraction

Xiyao Ma
Shiqi Liu
Xiaoliang Xie
Xiaohu Zhou
Zengguang Hou
Xinkai Qu
Wenzheng Han
Ming Wang
Meng Song
Linsen Zhang

In computer-assisted intravascular interventional surgery, extracting detailed information of target vessels from X-ray angiographic images can be meaningful in improving safety and effectiveness. However, large amounts of effort have been dedicated to segmenting the whole blood vessels from the background while ignoring the internal structure, which is limited in clinical application. In this paper, we propose a flexible and universal endpoint-based framework for vessel structural information extraction. The framework first localizes all the endpoints of target vessel segments through a Coarse-to-Fine Keypoint Detection Network (CFKD-Net), in which the designed Multi-branch Feature Aggregation (MFA) module captures both in-patch and cross-patch information to help recognize the points of interest based on global structure. A novel MaskMSELoss is also proposed to disambiguate those irrelevant responses. Then a designed VEssel Segmentation and Analysis (VESA) algorithm will generate the segmentation mask and morphological analysis for each vessel segment simply based on the endpoints. It can also be flexibly applied to analyze variant blood vessels which are not pre-defined before. Extensive experiments on two different coronary artery datasets consistently demonstrate that this framework can achieve state-of-the-art detection performance and successfully extract and analyze target vessel segments. Since the framework shows excellent performance on the coronary arteries with severe deformation and strong noise, it is highly promising for analyzing other vascular images.

FDCNet: Feature Drift Compensation Network for Class-Incremental Weakly Supervised Object Localization

Sejin Park
Taehyung Lee
Yeejin Lee
Byeongkeun Kang

This work addresses the task of class-incremental weakly supervised object localization (CI-WSOL). The goal is to incrementally learn object localization for novel classes using only image-level annotations while retaining the ability to localize previously learned classes. This task is important because annotating bounding boxes for every new incoming data is expensive, although object localization is crucial in various applications. To the best of our knowledge, we are the first to address this task. Thus, we first present a strong baseline method for CI-WSOL by adapting the strategies of class-incremental classifiers to mitigate catastrophic forgetting. These strategies include applying knowledge distillation, maintaining a small data set from previous tasks, and using cosine normalization. We then propose the feature drift compensation network to compensate for the effects of feature drifts on class scores and localization maps. Since updating network parameters to learn new tasks causes feature drifts, compensating for the final outputs is necessary. Finally, we evaluate our proposed method by conducting experiments on two publicly available datasets (ImageNet-100 and CUB-200). The experimental results demonstrate that the proposed method outperforms other baseline methods.

Collaborative Learning of Diverse Experts for Source-free Universal Domain Adaptation

Meng Shen
Yanzuo Lu
Yanxu Hu
Andy J. Ma

Source-free universal domain adaptation (SFUniDA) is a challenging yet practical problem that adapts the source model to the target domain in the presence of distribution and category shifts without accessing source domain data. Most existing methods are developed based on a single-expert target model for both known- and unknown-class data training, such that the known- and unknown-class data in the target domain may not be separated well from each other. To address this issue, we propose a novel Cobllaborative Learning of Diverse Experts (CoDE) method for SFUniDA. In our method, unknown-class compatible source model training is designed to reserve space for the potential target unknown-class data. Two diverse experts are learned to better recognize the target known- and unknown-class data respectively by the specialized entropy discrimination. We improve the transferability of both experts by collaboratively correcting the possible misclassification errors with consistency and diversity learning. The final prediction with high confidence is obtained by gating the diverse experts based on soft neighbor density. Extensive experiments on four publicly available benchmarks demonstrate the superiority of our method compared to the state of the art.

Read Ten Lines at One Glance: Line-Aware Semi-Autoregressive Transformer for Multi-Line Handwritten Mathematical Expression Recognition

Wentao Yang
Zhe Li
Dezhi Peng
Lianwen Jin
Mengchao He
Cong Yao

Handwritten Mathematical Expression Recognition (HMER) plays a critical role in various applications, such as digitized education and scientific research. Although existing methods have achieved promising performance on publicly available datasets, they still struggle to recognize multi-line mathematical expressions (MEs), suffering from complex structures and slow inference speed. To address these issues, we propose a Line-Aware Semi-autoregressive Transformer (LAST) that treats multi-line mathematical expression sequences as two-dimensional dual-end structures. The proposed LAST utilizes a line-wise dual-end decoding strategy to decode multi-line mathematical expressions in parallel and perform dual-end decoding within each line. Specifically, we introduce a line-aware positional encoding module and a line-partitioned dual-end mask to endow LAST with line order awareness and directionality. Additionally, we adopt a shared-task optimization strategy to train LAST in both autoregressive and semi-autoregressive tasks. To evaluate the effectiveness of our approach in real-world scenarios, we have built a new Multi-line Mathematical Expression dataset (M2E), which, to the best of our knowledge, is the first of its kind and boasts with the largest character category, the largest samples of characters, and the longest average sequence length, compared to existing ME datasets. Experimental results on both the M2E dataset and publicly available datasets demonstrate the effectiveness of our proposed method. Notably, our semi-autoregressive decoding approach achieves significantly faster decoding speeds while still achieving state-of-the-art performance compared to the existing methods.

Beyond Domain Gap: Exploiting Subjectivity in Sketch-Based Person Retrieval

Kejun Lin
Zhixiang Wang
Zheng Wang
Yinqiang Zheng
Shin'ichi Satoh

Person re-identification (re-ID) requires densely distributed cameras. In practice, the person of interest may not be captured by cameras and therefore need to be retrieved using subjective information (e.g., sketches from witnesses). Previous research defines this case using the sketch as sketch re-identification (Sketch re-ID) and focuses on eliminating the domain gap. Actually, subjectivity is another significant challenge. We model and investigate it by posing a new dataset with multi-witness descriptions. It features two aspects. 1) Large-scale. It contains over 4,763 sketches and 32,668 photos, making it the largest Sketch re-ID dataset. 2) Multi-perspective and multi-style. Our dataset offers multiple sketches for each identity. Witnesses' subjective cognition provides multiple perspectives on the same individual, while different artists' drawing styles provide variation in sketch styles. We further have two novel designs to alleviate the challenge of subjectivity. 1) Fusing subjectivity. We propose a non-local (NL) fusion module that gathers sketches from different witnesses for the same identity. 2) Introducing objectivity. An AttrAlign module utilizes attributes as an implicit mask to align cross-domain features. To push forward the advance of Sketch re-ID, we set three benchmarks (large-scale, multi-style, cross-style). Extensive experiments demonstrate our leading performance in these benchmarks. Dataset and Codes are publicly available at: https://github.com/Lin-Kayla/subjectivity-sketch-reid

Rethinking Pseudo-Label-Based Unsupervised Person Re-ID with Hierarchical Prototype-based Graph

Ben Sha
Baopu Li
Tao Chen
Jiayuan Fan
Tao Sheng

Unsupervised person re-identification (Re-ID) aims to match individuals without manual annotations. However, existing methods often struggle with intra-class variations due to differences in person poses and camera styles such as resolution and environment information. Additionally, clustering may produce incorrect pseudo-labels, compounding the issue. To address these challenges, we propose a novel hierarchical prototype-based graph network (HPG-Net) for unsupervised person Re-ID. Our approach uses a hierarchical prototype-based graph structure to describe person images by attributes of poses and camera styles, with each graph node representing the average of image features as a prototype. We then apply a hierarchical contrastive learning module to enhance the feature learning at each level, reducing the impact of intra-class differences caused by extraneous attributes. We also calculate the similarity between samples and each level of prototypes, maintaining prototype-based graph consistency with the mean-teacher network to mitigate the accumulation errors caused by pseudo-labels. Experimental results on three benchmarks show that our method outperforms state-of-the-art (SOTA) works. Moreover, we achieve promising performance on an occluded dataset.

Single Domain Generalization via Unsupervised Diversity Probe

Kehua Guo
Rui Ding
Tian Qiu
Xiangyuan Zhu
Zheng Wu
Liwei Wang
Hui Fang

Single domain generalization (SDG) is a realistic yet challenging domain generalization scenario that aims to generalize a model trained on a single domain to multiple unseen domains. Typical SDG methods are essentially supervised data augmentation strategies, which tend to enhance the novelty rather than the diversity of augmented samples. Insufficient diversity may jeopardize the model generalization ability. In this paper, we propose a novel adversarial method, termed Unsupervised Diversity Probe (UDP), to synthesize novel and diverse samples in fully unsupervised settings. More specifically, to ensure that samples are novel, we study SDG from an information-theoretic perspective that minimizes the uncertainty coefficients between synthesized and source samples. Considering that the variation in a single source domain is limited, we introduce a regularization imposed on the auxiliary module that synthesizes variable samples, incorporated with uncertainty coefficients in an adversarial manner to complement the diversity. Subsequently, an available region is utilized to guarantee the samples' safety. For the network architecture, we design a simple probe module that can synthesize samples in several different aspects. UDP is an unsupervised and easy-to-implement method that solves SDG using only synthetic (source) samples, thus reducing the dependence on task models. Extensive experiments on three benchmark datasets show that UDP achieves remarkable results and outperforms existing supervised and unsupervised methods by a large margin in single domain generalization.

PBFormer: Capturing Complex Scene Text Shape with Polynomial Band Transformer

Ruijin Liu
Ning Lu
Dapeng Chen
Cheng LI
Zejian Yuan
Wei Peng

We present PBFormer, an efficient yet powerful scene text detector that unifies the transformer with a novel text shape representation Polynomial Band (PB). The representation has four polynomial curves to fit a text's top, bottom, left, and right sides, which can capture a text with a complex shape by varying polynomial coefficients. PB has appealing features compared with conventional representations: 1) It can model different curvatures with a fixed number of parameters, while polygon-points-based methods need to utilize a different number of points. 2) It can distinguish adjacent or overlapping texts as they have apparent different curve coefficients, while segmentation-based or points-based methods suffer from adhesive spatial positions. PBFormer combines the PB with the transformer, which can directly generate smooth text contours sampled from predicted curves without interpolation. A parameter-free cross-scale pixel attention (CPA) module is employed to highlight the feature map of a suitable scale while suppressing the other feature maps. The simple operation can help detect small-scale texts and is compatible with the one-stage DETR framework, where no postprocessing exists for NMS. Furthermore, PBFormer is trained with a shape-contained loss, which not only enforces the piecewise alignment between the ground truth and the predicted curves but also makes curves' position and shapes consistent with each other. Without bells and whistles about text pre-training, our method is superior to the previous state-of-the-art text detectors on the arbitrary-shaped text datasets. Codes will be public.

DANet: Multi-scale UAV Target Detection with Dynamic Feature Perception and Scale-aware Knowledge Distillation

Houzhang Fang
Zikai Liao
Lu Wang
Qingshan Li
Yi Chang
Luxin Yan
Xuhua Wang

Multi-scale infrared unmanned aerial vehicle (UAV) targets (IRUTs) detection under dynamic scenarios remains a challenging task due to weak target features, varying shapes and poses, and complex background interference. Current detection methods find it difficult to address the above issues accurately and efficiently. In this paper, we design a dynamic attentive network (DANet) incorporating a scale-adaptive feature enhancement mechanism (SaFEM) and an attention-guided cross-weighting feature aggregator (ACFA). The SaFEM adaptively adjusts the network's receptive fields at hierarchical network levels leveraging separable deformable convolution (SDC), which enhances the network's multi-scale IRUT awareness. The ACFA, modulated by two crossing attention mechanisms, strengthens structural and semantic properties on neighboring levels for the accurate representation of multi-scale IRUT features from different levels. A plug-and-play anti-distractor contrastive regularization (ADCR) is also imposed on our DANet, which enforces similarity on features of targets and distractors from a new uncompressed feature projector (UFP) to increase the network's anti-distractor ability in complex backgrounds. To further increase the multi-scale UAV detection performance of DANet while maintaining its efficiency superiority, we propose a novel scale-specific knowledge distiller (SSKD) based on a divide-and-conquer strategy. For the "divide'' stage, we intendedly construct three task-oriented teachers to learn tailored knowledge for small-, medium-, and large-scale IRUTs. For the "conquer'' stage, we propose a novel element-wise attentive distillation module (EADM), where we employ a pixel-wise attention mechanism to highlight teacher and student IRUT features, and incorporate IRUT-associated prior knowledge for the collaborative transfer of refined multi-scale IRUT features to our DANet. Extensive experiments on real infrared UAV datasets demonstrate that our DANet is able to detect multi-scale UAVs with a satisfactory balance between accuracy and efficiency.

A Unified Query-based Paradigm for Camouflaged Instance Segmentation

Bo Dong
Jialun Pei
Rongrong Gao
Tian-Zhu Xiang
Shuo Wang
Huan Xiong

Due to the high similarity between camouflaged instances and the background, the recently proposed camouflaged instance segmentation (CIS) faces challenges in accurate localization and instance segmentation. To this end, inspired by query-based transformers, we propose a unified query-based multi-task learning framework for camouflaged instance segmentation, termed UQFormer, which builds a set of mask queries and a set of boundary queries to learn a shared composed query representation and efficiently integrates global camouflaged object region and boundary cues, for simultaneous instance segmentation and instance boundary detection in camouflaged scenarios. Specifically, we design a composed query learning paradigm that learns a shared representation to capture object region and boundary features by the cross-attention interaction of mask queries and boundary queries in the designed multi-scale unified learning transformer decoder. Then, we present a transformer-based multi-task learning framework for simultaneous camouflaged instance segmentation and camouflaged instance boundary detection based on the learned composed query representation, which also forces the model to learn a strong instance-level query representation. Notably, our model views the instance segmentation as a query-based direct set prediction problem, without other post-processing such as non-maximal suppression. Compared with 14 state-of-the-art approaches, our UQFormer significantly improves the performance of camouflaged instance segmentation. Our code will be available at: https://github.com/dongbo811/UQFormer.

Unite-Divide-Unite: Joint Boosting Trunk and Structure for High-accuracy Dichotomous Image Segmentation

Jialun Pei
Zhangjun Zhou
Yueming Jin
He Tang
Pheng-Ann Heng

High-accuracy Dichotomous Image Segmentation (DIS) aims to pinpoint category-agnostic foreground objects from natural scenes. The main challenge for DIS involves identifying the highly accurate dominant area while rendering detailed object structure. However, directly using a general encoder-decoder architecture may result in an oversupply of high-level features and neglect the shallow spatial information necessary for partitioning meticulous structures. To fill this gap, we introduce a novel Unite-Divide-Unite Network (UDUN) that restructures and bipartitely arranges complementary features to simultaneously boost the effectiveness of trunk and structure identification. The proposed UDUN proceeds from several strengths. First, a dual-size input feeds into the shared backbone to produce more holistic and detailed features while keeping the model lightweight. Second, a simple Divide-and-Conquer Module (DCM) is proposed to decouple multiscale low- and high-level features into our structure decoder and trunk decoder to obtain structure and trunk information respectively. Moreover, we design a Trunk-Structure Aggregation module (TSA) in our union decoder that performs cascade integration for uniform high-accuracy segmentation. As a result, UDUN performs favorably against state-of-the-art competitors in all six evaluation metrics on overall DIS-TE, i.e., achieving 0.772 weighted F-measure and 977 HCE. Using 1024X1024 input, our model enables real-time inference at 65.3 fps with ResNet-18. The source code is available at https://github.com/PJLallen/UDUN.

Exploring High-Correlation Source Domain Information for Multi-Source Domain Adaptation in Semantic Segmentation

Yuxiang Cai
Meng Xi
Yongheng Shang
Jianwei Yin

Multi-source domain adaptation (MSDA) aims to transfer knowledge from multiple source domains to one target domain. Although multi-source domains contain more complementary information than single source domain, MSDA involves some disturbed source samples, which will degrade the adaptation performance. To solve this problem, we propose a novel MSDA method for semantic segmentation. Specifically, to fully explore the optimal source samples for target domain, we propose a novel correlation measurement mechanism, weighing domain-level source-target correlation (DSC) and pixel-level source-target correlation (PSC). For each pair of source and target domains, DSC and PSC estimate the source-target correlations via the distances between target class prototypes and source class prototypes, and between target class prototypes and every pixel of source features, respectively. Built upon PSC, we propose a novel mix-up strategy, which pastes high-correlation source pixels to target images, to construct augmented mixing images for adaptation. Then we train the segmentor on the mixed images with pseudo labels and labeled source images, with DSC and PSC to suppress the negative effects of the low-correlation source domains and pixels. Furthermore, an attentive prototype alignment loss, based on DSC, is proposed to align target and multi-source domains, which attaches more importance to high-correlation source domains. The experimental results on the representative benchmark datasets (i.e., GTA5 and SYNTHIA → Cityscapes) highlight that our method substantially outperforms the state-of-the-art single-source domain adaptation and MSDA methods.

Deep Image Harmonization in Dual Color Spaces

Linfeng Tan
Jiangtong Li
Li Niu
Liqing Zhang

Image harmonization is an essential step in image composition that adjusts the appearance of composite foreground to address the inconsistency between foreground and background. Existing methods primarily operate in correlated RGB color space, leading to entangled features and limited representation ability. In contrast, decorrelated color space (e.g., Lab) has decorrelated channels that provide disentangled color and illumination statistics. In this paper, we explore image harmonization in dual color spaces, which supplements entangled RGB features with disentangled L, a, b features to alleviate the workload in harmonization process. The network comprises a RGB harmonization backbone, an Lab encoding module, and an Lab control module. The backbone is a U-Net network translating composite image to harmonized image. Three encoders in Lab encoding module extract three control codes independently from L, a, b channels, which are used to manipulate the decoder features in harmonization backbone via Lab control module. Our code and model are available at https://github.com/bcmi/DucoNet-Image-Harmonization.

Pixel Adapter: A Graph-Based Post-Processing Approach for Scene Text Image Super-Resolution

Wenyu Zhang
Xin Deng
Baojun Jia
Xingtong Yu
Yifan Chen
Jin Ma
Qing Ding
Xinming Zhang

Current Scene text image super-resolution approaches primarily focus on extracting robust features, acquiring text information, and complex training strategies to generate super-resolution images. However, the upsampling module, which is crucial in the process of converting low-resolution images to high-resolution ones, has received little attention in existing works. To address this issue, we propose the Pixel Adapter Module (PAM) based on graph attention to address pixel distortion caused by upsampling. The PAM effectively captures local structural information by allowing each pixel to interact with its neighbors and update features. Unlike previous graph attention mechanisms, our approach achieves 2-3 orders of magnitude improvement in efficiency and memory utilization by eliminating the dependency on sparse adjacency matrices and introducing a sliding window approach for efficient parallel computation. Additionally, we introduce the MLP-based Sequential Residual Block (MSRB) for robust feature extraction from text images, and a Local Contour Awareness loss (ℒlca) to enhance the model's perception of details. Comprehensive experiments on TextZoom demonstrate that our proposed method generates high-quality super-resolution images, surpassing existing methods in recognition accuracy. For single-stage and multi-stage strategies, we achieved improvements of 0.7% and 2.6%, respectively, increasing the performance from 52.6% and 53.7% to 53.3% and 56.3%. The code is available at https://github.com/wenyu1009/RTSRN.

Where and How: Mitigating Confusion in Neural Radiance Fields from Sparse Inputs

Yanqi Bao
Yuxin Li
Jing Huo
Tianyu Ding
Xinyue Liang
Wenbin Li
Yang Gao

Neural Radiance Fields from Sparse inputs (NeRF-S) have shown great potential in synthesizing novel views with a limited number of observed viewpoints. However, due to the inherent limitations of sparse inputs and the gap between non-adjacent views, rendering results often suffer from over-fitting and foggy surfaces, a phenomenon we refer to as "CONFUSION" during volume rendering. In this paper, we analyze the root cause of this confusion and attribute it to two fundamental questions: "WHERE" and "HOW". To this end, we present a novel learning framework, WaH-NeRF, which effectively mitigates confusion by tackling the following challenges: (i) "WHERE" to Sample? in NeRF-S-we introduce a Deformable Sampling strategy and a Weight-based Mutual Information Loss to address sample-position confusion arising from the limited number of viewpoints; and (ii) "HOW" to Predict? in NeRF-S-we propose a Semi-Supervised NeRF learning Paradigm based on pose perturbation and a Pixel-Patch Correspondence Loss to alleviate prediction confusion caused by the disparity between training and testing viewpoints. By integrating our proposed modules and loss functions, WaH-NeRF outperforms previous methods under the NeRF-S setting. Code is available https://github.com/bbbbby-99/WaH-NeRF.

One-stage Low-resolution Text Recognition with High-resolution Knowledge Transfer

Hang Guo
Tao Dai
Mingyan Zhu
Guanghao Meng
Bin Chen
Zhi Wang
Shu-Tao Xia

Recognizing characters from low-resolution (LR) text images poses a significant challenge due to the information deficiency as well as the noise and blur in low-quality images. Current solutions for low-resolution text recognition (LTR) typically rely on a two-stage pipeline that involves super-resolution as the first stage followed by the second-stage recognition. Although this pipeline is straightforward and intuitive, it has to use an additional super-resolution network, which causes inefficiencies during training and testing. Moreover, the recognition accuracy of the second stage heavily depends on the reconstruction quality of the first stage, causing ineffectiveness.In this work, we attempt to address these challenges from a novel perspective: adapting the recognizer to low-resolution inputs by transferring the knowledge from the high-resolution. Guided by this idea, we propose an efficient and effective knowledge distillation framework to achieve multi-level knowledge transfer.Specifically, the visual focus loss is proposed to extract the character position knowledge with resolution gap reduction and character region focus, the semantic contrastive loss is employed to exploit the contextual semantic knowledge with contrastive learning, and the soft logits loss facilitates both local word-level and global sequence-level learning from the soft teacher label.Extensive experiments show that the proposed one-stage pipeline significantly outperforms super-resolution based two-stage frameworks in terms of effectiveness and efficiency, accompanied by favorable robustness.Code is available at https://github.com/csguoh/KD-LTR.

Calibration-based Dual Prototypical Contrastive Learning Approach for Domain Generalization Semantic Segmentation

Muxin Liao
Shishun Tian
Yuhang Zhang
Guoguang Hua
Wenbin Zou
Xia Li

Prototypical contrastive learning (PCL) has been widely used to learn class-wise domain-invariant features recently. These methods are based on the assumption that the prototypes, which are represented as the central value of the same class in a certain domain, are domain-invariant. Since the prototypes of different domains have discrepancies as well, the class-wise domain-invariant features learned from the source domain by PCL need to be aligned with the prototypes of other domains simultaneously. However, the prototypes of the same class in different domains may be different while the prototypes of different classes may be similar, which may affect the learning of class-wise domain-invariant features. Based on these observations, a calibration-based dual prototypical contrastive learning (CDPCL) approach is proposed to reduce the domain discrepancy between the learned class-wise features and the prototypes of different domains for domain generalization semantic segmentation. It contains an uncertainty-guided PCL (UPCL) and a hard-weighted PCL (HPCL). Since the domain discrepancies of the prototypes of different classes may be different, we propose an uncertainty probability matrix to represent the domain discrepancies of the prototypes of all the classes. The UPCL estimates the uncertainty probability matrix to calibrate the weights of the prototypes during the PCL. Moreover, considering that the prototypes of different classes may be similar in some circumstances, which means these prototypes are hard-aligned, the HPCL is proposed to generate a hard-weighted matrix to calibrate the weights of the hard-aligned prototypes during the PCL. Extensive experiments demonstrate that our approach achieves superior performance over current approaches on domain generalization segmentation tasks. The source code will be released at https://github.com/seabearlmx/CDPCL.

Skeleton MixFormer: Multivariate Topology Representation for Skeleton-based Action Recognition

Wentian Xin
Qiguang Miao
Yi Liu
Ruyi Liu
Chi-Man Pun
Cheng Shi

Vision Transformer, which performs well in various vision tasks, encounters a bottleneck in skeleton-based action recognition and falls short of advanced GCN-based methods. The root cause is that the current skeleton transformer depends on the self-attention mechanism of the complete channel of the global joint, ignoring the highly discriminative differential correlation within the channel, so it is challenging to learn the expression of the multivariate topology dynamically. To tackle this, we present Skeleton MixFormer, an innovative spatio-temporal architecture to effectively represent the physical correlations and temporal interactivity of the compact skeleton data. Two essential components make up the proposed framework: 1) Spatial MixFormer. The channel-grouping and mix-attention are utilized to calculate the dynamic multivariate topological relationships. Compared with the full-channel self-attention method, Spatial MixFormer better highlights the channel groups' discriminative differences and the joint adjacency's interpretable learning. 2) Temporal MixFormer, which consists of Multiscale Convolution, Temporal Transformer and Sequential Holding Module. The multivariate temporal models ensure the richness of global difference expression and realize the discrimination of crucial intervals in the sequence, thereby enabling more effective learning of long and short-term dependencies in actions. Our Skeleton MixFormer demonstrates state-of-the-art (SOTA) performance across seven different settings on four standard datasets, namely NTU-60, NTU-120, NW-UCLA, and UAV-Human. Related code will be available on https://github.com/ElricXin/Skeleton-MixFormer.

Mask Again: Masked Knowledge Distillation for Masked Video Modeling

Xiaojie Li
Shaowei He
Jianlong Wu
Yue Yu
Liqiang Nie
Min Zhang

Masked video modeling has shown remarkable performance in downstream tasks by predicting masked video tokens from visible ones. However, training models from scratch on large-scale unlabeled data remains computationally challenging and time-consuming. Moreover, the commonly used random-based sampling techniques may lead to the selection of redundant or low-information regions, hindering the model from learning discriminative representations within the limited training epochs. To achieve efficient pre-training, we propose MaskAgain, an efficient feature-based knowledge distillation framework for masked video pre-training that facilitates knowledge transfer from a pre-trained teacher model to a student model. In contrast to previous approaches that align all visible token features with the teacher model at output layers, MaskAgain adopts a selective approach by masking visible tokens again at both the hidden and output layers of the transformer block. Attention mechanisms are utilized for informative feature selection. At the hidden level, attention maps generated by the transformer's multi-head attention structure are utilized to select crucial token information at both temporally-global and temporally-local levels. Additionally, at the output level, an activation-based attention map is generated using token features, enabling us to focus on important tokens while preserving feature similarity and the relationship matrix similarity between patches. Extensive experimental results show that MaskAgain achieves comparable or even better performance than existing methods on benchmark datasets with much fewer training epochs and much less memory, which demonstrates that MaskAgain allows for efficient pre-training of accurate video models, reducing computational resources and training time significantly. Code is released at https://github.com/xiaojieli0903/MaskAgain.

Human-Object-Object Interaction: Towards Human-Centric Complex Interaction Detection

Mingxuan Zhang
Xiao Wu
Zhaoquan Yuan
Qi He
Xiang Huang

Localizing and recognizing interactive actions in videos is a pivotal yet intricate task that paves the way towards profound video comprehension. Recent advancements in Human-Object Interaction (HOI) detection, which involve detecting and localizing the interactions between human and object pairs, have undeniably marked significant progress. However, the realm of human-object-object interaction, an essential aspect of real-world industrial applications, remains largely uncharted. In this paper, we introduce a novel task referred to as Human-Object-Object Interaction (HOOI) detection and present a cutting-edge method named the Human-Object-Object Interaction Network (H2O-Net). The proposed H2O-Net is comprised of two principal modules: sequential motion feature extraction and HOOI modeling. The former module delves into the gradually evolving visual characteristics of entities throughout the HOOI process, harnessing spatial-temporal features across multiple fine-grained partitions. Conversely, the latter module aspires to encapsulate HOOI actions through intricate interactions between entities. It commences by capturing and amalgamating two sub-interaction features to extract comprehensive HOOI features, subsequently refining them using the interaction cues embedded within the long-term global context. Furthermore, we contribute to the research community by constructing a new video dataset, dubbed the HOOI dataset. The actions encompassed within this dataset pertain to pivotal operational behaviors in industrial manufacturing, imbuing it with substantial application potential and serving as a valuable addition to the existing repertoire of interaction action detection datasets. Experimental evaluations conducted on the proposed HOOI and widely-used AVA datasets demonstrate that our method outperforms existing state-of-the-art techniques by margins of 6.16 mAP and 1.9 mAP, respectively, thus substantiating its effectiveness.

On the Importance of Spatial Relations for Few-shot Action Recognition

Yilun Zhang
Yuqian Fu
Xingjun Ma
Lizhe Qi
Jingjing Chen
Zuxuan Wu
Yu-Gang Jiang

Deep learning has achieved great success in video recognition, yet still struggles to recognize novel actions when faced with only a few examples. To tackle this challenge, few-shot action recognition methods have been proposed to transfer knowledge from a source dataset to a novel target dataset with only one or a few labeled videos. However, existing methods mainly focus on modeling the temporal relations between the query and support videos while ignoring the spatial relations. In this paper, we find that the spatial misalignment between objects also occurs in videos, notably more common than the temporal inconsistency. We are thus motivated to investigate the importance of spatial relations and propose a more accurate few-shot action recognition method that leverages both spatial and temporal information. Particularly, a novel Spatial Alignment Cross Transformer (SA-CT) which learns to re-adjust the spatial relations and incorporates the temporal information is contributed. Experiments reveal that, even without using any temporal information, the performance of SA-CT is comparable to temporal based methods on 3/4 benchmarks. To further incorporate the temporal information, we propose a simple yet effective Temporal Mixer module. The Temporal Mixer enhances the video representation and improves the performance of the full SA-CT model, achieving very competitive results. In this work, we also exploit large-scale pretrained models for few-shot action recognition, providing useful insights for this research direction.

CgT-GAN: CLIP-guided Text GAN for Image Captioning

Jiarui Yu
Haoran Li
Yanbin Hao
Bin Zhu
Tong Xu
Xiangnan He

The large-scale visual-language pre-trained model, Contrastive Language-Image Pre-training (CLIP), has significantly improved image captioning for scenarios without human-annotated image-caption pairs. Recent advanced CLIP-based image captioning without human annotations follows a text-only training paradigm, i.e., reconstructing text from shared embedding space. Nevertheless, these approaches are limited by the training/inference gap or huge storage requirements for text embeddings. Given that it is trivial to obtain images in the real world, we propose CLIP-guided text GAN (CgT-GAN), which incorporates images into the training process to enable the model to "see" real visual modality. Particularly, we use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus and CLIP-based reward to provide semantic guidance. The caption generator is jointly rewarded based on the caption naturalness to human language calculated from the GAN's discriminator and the semantic guidance reward computed by the CLIP-based reward module. In addition to the cosine similarity as the semantic guidance reward (i.e., CLIP-cos), we further introduce a novel semantic guidance reward called CLIP-agg, which aligns the generated caption with a weighted text embedding by attentively aggregating the entire corpus. Experimental results on three subtasks (ZS-IC, In-UIC and Cross-UIC) show that CgT-GAN outperforms state-of-the-art methods significantly across all metrics. Code is available at https://github.com/Lihr747/CgtGAN.

Fine-grained Key-Value Memory Enhanced Predictor for Video Representation Learning

Xiaojie Li
Jianlong Wu
Shaowei He
Shuo Kang
Yue Yu
Liqiang Nie
Min Zhang

Self-supervised learning methods have shown significant promise in acquiring robust spatiotemporal representations from unlabeled videos. In this work, we address three critical limitations in existing self-supervised video representation learning: 1) insufficient utilization of contextual information and lifelong memory, 2) lack of fine-grained visual concept alignment, and 3) neglect of the feature distribution gap between encoders. To overcome these limitations, we propose a novel memory-enhanced predictor that leverages key-value memory networks with separate memories for the online and target encoders. This design enables the effective storage and retrieval of contextual knowledge, facilitating informed predictions and enhancing overall performance. Additionally, we introduce a visual concept alignment module that ensures fine-grained alignment of shared semantic information across segments of the same video. By employing coupled dictionary learning, we effectively decouple visual concepts, enriching the semantic representation stored in the memory networks. Our proposed approach is extensively evaluated on widely recognized benchmarks for action recognition and retrieval tasks, demonstrating its superiority in learning generalized video representations with significantly improved performance compared to existing state-of-the-art self-supervised learning methods. Code is released at https://github.com/xiaojieli0903/FGKVMemPred_video.

Train One, Generalize to All: Generalizable Semantic Segmentation from Single-Scene to All Adverse Scenes

Ziyang Gong
Fuhao Li
Yupeng Deng
Wenjun Shen
Xianzheng Ma
Zhenming Ji
Nan Xia

Unsupervised Domain Adaptation (UDA) for semantic segmentation has received widespread attention for its ability to transfer knowledge from the source to target domains without a high demand for annotations. However, semantic segmentation under adverse conditions still poses significant challenges for autonomous driving, as bad weather observation data may introduce unforeseeable problems. Although previous UDA works are devoted to adverse scene tasks, their adaptation process is redundant. For instance, unlabeled snow scene training data is a must for the model to achieve fair segmentation performance in snowy scenarios. We propose calling this type of adaptation process the Single to Single (STS) strategy. Clearly, STS is time-consuming and may show weaknesses in some comprehensive scenes, such as a night scene of sleet. Motivated by the concept of Domain Generalization (DG), we propose the Single to All (STA) model. Unlike DG, which trains models on one or multiple source domains without target domains, the STA model is based on UDA and employs one source domain, one target domain, and one introduced domain to achieve generalization to all adverse conditions by training on a single-scene dataset. Specifically, the STA model is advantageous as it learns from the source domain, reserves the style factors via a Reservation domain, and adapts the unified factors by the Randomization module. An Output Space Refusion module is also further incorporated to strengthen STA. Our STA achieves state-of-the-art performance in the Foggy Driving benchmark and demonstrates great domain generalizability in all conditions of the ACDC and Foggy Zurich benchmarks.

All-in-one Multi-degradation Image Restoration Network via Hierarchical Degradation Representation

Cheng Zhang
Yu Zhu
Qingsen Yan
Jinqiu Sun
Yanning Zhang

The aim of image restoration is to recover high-quality images from distorted ones. However, current methods usually focus on a single task (e.g., denoising, deblurring or super-resolution) which cannot address the needs of real-world multi-task processing, especially on mobile devices. Thus, developing an all-in-one method that can restore images from various unknown distortions is a significant challenge. Previous works have employed contrastive learning to learn the degradation representation from observed images, but this often leads to representation drift caused by deficient positive and negative pairs. To address this issue, we propose a novel All-in-one Multi-degradation Image Restoration Network (AMIRNet) that can effectively capture and utilize accurate degradation representation for image restoration. AMIRNet learns a degradation representation for unknown degraded images by progressively constructing a tree structure through clustering, without any prior knowledge of degradation information. This tree-structured representation explicitly reflects the consistency and discrepancy of various distortions, providing a specific clue for image restoration. To further enhance the performance of the image restoration network and overcome domain gaps caused by unknown distortions, we design a feature transform block (FTB) that aligns domains and refines features with the guidance of the degradation representation. We conduct extensive experiments on multiple distorted datasets, demonstrating the effectiveness of our method and its advantages over state-of-the-art restoration methods both qualitatively and quantitatively.

NPF-200: A Multi-Modal Eye Fixation Dataset and Method for Non-Photorealistic Videos

Ziyu Yang
Sucheng Ren
Zongwei Wu
Nanxuan Zhao
Junle Wang
Jing Qin
Shengfeng He

Non-photorealistic videos are in demand with the wave of the metaverse, but lack of sufficient research studies. This work aims to take a step forward to understand how humans perceive non-photorealistic videos with eye fixation (i.e., saliency detection), which is critical for enhancing media production, artistic design, and game user experience. To fill in the gap of missing a suitable dataset for this research line, we present NPF-200, the first large-scale multi-modal dataset of purely non-photorealistic videos with eye fixations. Our dataset has three characteristics: 1) it contains soundtracks that are essential according to vision and psychological studies; 2) it includes diverse semantic content and videos are of high-quality; 3) it has rich motions across and within videos. We conduct a series of analyses to gain deeper insights into this task and compare several state-of-the-art methods to explore the gap between natural images and non-photorealistic data. Additionally, as the human attention system tends to extract visual and audio features with different frequencies, we propose a universal frequency-aware multi-modal non-photorealistic saliency detection model called NPSNet, demonstrating the state-of-the-art performance of our task. The results uncover strengths and weaknesses of multi-modal network design and multi-domain training, opening up promising directions for future works. Our dataset and code can be found at https://github.com/Yangziyu/NPF200

LandmarkGait: Intrinsic Human Parsing for Gait Recognition

Zengbin Wang
Saihui Hou
Man Zhang
Xu Liu
Chunshui Cao
Yongzhen Huang
Shibiao Xu

Gait recognition is an emerging biometric technology for identifying pedestrians based on their unique walking patterns. In past gait recognition, global-based methods are inadequate to meet the growing demand for accuracy, while commonly used part-based methods provided coarse and inaccurate feature representation for specific body parts. Human parsing appears to be a better option for accurately representing specific and complete body parts in gait recognition. However, its practical application in gait recognition is often hindered by missing RGB modality, lack of annotated body parts, and difficulty in balancing parsing quantity and quality. To address this issue, we propose LandmarkGait, an accessible and alternative parsing-based solution for gait recognition. LandmarkGait introduces an unsupervised landmark discovery network to transform the dense silhouette into a finite set of landmarks with remarkable consistency across various conditions. By grouping landmarks subsets corresponding to distinct body part regions, following a reconstruction task and further refinement from high-quality input silhouettes, we can directly obtain fine-grained parsing results from original binary silhouettes in an unsupervised manner. Moreover, we also develop a multi-scale feature extractor that simultaneously captures global and parsing feature representations based on the integrity and flexibility of specific body parts. Extensive experiments demonstrate that our LandmarkGait can extract more stable features and exhibit significant performance improvement under all conditions, especially in various dressing conditions. Code is available at https://github.com/wzb-bupt/LandmarkGait.

Patchmatch Stereo++: Patchmatch Binocular Stereo with Continuous Disparity Optimization

Wenjia Ren
Qingmin Liao
Zhijing Shao
Xiangru Lin
Xin Yue
Yu Zhang
Zongqing Lu

Current deep-learning-based stereo matching algorithms achieve remarkably low error rates but they suffer from the edge ambiguity effect. The primary reason is that they treat disparity estimation as a labeling problem, constructing a cost volume based on uniform discrete pixel-wise labels. It is insufficient to model the continuous disparity probability distribution (DPD), which harms the accuracy of complex regions. Moreover, current cost aggregation strategies cannot process unstructured disparity candidates very well, which is one of the bottlenecks limiting continuous modeling. We propose Patchmatch Stereo++, inspired by the traditional Patchmatch Stereo to achieve better continuous disparity optimization in deep-learning-based methods. Firstly, to model accurate continuous DPD, we introduce an adaptive dense sub-pixel sampling strategy to binocular stereo and approximate a continuous unstructured DPD for every pixel. Secondly, we design a convolution-based optimizer that can accept unstructured disparity candidates to parse the above continuous DPD in an adaptive manner and perform updates accordingly. Extensive experiments demonstrate our method has the best performance among existing stereo matching networks at the edges, both quantitatively and qualitatively. At the time of submission, compared with published works pre-trained on SceneFlow, we rank 1st in the foreground of KITTI and 2nd on SceneFlow, ETH3D under various metrics.The source code will be released.

Consistency-aware Feature Learning for Hierarchical Fine-grained Visual Classification

Rui Wang
Cong Zou
Weizhong Zhang
Zixuan Zhu
Lihua Jing

Hierarchical Fine-Grained Visual Classification (HFGVC) assigns a label sequence (e.g., ["Albatross'', "Laysan Albatross'']) with a coarse to fine hierarchy to each object. It remains challenging to achieve high accuracy and consistency due to the small inter-class difference, large intra-class variance, and difficulty in modeling relationships among classification tasks at different granularities. In this paper, we propose an effective Consistency-Aware Feature Learning (CAFL) method for HFGVC to improve prediction consistency and classification accuracy simultaneously. Our key idea is to encode the prediction consistency constraint into a weak supervision mechanism via forward deduction and backward induction over the label hierarchy. Furthermore, we develop a disentanglement and bidirectional reinforcement classification head to extract the features for the classifiers at different granularities. Together with the stop-gradient policy and attention mechanism, they enable each classifier to exploit the features from the ones at other granularities without suffering from their conflicting gradients in training. We evaluate our method on several commonly-used fine-grained public datasets, including CUB-200-2011, FGVC-Aircraft, and Stanford Cars. The results show that our method not only achieves state-of-the-art classification accuracy but also effectively reduces inconsistency errors by 50% under the hierarchical fine-grained classification setting.

FSR-Net: Deep Fourier Network for Shadow Removal

Jun Yu
Peng He
Ziqi Peng

The presence of shadows degrades the performance of various multimedia tasks. Image shadow removal aims at restoring the background of shadow regions, which is generally an open challenge. Unlike most existing deep learning-based methods that focus on restoring such degradations in the spatial domain, we introduce a novel shadow removal method that also exploits frequency domain information. Specifically, we firstly revisit the frequency characteristics of shadow images via Fourier transform, where amplitude components contain most lightness information and phase components are related to structure information. To this end, we propose a two-stage deep Fourier shadow removal network (FSR-Net) to enhance the brightness of shadow regions, and correspondingly improve the shadow removal performance of whole images. For each stage, it consists of an amplitude recovery network and a phase recovery network to progressively reconstruct the lightness and structure components. To facilitate the learning of these two representations, we introduce the frequency and spatial interaction blocks to process the local spatial features and the global frequency information separately. Extensive experiments demonstrate that FSR-Net achieves superior results than other approaches with fewer parameters. For example, our method obtains a 1.05dB improvement on ISTD[34] dataset over the previous state-of-the-art method [43] with 0.30M parameters.

Multi-Speed Global Contextual Subspace Matching for Few-Shot Action Recognition

Tianwei Yu
Peng Chen
Yuanjie Dang
Ruohong Huan
Ronghua Liang

Few-shot action recognition (FSAR) aims to classify unseen query actions into categories represented by a few labeled support videos. Most current FSAR methods adopt the frame-level matching mechanism that requires continuous actions to be represented by a fixed number of frame features. However, this could compromise the completeness of the contextual video information and make it difficult to handle video features of varying frame sampling speeds. In this paper, we propose a multi-speed global contextual subspace matching (MGCSM) method that generates global contextual action subspace representations from videos containing different numbers of frames to preserve contextual semantic information. Specifically, we propose to obtain the scale-agnostic information of embedding video features using a global contextual aggregation (GCA) module and then generate the discriminative action subspace representation with an action subspace generation (ASG) module. Furthermore, we introduce a multi-speed subspace matching (MSM) mechanism that generates a multi-speed classification score by integrating the similarities between query videos and support subspaces of varying sampling speeds. The proposed method is embedding-agnostic and can be combined with most mainstream embedding networks without model re-designs. Comprehensive and reproducible experiments on standard datasets demonstrate our method's superior performance compared to existing state-of-the-art methods.

Lightweight Super-Resolution Head for Human Pose Estimation

Haonan Wang
Jie Liu
Jie Tang
Gangshan Wu

Heatmap-based methods have become the mainstream method for pose estimation due to their superior performance. However, heatmap-based approaches suffer from significant quantization errors with downscale heatmaps, which result in limited performance and the detrimental effects of intermediate supervision. Previous heatmap-based methods relied heavily on additional post-processing to mitigate quantization errors. Some heatmap-based approaches improve the resolution of feature maps by using multiple costly upsampling layers to improve localization precision. To solve the above issues, we creatively view the backbone network as a degradation process and thus reformulate the heatmap prediction as a Super-Resolution (SR) task. We first propose the SR head, which predicts heatmaps with a spatial resolution higher than the input feature maps (or even consistent with the input image) by super-resolution, to effectively reduce the quantization error and the dependence on further post-processing. Besides, we propose SRPose to gradually recover the HR heatmaps from LR heatmaps and degraded features in a coarse-to-fine manner. To reduce the training difficulty of HR heatmaps, SRPose applies SR heads to supervise the intermediate features in each stage. In addition, the SR head is a lightweight and generic head that applies to top-down and bottom-up methods. Extensive experiments on the COCO, MPII, and CrowdPose datasets show that SRPose outperforms the corresponding heatmap-based approaches.

Exploiting Time-Frequency Conformers for Music Audio Enhancement

Yunkee Chae
Junghyun Koo
Sungho Lee
Kyogu Lee

With the proliferation of video platforms on the internet, recording musical performances by mobile devices has become commonplace. However, these recordings often suffer from degradation such as noise and reverberation, which negatively impact the listening experience. Consequently, the necessity for music audio enhancement (referred to as music enhancement from this point onward), involving the transformation of degraded audio recordings into pristine high-quality music, has surged to augment the auditory experience. To address this issue, we propose a music enhancement system based on the Conformer architecture that has demonstrated outstanding performance in speech enhancement tasks. Our approach explores the attention mechanisms of the Conformer and examines their performance to discover the best approach for the music enhancement task. Our experimental results show that our proposed model achieves state-of-the-art performance on single-stem music enhancement. Furthermore, our system can perform general music enhancement with multi-track mixtures, which has not been examined in previous work. Audio samples enhanced with our system are available at: https://tinyurl.com/smpls9999

Exploring Dual Representations in Large-Scale Point Clouds: A Simple Weakly Supervised Semantic Segmentation Framework

Jiaming Liu
Yue Wu
Maoguo Gong
Qiguang Miao
Wenping Ma
Cai Xu

Existing work shows that 3D point clouds produce only about a 4% drop in semantic segmentation even at 1% random point annotation, which inspires us to further explore how to achieve better results at lower cost. As scene point clouds provide position and color information and often used in tandem as the only input, with little work going into segmentation by fusing information from dual spaces. To optimize point cloud representations, we propose a novel framework for the dual representation query network (DRQNet). The proposed framework partitions the input point cloud into position and color spaces, using the separately extracted geometric structure and semantic context to create an internal supervisory mechanism that bridges the dual spaces and fuses the information. Adopting sparsely annotated points as the query set, DRQNet provide guidance and perceptual information for multi-stage point clouds through random sampling. More, to differentiate and enhance the features generated by local neighbourhoods within multiple perceptual fields, we design a representation selection module to identify the contributions made by the position and color of each query point, and weight them adaptively according to reliability. The proposed DRQNet is robust to point cloud analysis and eliminates the effects of irregularities and disorder. Our method achieves significant performance gains on three mainstream benchmarks.

Foreground/Background-Masked Interaction Learning for Spatio-temporal Action Detection

Keke Chen
Xiangbo Shu
Guo-Sen Xie
Rui Yan
Jinhui Tang

Spatio-temporal Action Detection (SAD) aims to recognize the multi-class actions, and meanwhile locate their spatio-temporal occurrence in untrimmed videos. Besides relying on the inherent inter-actor interactions, most previous SAD approaches model actor interactions between multi-actors and the whole frames or special parts (e.g., objects/hands). However, such approaches are relatively graceless by 1) roughly treating all various actors to equivalently interact with frames/parts or by 2) sumptuously borrowing multiple costly detectors to acquire the special parts. To solve the above dilemma, we propose a novel Foreground/Background-masked Interaction Learning (dubbed as FBI Learning) framework to learn the multi-actor features by attentively interacting with the hands-down foreground and background frames. Specifically, we first design a new Mask-guided Cross Attention (MCA) mechanism that calculates the masked cross-attentions to capture the compact relations between the actors and foreground/background regions. Next, we present a new Actor-guided Feature Aggregation (AFA) scheme that integrates foreground- and background-interacted actor features with the learnable actor-based weights. Finally, we construct a long-term feature bank that associates temporal context information to facilitate action classification. Extensive experiments are conducted on commonly available UCF101-24, MultiSports, and AVA v2.1/v2.2 datasets, which illustrate the competitive performance of FBI Learning against the state-of-the-art methods.

TIVA-KG: A Multimodal Knowledge Graph with Text, Image, Video and Audio

Xin Wang
Benyuan Meng
Hong Chen
Yuan Meng
Ke Lv
Wenwu Zhu

Knowledge graphs serve as a powerful tool to boost model performances for various applications covering computer vision, natural language processing, multimedia data mining, etc. The process of knowledge acquisition for human is multimodal in essence, covering text, image, video and audio modalities. However, existing multimodal knowledge graphs fail to cover all these four elements simultaneously, severely limiting their expressive powers in performance improvement for downstream tasks. In this paper, we propose TIVA-KG, a multimodal Knowledge Graph covering Text, Image, Video and Audio, which can benefit various downstream tasks. Our proposed TIVA-KG has two significant advantages over existing knowledge graphs in i) coverage of up to four modalities including text, image, video, audio, and ii) capability of triplet grounding which grounds multimodal relations to triples instead of entities. We further design a Quadruple Embedding Baseline (QEB) model to validate the necessity and efficacy of considering four modalities in KG. We conduct extensive experiments to test the proposed TIVA-KG with various knowledge graph representation approaches over link prediction task, demonstrating the benefits and necessity of introducing multiple modalities and triplet grounding. TIVA-KG is expected to promote further research on mining multimodal knowledge graph as well as the relevant downstream tasks in the community. TIVA-KG is now available at our website: http://mn.cs.tsinghua.edu.cn/tivakg.

Enhancing Fake News Detection in Social Media via Label Propagation on Cross-modal Tweet Graph

Wanqing Zhao
Yuta Nakashima
Haiyuan Chen
Noboru Babaguchi

Fake news detection in social media has become increasingly important due to the rapid proliferation of personal media channels and the consequential dissemination of misleading information. Existing methods, which primarily rely on multimodal features and graph-based techniques, have shown promising performance in detecting fake news. However, they still face a limitation, i.e., sparsity in graph connections, which hinders capturing possible interactions among tweets. This challenge has motivated us to explore a novel method that densifies the graph's connectivity to capture denser interaction better. Our method constructs a cross-modal tweet graph using CLIP, which encodes images and text into a unified space, allowing us to extract potential connections based on similarities in text and images. We then design a Feature Contextualization Network with Label Propagation (FCN-LP) to model the interaction among tweets as well as positive or negative correlations between predicted labels of connected tweets. The propagated labels from the graph are weighted and aggregated for the final detection. To enhance the model's generalization ability to unseen events, we introduce a domain generalization loss that ensures consistent features between tweets on seen and unseen events. We use three publicly available fake news datasets, Twitter, PHEME, and Weibo, for evaluation. Our method consistently improves the performance over the state-of-the-art methods on all benchmark datasets and effectively demonstrates its aptitude for generalizing fake news detection in social media.

Cooperative Colorization: Exploring Latent Cross-Domain Priors for NIR Image Spectrum Translation

Xingxing Yang
Jie Chen
Zaifeng Yang

Near-infrared (NIR) image spectrum translation is a challenging problem with many promising applications. Existing methods struggle with the mapping ambiguity between the NIR and the RGB domains, and generalize poorly due to the limitations of models' learning capabilities and the unavailability of sufficient NIR-RGB image pairs for training. To address these challenges, we propose a cooperative learning paradigm that colorizes NIR images in parallel with another proxy grayscale colorization task by exploring latent cross-domain priors (i.e., latent spectrum context priors and task domain priors), dubbed CoColor. The complementary statistical and semantic spectrum information from these two task domains -- in the forms of pre-trained colorization networks -- are brought in as task domain priors. A bilateral domain translation module is subsequently designed, in which intermittent NIR images are generated from grayscale and colorized in parallel with authentic NIR images; and vice versa for the grayscale images. These intermittent transformations act as latent spectrum context priors for efficient domain knowledge exchange. We progressively fine-tune and fuse these modules with a series of pixel-level and feature-level consistency constraints. Experiments show that our proposed cooperative learning framework produces satisfactory spectrum translation outputs with diverse colors and rich textures, and outperforms state-of-the-art counterparts by 3.95dB and 4.66dB in terms of PNSR for the NIR and grayscale colorization tasks, respectively.

ALA: Naturalness-aware Adversarial Lightness Attack

Yihao Huang
Liangru Sun
Qing Guo
Felix Juefei-Xu
Jiayi Zhu
Jincao Feng
Yang Liu
Geguang Pu

Most researchers have tried to enhance the robustness of deep neural networks (DNNs) by revealing and repairing the vulnerability of DNNs with specialized adversarial examples. Parts of the attack examples have imperceptible perturbations restricted by Lp norm. However, due to their high-frequency property, the adversarial examples can be defended by denoising methods and are hard to realize in the physical world. To avoid the defects, some works have proposed unrestricted attacks to gain better robustness and practicality. It is disappointing that these examples usually look unnatural and can alert the guards. In this paper, we propose Adversarial Lightness Attack (ALA), a white-box unrestricted adversarial attack that focuses on modifying the lightness of the images. The shape and color of the samples, which are crucial to human perception, are barely influenced. To obtain adversarial examples with a high attack success rate, we propose unconstrained enhancement in terms of the light and shade relationship in images. To enhance the naturalness of images, we craft the naturalness-aware regularization according to the range and distribution of light. The effectiveness of ALA is verified on two popular datasets for different tasks (i.e., ImageNet for image classification and Places-365 for scene recognition).

Neural Image Popularity Assessment with Retrieval-augmented Transformer

Liya Ji
Chan Ho Park
Zhefan Rao
Qifeng Chen

Since the advent of social media platforms, image selection based on social preference is a challenging task that all users inherently undertake before sharing images with the public. In our user study for this problem, human choices of images based on perceived social preference are largely inaccurate (58.7% accuracy). The challenge of this task, also known as image popularity assessment, lies in its subjective nature caused by visual and non-visual factors. Especially in the social media setting, social feedback on a particular image largely differs depending on who uploads it. Therefore social preference model should be able to account for this user-specific image aspect of the task. To address this issue, we present a retrieval-augmented approach that leverages both image features and user-specific statistics for neural image popularity assessment. User-specific statistics are derived by retrieving past images with their statistics from a memory bank. By combining these statistics with image features, our approach achieves 79.5% accuracy, which significantly outperforms human and baseline models on the pairwise ranking of images from the Instagram Influencer Dataset. Our source code will be publicly available.

A Figure Skating Jumping Dataset for Replay-Guided Action Quality Assessment

Yanchao Liu
Xina Cheng
Takeshi Ikenaga

In competitive sports, judges often scrutinize replay videos from multiple views to adjudicate uncertain or contentious actions, and ultimately ascertain the definitive score. Most existing action quality assessment methods regress from a single video or a pairwise exemplar and input videos, which are limited by the viewpoint and zoom scale of videos. To end this, we construct a Replay Figure Skating Jumping dataset (RFSJ), containing additional view information provided by the post-match replay video and fine-grained annotations. We also propose a Replay-Guided approach for action quality assessment, learned by a Triple-Stream Contrastive Transformer and a Temporal Concentration Module. Specifically, besides the pairwise input and exemplar, we contrast the input and its replay by an extra contrastive module. Then the consistency of scores guides the model to learn features of the same action under different views and zoom scales. In addition, based on the fact that errors or highlight moments of athletes are crucial factors affecting scoring, these moments are concentrated in parts of the video rather than a uniform distribution. The proposed temporal concentration module encourages the model to concentrate on these features, then cooperates with the contrastive regression module to obtain an effective scoring mechanism. Extensive experiments demonstrate that our method achieves Spearman's Rank Correlation of 0.9346 on the proposed RFSJ dataset, improving over the existing state-of-the-art methods.

Enhancing Visibility in Nighttime Haze Images Using Guided APSF and Gradient Adaptive Convolution

Yeying Jin
Beibei Lin
Wending Yan
Yuan Yuan
Wei Ye
Robby T. Tan

Visibility in hazy nighttime scenes is frequently reduced by multiple factors, including low light, intense glow, light scattering, and the presence of multicolored light sources. Existing nighttime dehazing methods often struggle with handling glow or low-light conditions, resulting in either excessively dark visuals or unsuppressed glow outputs. In this paper, we enhance the visibility from a single nighttime haze image by suppressing glow and enhancing low-light regions. To handle glow effects, our framework learns from the rendered glow pairs. Specifically, a light source aware network is proposed to detect light sources of night images, followed by the APSF (Angular Point Spread Function)-guided glow rendering. Our framework is then trained on the rendered images, resulting in glow suppression. Moreover, we utilize gradient-adaptive convolution, to capture edges and textures in hazy scenes. By leveraging extracted edges and textures, we enhance the contrast of the scene without losing important structural details. To boost low-light intensity, our network learns an attention map, then adjusted by gamma correction. This attention has high values on low-light regions and low values on haze and glow regions. Extensive evaluation on real nighttime haze images, demonstrates the effectiveness of our method. Our experiments demonstrate that our method achieves a PSNR of 30.38dB, outperforming state-of-the-art methods by 13% on GTA5 nighttime haze dataset. Our data and code is available at: https://github.com/jinyeying/nighttime_dehaze.

Rethinking Voice-Face Correlation: A Geometry View

Xiang Li
Yandong Wen
Muqiao Yang
Jinglu Wang
Rita Singh
Bhiksha Raj

Previous works on voice-face matching and voice-guided face synthesis demonstrate strong correlations between voice and face, but mainly rely on coarse semantic cues such as gender, age, and emotion. In this paper, we aim to investigate the capability of reconstructing the 3D facial shape from voice from a geometry perspective without any semantic information. We propose a voice-anthropometric measurement (AM)-face paradigm, which identifies predictable facial AMs from the voice and uses them to guide 3D face reconstruction. By leveraging AMs as a proxy to link the voice and face geometry, we can eliminate the influence of unpredictable AMs and make the face geometry tractable. Our approach is evaluated on our proposed dataset with ground-truth 3D face scans and corresponding voice recordings, and we find significant correlations between voice and specific parts of the face geometry, such as the nasal cavity and cranium. Our work offers a new perspective on voice-face correlation and can serve as a good empirical study for anthropometry science.

Dynamic Grouped Interaction Network for Low-Light Stereo Image Enhancement

Baiang Li
Huan Zheng
Zhao Zhang
Yang Zhao
Zhongqiu Zhao
Haijun Zhang

Low-Light Stereo Image Enhancement (LLSIE) tackles the challenge of improving the illumination and restoring the details in stereo images. However, existing deep learning-based LLSIE methods trained on high-resolution low-light images often exhibit sub-optimal performance when interacting with information from the left and right views. We find that this is because of: (1) the high computational cost arising from quadratic complexity, which hinders the enhancement model's ability to process high-resolution images; and (2) the limitations of conventional fusion strategies in previous work, which inadequately capture cross-view cues, resulting in weak feature representation and compromised detail recovery. To address these limitations, we propose a novel Dynamic Grouped Interaction Network (DGI-Net) to enhance illumination and recover more details while reducing the computational cost. Specifically, DGI-Net employs the U-Net structure, which effectively mitigates noise during the low-light enhancement. Furthermore, we design a Grouped Stereo Interaction Module (GSIM) with a grouping strategy to efficiently discover cross-view cues while minimizing computations. To dynamically fuse stereo information and fully exploit cross-view correlations, we also introduce a Dynamic Embedding Module (DEM) to establish dynamic connections between inter-view cues and intra-view features, which performs dynamic weight processing on cross-view cues to eliminate noise during fusion. For intra-view processing, we present a Diversity Enhanced Block (DEB) to extract multi-scale features, thereby improving diversity and feature representation. This multi-scale feature extraction also addresses low image contrast in dark lighting conditions. Experimental results demonstrate that DGI-Net outperforms current state-of-the-art methods in low-light stereo image enhancement.

PVG: Progressive Vision Graph for Vision Recognition

JiaFu Wu
Jian Li
Jiangning Zhang
Boshen Zhang
Mingmin Chi
Yabiao Wang
Chengjie Wang

Convolution-based and Transformer-based vision backbone networks process images into the grid or sequence structures, respectively, which are inflexible for capturing irregular objects. Though Vision GNN (ViG) adopts graph-level features for complex images, it has some issues, such as inaccurate neighbor node selection, expensive node information aggregation calculation, and over-smoothing in the deep layers. To address the above problems, we propose a Progressive Vision Graph (PVG) architecture for vision recognition task. Compared with previous works, PVG contains three main components: 1) Progressively Separated Graph Construction (PSGC) to introduce second-order similarity by gradually increasing the channel of the global graph branch and decreasing the channel of local branch as the layer deepens; 2) Neighbor nodes information aggregation and update module by using Max pooling and mathematical Expectation (MaxE) to aggregate rich neighbor information; 3) Graph error Linear Unit (GraphLU) to enhance low-value information in a relaxed form to reduce the compression of image detail information for alleviating the over-smoothing. Extensive experiments on mainstream benchmarks demonstrate the superiority of PVG over state-of-the-art methods, e.g., our PVG-S obtains 83.0% Top-1 accuracy on ImageNet-1K that surpasses GNN-based ViG-S by +0.9↑ with the parameters reduced by 18.5%, while the largest PVG-B obtains 84.2% that has +0.5↑ improvement than ViG-B. Furthermore, our PVG-S obtains +1.3↑ box AP and +0.4↑ mask AP gains than ViG-S on COCO dataset.

StylePrompter: All Styles Need Is Attention

Chenyi Zhuang
Pan Gao
Aljosa Smolic

GAN inversion aims at inverting given images into corresponding latent codes for Generative Adversarial Networks (GANs), especially StyleGAN where exists a disentangled latent space that allows attribute-based image manipulation. As most inversion methods build upon Convolutional Neural Networks (CNNs), we transfer a hierarchical vision Transformer backbone innovatively to predict W+ latent codes at token level. We further apply a Style-driven Multi-scale Adaptive Refinement Transformer (SMART) in ℱ space to refine the intermediate style features of the generator. By treating style features as queries to retrieve lost identity information from the encoder's feature maps, SMART can not only produce high-quality inverted images but also surprisingly adapt to editing tasks. We then prove that StylePrompter lies in a more disentangled W+ and show the controllability of SMART. Finally, quantitative and qualitative experiments demonstrate that Style Prompter can achieve desirable performance in balancing reconstruction quality and editability, and is "smart" enough to fit into most edits, outperforming other ℱ -involved inversion methods. Our code is available at: https://github.com/I2-Multimedia-Lab/StylePrompter.

Improving Federated Person Re-Identification through Feature-Aware Proximity and Aggregation

Pengling Zhang
Huibin Yan
Wenhui Wu
Shuoyao Wang

Person re-identification (ReID) is a challenging task that aims to identify individuals across multiple non-overlapping camera views. To enhance the performance and robustness of ReID models, it is crucial to train them over multiple data sources. However, the traditional centralized approach poses a significant challenge to privacy as it requires collecting data from distributed data owners. To overcome this challenge, we employ the federated learning approach, which enables distributed model training without compromising data privacy. In this paper, we propose a novel feature-aware local proximity and global aggregation method for federated ReID to extract robust feature representations. Specifically, we introduce a proximal term and a feature regularization term for local model training to improve local training accuracy while ensuring global aggregation convergence. Furthermore, we use the cosine distance of backbone features to determine the global aggregation weight of each local model. Our proposed method significantly improves the performance and generalization of the global model. Extensive experiments demonstrate the effectiveness of our proposal. Specifically, our method achieves an additional 27.3% Rank-1 average accuracy in federated full supervision and an extra 20.3% mean Average Precision (mAP) on DukeMTMC in federated domain generalization.

Transformer-based Open-world Instance Segmentation with Cross-task Consistency Regularization

Xizhe Xue
Dongdong Yu
Lingqiao Liu
Yu Liu
Satoshi Tsutsui
Ying Li
Zehuan Yuan
Ping Song
Mike Zheng Shou

Open-World Instance Segmentation (OWIS) is an emerging research topic that aims to segment class-agnostic object instances from images. The mainstream approaches use a two-stage segmentation framework, which first locates the candidate object bounding boxes and then performs instance segmentation. In this work, we instead promote a single-stage transformer-based framework for OWIS. We argue that the end-to-end training process in the single-stage framework can be more convenient for directly regularizing the localization of class-agnostic object pixels. Based on the transformer-based instance segmentation framework, we propose a regularization model to predict foreground pixels and use its relation to instance segmentation to construct a cross-task consistency loss. We show that such a consistency loss could alleviate the problem of incomplete instance annotation - a common problem in the existing OWIS datasets. We also show that the proposed loss lends itself to an effective solution to semi-supervised OWIS that could be considered an extreme case that all object annotations are absent for some images. Our extensive experiments demonstrate that the proposed method achieves impressive results in both fully-supervised and semi-supervised settings. Compared to SOTA methods, the proposed method significantly improves the AP_100 score by 4.75% in UVO dataset →UVO dataset setting and 4.05% in COCO dataset →UVO dataset setting.

Cross-Illumination Video Anomaly Detection Benchmark

Dongliang Zhu
Ruimin Hu
Shengli Song
Xiang Guo
Xixi Li
Zheng Wang

Video anomaly detection is a critical problem with widespread applications in domains such as security surveillance. Most existing methods focus on video anomaly detection tasks under uniform illumination conditions. However, in the real world, the situation is much more complicated. Video anomalies are widespread across periods and under different illumination conditions, which can lead to the detector model incorrectly reporting high anomaly scores. To address this challenge, we design a benchmark framework for the cross-illumination video anomaly detection task. The framework restores videos under different illumination scales to the same illumination scale. This reduces domain differences between uniformly illuminated training videos and differently illuminated test videos. Additionally, to demonstrate the illumination change problem and evaluate our model, we construct three large-scale datasets with a wide range of illumination variations. We experimentally validate our approach on three cross-illuminance video anomaly detection datasets. Experimental results show that our method outperforms existing methods regarding detection accuracy and is more robust.

Practical Edge Detection via Robust Collaborative Learning

Yuanbin Fu
Xiaojie Guo

Edge detection, as a core component in a wide range of vision-oriented tasks, is to identify object boundaries and prominent edges in natural images. An edge detector is desired to be both efficient and accurate for practical use. To achieve the goal, two key issues should be concerned: 1) How to liberate deep edge models from inefficient pre-trained backbones that are leveraged by most existing deep learning methods, for saving the computational cost and cutting the model size; and 2) How to mitigate the negative influence from noisy or even wrong labels in training data, which widely exist in edge detection due to the subjectivity and ambiguity of annotators, for the robustness and accuracy. In this paper, we attempt to simultaneously address the above problems via developing a collaborative learning based model, termed PEdger. The principle behind our PEdger is that, the information learned from different training moments and heterogeneous (recurrent and non recurrent in this work) architectures, can be assembled to explore robust knowledge against noisy annotations, even without the help of pre-training on extra data. Extensive ablation studies together with quantitative and qualitative experimental comparisons on the BSDS500 and NYUD datasets are conducted to verify the effectiveness of our design, and demonstrate its superiority over other competitors in terms of accuracy, speed, and model size.

MSECNet: Accurate and Robust Normal Estimation for 3D Point Clouds by Multi-Scale Edge Conditioning

Haoyi Xiu
Xin Liu
Weimin Wang
Kyoung-Sook Kim
Masashi Matsuoka

Estimating surface normals from 3D point clouds is critical for various applications, including surface reconstruction and rendering. While existing methods for normal estimation perform well in regions where normals change slowly, they tend to fail where normals vary rapidly. To address this issue, we propose a novel approach called MSECNet, which improves estimation in normal varying regions by treating normal variation modeling as an edge detection problem. MSECNet consists of a backbone network and a multi-scale edge conditioning (MSEC) stream. The MSEC stream achieves robust edge detection through multi-scale feature fusion and adaptive edge detection. The detected edges are then combined with the output of the backbone network using the edge conditioning module to produce edge-aware representations. Extensive experiments show that MSECNet outperforms existing methods on both synthetic (PCPNet) and real-world (SceneNN) datasets while running significantly faster. We also conduct various analyses to investigate the contribution of each component in the MSEC stream. Finally, we demonstrate the effectiveness of our approach in surface reconstruction.

Efficient Parallel Multi-Scale Detail and Semantic Encoding Network for Lightweight Semantic Segmentation

Xiao Liu
Xiuya Shi
Lufei Chen
Linbo Qing
Chao Ren

In this work, we propose PMSDSEN, a parallel multi-scale encoder-decoder network architecture for semantic segmentation, inspired by the human visual perception system's ability to aggregate contextual information in various contexts and scales. Our approach introduces the efficient Parallel Multi-Scale Detail and Semantic Encoding (PMSDSE) unit to extract detailed local information and coarse large-range relationships in parallel, enabling the recognition of object boundaries and object-level areas. By stacking multiple PMSDSEs, our network learns fine-grained details and textures along with abstract category and semantic information, effectively utilizing a larger range of surrounding context information for robust segmentation. To further enhance the network's receptive field without increasing computational complexity, the Multi-Scale Semantic Extractor (MSSE) at the end of the encoder is utilized for multi-scale semantic context extraction and detailed information encoding. Additionally, the Dynamic Weighted Feature Fusion (DWFF) strategy is employed to integrate shallow layer detail information and deep layer semantic information during the decoder stage. Our method can obtain multi-scale context from local to global, achieving efficiently low-level feature extraction to high-level semantic interpretation at different scales and in different contexts. Without bells and whistles, PMSDSEN obtains a better trade-off between accuracy and complexity on popular benchmarks, including Cityscapes and Camvid. Specifically, PMSDSEN attains 73.2% mIoU with only 0.9M parameters on the Cityscapes test set. Codes and supplementary materials link: https://github.com/liux520/PMSDSEN.

Multi-Frame Self-Supervised Depth Estimation with Multi-Scale Feature Fusion in Dynamic Scenes

Jiquan Zhong
Xiaolin Huang
Xiao Yu

Monocular depth estimation is a fundamental task in computer vision and multimedia. The self-supervised learning pipeline makes it possible to train the monocular depth network with no need of depth labels. In this paper, a multi-frame depth model with multi-scale feature fusion is proposed for strengthening texture features and spatial-temporal features, which improves the robustness of depth estimation between frames with large camera ego-motion. A novel dynamic object detecting method with geometry explainability is proposed. The detected dynamic objects are excluded during training, which guarantees the static environment assumption and relieves the accuracy degradation problem of the multi-frame depth estimation. Robust knowledge distillation with a consistent teacher network and reliability guarantee is proposed, which improves the multi-frame depth estimation without an increase in computation complexity during the test. The experiments show that our proposed methods achieve great performance improvement on the multi-frame depth estimation.

Peering into The Sketch: Ultra-Low Bitrate Face Compression for Joint Human and Machine Perception

Yudong Mao
Peilin Chen
Shurun Wang
Shiqi Wang
Dapeng Wu

We propose a novel face compression framework that leverages the external priors for joint human and machine perception under ultra-low bitrate scenarios. The proposed framework leverages the semantic richness of face images by representing the faces into sketches and thumbnails, resulting in improved bitrate utility for both human and machine vision. At the decoder side, the framework introduces a two-stage generative reconstruction, which faithfully enhances the reconstructed image via semi-parametric modeling and retrieved guidance from the external database. In particular, this coarse-to-fine strategy also results in improved identity consistency and analysis performance of the reconstructed image. Extensive evaluations of the proposed method have been conducted on the public face dataset by comparing it with end-to-end image compression techniques as well as traditional image compression standards. The experimental results demonstrate the effectiveness of the proposed method via superior perceptual and analytical performance under ultra-low bitrate conditions.

MTSN: Multiscale Temporal Similarity Network for Temporal Action Localization

Xiaodong Jin
Taiping Zhang

Temporal Action Localization (TAL) aims to predict the categories and temporal segments of all action instances in untrimmed videos, which is a critical and challenging task in the video understanding field. The performances of existing TAL methods remain unsatisfactory, due to the lack of highly effective temporal modeling and refined action proposal decoding. In this paper, we propose Multiscale Temporal Similarity Network (MTSN), a novel one-stage method for TAL, which mainly benefits from dynamic complementary modeling and temporal similarity decoding. Specifically, we first design Dynamic Complementary Context Aggregation (DCCA), a Transformer-based encoder. DCCA performs both long-range and short-range temporal modeling through different interaction range types of attention heads at each feature pyramid level, while higher-level semantic representations are effectively complemented with more short-range detail information in a dynamic fashion. Moreover, Temporal Similarity Mask (TSM) is designed to generate masks through an optimized globally-aware decoding process, including similarity cross-modeling, region-aware optimization and multiscale aggregated residual, which leads to high-quality action proposals. We conduct extensive experiments on two major TAL benchmarks: THUMOS14 and ActivityNet-1.3, where our method establishes a new state-of-the-art and significantly outperforms the previous best methods. Without bells and whistles, on THUMOS14, MTSN achieves an average mAP of 72.1% (+5.3%). On ActivityNet-1.3, MTSN reaches an average mAP of 40.7% (+3.1%), which crosses the 40% average mAP for the first time.

Disentangling Multi-view Representations Beyond Inductive Bias

Guanzhou Ke
Yang Yu
Guoqing Chao
Xiaoli Wang
Chenyang Xu
Shengfeng He

Multi-view (or -modality) representation learning aims to understand the relationships between different view representations. Existing methods disentangle multi-view representations into consistent and view-specific representations by introducing strong inductive biases, which can limit their generalization ability. In this paper, we propose a novel multi-view representation disentangling method that aims to go beyond inductive biases, ensuring both interpretability and generalizability of the resulting representations. Our method is based on the observation that discovering multi-view consistency in advance can determine the disentangling information boundary, leading to a decoupled learning objective. We also found that the consistency can be easily extracted by maximizing the transformation invariance and clustering consistency between views. These observations drive us to propose a two-stage framework. In the first stage, we obtain multi-view consistency by training a consistent encoder to produce semantically-consistent representations across views as well as their corresponding pseudo-labels. In the second stage, we disentangle specificity from comprehensive representations by minimizing the upper bound of mutual information between consistent and comprehensive representations. Finally, we reconstruct the original data by concatenating pseudo-labels and view-specific representations. Our experiments on four multi-view datasets demonstrate that our proposed method outperforms 12 comparison methods in terms of clustering and classification performance. The visualization results also show that the extracted consistency and specificity are compact and interpretable. Our code can be found at https://github.com/Guanzhou-Ke/DMRIB.

Implicit Decouple Network for Efficient Pose Estimation

Lei Zhao
Le Han
Min Yao
Nenggan Zheng

In the field of pose estimation, keypoint representations can take the form of Gaussian heatmaps, classification vectors, or direct coordinates. However, the current networks suffer from a lack of consistency with these keypoint representations. They only accommodate these representations in the final layer, resulting in suboptimal efficiency and requiring a high number of parameters or computational resources. In this paper, we propose a simple yet efficient plug-and-play module, named the Implicit Decouple Module (IDM), which decouples features into two parts along the x-y axes and aggregates features in a direction-aware manner. This approach implicitly fuses direction-specific coordinate information, improving the consistency with the keypoint representations, especially in vector form. Furthermore, we introduce a fully convolutional backbone network, named the Implicit Decouple Network (IDN), which incorporates IDM without the need to maintain high-resolution features, dense multi-level feature fusion, or lots of repeated stages, while still achieving high performance. In experiments on the COCO dataset, our basic IDN without pre-training can outperform HRNet (28.5M) by 2.4 AP with 18.2M parameters, and even surpass some transformer-based methods. In the lightweight model scenario, our model outstrips Lite-HRNet by 3.9 AP with only 2.5M parameters. We also evaluate our model on the person instance segmentation task and other datasets, demonstrating its generality and effectiveness. http(s)://znk.ink/su/mm23idn.

Occluded Skeleton-Based Human Action Recognition with Dual Inhibition Training

Zhenjie Chen
Hongsong Wang
Jie Gui

Recently, skeleton-based human action recognition has received widespread attention in computer vision community. However, most existing research focuses on improving the recognition accuracy on complete skeleton data, while ignoring the performance on the incomplete skeleton data with occlusion or noise. This paper addresses occluded and noise-robust skeleton-based action recognition and presents a novel Dual Inhibition Training strategy. Specifically, we propose Part-aware and Dual-inhibition Graph Convolutional Network (PDGCN), which comprises of three parts: Input Skeleton Inhibition (ISI), Part-Aware Representation Learning (PARL) and Predicted Score Inhibition (PSI). The ISI and PSI are plug and play modules which could encourage the model to learn discriminative features from diversified body joints by effectively simulating key body part occlusions and random occlusions. The PARL module learns both the global and local representations from the whole body and body parts, respectively, and progressively fuses them during representation learning to enhance the model robustness under occlusions. Finally, we design different settings for occluded skeleton-based human action recognition to deep study this problem and better evaluate different approaches. Our approach achieves state-of-the-art results on different benchmarks and dramatically outperforms the recent skeleton-based action recognition approaches, especially under large-scale temporal occlusion.

P2I-NET: Mapping Camera Pose to Image via Adversarial Learning for New View Synthesis in Real Indoor Environments

Xujie Kang
Kanglin Liu
Jiang Duan
Yuanhao Gong
Guoping Qiu

Given a new 6DoF camera pose in an indoor environment, we study the challenging problem of predicting the view from that pose based on a set of reference RGBD views. Existing explicit or implicit 3D geometry construction methods are computationally expensive while those based on learning have predominantly focused on isolated views of object categories with regular geometric structure. Differing from the traditional render-inpaint approach to new view synthesis in the real indoor environment, we propose a conditional generative adversarial neural network (P2I-NET) to directly predict the new view from the given pose. P2I-NET learns the conditional distribution of the images of the environment for establishing the correspondence between the camera pose and its view of the environment, and achieves this through a number of innovative designs in its architecture and training lost function. Two auxiliary discriminator constraints are introduced for enforcing the consistency between the pose of the generated image and that of the corresponding real world image in both the latent feature space and the real world pose space. Additionally a deep convolutional neural network (CNN) is introduced to further reinforce this consistency in the pixel space. We have performed extensive new view synthesis experiments on real indoor datasets. Results show that P2I-NET has superior performance against a number of NeRF based strong baseline models. In particular, we show that P2I-NET is 40 to 100 times faster than these competitor techniques while synthesising similar quality images. Furthermore, we contribute a new publicly available indoor environment dataset containing 22 high resolution RGBD videos where each frame also has accurate camera pose parameters.

IRCasTRF: Inverse Rendering by Optimizing Cascaded Tensorial Radiance Fields, Lighting, and Materials From Multi-view Images

Wenpeng Xing
Jie Chen
Ka Chun Cheung
Simon See

We propose an inverse rendering pipeline that simultaneously reconstructs scene geometry, lighting, and spatially-varying material from a set of multi-view images. Specifically, the proposed pipeline involves volume and physics-based rendering, which are performed separately in two steps: exploration and exploitation. During the exploration step, our method utilizes the compactness of neural radiance fields and a flexible differentiable volume rendering technique to learn an initial volumetric field. Here, we introduce a novel cascaded tensorial radiance field method on top of the Canonical Polyadic (CP) decomposition to boost model compactness beyond conventional methods. In the exploitation step, a shading pass that incorporates a differentiable physics-based shading method is applied to jointly optimize the scene's geometry, spatially-varying materials, and lighting, using image reconstruction loss. Experimental results demonstrate that our proposed inverse rendering pipeline, IRCasTRF, outperforms prior works in inverse rendering quality. The final output is highly compatible with downstream applications like scene editing and advanced simulations. Further details are available on the project page: https://ircasrf.github.io/.

Noise-Robust Continual Test-Time Domain Adaptation

Zhiqi Yu
Jingjing Li
Zhekai Du
Fengling Li
Lei Zhu
Yang Yang

Continual test-time domain adaptation (TTA) is a challenging topic in the field of source-free domain adaptation, which focuses on addressing cross-domain multimedia information during inference with a continuously changing data distribution. Previous methods have been found to lack noise robustness, leading to a significant increase in errors under strong noise. In this paper, we address the noise-robustness problem in continual TTA by offering three effective recipes to mitigate it. At the category level, we employ the Taylor cross-entropy loss to alleviate the low confidence category bias commonly associated with cross-entropy. At the sample level, we reweight the target samples based on uncertainty to prevent the model from overfitting on noisy samples. Finally, to reduce pseudo-label noise, we propose a soft ensemble negative learning mechanism to guide the model optimization using ensemble complementary pseudo labels. Our method achieves state-of-the-art performance on three widely used continual TTA datasets, particularly in the strong noise setting that we introduced.

TIRDet: Mono-Modality Thermal InfraRed Object Detection Based on Prior Thermal-To-Visible Translation

Zeyu Wang
Fabien Colonnier
Jinghong Zheng
Jyotibdha Acharya
Wenyu Jiang
Kejie Huang

Cross-modality images that combine visible-infrared spectra can provide complementary information for object detection. In particular, they are well-suited for autonomous vehicle applications in dark environments with limited illumination. However, it is time-consuming to acquire a large number of pixel-aligned visible-thermal image pairs, and real-time alignment is challenging in practical driving systems. Furthermore, the quality of visible-spectrum images can be adversely affected by complex environmental conditions. In this paper, we propose a novel neural network called TIRDet, which only utilizes Thermal InfraRed (TIR) images for mono-modality object detection. To compensate for the lacked visible-band information, we adopt a prior Thermal-To-Visible (T2V) translation model to obtain the translated visible images and the latent T2V codes. In addition, we introduce a novel attention-based Cross-Modality Aggregation (CMA) module, which can augment the modality-translation awareness of TIRDet by preserving the T2V semantic information. Extensive experiments on FLIR and LLVIP datasets demonstrate that our TIRDet significantly outperforms all mono-modality detection methods based on thermal images, and it even surpasses most State-Of-The-Art (SOTA) multispectral methods using visible-thermal image pairs. Code is available at https://github.com/zeyuwang-zju/TIRDet

HARP: Let Object Detector Undergo Hyperplasia to Counter Adversarial Patches

Junzhe Cai
Shuiyan Chen
Heng Li
Beihao Xia
Zimin Mao
Wei Yuan

Adversarial patches can mislead object detectors to produce erroneous predictions. To defend against adversarial patches, one can take two types of protections on the model side, including modifying the detector itself (e.g., adversarial training) or attaching a new model in front of the detector. However, the former often deteriorates clean performance of detectors, and the latter may have high deployment costs caused by too many training parameters. Inspired by the phenomenon of "bone hyperplasia" in human bodies, we present a novel model-side adversarial patch defense, called HARP (Hyperplasia based Adversarial Patch defense). Just as bone hyperplasia can enhance bone strength and skeletal stability, the hyperostosia of detectors can also help to resist adversarial patches. Following this idea, HARP chooses to improve adversarial robustness by "growing" lightweight CNN modules (i.e., hyperplasia modules) on the pre-trained object detectors. We conduct extensive experiments on the PASCAL VOC and COCO datasets to compare HARP with the data-side defense JPEG and the model-side defenses adversarial training, SAC and FNC. Experimental results show that HARP provides excellent defense against adversarial patches while maintaining clean performance, outperforming the compared defense methods. Under PGD-based adaptive attacks, HARP surpasses the recently proposed defense method SAC by 12.5% in mean average precision (mAP) on PASCAL VOC, and 13.2% on COCO dataset. In addition, experiments confirm that the increase in model inference time caused by HARP is almost negligible.

Scale-space Tokenization for Improving the Robustness of Vision Transformers

Lei Xu
Rei Kawakami
Nakamasa Inoue

The performance of the Vision Transformer (ViT) model and its variants in most vision tasks has surpassed traditional Convolutional Neural Networks (CNNs) in terms of in-distribution accuracy. However, ViTs still have significant room for improvement in their robustness to input perturbations. Furthermore, robustness is a critical aspect to consider when deploying ViTs in real-world scenarios. Despite this, some variants of ViT improve the in-distribution accuracy and computation performance at the cost of sacrificing the model's robustness and generalization. In this study, inspired by the prior findings on the potential effectiveness of shape bias to robustness improvement and the importance of multi-scale analysis, we propose a simple yet effective method, scale-space tokenization, to improve the robustness of ViT while maintaining in-distribution accuracy. Based on this method, we build Scale-space-based Robust Vision Transformer (SRVT) model. Our method consists of scale-space patch embedding and scale-space positional encoding. The scale-space patch embedding makes a sequence of variable-scale images and increases the model's shape bias to enhance its robustness. The scale-space positional encoding implicitly boosts the model's invariance to input perturbations by incorporating scale-aware position information into 3D sinusoidal positional encoding. We conduct experiments on image recognition benchmarks (CIFAR10/100 and ImageNet-1k) from the perspectives of in-distribution accuracy, adversarial and out-of-distribution robustness. The experimental results demonstrate our method's effectiveness in improving robustness without compromising in-distribution accuracy. Especially, our approach achieves advanced adversarial robustness on ImageNet-1k benchmark compared with state-of-the-art robust ViT.

Margin MCC: Chance-Robust Metric for Video Boundary Detection with Allowed Margin

Kosuke Mizufune
Shunsuke Tanaka
Toshihide Yukitake
Tatsushi Matsubayashi

Video boundary detection is a task to divide a video into several segments based on event changes such as scenes or actions. The most common evaluation is to judge whether the distance between predicted boundaries and the ground truth boundaries is lower than allowed margin and then compute F1 score. However, we found that the evaluation only by F1 measure can lead to wrong conclusions since even completely random model can achieve inflated F1 when the number of predictions is large. To design a robust metric against chance, we propose Margin Matthews Correlation Coefficient (MMCC) as an extension of Matthews Correlation Coefficient (MCC) to video boundary detection with allowed margin. Although MCC is a robust metric against chance, it is not obvious that the same is true in video boundary detection due to allowed margin. Specifically, some definitions of MCC do not keep a constant as for the number of predicted boundaries. Therefore, we design MMCC so that the expected MMCC for random guessing will be zero, based on mathematical analysis. We empirically examine if MMCC is robust against completely random guessing and oversegmentation/undersegmentation, while F1 is not.

Exploring the Knowledge Transferred by Response-Based Teacher-Student Distillation

Liangchen Song
Xuan Gong
Helong Zhou
Jiajie Chen
Qian Zhang
David Doermann
Junsong Yuan

Response-based Knowledge Distillation refers to the technique of supervising the student network with the teacher networks' predictions. The method is motivated by observing that the predicted probabilities reflect the relation among labels, which is the knowledge to be transferred. This paper explores the transferred knowledge from a novel perspective: comparing the knowledge transferred through different teachers. Two intriguing properties are observed. First, higher confidence scores of teachers' predictions lead to better distillation results, and second, teachers' incorrectly predicted training samples should be kept for distillation. We then analyze the phenomenon by studying teachers' decision boundaries, of which some can help the student generalize while some may not. Based on the observations, we further propose an embarrassingly simple distillation framework named Efficient Distillation, which is effective on ImageNet with different teacher-student pairs: When using ResNet34 as the teacher, the student ResNet18 trained from scratch reaches 74.07% Top-1 accuracy within 98 GPU hours (RTX 3090), outperforming current state-of-the-art result (73.19%) by a large margin. Our code is available at https://github.com/lsongx/EffDstl.

Selecting Learnable Training Samples is All DETRs Need in Crowded Pedestrian Detection

Feng Gao
Jiaxu Leng
Ji Gan
Xinbo Gao

DEtection TRansformer (DETR) and its variants (DETRs) achieved impressive performance in general object detection. However, in crowded pedestrian detection, the performance of DETRs is still unsatisfactory due to the inappropriate sample selection method which results in more false positives. To settle the issue, we propose a simple but effective sample selection method for DETRs, Sample Selection for Crowded Pedestrians (SSCP), which consists of the constraint-guided label assignment scheme (CGLA) and the utilizability-aware focal loss (UAFL). Our core idea is to select learnable samples for DETRs and adaptively regulate the loss weights of samples based on their utilizability. Specifically, in CGLA, we proposed a new cost function to ensure that only learnable positive training samples are retained and the rest are negative training samples. Further, considering the utilizability of samples, we designed UAFL to adaptively assign different loss weights to learnable positive samples depending on their gradient ratio and IoU. Experimental results show that the proposed SSCP effectively improves the baselines without introducing any overhead in inference. Especially, Iter Deformable DETR is improved to 39.7(-2.0)% MR on Crowdhuman and 31.8(-0.4)% MR on Citypersons.

Data-Efficient Masked Video Modeling for Self-supervised Action Recognition

Qiankun Li
Xiaolong Huang
Zhifan Wan
Lanqing Hu
Shuzhe Wu
Jie Zhang
Shiguang Shan
Zengfu Wang

Recently, self-supervised video representation learning based on Masked Video Modeling (MVM) has demonstrated promising results for action recognition. However, existing methods face two significant challenges: (1) video actions involve a crucial temporal dimension, yet current masking strategies adopt inefficient random approaches that undermine low-density dynamic motion clues in videos; (2) pre-training requires large-scale datasets and significant computing resources (including large batch sizes and enormous iterations). To address these issues, we propose a novel method named Data-Efficient Masked Video Modeling (DEMVM) for self-supervised action recognition. Specifically, a novel masking strategy named Flow-Guided Dense Masking (FGDM) is proposed to facilitate efficient learning by focusing more on the action-related temporal clues, which applies dense masking to dynamic regions based on optical flow priors, while sparse masking to background regions. Furthermore, DEMVM introduces a 3D video tokenizer to enhance the modeling of temporal clues. Finally, Progressive Masking Ratio (PMR) and 2D initialization strategies are presented to enable the model to adapt to the characteristics of the MVM paradigm during different training stages. Extensive experiments on multiple benchmarks, UCF101, HMDB51, and Mimetics, demonstrate that our method achieves state-of-the-art performance in the downstream action recognition task with both efficient data and low computational cost. More interestingly, the few-shot experiment on the Mimetics dataset shows that DEMVM can accurately recognize actions even in the presence of context bias.

DeNoising-MOT: Towards Multiple Object Tracking with Severe Occlusions

Teng Fu
Xiaocong Wang
Haiyang Yu
Ke Niu
Bin Li
Xiangyang Xue

Multiple object tracking (MOT) tends to become more challenging when severe occlusions occur. In this paper, we analyze the limitations of traditional Convolutional Neural Network-based methods and Transformer-based methods in handling occlusions and propose DNMOT, an end-to-end trainable DeNoising Transformer for MOT. To address the challenge of occlusions, we explicitly simulate the scenarios when occlusions occur. Specifically, we augment the trajectory with noises during training and make our model learn the denoising process in an encoder-decoder architecture, so that our model can exhibit strong robustness and perform well under crowded scenes. Additionally, we propose a Cascaded Mask strategy to better coordinate the interaction between different types of queries in the decoder to prevent the mutual suppression between neighboring trajectories under crowded scenes. Notably, the proposed method requires no additional modules like matching strategy and motion state estimation in inference. We conduct extensive experiments on the MOT17, MOT20, and DanceTrack datasets, and the experimental results show that our method outperforms previous state-of-the-art methods by a clear margin.

Co-Salient Object Detection with Semantic-Level Consensus Extraction and Dispersion

Peiran Xu
Yadong Mu

Given a group of images, co-salient object detection (CoSOD) aims to highlight the common salient object in each image. There are two factors closely related to the success of this task, namely consensus extraction, and the dispersion of consensus to each image. Most previous works represent the group consensus using local features, while we instead utilize a hierarchical Transformer module for extracting semantic-level consensus. Therefore, it can obtain a more comprehensive representation of the common object category, and exclude interference from other objects that share local similarities with the target object. In addition, we propose a Transformer-based dispersion module that takes into account the variation of the co-salient object in different scenes. It distributes the consensus to the image feature maps in an image-specific way while making full use of interactions within the group. These two modules are integrated with a ViT encoder and an FPN-like decoder to form an end-to-end trainable network, without additional branch and auxiliary loss. The proposed method is evaluated on three commonly used CoSOD datasets and achieves state-of-the-art performance.

BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet Tag-guided Synthetic Data

Xuenan Xu
Zhiling Zhang
Zelin Zhou
Pingyue Zhang
Zeyu Xie
Mengyue Wu
Kenny Q. Zhu

Compared with ample visual-text pre-training research, few works explore audio-text pre-training, mostly due to the lack of sufficient parallel audio-text data. Most existing methods incorporate the visual modality as a pivot for audio-text pre-training, which inevitably induces data noise. In this paper, we propose to utilize audio captioning to generate text directly from audio, without the aid of the visual modality so that potential noise from modality mismatch is eliminated. Furthermore, we propose caption generation under the guidance of AudioSet tags, leading to more accurate captions. With the above two improvements, we curate high-quality, large-scale parallel audio-text data, based on which we perform audio-text pre-training. We comprehensively demonstrate the performance of the pre-trained model on a series of downstream audio-related tasks, including single-modality tasks like audio classification and tagging, as well as cross-modal tasks consisting of audio-text retrieval and audio-based text generation. Experimental results indicate that our approach achieves state-of-the-art zero-shot classification performance on most datasets, suggesting the effectiveness of our synthetic data. The audio encoder also serves as an efficient pattern recognition model by fine-tuning it on audio-related tasks. Synthetic data and pre-trained models are available online1 The code, checkpoints and data are available at https://github.com/wsntxxn/BLAT and https://zenodo.org/record/8218696/.

A Simple Baseline for Open-World Tracking via Self-training

Bingyang Wang
Tanlin Li
Jiannan Wu
Yi Jiang
Huchuan Lu
You He

Open-World Tracking (OWT) presents a challenging yet emerging problem, aiming to track every object of any category. Different from traditional Multi-Object Tracking (MOT), OWT needs to additionally track targets beyond predefined categories in the training set. To address the problem, we propose a simple baseline, SimOWT. We simplify the recently proposed OWT algorithm by streamlining the association module and accelerating the inference speed. By leveraging the self-training paradigm, SimOWT can distinguish unknown-class targets from the background, fully unleashing the potential of TAO-OW dataset. Furthermore, we enhance SimOWT from the perspectives of Pseudo Boxes Merging and Re-Weighting, thereby discovering more targets belonging to unknown classes and reducing the sensitivity of the model to low-quality pseudo-labels. Benefiting from the proposed approaches, SimOWT demonstrates a significant improvement in tracking performance on unknown classes. Moreover, the comprehensive experiments on the TAO-OW benchmark demonstrate that our model outperforms the state-of-the-art OWT method, OWTB, with an absolute gain of 11.2% OWTA and 16.4% detection recall respectively on unknown classes. The code is released at https://github.com/22109095/SimOWT.

VTLayout: A Multi-Modal Approach for Video Text Layout

Yuxuan Zhao
Jin Ma
Zhongang Qi
Zehua Xie
Yu Luo
Qiusheng Kang
Ying Shan

The rapid explosion of video distribution is accompanied by a massive amount of video text, which encompasses rich information about the video content. While previous research has primarily focused on text extraction from videos like text detection, tracking, recognition and end to end spotting, the layout of video text has received limited attention. As different text categories convey distinct meanings, video text layout is critical for video understanding tasks such as video summarization and shooting environment comprehension. To bridge the gap between video OCR and understanding, we explore the study of video text layout in this work. We first optimize the layout annotation of the BOVText, a bilingual, open-world video text dataset, by expanding text categories and defining five clear categories: scene, subtitle, title, logo, and other. Additionally, we rectify the original unreasonable layout annotation based on these definitions. We also propose a Video-level Text Layout model (VTLayout) to address the layout problem, which fuses textual, visual, and spatial-temporal embedding of video text trajectories. To the best of our knowledge, this is the first method to tackle text layout on video level. Our method outperforms image-level layout methods across all text categories and exhibits faster inference speed. This study underscores the significance of video text layout in video understanding and offers an effective solution to this challenge. Our annotation is available at https://github.com/TencentARC/VTLayout.

SEAR: Semantically-grounded Audio Representations

Rajat Hebbar
Digbalay Bose
Shrikanth Narayanan

Audio supports visual story-telling in movies through the use of different sounds. These sounds are often tied to different visual elements, including foreground entities, the interactions between them as well as background context. Visual captions provide a condensed view of an image, providing a natural language description of entities and the relationships between them. In this work, we utilize visual captions to semantically ground audio representations in a self-supervised setup. We leverage state-of-the-art vision-language models to augment movie datasets with visual captions at scale to the order of 9.6M captions to learn audio representations from over 2500 hours of movie data. We evaluate the utility of the learned representations and show state-of-the art performance on two movie understanding tasks, genre and speaking-style classification, outperforming video based methods and audio baselines. Finally, we show that the learned model can be transferred in a zero-shot manner through application in both movie understanding tasks and general action recognition.

DocDiff: Document Enhancement via Residual Diffusion Models

Zongyuan Yang
Baolin Liu
Yongping Xxiong
Lan Yi
Guibin Wu
Xiaojun Tang
Ziqi Liu
Junjie Zhou
Xing Zhang

Removing degradation from document images not only improves their visual quality and readability, but also enhances the performance of numerous automated document analysis and recognition tasks. However, existing regression-based methods optimized for pixel-level distortion reduction tend to suffer from significant loss of high-frequency information, leading to distorted and blurred text edges. To compensate for this major deficiency, we propose DocDiff, the first diffusion-based framework specifically designed for diverse challenging document enhancement problems, including document deblurring, denoising, and removal of watermarks and seals. DocDiff consists of two modules: the Coarse Predictor (CP), which is responsible for recovering the primary low-frequency content, and the High-Frequency Residual Refinement (HRR) module, which adopts the diffusion models to predict the residual (high-frequency information, including text edges), between the ground-truth and the CP-predicted image. DocDiff is a compact and computationally efficient model that benefits from a well-designed network architecture, an optimized training loss objective, and a deterministic sampling process with short time steps. Extensive experiments demonstrate that DocDiff achieves state-of-the-art (SOTA) performance on multiple benchmark datasets, and can significantly enhance the readability and recognizability of degraded document images. Furthermore, our proposed HRR module in pre-trained DocDiff is plug-and-play and ready-to-use, with only 4.17M parameters. It greatly sharpens the text edges generated by SOTA deblurring methods without additional joint training. Available codes: https://github.com/Royalvice/DocDiff https://github.com/Royalvice/DocDiff.

POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-view World

Boshen Xu
Sipeng Zheng
Qin Jin

We humans are good at translating third-person observations of hand-object interactions (HOI) into an egocentric view. However, current methods struggle to replicate this ability of view adaptation from third-person to first-person. Although some approaches attempt to learn view-agnostic representation from large-scale video datasets, they ignore the relationships among multiple third-person views. To this end, we propose a Prompt-Oriented View-agnostic learning (POV) framework in this paper, which enables this view adaptation with few egocentric videos. Specifically, We introduce interactive masking prompts at the frame level to capture fine-grained action information, and view-aware prompts at the token level to learn view-agnostic representation. To verify our method, we establish two benchmarks for transferring from multiple third-person views to the egocentric view. Our extensive experiments on these benchmarks demonstrate the efficiency and effectiveness of our POV framework and prompt tuning techniques in terms of view adaptation and view generalization.

GraMMaR: Ground-aware Motion Model for 3D Human Motion Reconstruction

Sihan Ma
Qiong Cao
Hongwei Yi
Jing Zhang
Dacheng Tao

Demystifying complex human-ground interactions is essential for accurate and realistic 3D human motion reconstruction from RGB videos, as it ensures consistency between the humans and the ground plane. Prior methods have modeled human-ground interactions either implicitly or in a sparse manner, often resulting in unrealistic and incorrect motions when faced with noise and uncertainty. In contrast, our approach explicitly represents these interactions in a dense and continuous manner. To this end, we propose a novel Ground-aware Motion Model for 3D Human Motion Reconstruction, named GraMMaR, which jointly learns the distribution of transitions in both pose and interaction between every joint and ground plane at each time step of a motion sequence. It is trained to explicitly promote consistency between the motion and distance change towards the ground. After training, we establish a joint optimization strategy that utilizes GraMMaR as a dual-prior, regularizing the optimization towards the space of plausible ground-aware motions. This leads to realistic and coherent motion reconstruction, irrespective of the assumed or learned ground plane. Through extensive evaluation on the AMASS and AIST++ datasets, our model demonstrates good generalization and discriminating abilities in challenging cases including complex and ambiguous human-ground interactions. The code will be available at https://github.com/xymsh/GraMMaR.

SpeechTripleNet: End-to-End Disentangled Speech Representation Learning for Content, Timbre and Prosody

Hui Lu
Xixin Wu
Zhiyong Wu
Helen Meng

Disentangled speech representation learning aims to separate different factors of variation from speech into disjoint representations. This paper focuses on disentangling speech into representations for three factors: spoken content, speaker timbre, and speech prosody. Many previous methods for speech disentanglement have focused on separating spoken content and speaker timbre. However, the lack of explicit modeling of prosodic information leads to degraded speech generation performance and uncontrollable prosody leakage into content and/or speaker representations. While some recent methods have utilized explicit speaker labels or pre-trained models to facilitate triple-factor disentanglement, there are no end-to-end methods to simultaneously disentangle three factors using only unsupervised or self-supervised learning objectives. This paper introduces SpeechTripleNet, an end-to-end method to disentangle speech into representations for content, timbre, and prosody. Based on VAE, SpeechTripleNet restricts the structures of the latent variables and the amount of information captured in them to induce disentanglement. It is a pure unsupervised/self-supervised learning method that only requires speech data and no additional labels. Our qualitative and quantitative results demonstrate that SpeechTripleNet is effective in achieving triple-factor speech disentanglement, as well as controllable speech editing concerning different factors.

Generating Explanations for Embodied Action Decision from Visual Observation

Xiaohan Wang
Yuehu Liu
Xinhang Song
Beibei Wang
Shuqiang Jiang

Getting trust is crucial for embodied agents (such as robots and autonomous vehicles) to collaborate with human beings, especially non-experts. The most direct way for mutual understanding is through natural language explanation. Existing researches consider generating visual explanations for object recognition, while the exploration of explaining embodied decisions remains vacant. In this paper, we study generating action decisions and explanations based on visual observation. Distinct to explanations for recognition, justifying an action needs to show why it's better than other actions. Besides, the understanding of scene structure is required since the agent needs to interact with the environment (e.g. navigation, moving objects). We introduce a new dataset THOR-EAE (Embodied Action Explanation) collected based on AI2-THOR simulator. The dataset consists of over 840,000 egocentric images of indoor embodied observation which are annotated with the optimal action labels and explanation sentences. An explainable decision-making criterion is developed considering scene layout and action attributes for efficient annotation. We propose a graph action justification model, exploiting graph neural networks for obstacle-surroundings relations representation and justifying the actions under the guidance of decision results. Experimental results on THOR-EAE dataset showcase its challenge and the effectiveness of the proposed method.

Scene-aware Human Pose Generation using Transformer

Jieteng Yao
Junjie Chen
Li Niu
Bin Sheng

Affordance learning considers the interaction opportunities for an actor in the scene and thus has wide application in scene understanding and intelligent robotics. In this paper, we focus on contextual affordance learning, i.e., using affordance as context to generate a reasonable human pose in a scene. Existing scene-aware human pose generation methods could be divided into two categories depending on whether using pose templates. Our proposed method belongs to the template-based category, which benefits from the representative pose templates. Moreover, inspired by recent transformer-based methods, we associate each query embedding with a pose template, and use the interaction between query embeddings and scene feature map to effectively predict the scale and offsets for each pose template. In addition, we employ knowledge distillation to facilitate the offset learning given the predicted scale. Comprehensive experiments on Sitcom dataset demonstrate the effectiveness of our method.

Dynamic Compositional Graph Convolutional Network for Efficient Composite Human Motion Prediction

Wanying Zhang
Shen Zhao
Fanyang Meng
Songtao Wu
Mengyuan Liu

With potential applications in fields including intelligent surveillance and human-robot interaction, the human motion prediction task has become a hot research topic and also has achieved high success, especially using the recent Graph Convolutional Network (GCN). Current human motion prediction task usually focuses on predicting human motions for atomic actions. Observing that atomic actions can happen at the same time and thus formulating the composite actions, we propose the composite human motion prediction task. To handle this task, we first present a Composite Action Generation (CAG) module to generate synthetic composite actions for training, thus avoiding the laborious work of collecting composite action samples. Moreover, we alleviate the effect of composite actions on demand for a more complicated model by presenting a Dynamic Compositional Graph Convolutional Network (DC-GCN). Extensive experiments on the Human3.6M dataset and our newly collected CHAMP dataset consistently verify the efficiency of our DC-GCN method, which achieves state-of-the-art motion prediction accuracies and meanwhile needs few extra computational costs than traditional GCN-based human motion methods.

Diffusion-Augmented Depth Prediction with Sparse Annotations

Jiaqi Li
Yiran Wang
Zihao Huang
Jinghong Zheng
Ke Xian
Zhiguo Cao
Jianming Zhang

Depth estimation aims to predict dense depth maps. In autonomous driving scenes, sparsity of annotations makes the task challenging. Supervised models produce concave objects due to insufficient structural information. They overfit to valid pixels and fail to restore spatial structures. Self-supervised methods are proposed for the problem. Their robustness is limited by pose estimation, leading to erroneous results in natural scenes. In this paper, we propose a supervised framework termed Diffusion-Augmented Depth Prediction (DADP). We leverage the structural characteristics of diffusion model to enforce depth structures of depth models in a plug-and-play manner. An object-guided integrality loss is also proposed to further enhance regional structure integrality by fetching objective information. We evaluate DADP on three driving benchmarks and achieve significant improvements in depth structures and robustness. Our work provides a new perspective on depth estimation with sparse annotations in autonomous driving scenes.

Chaos to Order: A Label Propagation Perspective on Source-Free Domain Adaptation

Chunwei Wu
Guitao Cao
Yan Li
Xidong Xi
Wenming Cao
Hong Wang

Source-free domain adaptation (SFDA), where only a pre-trained source model is used to adapt to the target distribution, is a more general approach to achieving domain adaptation in the real world. However, it can be challenging to capture the inherent structure of the target features accurately due to the lack of supervised information on the target domain. By analyzing the clustering performance of the target features, we show that they still contain core features related to discriminative attributes but lack the collation of semantic information. Inspired by this insight, we present Chaos to Order (CtO), a novel approach for SFDA that strives to constrain semantic credibility and propagate label information among target subpopulations. CtO divides the target data into inner and outlier samples based on the adaptive threshold of the learning state, customizing the learning strategy to fit the data properties best. Specifically, inner samples are utilized for learning intra-class structure thanks to their relatively well-clustered properties. The low-density outlier samples are regularized by input consistency to achieve high accuracy with respect to the ground truth labels. In CtO, by employing different learning strategies to propagate the labels from the inner local to outlier instances, it clusters the global samples from chaos to order. We further adaptively regulate the neighborhood affinity of the inner samples to constrain the local semantic credibility. In theoretical and empirical analyses, we demonstrate that our algorithm not only propagates from inner to outlier but also prevents local clustering from forming spurious clusters. Empirical evidence demonstrates that CtO outperforms the state of the arts on three public benchmarks: Office-31, Office-Home, and VisDA.

Beware of Overcorrection: Scene-induced Commonsense Graph for Scene Graph Generation

Lianggangxu Chen
Jiale Lu
Youqi Song
Changbo Wang
Gaoqi He

A scene graph generation task is largely restricted under a class imbalance. Previous methods have alleviated the class imbalance problem by incorporating commonsense information into the classification, enabling the prediction model to rectify the incorrect head class into the correct tail class. However, the results of commonsense-based models are typically overcorrected, e.g., the visually correct head class is forcibly modified into the wrong tail class. We argue that there are two principal reasons for this phenomenon. First, existing models ignore the semantic gap between commonsense knowledge and real scenes. Second, current commonsense fusion strategies propagate the neighbors in the visual-linguistic contexts without long-range correlation. To alleviate overcorrection, we formulate the commonsense-based scene graph generation task as two sub-problems: scene-induced commonsense graph generation (SI-CGG) and commonsense-inspired scene graph generation (CI-SGG). In SI-CGG module, unlike conventional methods using fixed commonsense graph, we adaptively adjust the node embeddings in a commonsense graph according to their visual appearance and configure the new reasoning edge under a specific visual context. The CI-SGG module is proposed to propagate the information from scene-induced commonsense graph back to the scene graph. It updates the representations of each node in scene graph by the aggregation of neighbourhood information at different scales. Through maximum likelihood optimisation of the logarithmic Gaussian process, the scene graph automatically adapt to the different neighbors in the visual-linguistic contexts. Systematic experiments on the Visual Genome dataset show that our full method achieves state-of-the-art performance.

Scene Text Segmentation with Text-Focused Transformers

Haiyang Yu
Xiaocong Wang
Ke Niu
Bin Li
Xiangyang Xue

Text segmentation is a crucial aspect of various text-related tasks, including text erasing, text editing, and font style transfer. In recent years, multiple text segmentation datasets, such as TextSeg focusing on Latin text segmentation and BTS on bilingual text segmentation, have been proposed. However, existing methods either disregard the annotations of text location or directly use pre-trained text detectors. In general, these methods cannot fully utilize the annotations of text location in the datasets. To explicitly incorporate text location information to guide text segmentation, we propose an end-to-end text-focused segmentation framework, where text detection and segmentation are jointly optimized. In the proposed framework, we first extract multi-level global visual features through residual convolution blocks and then predict the mask of text areas using a text detection head. Subsequently, we develop a text-focused module that compels the model to pay more attention to text areas. Specifically, we introduce two types of attention masks to extract corresponding features: text-aware and instance-aware features. Finally, we employ hierarchical Transformer encoders to fuse multi-level features and predict the text mask with a text segmentation head. To evaluate the effectiveness of our method, we conduct experiments on six text segmentation benchmarks. The experimental results demonstrate that the proposed method outperforms the previous state-of-the-art (SOTA) methods by a clear margin in most cases. The code and supplementary materials are available at https://github.com/FudanVI/FudanOCR/tree/main/text-focused-Transformers https://github.com/FudanVI/FudanOCR/tree/main/text-focused-Transformers.

MIEP: Channel Pruning with Multi-granular Importance Estimation for Object Detection

Liangwei Jiang
Jiaxin Chen
Di Huang
Yunhong Wang

This paper investigates compressing a pre-trained deep object detector to a lightweight one by channel pruning, which has proved effective and flexible in promoting efficiency. However, the majority of existing works trim channels based on a monotonous criterion for general purposes, i.e., the importance to the task-specific loss. They are prone to overly prune intermediate layers and simultaneously leave large intra-layer redundancy, severely deteriorating the detection accuracy. To address the issues above, we propose a novel channel pruning approach with multi-granular importance estimation (MIEP), consisting of the Feature-level Object-sensitive Importance (FOI) and the Intra-layer Redundancy-aware Importance (IRI). The former puts large weights on channels that are critical for object representation through the guidance of object features from the pre-trained model, and mitigates over-pruning when combined with the task-specific loss. The latter groups highly correlated channels based on clustering, which are subsequently pruned with priority to decrease redundancy. Extensive experiments on the COCO and VOC benchmarks demonstrate that MIEP remarkably outperforms the state-of-the-art channel pruning approaches, achieves a better balance between accuracy and efficiency compared to lightweight object detectors, and generalizes well to various detection frameworks (e.g., Faster-RCNN and FSAF) and tasks (e.g., classification).

SESSION: Poster Session II: Understanding Multimedia Content -- Multimodal Fusion and Embedding

Disentangled Representation Learning with Causality for Unsupervised Domain Adaptation

Shanshan Wang
Yiyang Chen
Zhenwei He
Xun Yang
Mengzhu Wang
Quanzeng You
Xingyi Zhang

Most efforts in unsupervised domain adaptation (UDA) focus on learning the domain-invariant representations between the two domains. However, such representations may still confuse two patterns due to the domain gap. Considering that semantic information is useful for the final task and domain information always indicates the discrepancy between two domains, to address this issue, we propose to decouple the representations of semantic features from domain features to reduce domain bias. Different from traditional methods, we adopt a simple but effective module with only one domain discriminator to decouple the representations, offering two benefits. Firstly, it eliminates the need for labeled sample pairs, making it more suitable for UDA. Secondly, without adversarial learning, our model can achieve a more stable training phase. Moreover, to further enhance the task-specific features, we employ a causal mechanism to separate semantic features related to causal factors from the overall feature representations. Specially, we utilize a dual-classifier strategy, where each classifier is fed with the entire features and the semantic features, respectively. By minimizing the discrepancy between the outputs of the two classifiers, the causal influence of the semantic features is accentuated. Experiments on several public datasets demonstrate the proposed model can outperform the state-of-the-art methods. Our code is available at: https://github.com/qzxRtY37/DRLC https://github.com/qzxRtY37/DRLC.

Localized and Balanced Efficient Incomplete Multi-view Clustering

Jie Wen
Gehui Xu
Chengliang Liu
Lunke Fei
Chao Huang
Wei Wang
Yong Xu

In recent years, many incomplete multi-view clustering methods have been proposed to address the challenging unsupervised clustering issue on the multi-view data with missing views. However, most of the existing works are inapplicable to large-scale clustering task and their clustering results are unstable since these methods have high computational complexities and their results are produced by kmeans rather than their designed learning models. In this paper, we propose a new one-step incomplete multi-view clustering model, called Localized and Balanced Incomplete Multi-view Clustering (LBIMVC), to address these issues. Specifically, LBIMVC develops a new graph regularized incomplete multi-matrix-factorization model to obtain the unique clustering result by learning a consensus probability representation, where each element of the consensus representation can directly reflect the probability of the corresponding sample to the class. In addition, the proposed graph regularized model integrates geometric preserving and consensus representation learning into one term without introducing any extra constraint terms and parameters to explore the structure of data. Moreover, to avoid that samples are over divided into a few clusters, a balanced constraint is introduced to the model. Experimental results on four databases demonstrate that our method not only obtains competitive clustering performance, but also performs faster than some state-of-the-art methods.

Interpolation Normalization for Contrast Domain Generalization

Mengzhu Wang
Junyang Chen
Huan Wang
Huisi Wu
Zhidan Liu
Qin Zhang

Domain generalization refers to the challenge of training a model from various source domains that can generalize well to unseen target domains. Contrastive learning is a promising solution that aims to learn domain-invariant representations by utilizing rich semantic relations among sample pairs from different domains. One simple approach is to bring positive sample pairs from different domains closer, while pushing negative pairs further apart. However, in this paper, we find that directly applying contrastive-based methods is not effective in domain generalization. To overcome this limitation, we propose to leverage a novel contrastive learning approach that promotes class-discriminative and class-balanced features from source domains. Essentially, clusters of sample representations from the same category are encouraged to cluster, while those from different categories are spread out, thus enhancing the model's generalization capability. Furthermore, most existing contrastive learning methods use batch normalization, which may prevent the model from learning domain-invariant features. Inspired by recent research on universal representations for neural networks, we propose a simple emulation of this mechanism by utilizing batch normalization layers to distinguish visual classes and formulating a way to combine them for domain generalization tasks. Our experiments demonstrate a significant improvement in classification accuracy over state-of-the-art techniques on popular domain generalization benchmarks, including Digits-DG, PACS, Office-Home and DomainNet.

Multi-teacher Self-training for Semi-supervised Node Classification with Noisy Labels

Yujing Liu
Zongqian Wu
Zhengyu Lu
Guoqiu Wen
Junbo Ma
Guangquan Lu
Xiaofeng Zhu

Graph neural networks (GNNs) have achieved promising results for semi-supervised learning tasks on the graph-structured data. However, most existing methods assume that the training data are with correct labels, but in the real world, the graph-structured data often carry noisy labels to reduce the effectiveness of GNNs. To address this issue, this paper proposes a new label correction method, called multi-teacher self-training (MTS-GNN for short), to conduct semi-supervised node classification with noisy labels. Specifically, we first save the parameters of the model training in the earlier iterations as teacher models, and then use them to guide the processes, including model training, noisy label removal, and pseudo-label selection, in the later iterations of the training process of semi-supervised node classification. As a result, based on the guidance of the teacher models, the proposed method achieves the model effectiveness by solving the over-fitting issue, improves the accuracy of noisy label removal and the quality of pseudo-label selection. Extensive experimental results on real datasets show that our method achieves the best effectiveness, compared to state-of-the-art methods.

Long Short-Term Graph Memory Against Class-imbalanced Over-smoothing

Liang Yang
Jiayi Wang
Tingting Zhang
Dongxiao He
Chuan Wang
Yuanfang Guo
Xiaochun Cao
Bingxin Niu
Zhen Wang

Most Graph Neural Networks (GNNs) follow the message-passing scheme. Residual connection is an effective strategy to tackle GNNs' over-smoothing issue and performance reduction issue on non-homophilic networks. Unfortunately, the coarse-grained residual connection still suffers from class-imbalanced over-smoothing issue, due to the fixed and linear combination of topology and attribute in node representation learning. To make the combination flexible to capture complicated relationship, this paper reveals that the residual connection needs to be node-dependent, layer-dependent, and related to both topology and attribute. To alleviate the difficulty in specifying complicated relationship, this paper presents a novel perspective on GNNs, i.e., the representations of one node in different layers can be seen as a sequence of states. From this perspective, existing residual connections are not flexible enough for sequence modeling. Therefore, a novel node-dependent residual connection, i.e., Long Short-Term Graph Memory Network (LSTGM) is proposed to employ Long Short-Term Memory (LSTM), to model the sequence of node representation. To make the graph topology fully employed, LSTGM innovatively enhances the updated memory and three gates with graph topology. A speedup version is also proposed for effective training. Experimental evaluations on real-world datasets demonstrate their effectiveness in preventing over-smoothing issue and handling networks with heterophily.

Class-level Structural Relation Modeling and Smoothing for Visual Representation Learning

Zitan Chen
Zhuang Qi
Xiao Cao
Xiangxian Li
Xiangxu Meng
Lei Meng

Representation learning for images has been advanced by recent progress in more complex neural models such as the Vision Transformers and new learning theories such as the structural causal models. However, these models mainly rely on the classification loss to implicitly regularize the class-level data distributions, and they may face difficulties when handling classes with diverse visual patterns. We argue that the incorporation of the structural information between data samples may improve this situation. To achieve this goal, this paper presents a framework termed Class-level Structural Relation Modeling and Smoothing for Visual Representation Learning (CSRMS), which includes the Class-level Relation Modelling, Class-aware Graph Sampling, and Relational Graph-Guided Representation Learning modules to model a relational graph of the entire dataset and perform class-aware smoothing and regularization operations to alleviate the issue of intra-class visual diversity and inter-class similarity. Specifically, the Class-level Relation Modelling module uses a clustering algorithm to learn the data distributions in the feature space and identify three types of class-level sample relations for the training set; Class-aware Graph Sampling module extends typical training batch construction process with three strategies to sample dataset-level sub-graphs; and Relational Graph-Guided Representation Learning module employs a graph convolution network with knowledge-guided smoothing operations to ease the projection from different visual patterns to the same class. Experiments demonstrate the effectiveness of structured knowledge modelling for enhanced representation learning and show that CSRMS can be incorporated with any state-of-the-art visual representation learning models for performance gains. The source codes and demos have been released at https://github.com/czt117/CSRMS.

Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding

Shengkai Sun
Daizong Liu
Jianfeng Dong
Xiaoye Qu
Junyu Gao
Xun Yang
Xun Wang
Meng Wang

Unsupervised pre-training has shown great success in skeleton-based action understanding recently. Existing works typically train separate modality-specific models (i.e., joint, bone, and motion), then integrate the multi-modal information for action understanding by a late-fusion strategy. Although these approaches have achieved significant performance, they suffer from the complex yet redundant multi-stream model designs, each of which is also limited to the fixed input skeleton modality. To alleviate these issues, in this paper, we propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL, which exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner. Specifically, instead of designing separate modality-specific optimization processes for uni-modal unsupervised learning, we feed different modality inputs into the same stream with an early-fusion strategy to learn their multi-modal features for reducing model complexity. To ensure that the fused multi-modal features do not exhibit modality bias, i.e., being dominated by a certain modality input, we further propose both intra- and inter-modal consistency learning to guarantee that the multi-modal features contain the complete semantics of each modal via feature decomposition and distinct alignment. In this manner, our framework is able to learn the unified representations of uni-modal or multi-modal skeleton input, which is flexible to different kinds of modality input for robust action understanding in practical cases. Extensive experiments conducted on three large-scale datasets, i.e., NTU-60, NTU-120, and PKU-MMD II, demonstrate that UmURL is highly efficient, possessing the approximate complexity with the uni-modal methods, while achieving new state-of-the-art performance across various downstream task scenarios in skeleton-based action representation learning. Our source code is available at https://github.com/HuiGuanLab/UmURL.

Little Strokes Fell Great Oaks: Boosting the Hierarchical Features for Multi-exposure Image Fusion

Pan Mu
Zhiying Du
Jinyuan Liu
Cong Bai

In recent years, deep learning networks have made remarkable strides in the domain of multi-exposure image fusion. Nonetheless, prevailing approaches often involve directly feeding over-exposed and under-exposed images into the network, which leads to the under-utilization of inherent information present in the source images. Additionally, unsupervised techniques predominantly employ rudimentary weighted summation for color channel processing, culminating in an overall desaturated final image tone. To partially mitigate these issues, this study proposes a gamma correction module specifically designed to fully leverage latent information embedded within source images. Furthermore, a modified transformer block, embracing self-attention mechanisms, is introduced to optimize the fusion process. Ultimately, a novel color enhancement algorithm is presented to augment image saturation while preserving intricate details. The source code is available at https://github.com/ZhiyingDu/BHFMEF.

Triple-Granularity Contrastive Learning for Deep Multi-View Subspace Clustering

Jing Wang
Songhe Feng
Gengyu Lyu
Zhibin Gu

Multi-view subspace clustering (MVSC), which leverages comprehensive information from multiple views to effectively reveal the intrinsic relationships among instances, has garnered significant research interest. However, previous MVSC research focuses on exploring the cross-view consistent information only in the instance representation hierarchy or affinity relationship hierarchy, which prevents a joint investigation of the multi-view consistency in multiple hierarchies. To this end, we propose a Triple-gRanularity contrastive learning framework for deep mUlti-view Subspace clusTering (TRUST), which benefits from the comprehensive discovery of valuable information from three hierarchies, including the instance, specific-affinity relationship, and consensus-affinity relationship. Specifically, we first use multiple view-specific autoencoders to extract noise-robust instance representations, which are then respectively input into the MLP model and self-representation model to obtain high-level instance representations and view-specific affinity matrices. Then, the instance and specific-affinity relationship contrastive regularization terms are separately imposed on the high-level instance representations and view specific-affinity matrices, ensuring the cross-view consistency can be found from the instance representations to the view-specific affinity matrices. Furthermore, multiple view-specific affinity matrices are fused into a consensus one associated with the consensus-affinity relationship contrastive constraint, which embeds the local structural relationship of high-level instance representations into the consensus affinity matrix. Extensive experiments on various datasets demonstrate that our method is more effective when compared with other state-of-art methods.

CTCP: Cross Transformer and CNN for Pansharpening

Zhao Su
Yong Yang
Shuying Huang
Weiguo Wan
Wei Tu
Hangyuan Lu
Changjie Chen

Pansharpening is to fuse a high-resolution panchromatic (PAN) image with a low-resolution multispectral (LRMS) image to obtain an enhanced LRMS image with high spectral and spatial resolution. The current Transformer-based pansharpening methods neglect the interaction between the extracted long- and short-range features, resulting in spectral and spatial distortion in the fusion results. To address this issue, a novel cross Transformer and convolutional neural network (CNN) for pansharpening (CTCP) is proposed to achieve better fusion results by designing a cross mechanism, which can enhance the interaction between long- and short-range features. First, a dual branch feature extraction module (DBFEM) is constructed to extract the features from the LRMS and PAN images, respectively, reducing the aliasing of the two image features. In the DBFEM, to improve the feature representation ability of the network, a cross long-short-range feature module (CLSFM) is designed by combining the feature learning capabilities of Transformer and CNN via the cross mechanism, which achieves the integration of long-short-range features. Then, to improve the ability of spectral feature representation, a spectral feature enhancement fusion module (SFEFM) based on a frequency channel attention is constructed to realize feature fusion. Finally, the shallow features from the PAN image are reused to provide detail features, which are integrated with the fused features to obtain the final pansharpened results. To the best of our knowledge, this is the first attempt to introduce the cross mechanism between Transformer and CNN in pansharpening field. Numerous experiments show that our CTCP outperforms some state-of-the-art (SOTA) approaches both subjectively and objectively. The source code will be released at https://github.com/zhsu99/CTCP.

Chain of Propagation Prompting for Node Classification

Yonghua Zhu
Zhenyun Deng
Yang Chen
Robert Amor
Michael Witbrock

Graph Neural Networks (GNN) are an effective technique for node classification, but their performance is easily affected by the quality of the primitive graph and the limited receptive field of message-passing. In this paper, we propose a new self-attention method, namely Chain of Propagation Prompting (CPP), to address the above issues as well as reduce dependence on label information when employing self-attention for node classification. To do this, we apply the self-attention framework to reduce the impact of a low-quality graph and to obtain a maximal receptive field for the message-passing. We also design a simple pattern of message-passing as the prompt to make self-attention capture complex patterns and reduce the dependence on label information. Comprehensive experimental results on real graph datasets demonstrate that CPP outperforms all relevant comparison methods.

Efficient Multi-View Graph Clustering with Local and Global Structure Preservation

Yi Wen
Suyuan Liu
Xinhang Wan
Siwei Wang
Ke Liang
Xinwang Liu
Xihong Yang
Pei Zhang

Anchor-based multi-view graph clustering (AMVGC) has received abundant attention owing to its high efficiency and the capability to capture complementary structural information across multiple views. Intuitively, a high-quality anchor graph plays an essential role in the success of AMVGC. However, the existing AMVGC methods only consider single-structure information, i.e., local or global structure, which provides insufficient information for the learning task. To be specific, the over-scattered global structure leads to learned anchors failing to depict the cluster partition well. In contrast, the local structure with an improper similarity measure results in potentially inaccurate anchor assignment, ultimately leading to sub-optimal clustering performance. To tackle the issue, we propose a novel anchor-based multi-view graph clustering framework termed Efficient Multi-View Graph Clustering with Local and Global Structure Preservation (EMVGC-LG). Specifically, a unified framework with a theoretical guarantee is designed to capture local and global information. Besides, EMVGC-LG jointly optimizes anchor construction and graph learning to enhance the clustering quality. In addition, EMVGC-LG inherits the linear complexity of existing AMVGC methods respecting the sample number, which is time-economical and scales well with the data size. Extensive experiments demonstrate the effectiveness and efficiency of our proposed method.

Scalable Incomplete Multi-View Clustering with Structure Alignment

Yi Wen
Siwei Wang
Ke Liang
Weixuan Liang
Xinhang Wan
Xinwang Liu
Suyuan Liu
Jiyuan Liu
En Zhu

The success of existing multi-view clustering (MVC) relies on the assumption that all views are complete. However, samples are usually partially available due to data corruption or sensor malfunction, which raises the research of incomplete multi-view clustering (IMVC). Although several anchor-based IMVC methods have been proposed to process the large-scale incomplete data, they still suffer from the following drawbacks: i) Most existing approaches neglect the inter-view discrepancy and enforce cross-view representation to be consistent, which would corrupt the representation capability of the model; ii) Due to the samples disparity between different views, the learned anchor might be misaligned, which we referred as the Anchor-Unaligned Problem for Incomplete data (AUP-ID). Such the AUP-ID would cause inaccurate graph fusion and degrades clustering performance. To tackle these issues, we propose a novel incomplete anchor graph learning framework termed Scalable Incomplete Multi-View Clustering with Structure Alignment (SIMVC-SA). Specially, we construct the view-specific anchor graph to capture the complementary information from different views. In order to solve the AUP-ID, we propose a novel structure alignment module to refine the cross-view anchor correspondence. Meanwhile, the anchor graph construction and alignment are jointly optimized in our unified framework to enhance clustering quality. Through anchor graph construction instead of full graphs, the time and space complexity of the proposed SIMVC-SA is proven to be linearly correlated with the number of samples. Extensive experiments on seven incomplete benchmark datasets demonstrate the effectiveness and efficiency of our proposed method. Our code is publicly available at https://github.com/wy1019/SIMVC-SA.

Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval

Yi Bin
Haoxuan Li
Yahui Xu
Xing Xu
Yang Yang
Heng Tao Shen

Most existing cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts, e.g., CNN for images and RNN/Transformer for texts. Such discrepancy in architectures may induce different semantic distribution spaces and limit the interactions between images and texts, and further result in inferior alignment between images and texts. To fill this research gap, inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities. Specifically, we design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed Hierarchical Alignment Transformers (HAT), which consists of an image Transformer, a text Transformer, and a hierarchical alignment module. With such identical architectures, the encoders could produce representations with more similar characteristics for images and texts, and make the interactions and alignments between them much easier. Besides, to leverage the rich semantics, we devise a hierarchical alignment scheme to explore multi-level correspondences of different layers between images and texts. To evaluate the effectiveness of the proposed HAT, we conduct extensive experiments on two benchmark datasets, MSCOCO and Flickr30K. Experimental results demonstrate that HAT outperforms SOTA baselines by a large margin. Specifically, on two key tasks, i.e., image-to-text and text-to-image retrieval, HAT achieves 7.6% and 16.7% relative score improvement of Recall@1 on MSCOCO, and 4.4% and 11.6% on Flickr30k respectively. The code is available at https://github.com/LuminosityX/HAT.

Unbalanced Multi-view Deep Learning

Cai Xu
Zehui Li
Ziyu Guan
Wei Zhao
Xiangyu Song
Yue Wu
Jianxin Li

Most existing multi-view learning methods assume that the dimensions of different views are similar. In real-world applications, it is often the case that the dimension of a view may be extremely small compared with these of other views, resulting in an unbalanced multi-view learning problem. Previous methods for this problem have at least one of the following drawbacks: (1) despising the information of low dimensional views; (2) constructing balanced view-specific inter-instance similarity graphs or employing decision-level fusion, which cannot well learn multi-level inter-view correlations and is limited to category-related tasks such as clustering. To eliminate all these drawbacks, we present an Unbalanced Multi-view Deep Learning (UMDL) method. Considering a low dimensional view usually contains multiple patterns, we construct an overcomplete dictionary with its atoms exceeding the dimension of the original data. We transfer the original data into a combination of atoms and obtain a higher dimensional representation. We propose a sparse multi-view fusion paradigm to explicitly capture the complementarity of multi-view data in a flexible manner. Moreover, we construct positive and negative examples via balanced similarity graphs and employ contrastive learning to train UMDL in a self-supervised manner. Experiments conducted on a toy example and 7 balanced/unbalanced datasets show that UMDL outperforms baseline methods and can be well applied to downstream classification and segmentation tasks. The code is released at https://github.com/xdmvteam/UMDL.

Incomplete Multi-View Clustering with Regularized Hierarchical Graph

Shuping Zhao
Lunke Fei
Jie Wen
Bob Zhang
Pengyang Zhao

In this article, we propose a novel and effective incomplete multi-view clustering (IMVC) framework, referred to as incomplete multi-view clustering with regularized hierarchical graph (IMVC_RHG). Different from the existing graph learning-based IMVC methods, IMVC_RHG introduces a novel heterogeneous-graph learning and embedding strategy, which adopts the high-order structures between four tuples for each view, rather than a simple paired-sample intrinsic structure. Besides this, with the aid of the learned heterogeneous graphs, a between-view preserving strategy is designed to recover the incomplete graph for each view. Finally, a consensus representation for each sample is gained with a co-regularization term for final clustering. As a result of integrating these three learning strategies, IMVC_RHG can be flexibly applied to different types of IMVC tasks. Comparing with the other state-of-the-art methods, the proposed IMVC_RHG can achieve the best performances on real-world incomplete multi-view databases.

On Regularizing Multiple Clusterings for Ensemble Clustering by Graph Tensor Learning

Man-Sheng Chen
Jia-Qi Lin
Chang-Dong Wang
Wu-Dong Xi
Dong Huang

Ensemble clustering has shown its promising ability in fusing multiple base clusterings into a probably better and more robust clustering result. Typically, the co-association matrix based ensemble clustering methods attempt to integrate multiple connective matrices from base clusterings by weighted fusion to acquire a common graph representation. However, few of them are aware of the potential noise or corruption from the common representation by direct integration of different connective matrices with distinct cluster structures, and further consider the mutual information propagation between the input observations. In this paper, we propose a Graph Tensor Learning based Ensemble Clustering (GTLEC) method to refine multiple connective matrices by the substantial rank recovery and graph tensor learning. Within this framework, each input connective matrix is dexterously refined to approximate a graph structure by obeying the theoretical rank constraint with an adaptive weight coefficient. Further, we stack multiple refined connective matrices into a three-order tensor to extract their higher-order similarities via graph tensor learning, where the mutual information propagation across different graph matrices will also be promoted. Extensive experiments on several challenging datasets have confirmed the superiority of GTLEC compared with the state-of-the-art.

Event-guided Frame Interpolation and Dynamic Range Expansion of Single Rolling Shutter Image

Guixu Lin
Jin Han
Mingdeng Cao
Zhihang Zhong
Yinqiang Zheng

In the presence of abrupt motion, the pushbroom scanning mechanism of a rolling shutter (RS) camera tends to bring undesirable distortion, which is recently shown to be beneficial for high-speed frame interpolation. Although promising results have been reported by using multiple consecutive RS frames, to interpolate intermediate distortion-free frames from a single RS image is still an open question, due to the existence of multiple motions that can account for the recorded distortion. Another limitation of RS cameras in complex dynamic scenarios lies in the dynamic range, since traditional ways of multiple exposure for high dynamic range (HDR) imaging will fail due to alignment issues. To deal with these two challenges simultaneously, we propose to use an event camera for assistance, which has much faster temporal response and wider dynamic range. Since there does not exist learning data for this brand new imaging setup, we first build a quad-axis imaging system to capture a realistic dataset called REG-HDR, with pairs of fully aligned RS image and its associated events, as well as their corresponding high-speed HDR GS images. We also propose a flow-based network for frame interpolation, compounded with an attention-based fusion network for dynamic range expansion. Experimental results have verified the effectiveness of our proposed algorithm and the superiority of using realistic data for this challenging dural-purpose enhancement task.

Learnable Graph Filter for Multi-view Clustering

Peng Zhou
Liang Du

Multi-view clustering is an important machine learning task for multi-media data. Recently, graph filter based multi-view clustering achieves promising performance and attracts much attention. However, the conventional graph filter based methods only use a pre-defined graph filter for each view and the used graph filters ignore the rich information among all views. Different from the conventional methods, in this paper, we aim to tackle a new problem, i.e., instead of using the pre-defined graph filters, how to construct an appropriate consensus graph filter by considering the information in all views. To achieve this, we propose a novel multi-view clustering method with graph filter learning. In our method, we learn an appropriate consensus graph filter from all views of data with multiple graph learning rather than directly pre-defining it. Then, we provide an iterative algorithm to obtain the consensus graph filter and analyze why it can lead to better clustering results. The extensive experiments on benchmark data sets demonstrate the effectiveness and superiority of the proposed method. The codes of this article are released in http://Doctor-Nobody.github.io/codes/MCLGF.zip.

Cross-Silo Prototypical Calibration for Federated Learning with Non-IID Data

Zhuang Qi
Lei Meng
Zitan Chen
Han Hu
Hui Lin
Xiangxu Meng

Federated Learning aims to learn a global model on the server side that generalizes to all clients in a privacy-preserving manner, by leveraging the local models from different clients. Existing solutions focus on either regularizing the objective functions among clients or improving the aggregation mechanism for the improved model generalization capability. However, their performance is typically limited by the dataset biases, such as the heterogeneous data distributions and the missing classes. To address this issue, this paper presents a cross-silo prototypical calibration method (FedCSPC), which takes additional prototype information from the clients to learn a unified feature space on the server side. Specifically, FedCSPC first employs the Data Prototypical Modeling (DPM) module to learn data patterns via clustering to aid calibration. Subsequently, the cross-silo prototypical calibration (CSPC) module develops an augmented contrastive learning method to improve the robustness of the calibration, which can effectively project cross-source features into a consistent space while maintaining clear decision boundaries. Moreover, the CSPC module's ease of implementation and plug-and-play characteristics make it even more remarkable. Experiments were conducted on four datasets in terms of performance comparison, ablation study, in-depth analysis and case study, and the results verified that FedCSPC is capable of learning the consistent features across different data sources of the same class under the guidance of calibrated model, which leads to better performance than the state-of-the-art methods. The source codes have been released at https://github.com/qizhuang-qz/FedCSPC.

CALM: An Enhanced Encoding and Confidence Evaluating Framework for Trustworthy Multi-view Learning

Hai Zhou
Zhe Xue
Ying Liu
Boang Li
Junping Du
Meiyu Liang
Yuankai Qi

Multi-view learning aims to leverage data acquired from multiple sources to achieve better performance compared to using a single view. However, the performance of multi-view learning can be negatively impacted by noisy or corrupted views in certain real-world situations. As a result, it is crucial to assess the confidence of predictions and obtain reliable learning outcomes. In this paper, we introduce CALM, an enhanced encoding and confidence evaluation framework for trustworthy multi-view classification. Our method comprises enhanced multi-view encoding, multi-view confidence-aware fusion, and multi-view classification regularization, enabling the simultaneous evaluation of prediction confidence and the yielding trustworthy classifications. Enhanced multi-view encoding takes advantage of cross-view consistency and class diversity to improve the efficacy of the learned latent representation, facilitating more reliable classification results. Multi-view confidence-aware fusion utilizes a confidence-aware estimator to evaluate the confidence scores of classification outcomes. The final multi-view classification results are then derived through confidence-aware fusion. To achieve reliable and accurate confidence scores, multivariate Gaussian distributions are employed to model the prediction distribution. The advantage of CALM lies in its ability to evaluate the quality of each view, reducing the influence of low-quality views on the multi-view fusion process and ultimately leading to improved classification performance and confidence evaluation. Comprehensive experimental results demonstrate that our method outperforms other trusted multi-view learning methods in terms of effectiveness, reliability, and robustness.

Curriculum-Listener: Consistency- and Complementarity-Aware Audio-Enhanced Temporal Sentence Grounding

Houlun Chen
Xin Wang
Xiaohan Lan
Hong Chen
Xuguang Duan
Jia Jia
Wenwu Zhu

Temporal Sentence Grounding aims to retrieve a video moment given a natural language query. Most existing literature merely focuses on visual information in videos without considering the naturally accompanied audio which may contain rich semantics. The few works considering audio simply regard it as an additional modality, overlooking that: i) it's non-trivial to explore consistency and complementarity between audio and visual; ii) such exploration requires handling different levels of information densities and noises in the two modalities. To tackle these challenges, we propose Adaptive Dual-branch Promoted Network (ADPN) to exploit such consistency and complementarity: i) we introduce a dual-branch pipeline capable of jointly training visual-only and audio-visual branches to simultaneously eliminate inter-modal interference; ii) we design Text-Guided Clues Miner (TGCM) to discover crucial locating clues via considering both consistency and complementarity during audio-visual interaction guided by text semantics; iii) we propose a novel curriculum-based denoising optimization strategy, where we adaptively evaluate sample difficulty as a measure of noise intensity in a self-aware fashion. Extensive experiments show the state-of-the-art performance of our method.

Quality-Aware RGBT Tracking via Supervised Reliability Learning and Weighted Residual Guidance

Lei Liu
Chenglong Li
Yun Xiao
Jin Tang

RGB and thermal infrared (TIR) data have different visual properties, which make their fusion essential for effective object tracking in diverse environments and scenes. Existing RGBT tracking methods commonly use attention mechanisms to generate reliability weights for multi-modal feature fusion. However, without explicit supervision, these weights may be unreliably estimated, especially in complex scenarios. To address this problem, we propose a novel Quality-Aware RGBT Tracker (QAT) for robust RGBT tracking. QAT learns reliable weights for each modality in a supervised manner and performs weighted residual guidance to extract and leverage useful features from both modalities. We address the issue of the lack of labels for reliability learning by designing an efficient three-branch network that generates reliable pseudo labels, and a simple binary classification scheme that estimates high-accuracy reliability weights, mitigating the effect of noisy pseudo labels. To propagate useful features between modalities while reducing the influence of noisy modal features on the migrated information, we design a weighted residual guidance module based on the estimated weights and residual connections. We evaluate our proposed QAT on five benchmark datasets, including GTOT, RGBT210, RGBT234, LasHeR, and VTUAV, and demonstrate its excellent performance compared to state-of-the-art methods. Experimental results show that QAT outperforms existing RGBT tracking methods in various challenging scenarios, demonstrating its efficacy in improving the reliability and accuracy of RGBT tracking.

Event-Enhanced Multi-Modal Spiking Neural Network for Dynamic Obstacle Avoidance

Yang Wang
Bo Dong
Yuji Zhang
Yunduo Zhou
Haiyang Mei
Ziqi Wei
Xin Yang

Autonomous obstacle avoidance is of vital importance for an intelligent agent such as a mobile robot to navigate in its environment. Existing state-of-the-art methods train a spiking neural network (SNN) with deep reinforcement learning (DRL) to achieve energy-efficient and fast inference speed in complex/unknown scenes. These methods typically assume that the environment is static while the obstacles in real-world scenes are often dynamic. The movement of obstacles increases the complexity of the environment and poses a great challenge to the existing methods. In this work, we approach robust dynamic obstacle avoidance twofold. First, we introduce the neuromorphic vision sensor (i.e., event camera) to provide motion cues complementary to the traditional Laser depth data for handling dynamic obstacles. Second, we develop an DRL-based event-enhanced multimodal spiking actor network (EEM-SAN) that extracts information from motion events data via unsupervised representation learning and fuses Laser and event camera data with learnable thresholding. Experiments demonstrate that our EEM-SAN outperforms state-of-the-art obstacle avoidance methods by a significant margin, especially for dynamic obstacle avoidance.

Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture Recognition

Yujun Ma
Benjia Zhou
Ruili Wang
Pichao Wang

RGB-D action and gesture recognition remain an interesting topic in human-centered scene understanding, primarily due to the multiple granularities and large variation in human motion. Although many RGB-D based action and gesture recognition approaches have demonstrated remarkable results by utilizing highly integrated spatio-temporal representations across multiple modalities (i.e., RGB and depth data), they still encounter several challenges. Firstly, vanilla 3D convolution makes it hard to capture fine-grained motion differences between local clips under different modalities. Secondly, the intricate nature of highly integrated spatio-temporal modeling can lead to optimization difficulties. Thirdly, duplicate and unnecessary information can add complexity and complicate entangled spatio-temporal modeling. To address the above issues, we propose an innovative heuristic architecture called Multi-stage Factorized Spatio-Temporal (MFST) for RGB-D action and gesture recognition. The proposed MFST model comprises a 3D Central Difference Convolution Stem (CDC-Stem) module and multiple factorized spatio-temporal stages. The CDC-Stem enriches fine-grained temporal perception, and the multiple hierarchical spatio-temporal stages construct dimension-independent higher-order semantic primitives. Specifically, the CDC-Stem module captures bottom-level spatio-temporal features and passes them successively to the following spatio-temporal factored stages to capture the hierarchical spatial and temporal features through the Multi-Scale Convolution and Transformer (MSC-Trans) hybrid block and Weight-shared Multi-Scale Transformer (WMS-Trans) block. The seamless integration of these innovative designs results in a robust spatio-temporal representation that outperforms state-of-the-art approaches on RGB-D action and gesture recognition datasets.

M3R: Masked Token Mixup and Cross-Modal Reconstruction for Zero-Shot Learning

Peng Zhao
Qiangchang Wang
Yilong Yin

In the zero-shot learning (ZSL), learned representation spaces are often biased toward seen classes, thus limiting the ability to predict previously unseen classes. In this paper, we propose Masked token Mixup and cross-Modal Reconstruction for zero-shot learning, termed as M3R, which can significantly alleviate the bias toward seen classes. The M3R mainly consists of Random Token Mixup (RTM), Unseen Class Detection (UCD), and Hard Cross-modal Reconstruction (HCR). Firstly, mappings without proper adaptations to unseen classes would cause the bias toward seen classes. To address this issue, the RTM is introduced to generate diverse unseen class agents, thereby broadening the representation space to cover unknown classes. It is applied at a randomly selected layer in the Vision Transformer, producing smooth low- and high-level representation space boundaries to cover rich attributes. Secondly, it should be noted that unseen class agents generated by the RTM may be mixed with seen class samples. To overcome this challenge, the UCD is designed to generate greater entropy values for unseen classes, thereby distinguishing seen classes from unseen classes. Thirdly, to further mitigate the bias toward seen classes and explore associations between semantics and visual images, the HCR is proposed, which can reconstruct masked pixels based on few discriminative tokens and attribute embeddings. This approach can enable models to have a deep understanding of image contents and build powerful connections between semantic attributes and visual information. Both qualitative and quantitative results demonstrate the effectiveness and usefulness of our proposed M3R model.

Redundancy-aware Transformer for Video Question Answering

Yicong Li
Xun Yang
An Zhang
Chun Feng
Xiang Wang
Tat-Seng Chua

This paper identifies two kinds of redundancy in the current VideoQA paradigm. Specifically, the current video encoders tend to holistically embed all video clues at different granularities in a hierarchical manner, which inevitably introducesneighboring-frame redundancy that can overwhelm detailed visual clues at the object level. Subsequently, prevailing vision-language fusion designs introduce thecross-modal redundancy by exhaustively fusing all visual elements with question tokens without explicitly differentiating their pairwise vision-language interactions, thus making a pernicious impact on the answering. To this end, we propose a novel transformer-based architecture, that aims to model VideoQA in a redundancy-aware manner. To address the neighboring-frame redundancy, we introduce a video encoder structure that emphasizes the object-level change in neighboring frames, while adopting an out-of-neighboring message-passing scheme that imposes attention only on distant frames. As for the cross-modal redundancy, we equip our fusion module with a novel adaptive sampling, which explicitly differentiates the vision-language interactions by identifying a small subset of visual elements that exclusively support the answer. Upon these advancements, we find this \underlineR edundancy-\underlinea ware trans\underlineformer (RaFormer) can achieve state-of-the-art results on multiple VideoQA benchmarks.

Frequency-based Zero-Shot Learning with Phase Augmentation

Wanting Yin
Hongtao Xie
Lei Zhang
Jiannan Ge
Pandeng Li
Chuanbin Liu
Yongdong Zhang

Zero-Shot Learning (ZSL) aims to recognize images from seen and unseen classes by aligning visual and semantic knowledge (e.g., attribute descriptions). However, the fine-grained attributes in the RGB domain can be easily affected by background noise (e.g., the grey bird tail blending with the ground), making it difficult to effectively distinguish them. Analyzing the features in the frequency domain assists in better distinguishing the attributes since their patterns remain consistent across different images, unlike noise which may be more variable. Nevertheless, existing ZSL methods typically learn visual features directly from the RGB domain, which can impede the recognition of certain attributes. To overcome this limitation, we propose a novel ZSL method named Frequency-based Phase Augmentation (FPA) network, which learns an effective representation of the attributes in the frequency domain. Specifically, we introduce a Hybrid Phase Augmentation (HPA) module to transform visual features into the frequency domain and augment the phase component for better retention of semantic information of the attributes. The use of phase-augmented features enables FPA to capture more semantic knowledge that can be challenging to distinguish in the RGB domain, suppress noise, and highlight significant attributes. Our extensive experiments show that FPA achieves state-of-the-art performance across four standard datasets.

Uni-paint: A Unified Framework for Multimodal Image Inpainting with Pretrained Diffusion Model

Shiyuan Yang
Xiaodong Chen
Jing Liao

Recently, text-to-image denoising diffusion probabilistic models (DDPMs) have demonstrated impressive image generation capabilities and have also been successfully applied to image inpainting. However, in practice, users often require more control over the inpainting process beyond textual guidance, especially when they want to composite objects with customized appearance, color, shape, and layout. Unfortunately, existing diffusion-based inpainting methods are limited to single-modal guidance and require task-specific training, hindering their cross-modal scalability. To address these limitations, we propose Uni-paint, a unified framework for multimodal inpainting that offers various modes of guidance, including unconditional, text-driven, stroke-driven, exemplar-driven inpainting, as well as a combination of these modes. Furthermore, our Uni-paint is based on pretrained Stable Diffusion and does not require task-specific training on specific datasets, enabling few-shot generalizability to customized images. We have conducted extensive qualitative and quantitative evaluations that show our approach achieves comparable results to existing single-modal methods while offering multimodal inpainting capabilities not available in other methods. Code is available at https://github.com/ysy31415/unipaint.

UniNeXt: Exploring A Unified Architecture for Vision Recognition

Fangjian Lin
Jianlong Yuan
Sitong Wu
Fan Wang
Zhibin Wang

Vision Transformers have shown great potential in computer vision tasks. Most recent works have focused on elaborating the spatial token mixer for performance gains. However, we observe that a well-designed general architecture can significantly improve the performance of the entire backbone, regardless of which spatial token mixer is equipped. In this paper, we propose UniNeXt, an improved general architecture for the vision backbone. To verify its effectiveness, we instantiate the spatial token mixer with various typical and modern designs, including both convolution and attention modules. Compared with the architecture in which they are first proposed, our UniNeXt architecture can steadily boost the performance of all the spatial token mixers, and narrows the performance gap among them. Surprisingly, our UniNeXt equipped with naive local window attention even outperforms the previous state-of-the-art. Interestingly, the ranking of these spatial token mixers also changes under our UniNeXt, suggesting that an excellent spatial token mixer may be stifled due to a suboptimal general architecture, which further shows the importance of the study on the general architecture of vision backbone. Code is available at UniNeXt.

MCG-MNER: A Multi-Granularity Cross-Modality Generative Framework for Multimodal NER with Instruction

Junjie Wu
Chen Gong
Ziqiang Cao
Guohong Fu

Multimodal named entity recognition (MNER) is an essential task of vision and language, which aims to locate named entities and classify them to the predefined categories using visual scenarios. However, existing MNER studies often suffer from bias issues with fine-grained visual cue fusion, which may produce noisy coarse-grained visual cues for MNER. To accurately capture text-image relations and better refine multimodal representations, we propose a novel instruction-based Multi-granularity Cross-modality Generative framework for MNER, namely MCG-MNER. Concretely, we introduce a multi-granularity relation propagation to infer visual clues relevant to text. Then, we propose a method to jnject multi-granularity visual information into cross-modality interaction and fusion to learn a unified representation. Finally, we integrate task-specific instructions and answers for MCG-MNER. Comprehensive experimental results on three benchmark datasets, such as Twitter2015, Twitter2017 and WikiDiverse, demonstrate the superiority of our proposed method over several state-of-the-art MNER methods. We will publicly release our codes for future studies.

U2Net: A General Framework with Spatial-Spectral-Integrated Double U-Net for Image Fusion

Siran Peng
Chenhao Guo
Xiao Wu
Liang-Jian Deng

In image fusion tasks, images obtained from different sources exhibit distinct properties. Consequently, treating them uniformly with a single-branch network can lead to inadequate feature extraction. Additionally, numerous works have demonstrated that multi-scaled networks capture information more sufficiently than single-scaled models in pixel-level computer vision problems. Considering these factors, we propose U2Net, a spatial-spectral-integrated double U-shape network for image fusion. The U2Net utilizes a spatial U-Net and a spectral U-Net to extract spatial details and spectral characteristics, which allows for the discriminative and hierarchical learning of features from diverse images. In contrast to most previous works that merely employ concatenation to merge spatial and spectral information, this paper introduces a novel spatial-spectral integration structure called S2Block, which combines feature maps from different sources in a logical and effective way. We conduct a series of experiments on two image fusion tasks, including remote sensing pansharpening and hyperspectral image super-resolution (HISR). The U2Net outperforms representative state-of-the-art (SOTA) approaches in both quantitative and qualitative evaluations, demonstrating the superiority of our method. The code is available at https://github.com/PSRben/U2Net.

Modal-aware Visual Prompting for Incomplete Multi-modal Brain Tumor Segmentation

Yansheng Qiu
Ziyuan Zhao
Hongdou Yao
Delin Chen
Zheng Wang

In the realm of medical imaging, distinct magnetic resonance imaging (MRI) modalities can provide complementary medical insights. However, it is not uncommon for one or more modalities to be absent due to image corruption, artifacts, acquisition protocols, allergies to contrast agents, or cost constraints, posing a significant challenge for perceiving the modality-absent state in incomplete modality segmentation.In this work, we introduce a novel incomplete multi-modal segmentation framework called Modal-aware Visual Prompting (MAVP), which draws inspiration from the widely used pre-training and prompt adjustment protocol employed in natural language processing (NLP). In contrast to previous prompts that typically use textual network embeddings, we utilize embeddings as the prompts generated by a modality state classifier that focuses on the missing modality states. Additionally, we integrate modality state prompts into both the extraction stage of each modality and the modality fusion stage to facilitate intra/inter-modal adaptation. Our approach achieves state-of-the-art performance in various modality-incomplete scenarios compared to incomplete modality-specific solutions.

Where to Find Fascinating Inter-Graph Supervision: Imbalanced Graph Classification with Kernel Information Bottleneck

Hui Tang
Xun Liang

Imbalanced graph classification is ubiquitous yet challenging in many real-world applications. Existing methods typically follow the same convention of treating graph instances as discrete individuals and exploit graph neural networks (GNNs) to predict graph labels. Despite their success, they only propagate intra-graph information within a single graph while disregarding extra supervision globally derived from other graphs. In fact, the inter-graph learning plays a vital role in providing more supervision for minority graphs. However, it is disadvantageous to accurately derive reliable inter-graph supervision because the redundancy information from majority graphs is introduced to obscure the representations of minority graphs during the propagation process. To tackle this issue, we propose a novel method that integrates the restricted random walk kernel with the global graph information bottleneck (GIB) to improve imbalanced graph classification. Specifically, the restricted random walk kernel is proposed to perform the inter-graph learning with learnable graph filters and produce kernel outputs. To ensure that the redundant information of majority graphs does not plague kernel outputs, we model the entire kernel learning as a Markovian decision process and employ the global GIB manner to optimize it. Extensive experiments on real-world graph benchmark datasets verify the competitive performance of the proposed method.

pmBQA: Projection-based Blind Point Cloud Quality Assessment via Multimodal Learning

Wuyuan Xie
Kaimin Wang
Yakun Ju
Miaohui Wang

With the increasing communication and storage of point cloud data, there is an urgent need for an effective objective method to measure the quality before and after processing. To address this difficulty, we propose a projection-based blind quality indicator via multimodal learning for point cloud data, which can perceive both geometric distortion and texture distortion by using four homogeneous modalities (i.e., texture, normal, depth and roughness). To fully exploit the multimodal information, we further develop a deformable convolutionbased alignment module and a graph-based feature fusion module, and investigate a graph node attention-based evaluation method to forecast the quality score. Extensive experimental results on three benchmark databases show that our method achieves more accurate evaluation performance in comparison with 12 competitive methods.

Dropping Pathways Towards Deep Multi-View Graph Subspace Clustering Networks

Zihao Zhang
Qianqian Wang
Zhiqiang Tao
Quanxue Gao
Wei Feng

Multi-view graph clustering aims to leverage different views to obtain consistent information and improve clustering performance by sharing the graph structure. Existing multi-view graph clustering algorithms generally adopt a single-pathway network reconstruction and consistent feature extraction, building on top of auto-encoders and graph convolutional networks (GCN). Despite their promising results, these single-pathway methods may ignore the significant complementary information between different layers and the rich multi-level context inside. On the other hand, GCN usually employs a shallow network structure (2-3 layers) due to the over-smoothing with the increase of network depth, while few multi-view graph clustering methods explore the performance of deep networks. In this work, we propose a novel Dropping Pathways strategy toward building a deep Multi-view Graph Subspace Clustering network, namely DPMGSC, to fully exploit the deep and multi-level graph network representations. The proposed method implements a multi-pathway self-expressive network to capture pairwise affinities of graph nodes among multiple views. Moreover, we empirically study the impact of a series of dropping methods on deep multi-pathway networks. Extensive experiments demonstrate the effectiveness of the proposed DPMGSC compared with its deep counterpart and state-of-the-art methods.

Multi-view Graph Clustering via Efficient Global-Local Spectral Embedding Fusion

Penglei Wang
Danyang Wu
Rong Wang
Feiping Nie

With the proliferation of multimedia applications, data is frequently derived from multiple sources, leading to the accelerated advancement of multi-view clustering (MVC) methods. In this paper, we propose a novel MVC method, termed GLSEF, to handle the inconsistency existing in multiple spectral embeddings. To this end, GLSEF contains a two-level learning mechanism. Specifically, on the global level, GLSEF considers the diversity of features and selectively assigns smooth weights to partial more discriminative features that are conducive to clustering. On the local level, GLSEF resorts to the Grassmann manifold to maintain spatial and topological information and local structure in each view, thereby enhancing its suitability and accuracy for clustering. Moreover, unlike most previous methods that learn a low-dimension embedding and perform the k-means algorithm to obtain the final cluster labels, GLSEF directly acquires the discrete indicator matrix to prevent potential information loss during post-processing. To address the optimization involved in GLSEF, we present an efficient alternating optimization algorithm accompanied by convergence and time complexity analyses. Extensive empirical results on nine real-world datasets demonstrate the effectiveness and efficiency of GLSEF compared to existing state-of-the-art MVC methods.

Debunking Free Fusion Myth: Online Multi-view Anomaly Detection with Disentangled Product-of-Experts Modeling

Hao Wang
Zhi-Qi Cheng
Jingdong Sun
Xin Yang
Xiao Wu
Hongyang Chen
Yan Yang

Multi-view or even multi-modal data is appealing yet challenging for real-world applications. Detecting anomalies in multi-view data is a prominent recent research topic. However, most of the existing methods 1) are only suitable for two views or type-specific anomalies, 2) suffer from the issue of fusion disentanglement, and 3) do not support online detection after model deployment. To address these challenges, our main ideas in this paper are three-fold: multi-view learning, disentangled representation learning, and generative model. To this end, we propose dPoE, a novel multi-view variational autoencoder model that involves (1) a Product-of-Experts (PoE) layer in tackling multi-view data, (2) a Total Correction (TC) discriminator in disentangling view-common and view-specific representations, and (3) a joint loss function in wrapping up all components. In addition, we devise theoretical information bounds to control both view-common and view-specific representations. Extensive experiments on six real-world datasets demonstrate that the proposed dPoE outperforms baselines markedly.

Domain-irrelevant Feature Learning for Generalizable Pan-sharpening

Yunlong Lin
Zhenqi Fu
Ge Meng
Yingying Wang
Yuhang Dong
Linyu Fan
Hedeng Yu
Xinghao Ding

Pan-sharpening aims to spatially enhance the low-resolution multispectral image (LRMS) by transferring high-frequency details from a panchromatic image (PAN) while preserving the spectral characteristics of LRMS. Previous arts mainly focus on how to learn a high-resolution multispectral image (HRMS) on the i.i.d. assumption. However, the distribution of training and testing data often encounters significant shifts in different satellites. To this end, this paper proposes a generalizable pan-sharpening network via domain-irrelevant feature learning. On the one hand, a structural preservation module (STP) is designed to fuse high-frequency information of PAN and LRMS. Our STP is performed on the gradient domain because it consists of structure and texture details that can generalize well on different satellites. On the other hand, to avoid spectral distortion while promoting the generalization ability, a spectral preservation module (SPP) is developed. The key design of SPP is to learn a phase fusion network of PAN and LRMS. The amplitude of LRMS, which contains 'satellite style' information is directly injected in different fusion stages. Extensive experiments have demonstrated the effectiveness of our method against state-of-the-art methods in both single-satellite and cross-satellite scenarios. Code is available at: https://github.com/LYL1015/DIRFL.

Depth-aided Camouflaged Object Detection

Qingwei Wang
Jinyu Yang
Xiaosheng Yu
Fangyi Wang
Peng Chen
Feng Zheng

Camouflaged Object Detection (COD) aims to identify and segment objects that blend into their surroundings. Since the color and texture of the camouflaged objects are extremely similar to the surrounding environment, it is super challenging for vision models to precisely detect them. Inspired by research on biology and evolution, we introduce depth information as an additional cue to help break camouflage, which can provide spatial information and texture-free separation for foreground and background. To dig clues of camouflaged objects in both RGB and depth modalities, we innovatively propose Depth-aided Camouflaged Object Detection (DaCOD), which involves two key components. We firstly propose the Multi-modal Collaborative Learning (MCL) module, which aims to collaboratively learning deep features from both RGB and depth channels via a hybrid backbone. Then, we propose a novel Cross-modal Asymmetric Fusion (CAF) strategy, which asymmetrically fuse RGB and depth information for complementary depth feature enhancement to produce accurate predictions. We conducted numerous experiments of the proposed DaCOD on three widely-used challenging COD benchmark datasets, in which DaCOD outperforms the current state-of-the-arts by a large margin. All resources are available at https://github.com/qingwei-wang/DaCOD.

SemanticRT: A Large-Scale Dataset and Method for Robust Semantic Segmentation in Multispectral Images

Wei Ji
Jingjing Li
Cheng Bian
Zhicheng Zhang
Li Cheng

Growing interests in multispectral semantic segmentation (MSS) have been witnessed in recent years, thanks to the unique advantages of combining RGB and thermal infrared images to tackle challenging scenarios with adverse conditions. However, unlike traditional RGB-only semantic segmentation, the lack of a large-scale MSS dataset has become a hindrance to the progress of this field. To address this issue, we introduce a SemanticRT dataset - the largest MSS dataset to date, comprising 11,371 high-quality, pixel-level annotated RGB-thermal image pairs. It is 7 times larger than the existing MFNet dataset, and covers a wide variety of challenging scenarios in adverse lighting conditions such as low-light and pitch black. Further, a novel Explicit Complement Modeling (ECM) framework is developed to extract modality-specific information, which is propagated through a robust cross-modal feature encoding and fusion process. Extensive experiments demonstrate the advantages of our approach and dataset over the existing counterparts. Our new dataset may also facilitate further development and evaluation of existing and new MSS algorithms.

MEAformer: Multi-modal Entity Alignment Transformer for Meta Modality Hybrid

Zhuo Chen
Jiaoyan Chen
Wen Zhang
Lingbing Guo
Yin Fang
Yufeng Huang
Yichi Zhang
Yuxia Geng
Jeff Z. Pan
Wenting Song
Huajun Chen

Multi-modal entity alignment (MMEA) aims to discover identical entities across different knowledge graphs (KGs) whose entities are associated with relevant images. However, current MMEA algorithms rely on KG-level modality fusion strategies for multi-modal entity representation, which ignores the variations of modality preferences of different entities, thus compromising robustness against noise in modalities such as blurry images and relations. This paper introduces MEAformer, a mlti-modal entity alignment transformer approach for meta modality hybrid, which dynamically predicts the mutual correlation coefficients among modalities for more fine-grained entity-level modality fusion and alignment. Experimental results demonstrate that our model not only achieves SOTA performance in multiple training scenarios, including supervised, unsupervised, iterative, and low-resource settings, but also has a limited number of parameters, efficient runtime, and interpretability. Our code is available at https://github.com/zjukg/MEAformer.

Multi-Modal and Multi-Scale Temporal Fusion Architecture Search for Audio-Visual Video Parsing

Jiayi Zhang
Weixin Li

The weakly supervised audio-visual video parsing (AVVP) task aims to parse a video into a set of modality-wise events (i.e., audible, visible, or both), recognize categories of these events, and localize their temporal boundaries. Given the prevalence of audio-visual synchronous and asynchronous contents in multi-modal videos, it is crucial to capture and integrate the contextual events occurring at different moments and temporal scales. Although some researchers have made preliminary attempts at modeling event semantics with various temporal lengths, they mostly only perform a late fusion of multi-scale features across modalities. A comprehensive cross-modal and multi-scale temporal fusion strategy remains largely unexplored in the literature. To address this gap, we propose a novel framework named Audio-Visual Fusion Architecture Search (AVFAS) that can automatically find the optimal multi-scale temporal fusion strategy within and between modalities. Our framework generates a set of audio and visual features with distinct temporal scales and employs three modality-wise modules to search multi-scale feature selection and fusion strategies, jointly modeling modality-specific discriminative information. Furthermore, to enhance the alignment of audio-visual asynchrony, we introduce a Position- and Length-Adaptive Temporal Attention (PLATA) mechanism for cross-modal feature fusion. Extensive quantitative and qualitative experimental results demonstrate the effectiveness and efficiency of our framework.

Incorporating Domain Knowledge Graph into Multimodal Movie Genre Classification with Self-Supervised Attention and Contrastive Learning

Jiaqi Li
Guilin Qi
Chuanyi Zhang
Yongrui Chen
Yiming Tan
Chenlong Xia
Ye Tian

Multimodal movie genre classification has always been regarded as a demanding multi-label classification task due to the diversity of multimodal data such as posters, plot summaries, trailers and metadata. Although existing works have made great progress in modeling and combining each modality, they still face three issues: 1) unutilized group relations in metadata, 2) unreliable attention allocation, and 3) indiscriminative fused features. Given that the knowledge graph has been proven to contain rich information, we present a novel framework that exploits the knowledge graph from various perspectives to address the above problems. As a preparation, the metadata is processed into a domain knowledge graph. A translate model for knowledge graph embedding is adopted to capture the relations between entities. Firstly we retrieve the relevant embedding from the knowledge graph by utilizing group relations in metadata and then integrate it with other modalities. Next, we introduce an Attention Teacher module for reliable attention allocation based on self-supervised learning. It learns the distribution of the knowledge graph and produces rational attention weights. Finally, a Genre-Centroid Anchored Contrastive Learning module is proposed to strengthen the discriminative ability of fused features. The embedding space of anchors is initialized from the genre entities in the knowledge graph. To verify the effectiveness of our framework, we collect a larger and more challenging dataset named MM-IMDb 2.0 compared with the MM-IMDb dataset. The experimental results on two datasets demonstrate that our model is superior to the state-of-the-art methods. Our code and dataset is available at https://github.com/aoluming/IDKG.git.

Multi-scale Spatial-Spectral Attention Guided Fusion Network for Pansharpening

Yong Yang
Mengzhen Li
Shuying Huang
Hangyuan Lu
Wei Tu
Weiguo Wan

Pansharpening is to fuse high-resolution panchromatic (PAN) images with low-resolution multispectral (LR-MS) images to generate high-resolution multispectral (HR-MS) images. Most of the deep learning-based pansharpening methods did not consider the inconsistency of the PAN and LR-MS images and used simple concatenation to fuse the source images, which may cause spectral and spatial distortion in the fused results. To address this problem, a multi-scale spatial-spectral attention guided fusion network for pansharpening is proposed. First, the spatial features from the PAN image and spectral features from the LR-MS image are independently extracted to obtain the shallow features. Then, a spatial-spectral attention feature fusion module (SAFFM) is constructed to guide the reconstruction of spatial-spectral features by generating a guidance map to achieve the fusion of reconstructed features at different scales. In SAFFM, the guidance map is designed to ensure the spatial-spectral consistency of the reconstructed features. Finally, considering the difference between multiply scale features, a multi-level feature integration scheme is proposed to progressively achieve fusion of multi-scale features from different SAFFMs. Extensive experiments validate the effectiveness of the proposed network against other state-of-the-art (SOTA) pansharpening methods in both quantitative and qualitative assessments. The source code will be released at https://github.com/MELiMZ/ssaff.

Modality Profile - A New Critical Aspect to be Considered When Generating RGB-D Salient Object Detection Training Set

Xuehao Wang
Shuai Li
Chenglizhao Chen
Aimin Hao
Hong Qin

It is widely acknowledged that selecting appropriate training data is crucial for obtaining good results in real-world testing, more so than utilizing complex network architectures. However, in the field of RGB-D SOD research, researchers have primarily focused on enhancing network architectures and have given less consideration to the choice of training and testing datasets, which may not translate well in practical applications. This paper aims to address an existing issue - how can we automatically generate a data-driven RGB-D SOD training dataset? We propose that in addition to scene similarity, the concept of "modality profile'' should be taken into account. The term "modality profile'' refers to the complementary status of modalities within a given dataset. A training dataset with a modality profile similar to the test dataset can significantly improve performance. To address this, we present a viable solution for automatically generating a training dataset with any desired modality profile in a weakly supervised manner. Our method also provides high-quality pseudo-GTs for all RGB-D images obtained from the web, making it suitable for training RGB-D SOD models. Extensive quantitative evaluations demonstrate the significance of the proposed "modality profile'' and confirm the superiority of the newly constructed training set guided by our "modality profile''. All codes, datasets, and results are available at this link.

TMac: Temporal Multi-Modal Graph Learning for Acoustic Event Classification

Meng Liu
Ke Liang
Dayu Hu
Hao Yu
Yue Liu
Lingyuan Meng
Wenxuan Tu
Sihang Zhou
Xinwang Liu

Audiovisual data is everywhere in this digital age, which raises higher requirements for the deep learning models developed on them. To well handle the information of the multi-modal data is the key to a better audiovisual modal. We observe that these audiovisual data naturally have temporal attributes, such as the time information for each frame in the video. More concretely, such data is inherently multi-modal according to both audio and visual cues, which proceed in a strict chronological order. It indicates that temporal information is important in multi-modal acoustic event modeling for both intra- and inter-modal. However, existing methods deal with each modal feature independently and simply fuse them together, which neglects the mining of temporal relation and thus leads to sub-optimal performance. With this motivation, we propose a Temporal Multi-modal graph learning method for Acoustic event Classification, called TMac, by modeling such temporal information via graph learning techniques. In particular, we construct a temporal graph for each acoustic event, dividing its audio data and video data into multiple segments. Each segment can be considered as a node, and the temporal relationships between nodes can be considered as timestamps on their edges. In this case, we can smoothly capture the dynamic information in intra-modal and inter-modal. Several experiments are conducted to demonstrate TMac outperforms other SOTA models in performance. Our code is available at https://github.com/MGitHubL/TMac.

FOLT: Fast Multiple Object Tracking from UAV-captured Videos Based on Optical Flow

Mufeng Yao
Jiaqi Wang
Jinlong Peng
Mingmin Chi
Chao Liu

Multiple object tracking (MOT) has been successfully investigated in computer vision. However, MOT for the videos captured by unmanned aerial vehicles (UAV) is still challenging due to small object size, blurred object appearance, and very large and/or irregular motion in both ground objects and UAV platforms. In this paper, we propose FOLT to mitigate these problems and reach fast and accurate MOT in UAV view. Aiming at speed-accuracy trade-off, FOLT adopts a modern detector and light-weight optical flow extractor to extract object detection features and motion features at a minimum cost. Given the extracted flow, the flow-guided feature augmentation is designed to augment the object detection feature based on its optical flow, which improves the detection of small objects. Then the flow-guided motion prediction is also proposed to predict the object's position in the next frame, which improves the tracking performance of objects with very large displacements between adjacent frames. Finally, the tracker matches the detected objects and predicted objects using a spatially matching scheme to generate tracks for every object. Experiments on Visdrone and UAVDT datasets show that our proposed model can successfully track small objects with large and irregular motion and outperform existing state-of-the-art methods in UAV-MOT tasks.

ScribbleVC: Scribble-supervised Medical Image Segmentation with Vision-Class Embedding

Zihan Li
Yuan Zheng
Xiangde Luo
Dandan Shan
Qingqi Hong

Medical image segmentation plays a critical role in clinical decision-making, treatment planning, and disease monitoring. However, accurate segmentation of medical images is challenging due to several factors, such as the lack of high-quality annotation, imaging noise, and anatomical differences across patients. In addition, there is still a considerable gap in performance between the existing label-efficient methods and fully-supervised methods. To address the above challenges, we propose ScribbleVC, a novel framework for scribble-supervised medical image segmentation that leverages vision and class embeddings via the multimodal information enhancement mechanism. In addition, ScribbleVC uniformly utilizes the CNN features and Transformer features to achieve better visual feature extraction. The proposed method combines a scribble-based approach with a segmentation network and a class-embedding module to produce accurate segmentation masks. We evaluate ScribbleVC on three benchmark datasets and compare it with state-of-the-art methods. The experimental results demonstrate that our method outperforms existing approaches in terms of accuracy, robustness, and efficiency. The datasets and code are released on GitHub.

Temporally Efficient Gabor Transformer for Unsupervised Video Object Segmentation

Jiaqing Fan
Tiankang Su
Kaihua Zhang
Bo Liu
Qingshan Liu

Spatial-temporal structural details of targets in video (e.g. varying edges, textures over time) are essential to accurate Unsupervised Video Object Segmentation (UVOS). The vanilla multi-head self-attention in the Transformer-based UVOS methods usually concentrates on learning the general low-frequency information (e.g. illumination, color), while neglecting the high-frequency texture details, leading to unsatisfying segmentation results. To address this issue, this paper presents a Temporally efficient Gabor Transformer (TGFormer) for UVOS. The TGFormer jointly models the spatial dependencies and temporal coherence intra- and inter-frames, which can fully capture the rich structural details for accurate UVOS. Concretely, we first propose an effective learnable Gabor filtering Transformer to mine the structural texture details of the object for accurate UVOS. Then, to adaptively store the redundant neighboring historical information, we present an efficient dynamic neighboring frame selection module to automatically choose the useful temporal information, which simultaneously relieves the blurry frame and reduces the computation burden. Finally, we make the UVOS model be a fully Transformer architecture, meanwhile aggregating the information from space, Gabor and time domains, yielding a strong representation with rich structure details. Extensive experiments on five mainstream UVOS benchmarks (DAVIS2016, FBMS, DAVSOD, ViSal, and MCL) demonstrate the superiority of the presented solution to sate-of-the-art methods.

Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation

Haowei Wang
Jiji Tang
Jiayi Ji
Xiaoshuai Sun
Rongsheng Zhang
Yiwei Ma
Minda Zhao
Lincheng Li
Zeng Zhao
Tangjie Lv
Rongrong Ji

In recent years, 3D representation learning has turned to 2D vision-language pre-trained models to overcome data scarcity challenges. However, existing methods simply transfer 2D alignment strategies, aligning 3D representations with single-view 2D images and coarse-grained parent category text. These approaches introduce information degradation and insufficient synergy issues, leading to performance loss. Information degradation arises from overlooking the fact that a 3D representation should be equivalent to a series of multi-view images and more fine-grained subcategory text. Insufficient synergy neglects the idea that a robust 3D representation should align with the joint vision-language space, rather than independently aligning with each modality. In this paper, we propose a multi-view joint modality modeling approach, termed JM3D, to obtain a unified representation for point cloud, text, and image. Specifically, a novel Structured Multimodal Organizer (SMO) is proposed to address the information degradation issue, which introduces contiguous multi-view images and hierarchical text to enrich the representation of vision and language modalities. A Joint Multi-modal Alignment (JMA) is designed to tackle the insufficient synergy problem, which models the joint modality by incorporating language knowledge into the visual modality. Extensive experiments on ModelNet40 and ScanObjectNN demonstrate the effectiveness of our proposed method, JM3D, which achieves state-of-the-art performance in zero-shot 3D classification. JM3D outperforms ULIP by approximately 4.3% on PointMLP and achieves an improvement of up to 6.5% accuracy on PointNet++ in top-1 accuracy for zero-shot 3D classification on ModelNet40. The source code and trained models for all our experiments are publicly available at https://github.com/Mr-Neko/JM3D.

Hierarchical Visual Attribute Learning in the Wild

Kongming Liang
Xinran Wang
Haiwen Zhang
Zhanyu Ma
Jun Guo

Observing objects' attributes at different levels of detail is a fundamental aspect of how humans perceive and understand the world around them. Existing studies focused on attribute prediction in a flat way, but they overlook the underlying attribute hierarchy, e.g., navy blue is a subcategory of blue. In recent years, large language models, e.g., ChatGPT, have emerged with the ability to perform an extensive range of natural language processing tasks like text generation and classification. The factual knowledge learned by LLM can assist us build the hierarchical relations of visual attributes in the wild. Based on that, we propose a model called the object-specific attribute relation net, which takes advantage of three types of relations among attributes - positive, negative, and hierarchical - to better facilitate attribute recognition in images. Guided by the extracted hierarchical relations, our model can predict attributes from coarse to fine. Additionally, we introduce several evaluation metrics for attribute hierarchy to comprehensively assess the model's ability to comprehend hierarchical relations. Our extensive experiments demonstrate that our proposed hierarchical annotation brings improvements to the model's understanding of hierarchical relations of attributes, and the object-specific attribute relation net can recognize visual attributes more accurately.

Hierarchical Semantic Enhancement Network for Multimodal Fake News Detection

Qiang Zhang
Jiawei Liu
Fanrui Zhang
Jingyi Xie
Zheng-Jun Zha

The explosion of multimodal fake news content on social media has sparked widespread concern. Existing multimodal fake news detection methods have made significant contributions to the development of this field, but fail to adequately exploit the potential semantic information of images and ignore the noise embedded in news entities, which severely limits the performance of the models. In this paper, we propose a novel Hierarchical Semantic Enhancement Network (HSEN) for multimodal fake news detection by learning text-related image semantic and precise news high-order knowledge semantic information. Specifically, to complement the image semantic information, HSEN utilizes textual entities as the prompt subject vocabulary and applies reinforcement learning to discover the optimal prompt format for generating image captions specific to the corresponding textual entities, which contain multi-level cross-modal correlation information. Moreover, HSEN extracts visual and textual entities from image and text, and identifies additional visual entities from image captions to extend image semantic knowledge. Based on that, HSEN exploits an adaptive hard attention mechanism to automatically select strongly related news entities and remove irrelevant noise entities to obtain precise high-order knowledge semantic information, while generating attention mask for guiding cross-modal knowledge interaction. Extensive experiments show that our method outperforms state-of-the-art methods.

Towards Balanced Active Learning for Multimodal Classification

Meng Shen
Yizheng Huang
Jianxiong Yin
Heqing Zou
Deepu Rajan
Simon See

Training multimodal networks requires a vast amount of data due to their larger parameter space compared to unimodal networks. Active learning is a widely used technique for reducing data annotation costs by selecting only those samples that could contribute to improving model performance. However, current active learning strategies are mostly designed for unimodal tasks, and when applied to multimodal data, they often result in biased sample selection from the dominant modality. This unfairness hinders balanced multimodal learning, which is crucial for achieving optimal performance. To address this issue, we propose three guidelines for designing a more balanced multimodal active learning strategy. Following these guidelines, a novel approach is proposed to achieve more fair data selection by modulating the gradient embedding with the dominance degree among modalities. Our studies demonstrate that the proposed method achieves more balanced multimodal learning by avoiding greedy sample selection from the dominant modality. Our approach outperforms existing active learning strategies on a variety of multimodal classification tasks. Overall, our work highlights the importance of balancing sample selection in multimodal active learning and provides a practical solution for achieving more balanced active learning for multimodal classification.

Learning Event-Specific Localization Preferences for Audio-Visual Event Localization

Shiping Ge
Zhiwei Jiang
Yafeng Yin
Cong Wang
Zifeng Cheng
Qing Gu

Audio-Visual Event Localization (AVEL) aims to locate events that are both visible and audible in a video. Existing AVEL methods primarily focus on learning generic localization patterns that are applicable to all events. However, events often exhibit modality biases, such as visual-dominated, audio-dominated, or modality-balanced, which can lead to different localization preferences. These preferences may be overlooked by existing methods, resulting in unsatisfactory localization performance. To address this issue, this paper proposes a novel event-aware localization paradigm, which first identifies the event category and then leverages localization preferences specific to that event for improved event localization. To achieve this, we introduce a memory-assisted metric learning framework, which utilizes historic segments as anchors to adjust the unified representation space for both event classification and event localization. To provide sufficient information for this metric learning, we design a spatial-temporal audio-visual fusion encoder to capture the spatial and temporal interaction between audio and visual modalities. Extensive experiments on the public AVE dataset in both fully-supervised and weakly-supervised settings demonstrate the effectiveness of our approach. Code will be released at https://github.com/ShipingGe/AVEL.

Object Segmentation by Mining Cross-Modal Semantics

Zongwei Wu
Jingjing Wang
Zhuyun Zhou
Zhaochong An
Qiuping Jiang
Cédric Demonceaux
Guolei Sun
Radu Timofte

Multi-sensor clues have shown promise for object segmentation, but inherent noise in each sensor, as well as the calibration error in practice, may bias the segmentation accuracy. In this paper, we propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features, with the aim of controlling the modal contribution based on relative entropy. We explore semantics among the multimodal inputs in two aspects: the modality-shared consistency and the modality-specific variation. Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision. On the one hand, the AF block explicitly dissociates the shared and specific representation and learns to weight the modal contribution by adjusting the proportion, region, and pattern, depending upon the quality. On the other hand, our CFD initially decodes the shared feature and then refines the output through specificity-aware querying. Further, we enforce semantic consistency across the decoding layers to enable interaction across network hierarchies, improving feature discriminability. Exhaustive comparison on eleven datasets with depth or thermal clues, and on two challenging tasks, namely salient and camouflage object segmentation, validate our effectiveness in terms of both performance and robustness. The source code is publicly available at https://github.com/Zongwei97/XMSNet.

PSNEA: Pseudo-Siamese Network for Entity Alignment between Multi-modal Knowledge Graphs

Wenxin Ni
Qianqian Xu
Yangbangyan Jiang
Zongsheng Cao
Xiaochun Cao
Qingming Huang

Multi-modal entity alignment aims to identify entities that refer to the same concept in the real world across a plethora of multi-modal knowledge graphs (MMKGs). Most existing methods focus on reducing the embedding differences between multiple modalities while neglecting the following challenges: 1) cannot handle the heterogeneity across graphs, 2) suffer from the scarcity of pre-aligned data (a.k.a. initial seeds). To tackle these issues, we propose a Pseudo-Siamese Network for multi-modal Entity Alignment (PSNEA). It consists of two modules to extract various information and generate holistic embeddings. Specifically, the first module PSN is designed with two parallel branches to learn the representations for different MMKGs, thus effectively bridging the graph heterogeneity. On top of this, we introduce an Incremental Alignment Pool (IAP) to alleviate the scarcity of initial seeds by labeling likely alignment. IAP avoids error-prone by data swapping and sample re-weighting strategies. To the best of our knowledge, PSNEA is the first model that tackles graph heterogeneity and scarcity of initial seeds in one unified framework. The extensive experiments demonstrate that our model achieves the best performance on both cross-lingual and cross-graph datasets. The source code is available at https://github.com/idrfer/psn4ea.

Federated Deep Multi-View Clustering with Global Self-Supervision

Xinyue Chen
Jie Xu
Yazhou Ren
Xiaorong Pu
Ce Zhu
Xiaofeng Zhu
Zhifeng Hao
Lifang He

Federated multi-view clustering has the potential to learn a global clustering model from data distributed across multiple devices. In this setting, label information is unknown and data privacy must be preserved, leading to two major challenges. First, views on different clients often have feature heterogeneity, and mining their complementary cluster information is not trivial. Second, the storage and usage of data from multiple clients in a distributed environment can lead to incompleteness of multi-view data. To address these challenges, we propose a novel federated deep multi-view clustering method that can mine complementary cluster structures from multiple clients, while dealing with data incompleteness and privacy concerns. Specifically, in the server environment, we propose sample alignment and data extension techniques to explore the complementary cluster structures of multiple views. The server then distributes global prototypes and global pseudo-labels to each client as global self-supervised information. In the client environment, multiple clients use the global self-supervised information and deep autoencoders to learn view-specific cluster assignments and embedded features, which are then uploaded to the server for refining the global self-supervised information. Finally, the results of our extensive experiments demonstrate that our proposed method exhibits superior performance in addressing the challenges of incomplete multi-view data in distributed environments.

Audio-Visual Spatial Integration and Recursive Attention for Robust Sound Source Localization

Sung Jin Um
Dongjin Kim
Jung Uk Kim

The objective of the sound source localization task is to enable machines to detect the location of sound-making objects within a visual scene. While the audio modality provides spatial cues to locate the sound source, existing approaches only use audio as an auxiliary role to compare spatial regions of the visual modality. Humans, on the other hand, utilize both audio and visual modalities as spatial cues to locate sound sources. In this paper, we propose an audio-visual spatial integration network that integrates spatial cues from both modalities to mimic human behavior when detecting sound-making objects. Additionally, we introduce a recursive attention network to mimic human behavior of iterative focusing on objects, resulting in more accurate attention regions. To effectively encode spatial information from both modalities, we propose audio-visual pair matching loss and spatial region alignment loss. By utilizing the spatial cues of audio-visual modalities and recursively focusing objects, our method can perform more robust sound source localization. Comprehensive experimental results on the Flickr SoundNet and VGG-Sound Source datasets demonstrate the superiority of our proposed method over existing approaches. Our code is available at: https://github.com/VisualAIKHU/SIRA-SSL.

Hypergraph-Enhanced Hashing for Unsupervised Cross-Modal Retrieval via Robust Similarity Guidance

Fangming Zhong
Chenglong Chu
Zijie Zhu
Zhikui Chen

Unsupervised cross-modal hashing retrieval across image and text modality is a challenging task because of the suboptimality of similarity guidance, i.e., the joint similarity matrix constructed by existing methods does not possess clear enough guiding significance. How to construct more robust similarity matrix is the key to solve this problem. The unsupervised cross-modal retrieval methods based on graph have a good performance in mining semantic information of input samples, but the graph hashing based on traditional affinity graph cannot capture the high-order semantic information of input samples effectively. In order to overcome the aforementioned limitations, this paper presents a novel hypergraph-based approach for unsupervised cross-modal retrieval that differs from previous works in two significant ways. Firstly, to address the ubiquitous redundant information present in current methods, this paper introduces a robust similarity matrix constructing method. Secondly, we propose a novel hypergraph enhanced module that produces embedding vectors by hypergraph convolution and attention mechanism for input data, capturing important high-order semantics. Our approach is evaluated on the NUS-WIDE and MIRFlickr datasets, and yields state-of-the-art performance for unsupervised cross-modal retrieval.

Reinforcement Graph Clustering with Unknown Cluster Number

Yue Liu
Ke Liang
Jun Xia
Xihong Yang
Sihang Zhou
Meng Liu
Xinwang Liu
Stan Z. Li

Deep graph clustering, which aims to group nodes into disjoint clusters by neural networks in an unsupervised manner, has attracted great attention in recent years. Although the performance has been largely improved, the excellent performance of the existing methods heavily relies on an accurately predefined cluster number, which is not always available in the real-world scenario. To enable the deep graph clustering algorithms to work without the guidance of the predefined cluster number, we propose a new deep graph clustering method termed Reinforcement Graph Clustering (RGC). In our proposed method, cluster number determination and unsupervised representation learning are unified into a uniform framework by the reinforcement learning mechanism. Concretely, the discriminative node representations are first learned with the contrastive pretext task. Then, to capture the clustering state accurately with both local and global information in the graph, both node and cluster states are considered. Subsequently, at each state, the qualities of different cluster numbers are evaluated by the quality network, and the greedy action is executed to determine the cluster number. In order to conduct feedback actions, the clustering-oriented reward function is proposed to enhance the cohesion of the same clusters and separate the different clusters. Extensive experiments demonstrate the effectiveness and efficiency of our proposed method. The source code of RGC is shared at https://github.com/yueliu1999/RGC and a collection (papers, codes and, datasets) of deep graph clustering is shared at https://github.com/yueliu1999/Awesome-Deep-Graph-Clustering on Github.

Cultural Self-Adaptive Multimodal Gesture Generation Based on Multiple Culture Gesture Dataset

Jingyu Wu
Shi Chen
Shuyu Gan
Weijun Li
Changyuan Yang
Lingyun Sun

Co-speech gesture generation is essential for multimodal chatbots and agents. Previous research extensively studies the relationship between text, audio, and gesture. Meanwhile, to enhance cross-culture communication, culture-specific gestures are crucial for chatbots to learn cultural differences and incorporate cultural cues. However, culture-specific gesture generation faces two challenges: lack of large-scale, high-quality gesture datasets that include diverse cultural groups, and lack of generalization across different cultures. Therefore, in this paper, we first introduce a Multiple Culture Gesture Dataset (MCGD), the largest freely available gesture dataset to date. It consists of ten different cultures, over 200 speakers, and 10,000 segmented sequences. We further propose a Cultural Self-adaptive Gesture Generation Network (CSGN) that takes multimodal relationships into consideration while generating gestures using a cascade architecture and learnable dynamic weight. The CSGN adaptively generates gestures with different cultural characteristics without the need to retrain a new network. It extracts cultural features from the multimodal inputs or a cultural style embedding space with a designated culture. We broadly evaluate our method across four large-scale benchmark datasets. Empirical results show that our method achieves multiple cultural gesture generation and improves comprehensiveness of multimodal inputs. Our method improves the state-of-the-art average FGD from 53.7 to 48.0 and culture deception rate (CDR) from 33.63% to 39.87%.

DPNET: Dynamic Poly-attention Network for Trustworthy Multi-modal Classification

Xin Zou
Chang Tang
Xiao Zheng
Zhenglai Li
Xiao He
Shan An
Xinwang Liu

With advances in sensing technology, multi-modal data collected from different sources are increasingly available. Multi-modal classification aims to integrate complementary information from multi-modal data to improve model classification performance. However, existing multi-modal classification methods are basically weak in integrating global structural information and providing trustworthy multi-modal fusion, especially in safety-sensitive practical applications (e.g., medical diagnosis). In this paper, we propose a novel Dynamic Poly-attention Network (DPNET) for trustworthy multi-modal classification. Specifically, DPNET has four merits: (i) To capture the intrinsic modality-specific structural information, we design a structure-aware feature aggregation module to learn the corresponding structure-preserved global compact feature representation. (ii) A transparent fusion strategy based on the modality confidence estimation strategy is induced to track information variation within different modalities for dynamical fusion. (iii) To facilitate more effective and efficient multi-modal fusion, we introduce a cross-modal low-rank fusion module to reduce the complexity of tensor-based fusion and activate the implication of different rank-wise features via a rank attention mechanism. (iv) A label confidence estimation module is devised to drive the network to generate more credible confidence. An intra-class attention loss is introduced to supervise the network training. Extensive experiments on four real-world multi-modal biomedical datasets demonstrate that the proposed method achieves competitive performance compared to other state-of-the-art ones.

Tile Classification Based Viewport Prediction with Multi-modal Fusion Transformer

Zhiahao Zhang
Yiwei Chen
Weizhan Zhang
Caixia Yan
Qinghua Zheng
Qi Wang
Wangdu Chen

Viewport prediction is a crucial aspect of tile-based 360° video streaming system. However, existing trajectory based methods lack of robustness, also oversimplify the process of information construction and fusion between different modality inputs, leading to the error accumulation problem. In this paper, we propose a tile classification based viewport prediction method with Multi-modal Fusion Transformer, namely MFTR. Specifically, MFTR utilizes transformer-based networks to extract the long-range dependencies within each modality, then mine intra- and inter-modality relations to capture the combined impact of user historical inputs and video contents on future viewport selection. In addition, MFTR categorizes future tiles into two categories: user interested or not, and selects future viewport as the region that contains most user interested tiles. Comparing with predicting head trajectories, choosing future viewport based on tile's binary classification results exhibits better robustness and interpretability. To evaluate our proposed MFTR, we conduct extensive experiments on two widely used PVS-HM and Xu-Gaze dataset. MFTR shows superior performance over state-of-the-art methods in terms of average prediction accuracy and overlap ratio, also presents competitive computation efficiency.

Semantic-based Selection, Synthesis, and Supervision for Few-shot Learning

Jinda Lu
Shuo Wang
Xinyu Zhang
Yanbin Hao
Xiangnan He

Few-shot learning (FSL) is designed to explore the distribution of novel categories from a few samples. It is a challenging task since the classifier is usually susceptible to over-fitting when learning from limited training samples. To alleviate this phenomenon, a common solution is to achieve more training samples using a generic generation strategy in visual space. However, there are some limitations to this solution. It is because a feature extractor trained on base samples (known knowledge) tends to focus on the textures and structures of the objects it learns, which is inadequate for describing novel samples. To solve these issues, we introduce semantics and propose a Semantic-based Selection, Synthesis, and S upervision (4S) method, where semantics provide more diverse and informative supervision for recognizing novel objects. Specifically, we first utilize semantic knowledge to explore the correlation of categories in the textual space and select base categories related to the given novel category. This process can improve the efficiency of subsequent operations (synthesis and supervision). Then, we analyze the semantic knowledge to hallucinate the training samples by selectively synthesizing the contents from base and support samples. This operation not only increases the number of training samples but also takes advantage of the contents of the base categories to enhance the description of support samples. Finally, we also employ semantic knowledge as both soft and hard supervision to enrich the supervision for the fine-tuning procedure. Empirical studies on four FSL benchmarks demonstrate the effectiveness of 4S.

Exploring Universal Principles for Graph Contrastive Learning: A Statistical Perspective

Jinyong Wen
Shiming Xiang
Chunhong Pan

Although recent advances have prompted the prosperity in graph contrastive learning, the researches on universal principles for model design and desirable properties of latent representations are still inadequate. From a statistical perspective, this paper proposes two principles for guidance and constructs a general self-supervised framework for negative-free graph contrastive learning. Reformulating data augmentation as a mixture process, the first one, termed consistency principle, lays stress on exploring and mapping cross-view common information to consistent and essence-revealing representations. For the purpose of instantiation, four statistical indicators are employed to estimate and maximize the correlation between representations from various views, whose accordant variation trend during training implies the extraction of common content. With awareness of the insufficiency of a solo consistency principle, suffering from degenerated and coupled solutions, a decorrelation principle is put forward to encourage diverse and informative representations. Accordingly, two specific strategies, performing in representation space and eigen spectral space, respectively, are propounded to decouple various representation channels. Under two principles, various combinations of concrete implementations derive a family of methods. The comparison experiments with current state-of-the-arts demonstrate the effectiveness and sufficiency of two principles for high-quality graph representations. Furthermore, visual studies reveal how certain principles affect learned representations.

Text-to-Audio Generation using Instruction Guided Latent Diffusion Model

Deepanway Ghosal
Navonil Majumder
Ambuj Mehrish
Soujanya Poria

The immense scale of the recent large language models (LLM) allows many interesting properties, such as, instruction- and chain-of-thought-based fine-tuning, that has significantly improved zero- and few-shot performance in many natural language processing (NLP) tasks. Inspired by such successes, we adopt such an instruction-tuned LLM Flan-T5 as the text encoder for text-to-audio (TTA) generation-a task where the goal is to generate an audio from its textual description. The prior works on TTA either pre-trained a joint text-audio encoder or used a non-instruction-tuned model, such as, T5. Consequently, our latent diffusion model (LDM)-based approach (Tango) outperforms the state-of-the-art AudioLDM on most metrics and stays comparable on the rest on AudioCaps test set, despite training the LDM on a 63 times smaller dataset and keeping the text encoder frozen. This improvement might also be attributed to the adoption of audio pressure level-based sound mixing for training set augmentation, whereas the prior methods take a random mix.

DRIN: Dynamic Relation Interactive Network for Multimodal Entity Linking

Shangyu Xing
Fei Zhao
Zhen Wu
Chunhui Li
Jianbing Zhang
Xinyu Dai

Multimodal Entity Linking (MEL) is a task that aims to link ambiguous mentions within multimodal contexts to referential entities in a multimodal knowledge base. Recent methods for MEL adopt a common framework: they first interact and fuse the text and image to obtain representations of the mention and entity respectively, and then compute the similarity between them to predict the correct entity. However, these methods still suffer from two limitations: first, as they fuse the features of text and image before matching, they cannot fully exploit the fine-grained alignment relations between the mention and entity. Second, their alignment is static, leading to low performance when dealing with complex and diverse data. To address these issues, we propose a novel framework called Dynamic Relation Interactive Network (DRIN) for MEL tasks. DRIN explicitly models four different types of alignment between a mention and entity and builds a dynamic Graph Convolutional Network (GCN) to dynamically select the corresponding alignment relations for different input samples. Experiments on two datasets show that DRIN outperforms state-of-the-art methods by a large margin, demonstrating the effectiveness of our approach. Our code and datasets are publicly available.

MVCIR-net: Multi-view Clustering Information Reinforcement Network

Shaokui Gu
Xu Yuan
Liang Zhao
Zhenjiao Liu
Yan Hu
Zhikui Chen

Multi-view clustering (MVC) integrates information from different views to improve clustering performance compared to single-view clustering. However, the raw multi-view data in the feature space often contains irrelevant information to the clustering task, which is difficult to separate using existing methods. This irrelevant information is processed equally with clustering information, negatively impacting the final clustering performance. In this paper, we propose a new framework for multi-view clustering information reinforcement network (MVCIR-net) to alleviate these problems. Our method gives practical clustering meaning to the clustering distribution layer by contrastive learning. Then, the trusted neighbor instances distribution of the normalized graph is debias aggregated to form the clustering information propensity distribution, and the clustering information distribution is made to fit this distribution. In addition, the coupling degree of the clustering information distribution in different views on the same sample should be enhanced. Through the aforementioned strategies, the raw data is fuzzy mapped into clustering information, and the network's ability to recognize clustering information is strengthened. Finally, the fuzzy mapping data is input into the network and reconstructed to evaluate the quality of the extracted clustering information. Extensive experiments on public multi-view datasets show that MVCIR-net achieves superior clustering effectiveness and the ability to identify clustering information.

Preserving Local and Global Information: An Effective Metric-based Subspace Clustering

Yixi Liu
Yuze Tan
Hongjie Wu
Shudong Huang
Yazhou Ren
Jiancheng Lv

Subspace clustering, which recoveries the subspace representation in the form of an affinity graph, has drawn tons of attention due to its effectiveness in various clustering tasks. However, existing subspace clustering methods are usually fed with raw data, which may lead to a suboptimal result since it is difficult to directly and accurately depict the inherent relation between data points. In this paper, we propose a novel subspace clustering method by holistically utilizing the pairwise similarity and graph geometric structure. Our model first constructs an initial subspace representation by means of self-expression, which is able to depict the global structure of data. Then, we use an effective metric to recover an intrinsic matrix with pairwise similarity based on the obtained representation, which further preserves the local structure. Besides, we propose to facilitate the downstream subspace learning task by searching for a smooth representation of the original data, which is obtained by applying a low-pass filter to retain the graph geometric features. By leveraging the subtasks of learning the smooth representation, performing the subspace learning, and recovering the intrinsic similarity matrix in a unified learning framework, each subtask can be alternately boosted. Experiments on several benchmark data sets have been conducted to verify the proposed method.

FeaCo: Reaching Robust Feature-Level Consensus in Noisy Pose Conditions

Jiaming Gu
Jingyu Zhang
Muyang Zhang
Weiliang Meng
Shibiao Xu
Jiguang Zhang
Xiaopeng Zhang

Collaborative perception offers a promising solution to overcome challenges such as occlusion and long-range data processing. However, limited sensor accuracy leads to noisy poses that misalign observations among vehicles. To address this problem, we propose the FeaCo, which achieves robust Feature-level Consensus among collaborating agents in noisy pose conditions without additional training. We design an efficient Pose-error Rectification Module (PRM) to align derived feature maps from different vehicles, reducing the adverse effect of noisy pose and bandwidth requirements. We also provide an effective multi-scale Cross-level Attention Module (CAM) to enhance information aggregation and interaction between various scales. Our FeaCo outperforms all other localization rectification methods, as validated on both the collaborative perception simulation dataset OPV2V and real-world dataset V2V4Real, reducing heading error and enhancing localization accuracy across various error levels. Our code is available at: https://github.com/jmgu0212/FeaCo.git.

Cross-Lingual Transfer of Large Language Model by Visually-Derived Supervision Toward Low-Resource Languages

Masayasu Muraoka
Bishwaranjan Bhattacharjee
Michele Merler
Graeme Blackwood
Yulong Li
Yang Zhao

Recent progress on vision and language research has shown that visual supervision improves the performance of large language models (LLMs) in various natural language processing (NLP) tasks. In particular, the Vokenization approach [65] initiated a new way of incorporating visual information into LLM training, demonstrating the potential of visual supervision for NLP tasks in a monolingual (i.e., English) setting. Given the effectiveness of visual information in human communication among people who speak different languages, we tackle an ambitious question in this paper; can we expect that visual supervision contributes to cross-lingual transfer learning from a high-resource language to low-resource languages in NLP tasks? To study this hypothesis, we build a cross-lingual Vokenization model and train a cross-lingual LLM on three languages, English, Urdu, and Swahili, in which the last two are considered low-resource languages. The experimental results demonstrate that our visually-supervised cross-lingual transfer learning method significantly improves the LLM performance in multiple cross-lingual NLP tasks such as XNLI, NER, and TyDiQA tasks for low-resource languages. We also qualitatively and quantitatively demonstrate that the benefit of our approach increases as the linguistic distance between low-and high-resource languages grows larger.

ALEX: Towards Effective Graph Transfer Learning with Noisy Labels

Jingyang Yuan
Xiao Luo
Yifang Qin
Zhengyang Mao
Wei Ju
Ming Zhang

Graph Neural Networks (GNNs) have garnered considerable interest due to their exceptional performance in a wide range of graph machine learning tasks. Nevertheless, the majority of GNN-based approaches have been examined using well-annotated benchmark datasets, leading to suboptimal performance in real-world graph learning scenarios. To bridge this gap, the present paper investigates the problem of graph transfer learning in the presence of label noise, which transfers knowledge from a noisy source graph to an unlabeled target graph. We introduce a novel technique termed Balance Alignment and Information-aware Examination (ALEX) to address this challenge. ALEX first employs singular value decomposition to generate different views with crucial structural semantics, which help provide robust node representations using graph contrastive learning. To mitigate both label shift and domain shift, we estimate a prior distribution to build subgraphs with balanced label distributions. Building on this foundation, an adversarial domain discriminator is incorporated for the implicit domain alignment of complex multi-modal distributions. Furthermore, we project node representations into a different space, optimizing the mutual information between the projected features and labels. Subsequently, the inconsistency of similarity structures is evaluated to identify noisy samples with potential overfitting. Comprehensive experiments on various benchmark datasets substantiate the outstanding superiority of the proposed ALEX in different settings.

Skeletal Spatial-Temporal Semantics Guided Homogeneous-Heterogeneous Multimodal Network for Action Recognition

Chenwei Zhang
Yuxuan Hu
Min Yang
Chengming Li
Xiping Hu

Action recognition research has gained significant attention with two dominant unimodal approaches: skeleton-based and RGB video-based. While the former is known for its robustness in complex backgrounds, the latter provides rich environmental information useful for context-based analysis. However, the fusion of these two modalities remains an open challenge. In this paper, we propose a Spatial Transformer & Selective Temporal encoder (ST&ST) for skeleton-based action recognition by constructing two modules: Reranking-Enhanced Dynamic Mask Transformer (RE-DMT) and Selective Kernel Temporal Convolution (SK-TC). The RE-DMT captures global spatial features, while the dynamic mask strategy and reranking strategy reduce redundancy. The SK-TC captures both long-term and short-term temporal features and enables adaptive fusion. Furthermore, in two phases, we propose a Homogeneous-Heterogeneous Multimodal Network (HHMNet) for multi-modal action recognition. In the first phase, contrastive learning is employed to achieve implicit semantic fusion within the four homogeneous skeletal modalities (joint, bone, etc.). In the second phase, the fusion of heterogeneous modalities (skeleton & RGB video) is carried out at three levels: model, feature, and decision. At the model level, the powerful skeleton-based model from the previous phase provides explicit attention guidance to the RGB video-based model. At the feature level, multi-part contrastive learning enables semantic distillation between heterogeneous modalities. At the decision level, ensemble learning combines outputs for final action recognition. We evaluate our proposed ST&ST guided HHMNet on NTU RGB+D 60 & 120 and NW-UCLA datasets and demonstrate that it achieves state-of-the-art performance in both skeleton-based and multi-modal action recognition tasks.

Unveiling the Power of CLIP in Unsupervised Visible-Infrared Person Re-Identification

Zhong Chen
Zhizhong Zhang
Xin Tan
Yanyun Qu
Yuan Xie

Large-scale Vision-Language Pre-training (VLP) model, e.g., CLIP, has demonstrated its natural advantage in generating textual descriptions for images. These textual descriptions afford us greater semantic monitoring insights while not requiring any domain knowledge. In this paper, we propose a new prompt learning paradigm for unsupervised visible-infrared person re-identification (USL-VI-ReID) by taking full advantage of the visual-text representation ability from CLIP. In our framework, we establish a learnable cluster-aware prompt for person images and obtain textual descriptions allowing for subsequent unsupervised training. This description complements the rigid pseudo-labels and provides an important semantic supervised signal. On that basis, we propose a new memory-swapping contrastive learning, where we first find the correlated cross-modal prototypes by the Hungarian matching method and then swap the prototype pairs in the memory. Thus typical contrastive learning without any change could easily associate the cross-modal information. Extensive experiments on the benchmark datasets demonstrate the effectiveness of our method. For example, on SYSU-MM01 we arrive at 54.0% in terms of Rank-1 accuracy, over 9% improvement against state-of-the-art approaches. Code is available at https://github.com/CzAngus/CCLNet.

DTF-Net: Category-Level Pose Estimation and Shape Reconstruction via Deformable Template Field

Haowen Wang
Zhipeng Fan
Zhen Zhao
Zhengping Che
Zhiyuan Xu
Dong Liu
Feifei Feng
Yakun Huang
Xiuquan Qiao
Jian Tang

Estimating 6D poses and reconstructing 3D shapes of objects in open-world scenes from RGB-depth image pairs is challenging. Many existing methods rely on learning geometric features that correspond to specific templates while disregarding shape variations and pose differences among objects in the same category. As a result, these methods underperform when handling unseen object instances in complex environments. In contrast, other approaches aim to achieve category-level estimation and reconstruction by leveraging normalized geometric structure priors, but the static prior-based reconstruction struggles with substantial intra-class variations. To solve these problems, we propose the DTF-Net, a novel framework for pose estimation and shape reconstruction based on implicit neural fields of object categories. In DTF-Net, we design a deformable template field to represent the general category-wise shape latent features and intra-category geometric deformation features. The field establishes continuous shape correspondences, deforming the category template into arbitrary observed instances to accomplish shape reconstruction. We introduce a pose regression module that shares the deformation features and template codes from the fields to estimate the accurate 6D pose of each object in the scene. We integrate a multi-modal representation extraction module to extract object features and semantic masks, enabling end-to-end inference. Moreover, during training, we implement a shape-invariant training strategy and a viewpoint sampling method to further enhance the model's capability to extract object pose features. Extensive experiments on the REAL275 and CAMERA25 datasets demonstrate the superiority of DTF-Net in both synthetic and real scenes. Furthermore, we show that DTF-Net effectively supports grasping tasks with a real robot arm.

Text-Only Training for Visual Storytelling

Yuechen Wang
Wengang Zhou
Zhenbo Lu
Houqiang Li

Visual storytelling aims to generate a narrative based on a sequence of images, necessitating both vision-language alignment and coherent story generation. Most existing solutions predominantly depend on paired image-text training data, which can be costly to collect and challenging to scale. To address this, we formulate visual storytelling as a visual-conditioned story generation problem and propose a text-only training method that separates the learning of cross-modality alignment and story generation. Our approach specifically leverages the cross-modality pre-trained CLIP model to integrate visual control into a story generator, trained exclusively on text data. Moreover, we devise a training-free visual condition planner that accounts for the temporal structure of the input image sequence while balancing global and local visual content. The distinctive advantage of requiring only text data for training enables our method to learn from external text story data, enhancing the generalization capability of visual storytelling. We conduct extensive experiments on the VIST benchmark, showcasing the effectiveness of our approach in both in-domain and cross-domain settings. Further evaluations on expression diversity and human assessment underscore the superiority of our method in terms of informativeness and robustness.

Saliency Prototype for RGB-D and RGB-T Salient Object Detection

Zihao Zhang
Jie Wang
Yahong Han

Most of the existing bi-modal (RGB-D or RGB-T) salient object detection methods attempt to integrate multimodality information through various fusion strategies. However, existing methods lack a clear definition of salient regions before feature fusion, which results in poor model robustness. To tackle this problem, we propose a novel prototype, the saliency prototype, which captures common characteristic information among salient objects. A prototype contains inherent characteristics information of multiple salient objects, which can be used for feature enhancement of various salient objects. By utilizing the saliency prototype, we provide a clearer definition of salient regions and enable the model to focus on these regions before feature fusion, avoiding the influence of complex backgrounds during the feature fusion stage. Additionally, we utilize the saliency prototypes to address the quality issue of auxiliary modality. Firstly, we apply the saliency prototypes obtained by the primary modality to perform semantic enhancement of the auxiliary modality. Secondly, we dynamically allocate weights for the auxiliary modality during the feature fusion stage in proportion to its quality. Thus, we develop a new bi-modal salient detection architecture Saliency Prototype Network (SPNet), which can be used for both RGB-D and RGB-T SOD. Extensive experimental results on RGB-D and RGB-T SOD datasets demonstrate the effectiveness of the proposed approach against the state-of-the-art. Our code is available at https://github.com/ZZ2490/SPNet.

PAIF: Perception-Aware Infrared-Visible Image Fusion for Attack-Tolerant Semantic Segmentation

Zhu Liu
Jinyuan Liu
Benzhuang Zhang
Long Ma
Xin Fan
Risheng Liu

Infrared and visible image fusion is a powerful technique that combines complementary information from different modalities for downstream semantic perception tasks. Existing learning-based methods show remarkable performance, but are suffering from the inherent vulnerability of adversarial attacks, causing a significant decrease in accuracy. In this work, a perception-aware fusion framework is proposed to promote segmentation robustness in adversarial scenes. We first conduct systematic analyses about the components of image fusion, investigating the correlation with segmentation robustness under adversarial perturbations. Based on these analyses, we propose a harmonized architecture search with a decomposition-based structure to balance standard accuracy and robustness. We also propose an adaptive learning strategy to improve the parameter robustness of image fusion, which can learn effective feature extraction under diverse adversarial perturbations. Thus, the goals of image fusion (i.e., extracting complementary features from source modalities and defending attack) can be realized from the perspectives of architectural and learning strategies. Extensive experimental results demonstrate that our scheme substantially enhances the robustness, with gains of 15.3% mIOU of segmentation in the adversarial scene, compared with advanced competitors. The source codes are available at https://github.com/LiuZhu-CV/PAIF.

Cross-Modal Graph Attention Network for Entity Alignment

Baogui Xu
Chengjin Xu
Bing Su

The increasing popularity of multi-modal knowledge graphs (MMKGs) has led to a need for efficient entity alignment techniques that can exploit multi-modal information to integrate knowledge from different sources. GNN-based multi-modal entity alignment (MMEA) methods have achieved significant progress in entity alignment(EA) areas. However, these methods only rely on Graph Neural Networks (GNNs) to encode structural information, while ignoring visual and semantic modalities, which may lead to incomplete representation, thus how to integrate the visual and semantic information into GNN-based EA methods remains unexplored. In light of our insight that incorporating the message-passing mechanism of Graph Neural Networks to integrate multi-modal information is essential for fully exploiting the graph representation capability of GNN, we propose a novel Cross-modal Graph attention network for Entity Alignment (XGEA) that enables visual knowledge to interact with other views of the entity, including structural and literal information. We leverage the information from one modality as complementary relation information to compute the attention of another modality in the graph attention layers, enabling the learning of entity embedding by integrating multiple modalities. Moreover, the quantity of labeled data plays a crucial role in model performance, yet obtaining sufficient training data is expensive. To mitigate this issue, we use visual and semantic information to generate pseudo-pairs and propose a soft pseudo-labeling method for entity alignment to assign weights to the augmented training data to balance its quantity and quality. Extensive experiments show that our XGEA achieves superior performance consistently over the state-of-the-art MMEA baselines.

Intra- and Inter-Modal Curriculum for Multimodal Learning

Yuwei Zhou
Xin Wang
Hong Chen
Xuguang Duan
Wenwu Zhu

Multimodal learning has been widely studied and applied due to its improvement over previous unimodal tasks and its effectiveness on emerging multimodal challenges. However, it has been reported that modal encoders are under-optimized in multimodal learning in contrast to unimodal learning, especially when some modalities are dominant over others. Existing solutions to this problem suffer from two limitations: i) they merely focus on inter-modal balance, failing to consider the influence of intra-modal data on each modality; ii) their implementations heavily rely on unimodal performances or losses, thus being suboptimal for the tasks requiring modal interactions (e.g., visual question answering). To tackle these limitations, we propose I2MCL, a generic Intra- and Inter-Modal Curriculum Learning framework which simultaneously considers both data difficulty and modality balance for multimodal learning. In the intra-modal curriculum, we adopt a pretrained teacher model to obtain knowledge distillation loss as the difficulty measurer, which determines the data weights within the corresponding modality. In the inter-modal curriculum, we utilize a Pareto optimization strategy to measure and compare the gradients from distillation loss and task loss across modalities, capable of determining whether a modality should learn from the task or its teacher. Empirical experiments on various tasks including multimodal classification, visual question answering and visual entailment demonstrate that our proposed I2MCL is able to tackle the under-optimized modality problem and bring consistent improvement to multimodal learning.

Graph based Spatial-temporal Fusion for Multi-modal Person Re-identification

Yaobin Zhang
Jianming Lv
Chen Liu
Hongmin Cai

As a challenging task, unsupervised person re-identification (Re-ID) aims to optimize the pedestrian matching model based on the unlabeled image frames from surveillance videos. Recently, the fusion with the spatio-temporal clues of pedestrians have been proven effective to improve the performance of classification. However, most of these methods adopt some hard combination approaches by multiplying the visual scores with the spatio-temporal scores, which are sensitive to the noise caused by imprecise estimation of the spatio-temporal patterns in unlabeled datasets and limit the advantage of the fusion model. In this paper, we propose a Graph based Spatio-Temporal Fusion model for high-performance multi-modal person Re-ID, namely G-Fusion, to mitigate the impact of noise. In particular, we construct a graph of pedestrian images by selecting neighboring nodes based on the visual information and the transition time between cameras. Then we use a randomly initialized two-layer GraphSAGE model to obtain the multi-modal affinity matrix between images, and deploy the distillation learning to optimize the visual model by learning the affinity between the nodes. Finally, a graph-based multi-modal re-ranking method is deployed to make the decision in the testing phase for precise person Re-ID. Comprehensive experiments are conducted on two large-scale Re-ID datasets, and the results show that our method achieves a significant improvement of the performance while combined with SOTA unsupervised person Re-ID methods. Specifically, the mAP scores can reach 92.2%, and 80.4% on the Market-1501, and MSMT17 datasets respectively.

Transferring CLIP's Knowledge into Zero-Shot Point Cloud Semantic Segmentation

Yuanbin Wang
Shaofei Huang
Yulu Gao
Zhen Wang
Rui Wang
Kehua Sheng
Bo Zhang
Si Liu

Traditional 3D segmentation methods can only recognize a fixed range of classes that appear in the training set, which limits their application in real-world scenarios due to the lack of generalization ability. Large-scale visual-language pre-trained models, such as CLIP, have shown their generalization ability in the zero-shot 2D vision tasks, but are still unable to be applied to 3D semantic segmentation directly. In this work, we focus on zero-shot point cloud semantic segmentation and propose a simple yet effective baseline to transfer the visual-linguistic knowledge implied in CLIP to point cloud encoder at both feature and output levels. Both feature-level and output-level alignments are conducted between 2D and 3D encoders for effective knowledge transfer. Concretely, a Multi-granularity Cross-modal Feature Alignment (MCFA) module is proposed to align 2D and 3D features from global semantic and local position perspectives for feature-level alignment. For the output level, per-pixel pseudo labels of unseen classes are extracted using the pre-trained CLIP model as supervision for the 3D segmentation model to mimic the behavior of the CLIP image encoder. Extensive experiments are conducted on two popular benchmarks of point cloud segmentation. Our method outperforms significantly previous state-of-the-art methods under zero-shot setting (+29.2% mIoU on SemanticKITTI and 31.8% mIoU on nuScenes), and further achieves promising results in the annotation-free point cloud semantic segmentation setting, showing its great potential for label-efficient learning.

Bio-Inspired Audiovisual Multi-Representation Integration via Self-Supervised Learning

Zhaojian Li
Bin Zhao
Yuan Yuan

Audiovisual self-supervised representation learning has made significant strides in various audiovisual tasks. Existing methods mostly focus on single representation modeling between audio and visual modalities, ignoring the complex correspondence between them, resulting in the inability to execute cross-modal understanding in a more natural audiovisual scene. Several biological studies have shown that human learning is influenced by multi-layered synchronization of perception. To this end, inspired by biology, we argue to exploit the naturally existing relationships in audio and visual modalities to learn audiovisual representations under multilayer perceptual integration. Firstly, we introduce an audiovisual multi-representation pretext task that integrates semantic consistency, temporal alignment, and spatial correspondence. Secondly, we propose a self-supervised audiovisual multi-representation learning approach, which simultaneously learns the perceptual relationship between visual and audio modalities at semantic, temporal, and spatial levels. To establish fine-grained correspondence between visual objects and sounds, an audiovisual object detection module is proposed, which detects potential sounding objects by combining unsupervised knowledge at multiple levels. In addition, we propose a modality-wise loss and a task-wise loss to learn a subspace-orthogonal representation space that makes representation relations more discriminative. Finally, experimental results demonstrate that collectively understanding the semantic, temporal, and spatial correspondence between audiovisual modalities enables the model to perform better on downstream tasks such as sound separation, sound spatialization, and audiovisual segmentation.

DLFusion: Painting-Depth Augmenting-LiDAR for Multimodal Fusion 3D Object Detection

Junyin Wang
Chenghu Du
Hui Li
Shengwu Xiong

Surround-view cameras combined with image depth transformation to 3D feature space and fusion with point cloud features are highly regarded. The transformation of 2D features into 3D feature space by means of predefined sampling points and depth distribution happens throughout the scene, and this process generates a large number of redundant features. In addition, multimodal feature fusion unified in 3D space often happens in the previous step of the downstream task, ignoring the interactive fusion between different scales. To this end, we design a new framework, focusing on the design that can give 3D geometric perception information to images and unify them into voxel space to accomplish multi-scale interactive fusion, and we mitigate feature alignment between modal features by geometric relationships between voxel features. The method has two main designs. First, a Segmentation-guided Image View Transformation module is used to accurately transform the pixel region containing the object into a 3D pseudo-point voxel space with the help of a depth distribution. This allows subsequent feature fusion to be performed in a unified voxel feature. Secondly, a Voxel-centric Consistent Fusion module is used to alleviate the errors caused by depth estimation, as well as to achieve better feature fusion between unified modalities. Through extensive experiments on the KITTI and nuScenes datasets, we validate the effectiveness of our camera-LIDAR fusion method. Our proposed approach shows competitive performance on both datasets and outperforms state-of-the-art methods in certain classes of 3D object detection benchmarks. https://github.com/no-Name128/DLFusion [code release]

Automatic Network Architecture Search for RGB-D Semantic Segmentation

Wenna Wang
Tao Zhuo
Xiuwei Zhang
Mingjun Sun
Hanlin Yin
Yinghui Xing
Yanning Zhang

Recent RGB-D semantic segmentation networks are usually manually designed. However, due to limited human efforts and time costs, their performance might be inferior for complex scenarios. To address this issue, we propose the first Neural Architecture Search (NAS) method that designs the network automatically. Specifically, the target network consists of an encoder and a decoder. The encoder is designed with two independent branches, where each branch specializes in extracting features from RGB and depth images, respectively. The decoder fuses the features and generates the final segmentation result. Besides, for automatic network design, we design a grid-like network-level search space combined with a hierarchical cell-level search space. By further developing an effective gradient-based search strategy, the network structure with hierarchical cell architectures is discovered. Extensive results on two datasets show that the proposed method outperforms the state-of-the-art approaches, which achieves a mIoU score of 55.1% on the NYU-Depth v2 dataset and 50.3% on the SUN-RGBD dataset.

Attentive Alignment Network for Multispectral Pedestrian Detection

Nuo Chen
Jin Xie
Jing Nie
Jiale Cao
Zhuang Shao
Yanwei Pang

Multispectral pedestrian detection is of great importance in various around-the-clock applications, i.e., self-driving and video surveillance. Fusing the features from RGB images and thermal infrared (TIR) images to explore the complementary information between different modalities is one of the most effective manners to improve multispectral pedestrian detection performance. However, the misalignment between different modalities in spatial dimension and modality reliability would introduce harmful information during feature fusion, limiting the performance of multispectral pedestrian detection. To address the above issues, we propose an attentive alignment network, consisting of an attentive position alignment (APA) module and an attentive modality alignment (AMA) module. Our APA module emphasizes pedestrian regions while aligning the pedestrian regions between different modalities. Our AMA module utilizes a channel-wise attention mechanism with illumination guidance to eliminate the imbalance between different modalities. The experiments are conducted on two widely used multispectral detection datasets, KASIT and CVC-14. Our approach surpasses the current state-of-the-art performance on both datasets.

FedAA: Using Non-sensitive Modalities to Improve Federated Learning while Preserving Image Privacy

Dong Chen
Siliang Tang
Zijin Shen
Guoming Wang
Jun Xiao
Yueting Zhuang
Carl Yang

Federated learning aims to train a better global model without sharing the sensitive training samples (usually images) of local clients. Since the sample distributions in local clients tend to be different from each other (i.e., non-IID), one of the major challenges for federated learning is to alleviate model degradation when aggregating local models. The degradation can be attributed to the weight divergence that quantifies the difference of local models from different training processes. Furthermore, non-IID also results in feature space heterogeneity during local training, making neurons of local models in the same location have different functions and further exacerbating weight divergence. In this paper, we demonstrate that the problem can be solved by sharing information from the non-sensitive modality (e.g., metadata, non-sensitive descriptions, etc.) while keeping the sensitive information of images protected. In particular, we propose Federated Learning with Adversarial Example and Adversarial Identifier (FedAA) that trains adversarial examples based on the shared non-sensitive modality to fine-tune local models before global aggregation. The training of local models is enhanced by client identifiers that discriminate the source of inputs to force different local models to get similar outputs and be more homogeneous during the local training. Experiments show that FedAA significantly outperforms recent non-IID federated learning algorithms while preserving image privac, by sharing information from non-sensitive modalities.

Unsupervised Domain Adaptation for Video Object Grounding with Cascaded Debiasing Learning

Mengze Li
Haoyu Zhang
Juncheng Li
Zhou Zhao
Wenqiao Zhang
Shengyu Zhang
Shiliang Pu
Yueting Zhuang
Fei Wu

This paper addresses the Unsupervised Domain Adaptation (UDA) for the dense frame prediction task - Video Object Grounding (VOG). This investigation springs from the recognition of the limited generalization capabilities of data-driven approaches when confronted with unseen test scenarios. We set the goal of enhancing the adaptability of the source-dominated model from a labeled domain to the unlabeled target domain through re-training on pseudo-labels (i.e., predicted boxes of language-described objects). Given the potential for source-domain biases in the pseudo-label generation, we decompose the labeling refinement as two cascaded debiasing subroutines: (1) we develop a discarded training strategy to correct the Biased Proposal Selection by filtering out the examples with uncertain proposals selected from the proposal (candidate box) set. The identifier of these uncertain examples is the discordance between the predictions of the source-dominated model and those of a target-domain clustered classifier, which remains free from the source-domain bias. (2) With the refined proposals as a foundation, we measure Grounding Coordinate Offset based on the semantic distance of the model's prediction across domains, based on which we alleviate source-domain bias in the target model through adversarial learning. To verify the superiority of the proposed method, we collected two UDA-VOG datasets called I2O-VOG and R2M-VOG by manually dividing and combining the well-known VOG datasets. The extensive experiments on them show our model significantly outperforms SOTA methods by a large margin.

RAHNet: Retrieval Augmented Hybrid Network for Long-tailed Graph Classification

Zhengyang Mao
Wei Ju
Yifang Qin
Xiao Luo
Ming Zhang

Graph classification is a crucial task in many real-world multimedia applications, where graphs can represent various multimedia data types such as images, videos, and social networks. Previous efforts have applied graph neural networks (GNNs) in balanced situations where the class distribution is balanced. However, real-world data typically exhibit long-tailed class distributions, resulting in a bias towards the head classes when using GNNs and limited generalization ability over the tail classes. Recent approaches mainly focus on re-balancing different classes during model training, which fails to explicitly introduce new knowledge and sacrifices the performance of the head classes. To address these drawbacks, we propose a novel framework called Retrieval Augmented Hybrid Network (RAHNet) to jointly learn a robust feature extractor and an unbiased classifier in a decoupled manner. In the feature extractor training stage, we develop a graph retrieval module to search for relevant graphs that directly enrich the intra-class diversity for the tail classes. Moreover, we innovatively optimize a category-centered supervised contrastive loss to obtain discriminative representations, which is more suitable for long-tailed scenarios. In the classifier fine-tuning stage, we balance the classifier weights with two weight regularization techniques, i.e., Max-norm and weight decay. Experiments on various popular benchmarks verify the superiority of the proposed method against state-of-the-art approaches.

That's What I Said: Fully-Controllable Talking Face Generation

Youngjoon Jang
Kyeongha Rho
Jongbin Woo
Hyeongkeun Lee
Jihwan Park
Youshin Lim
Byeong-Yeol Kim
Joon Son Chung

The goal of this paper is to synthesise talking faces with controllable facial motions. To achieve this goal, we propose two key ideas. The first is to establish a canonical space where every face has the same motion patterns but different identities. The second is to navigate a multimodal motion space that only represents motion-related features while eliminating identity information. To disentangle identity and motion, we introduce an orthogonality constraint between the two different latent spaces. From this, our method can generate natural-looking talking faces with fully controllable facial attributes and accurate lip synchronisation. Extensive experiments demonstrate that our method achieves state-of-the-art results in terms of both visual quality and lip-sync score. To the best of our knowledge, we are the first to develop a talking face generation framework that can accurately manifest full target facial motions including lip, head pose, and eye movements in the generated video without any additional supervision beyond RGB video with audio.

Event-Diffusion: Event-Based Image Reconstruction and Restoration with Diffusion Models

Quanmin Liang
Xiawu Zheng
Kai Huang
Yan Zhang
Jie Chen
Yonghong Tian

Event cameras offer the advantages of low latency, high temporal resolution and HDR compared to conventional cameras. Due to the asynchronous and sparse nature of events, many existing algorithms cannot be directly applied, necessitating the reconstruction of intensity frames. However, existing reconstruction methods often result in artifacts and edge blurring due to noise and event accumulation. In this paper, we argue that the key to event-based image reconstruction is to enhance the edge information of objects and restore the artifacts in the reconstructed images. To explain, edge information is one of the most important features in the event stream, providing information on the shape and contour of objects. Considering the extraordinary capabilities of Denoising Diffusion Probabilistic Models (DDPMs) in image generation, reconstruction, and restoration, we propose a new framework which incorporate it into the reconstruction pipeline to obtain high-quality results which effectively remove artifacts and blur in reconstructed images. Specifically, we first extract edge information from the event stream using the proposed event-based denoising method. It employs the contrast maximization framework to remove noise from the event stream and extract clear object edge information. And then, the edge information is further adopted to our diffusion model, which is used to enhance the edges of objects in the reconstructed images, thus improving the restoration effect. Experimental results show that our method achieves significant improvements in the mean squared error (MSE), the structural similarity (SSIM), and the perceptual similarity (LPIPS) metrics, with average improvements of 40%, 15%, and 25%, respectively, compared to previous state-of-the-art models, and has good generalization performance.

Mask to Reconstruct: Cooperative Semantics Completion for Video-text Retrieval

Han Fang
Zhifei Yang
Xianghao Zang
Chao Ban
Zhongjiang He
Hao Sun
Lanxiang Zhou

Recently, masked video modeling has been widely explored and improved the model's understanding ability of visual regions at a local level. However, existing methods usually adopt random masking and follow the same reconstruction paradigm to complete the masked regions, which do not leverage the correlations between cross-modal content. In this paper, we present MAsk for Semantics COmpleTion (MASCOT) based on semantic-based masked modeling. Specifically, after applying attention-based video masking to generate high-informed and low-informed masks, we propose Informed Semantics Completion to recover masked semantics information. The recovery mechanism is achieved by aligning the masked content with the unmasked visual regions and corresponding textual context, which makes the model capture more text-related details at a patch level. Additionally, we shift the emphasis of reconstruction from irrelevant backgrounds to discriminative parts to ignore regions with low-informed masks. Furthermore, we design co-learning to incorporate video cues under different masks and learn more aligned representation. Our MASCOT performs state-of-the-art performance on four text-video retrieval benchmarks, including MSR-VTT, LSMDC, ActivityNet, and DiDeMo.

Self-Contrastive Graph Diffusion Network

Yixuan Ma
Kun Zhan

Augmentation techniques and sampling strategies are crucial in contrastive learning, but in most existing works, augmentation techniques require careful design, and their sampling strategies can only capture a small amount of intrinsic supervision information. Additionally, the existing methods require complex designs to obtain two different representations of the data. To overcome these limitations, we propose a novel framework called the Self-Contrastive Graph Diffusion Network (SCGDN). Our framework consists of two main components: the Attentional Module (AttM) and the Diffusion Module (DiFM). AttM aggregates higher-order structure and feature information to get an excellent embedding, while DiFM balances the state of each node in the graph through Laplacian diffusion learning and allows the cooperative evolution of adjacency and feature information in the graph. Unlike existing methodologies, SCGDN is an augmentation-free approach that avoids "sampling bias" and semantic drift, without the need for pre-training. We conduct a high-quality sampling of samples based on structure and feature information. If two nodes are neighbors, they are considered positive samples of each other. If two disconnected nodes are also unrelated on kNN graph, they are considered negative samples for each other. The contrastive objective reasonably uses our proposed sampling strategies, and the redundancy reduction term minimizes redundant information in the embedding and can well retain more discriminative information. In this novel framework, the graph self-contrastive learning paradigm gives expression to a powerful force. The results manifest that SCGDN can consistently generate out performance over both the contrastive methods and the classical methods. The source code is available at https://github.com/kunzhan/SCGDN.

Cross-modal & Cross-domain Learning for Unsupervised LiDAR Semantic Segmentation

Yiyang Chen
Shanshan Zhao
Changxing Ding
Liyao Tang
Chaoyue Wang
Dacheng Tao

In recent years, cross-modal domain adaptation has been studied on the paired 2D image and 3D LiDAR data to ease the labeling costs for 3D LiDAR semantic segmentation (3DLSS) in the target domain. However, in such a setting the paired 2D and 3D data in the source domain are still collected with additional effort. Since the 2D-3D projections can enable the 3D model to learn semantic information from the 2D counterpart, we ask whether we could further remove the need of source 3D data and only rely on the source 2D images. To answer it, this paper studies a new 3DLSS setting where a 2D dataset (source) with semantic annotations and a paired but unannotated 2D image and 3D LiDAR data (target) are available1. To achieve 3DLSS in this scenario, we propose Cross-Modal and Cross-Domain Learning (CoMoDaL). Specifically, our CoMoDaL aims at modeling 1) inter-modal cross-domain distillation between the unpaired source 2D image and target 3D LiDAR data, and 2) the intra-domain cross-modal guidance between the target 2D image and 3D LiDAR data pair. In CoMoDaL, we propose to apply several constraints, such as point-to-pixel and prototype-to pixel alignments, to associate the semantics in different modalities and domains by constructing mixed samples in two modalities. The experimental results on several datasets show that in the proposed setting, the developed CoMoDaL can achieve segmentation without the supervision of labeled LiDAR data. Ablations are also conducted to provide more analysis. Code will be available publicly2.

Multi-View Representation Learning via View-Aware Modulation

Ren Wang
Haoliang Sun
Xiushan Nie
Yuxiu Lin
Xiaoming Xi
Yilong Yin

Multi-view (representation) learning derives an entity's representation from its multiple observable views to facilitate various downstream tasks. The most challenging topic is how to model unobserved entities and their relationships to specific views. To this end, this work proposes a novel multi-view learning method using a View-Aware parameter Modulation mechanism, termed VAM. The key idea is to use trainable parameters as proxies for unobserved entities and views, such that modeling entity-view relationships is converted into modeling the relationship between proxy parameters. Specifically, we first build a set of trainable parameters to learn a mapping from multi-view data to the unified representation as the entity proxy. Then we learn a prototype for each view and design a Modulation Parameter Generator (MPG) that learns a set of view-aware scale and shift parameters from prototypes to modulate the entity proxy and obtain view proxies. By constraining the representativeness, uniqueness, and simplicity of the proxies and proposing an entity-view contrastive loss, parameters are alternatively updated. We end up with a set of discriminative prototypes, view proxies, and an entity proxy that are flexible enough to yield robust representations for out-of-sample entities. Extensive experiments on five datasets show that the results of our VAM outperform existing methods in both classification and clustering tasks.

Uni-Dual: A Generic Unified Dual-Task Medical Self-Supervised Learning Framework

Boxiang Yun
Xingran Xie
Qingli Li
Yan Wang

RGB images and medical hyperspectral images (MHSIs) are two widely-used modalities in computational pathology. The former is cheap, easy and fast to obtain while lacking pathological information such as physiochemical state. The latter is an emerging modality which captures electromagnetic radiation matter interaction but suffers from problems such as high time cost and low spatial resolution. In this paper, we bring forward a unified dual-task multi-modality self-supervised learning (SSL) framework, called Uni-Dual, which takes the most use of both paired and unpaired RGB-MHSIs. Concretely, we design a unified SSL paradigm for RGB images and MHSIs. Two tasks are proposed: (1) a discrimination learning task which learns high-level semantics via mining the cross-correlation across unpaired RGB-MHSIs, (2) a reconstruction learning task which models low-level stochastic variations via furthering the interaction across RGB-MHSI pairs. Our Uni-Dual enjoys the following benefits: (1) A unified model which can be easily transferred to different downstream tasks on various modality combinations. (2) We consider multi-constituent and structured information learning from MHSIs and RGB images for low-cost high-precision clinical purposes. Experiments conducted on various downstream tasks with different modalities show the proposed Uni-Dual substantially outperforms other competitive SSL methods.

Towards Better Multi-modal Keyphrase Generation via Visual Entity Enhancement and Multi-granularity Image Noise Filtering

Yifan Dong
Suhang Wu
Fandong Meng
Jie Zhou
Xiaoli Wang
Jianxin Lin
Jinsong Su

Multi-modal keyphrase generation aims to produce a set of keyphrases that represent the core points of the input text-image pair. In this regard, dominant methods mainly focus on multi-modal fusion for keyphrase generation. Nevertheless, there are still two main drawbacks: 1) only a limited number of sources, such as image captions, can be utilized to provide auxiliary information. However, they may not be sufficient for the subsequent keyphrase generation. 2) the input text and image are often not perfectly matched, and thus the image may introduce noise into the model. To address these limitations, in this paper, we propose a novel multi-modal keyphrase generation model, which not only enriches the model input with external knowledge, but also effectively filters image noise. First, we introduce external visual entities of the image as the supplementary input to the model, which benefits the cross-modal semantic alignment for keyphrase generation. Second, we simultaneously calculate an image-text matching score and image region-text correlation scores to perform multi-granularity image noise filtering. Particularly, we introduce the correlation scores between image regions and ground-truth keyphrases to refine the calculation of the previously-mentioned correlation scores. To demonstrate the effectiveness of our model, we conduct several groups of experiments on the benchmark dataset. Experimental results and in-depth analyses show that our model achieves the state-of-the-art performance. Our code is available on https://github.com/DeepLearnXMU/MM-MKP.

Multi-modal Social Bot Detection: Learning Homophilic and Heterophilic Connections Adaptively

Shilong Li
Boyu Qiao
Kun Li
Qianqian Lu
Meng Lin
Wei Zhou

The detection of social bots has become a critical task in maintaining the integrity of social media. With social bots evolving continually, they primarily evade detection by imitating human features and engaging in interactions with humans. To reduce the impact of social bots imitating human features, also known as feature camouflage, existing methods mainly utilize multi-modal user information for detection, especially GNN-based methods that utilize additional topological structure information. However, these methods ignore relation camouflage, which involves disguising through interactions with humans. We find that relation camouflage results in both homophilic connections formed by nodes of the same type and heterophilic connections formed by nodes of different types in social networks. The existing GNN-based detection methods assume all connections are homophilic while ignoring the difference among neighbors in heterophilic connections, which leads to a poor detection performance for bots with relation camouflage. To address this, we propose a multi-modal social bot detection method with learning homophilic and heterophilic connections adaptively (BothH for short). Specifically, firstly we determine whether each connection is homophilic or heterophilic with the connection classifier, and then we design a novel message propagating strategy that can learn the homophilic and heterophilic connections adaptively. We conduct experiments on the mainstream datasets and the results show that our model is superior to state-of-the-art methods.

CPU: Codebook Lookup Transformer with Knowledge Distillation for Point Cloud Upsampling

Weibing Zhao
Haiming Zhang
Chaoda Zheng
Xu Yan
Shuguang Cui
Zhen Li

Point clouds produced by 3D scanning are typically sparse, non-uniform, and noisy. Existing upsampling techniques directly learn the mapping from a sparse point set to a dense point set, which is often under-determined and ill-posed. To reduce the uncertainty and ambiguity of the upsampling mapping, this paper proposes a generic three-stage vector-quantization framework, which incorporates a Codebook lookup Transformer and knowledge distillation for Point Cloud Upsampling, named CPU. The proposed CPU reformulates the upsampling task into a relatively determinate code prediction task within a small, discrete proxy space. Since the traditional vector-quantization methods cannot be directly applied to point cloud upsampling scenarios, we introduce a knowledge distillation training scheme that facilitates efficient codebook learning and ensures full utilization of codebook entries. Specifically, we adopt a teacher-student training paradigm to avoid model collapse during codebook learning. In the first stage, we pre-train a vanilla auto-encoder of the dense point set as the teacher model, which provides rich guidance features to ensure sufficient codebook learning. In the second stage, we train a vector-quantized auto-encoder as a student model to capture high-fidelity geometric priors into a learned codebook with the aid of distillation. In the third stage, we propose a Codebook Lookup Transformer to model the global context of the sparse point set and predict the code indices. Then the coarse features of the sparse point set can be quantized and substituted by looking up the indices in the learned codebook. Benefiting from the expressive codebook priors and the distillation training scheme, the proposed CPU outperforms state-of-the-art methods quantitatively and qualitatively.

Your tone speaks louder than your face! Modality Order Infused Multi-modal Sarcasm Detection

Mohit Tomar
Abhisek Tiwari
Tulika Saha
Sriparna Saha

Figurative language is an essential component of human communication, and detecting sarcasm in text has become a challenging yet highly popular task in natural language processing. As humans, we rely on a combination of visual and auditory cues, such as facial expressions and tone of voice, to comprehend a message. Our brains are implicitly trained to integrate information from multiple senses to form a complete understanding of the message being conveyed, a process known as multi-sensory integration. The combination of different modalities not only provides additional information but also amplifies the information conveyed by each modality in relation to the others. Thus, the infusion order of different modalities also plays a significant role in multimodal processing. In this paper, we investigate the impact of different modality infusion orders for identifying sarcasm in dialogues. We propose a modality order-driven module integrated into a transformer network, MO-Sarcation that fuses modalities in an ordered manner. Our model outperforms several state-of-the-art models by 1-3% across various metrics, demonstrating the crucial role of modality order in sarcasm detection. The obtained improvements and detailed analysis show that audio tone should be infused with textual content, followed by visual information to identify sarcasm efficiently. The code and dataset are available at https://github.com/mohit2b/MO-Sarcation.

Fine-Grained Multimodal Named Entity Recognition and Grounding with a Generative Framework

Jieming Wang
Ziyan Li
Jianfei Yu
Li Yang
Rui Xia

Multimodal Named Entity Recognition (MNER) aims to locate and classify named entities mentioned in a pair of text and image. However, most previous MNER works focus on extracting entities in the form of text but failing to ground text symbols to their corresponding visual objects. Moreover, existing MNER studies primarily classify entities into four coarse-grained entity types, which are often insufficient to map them to their real-world referents. To solve these limitations, we introduce a task named Fine-grained Multimodal Named Entity Recognition and Grounding (FMNERG) in this paper, which aims to simultaneously extract named entities in text, their fine-grained entity types, and their grounded visual objects in image. Moreover, we construct a Twitter dataset for the FMNERG task, and further propose a T5-based multImodal GEneration fRamework (TIGER), which formulates FMNERG as a generation problem by converting all the entity-type-object triples into a target sequence and adapts a pre-trained sequence-to-sequence model T5 to directly generate the target sequence from an image-text input pair. Experimental results demonstrate that TIGER performs significantly better than a number of baseline systems on the annotated Twitter dataset. Our dataset annotation and source code are publicly released at https://github.com/NUSTM/FMNERG.

SkipStreaming: Pinpointing User-Perceived Redundancy in Correlated Web Video Streaming through the Lens of Scenes

Wei Liu
Xinlei Yang
Zhenhua Li
Feng Qian

When streaming over the web, correlated videos (e.g., a series of TV episodes) appear to bear considerable redundant clips, mostly included in the intros, outros, recaps, and commercial breaks, leading to a waste of network traffic and playback time. Mainstream video content providers have taken various measures to identify these clips, but often result in unexpected and undesirable user experiences. In this paper, we conduct a large-scale, crowdsourced study to demystify the root causes of poor experiences. Driven by the findings, we propose to reconsider the problem from a novel perspective of scenes without going through the excessive video frames, which pays special attention to how the contents of correlated videos are organized during video production. To enable this idea, we design efficient approaches to the separation of video scenes and the identification of visual redundancy. We build an open-source system to embody our design, which achieves fast (e.g., taking ~38 seconds to process a 45-minute video using a common commodity server) and accurate (incurring only 770-ms deviation on average) redundancy recognition on representative workloads.

Synthesizing Long-Term Human Motions with Diffusion Models via Coherent Sampling

Zhao Yang
Bing Su
Ji-Rong Wen

Text-to-motion generation has gained increasing attention, but most existing methods are limited to generating short-term motions that correspond to a single sentence describing a single action. However, when a text stream describes a sequence of continuous motions, the generated motions corresponding to each sentence may not be coherently linked. Existing long-term motion generation methods face two main issues. Firstly, they cannot directly generate coherent motions and require additional operations such as interpolation to process the generated actions. Secondly, they generate subsequent actions in an autoregressive manner without considering the influence of future actions on previous ones. To address these issues, we propose a novel approach that utilizes a past-conditioned diffusion model with two optional coherent sampling methods: Past Inpainting Sampling and Compositional Transition Sampling. Past Inpainting Sampling completes subsequent motions by treating previous motions as conditions, while Compositional Transition Sampling models the distribution of the transition as the composition of two adjacent motions guided by different text prompts. Our experimental results demonstrate that our proposed method is capable of generating compositional and coherent long-term 3D human motions controlled by a user-instructed long text stream. The code is available at https://github.com/yangzhao1230/PCMDM

Layout Sequence Prediction From Noisy Mobile Modality

Haichao Zhang
Yi Xu
Hongsheng Lu
Takayuki Shimizu
Yun Fu

Trajectory prediction plays a vital role in understanding pedestrian movement for applications such as autonomous driving and robotics. Current trajectory prediction models depend on long, complete, and accurately observed sequences from visual modalities. Nevertheless, real-world situations often involve obstructed cameras, missed objects, or objects out of sight due to environmental factors, leading to incomplete or noisy trajectories. To overcome these limitations, we propose LTrajDiff, a novel approach that treats objects obstructed or out of sight as equally important as those with fully visible trajectories. LTrajDiff utilizes sensor data from mobile phones to surmount out-of-sight constraints, albeit introducing new challenges such as modality fusion, noisy data, and the absence of spatial layout and object size information. We employ a denoising diffusion model to predict precise layout sequences from noisy mobile data using a coarse-to-fine diffusion strategy, incorporating the Random Mask Strategy, Siamese Masked Encoding Module, and Modality Fusion Module. Our model predicts layout sequences by implicitly inferring object size and projection status from a single reference timestamp or significantly obstructed sequences. Achieving state-of-the-art results in randomly obstructed experiments, our model outperforms other baselines in extremely short input experiments, illustrating the effectiveness of leveraging noisy mobile data for layout sequence prediction. In summary, our approach offers a promising solution to the challenges faced by layout sequence and trajectory prediction models in real-world settings, paving the way for utilizing sensor data from mobile phones to accurately predict pedestrian bounding box trajectories. To the best of our knowledge, this is the first work that addresses severely obstructed and extremely short layout sequences by combining vision with noisy mobile modality, making it the pioneering work in the field of layout sequence trajectory prediction.

Graph-Based Video-Language Learning with Multi-Grained Audio-Visual Alignment

Chenyang Lyu
Wenxi Li
Tianbo Ji
Longyue Wang
Liting Zhou
Cathal Gurrin
Linyi Yang
Yi Yu
Yvette Graham
Jennifer Foster

Video-language learning has attracted significant attention in the fields of multimedia, computer vision and natural language processing in recent years. One of the key challenges in this area is how to effectively integrate visual and linguistic information to enable machines to understand video content and query information. In this work, we leverage graph-based representations and multi-grained audio-visual alignment to address this challenge. First, our approach starts by transforming video and query inputs into visual-scene graphs and semantic role graphs using a visual-scene parser and semantic role labeler respectively. These graphs are then encoded using graph neural networks to obtain enriched representations and combined to obtain a video-query joint representation that enhances the semantic expressivity of the inputs. Second, to achieve accurate matching of relevant parts of audio and visual features, we propose a multi-grained alignment module that aligns the audio and visual features at multiple scales. This enables us to effectively fuse the audio and visual information in a way that is consistent with the semantic-level information captured by the graph-based representations. Experiments on five representative datasets collected for Video Retrieval and Video Question Answering tasks show that our approach outperforms the literature on several metrics. Our extensive ablation studies demonstrate the effectiveness of graph-based representation and multi-grained audio-visual alignment.

Advancing Video Question Answering with a Multi-modal and Multi-layer Question Enhancement Network

Meng Liu
Fenglei Zhang
Xin Luo
Fan Liu
Yinwei Wei
Liqiang Nie

Video question answering is an increasingly vital research field, spurred by the rapid proliferation of video content online and the urgent need for intelligent systems that can comprehend and interact with this content. Existing methodologies often lean towards video understanding and cross-modal information interaction modeling but tend to overlook the crucial aspect of comprehensive question understanding. To address this gap, we introduce the multi-modal and multi-layer question enhancement network, a groundbreaking framework emphasizing nuanced question understanding. Our approach begins by extracting object, appearance, and motion features from videos. Subsequently, we harness multi-layer outputs from a pre-trained language model, ensuring a thorough grasp of the question. Integrating object data into appearance is guided by global question and frame representation, facilitating the adaptive acquisition of appearance and motion-enhanced question representation. By amalgamating multi-modal question insights, our methodology adeptly determines answers to questions. Experimental results conducted on three benchmarks demonstrate the superiority of our tailored approach, underscoring the importance of advanced question comprehension in VideoQA.

Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning

Wenrui Li
Xi-Le Zhao
Zhengyu Ma
Xingtao Wang
Xiaopeng Fan
Yonghong Tian

Audio-visual zero-shot learning (ZSL) has attracted board attention, as it could classify video data from classes that are not observed during training. However, most of the existing methods are restricted to background scene bias and fewer motion details by employing a single-stream network to process scenes and motion information as a unified entity. In this paper, we address this challenge by proposing a novel dual-stream architecture Motion-Decoupled Spiking Transformer (MDFT) to explicitly decouple the contextual semantic information and highly sparsity dynamic motion information. Specifically, The Recurrent Joint Learning Unit (RJLU) could extract contextual semantic information effectively and understand the environment in which actions occur by capturing joint knowledge between different modalities. By converting RGB images to events, our approach effectively captures motion information while mitigating the influence of background scene biases, leading to more accurate classification results. We utilize the inherent strengths of Spiking Neural Networks (SNNs) to process highly sparsity event data efficiently. Additionally, we introduce a Discrepancy Analysis Block (DAB) to model the audio motion features. To enhance the efficiency of SNNs in extracting dynamic temporal and motion information, we dynamically adjust the threshold of Leaky Integrate-and-Fire (LIF) neurons based on the statistical cues of global motion and contextual semantic information. Our experiments demonstrate the effectiveness of MDFT, which consistently outperforms state-of-the-art methods across mainstream benchmarks. Moreover, we find that motion information serves as a powerful regularization for video networks, where using it improves the accuracy of HM and ZSL by 19.1% and 38.4%, respectively.

Multimodal Color Recommendation in Vector Graphic Documents

Qianru Qiu
Xueting Wang
Mayu Otani

Color selection plays a critical role in graphic document design and requires sufficient consideration of various contexts. However, recommending appropriate colors which harmonize with the other colors and textual contexts in documents is a challenging task, even for experienced designers. In this study, we propose a multimodal masked color model that integrates both color and textual contexts to provide text-aware color recommendation for graphic documents. Our proposed model comprises self-attention networks to capture the relationships between colors in multiple palettes, and cross-attention networks that incorporate both color and CLIP-based text representations. Our proposed method primarily focuses on color palette completion, which recommends colors based on the given colors and text. Additionally, it is applicable for another color recommendation task, full palette generation, which generates a complete color palette corresponding to the given text. Experimental results demonstrate that our proposed approach surpasses previous color palette completion methods on accuracy, color distribution, and user experience, as well as full palette generation methods concerning color diversity and similarity to the ground truth palettes.

Open-Vocabulary Object Detection via Scene Graph Discovery

Hengcan Shi
Munawar Hayat
Jianfei Cai

In recent years, open-vocabulary (OV) object detection has attracted increasing research attention. Unlike traditional detection, which only recognizes fixed-category objects, OV detection aims to detect objects in an open category set. Previous works often leverage vision-language (VL) training data (e.g., referring grounding data) to recognize OV objects. However, they only use pairs of nouns and individual objects in VL data, while these data usually contain much more information, such as scene graphs, which are also crucial for OV detection. In this paper, we propose a novel Scene-Graph-Based Discovery Network (SGDN) that exploits scene graph cues for OV detection. Firstly, a scene-graph-based decoder (SGDecoder) including sparse scene-graph-guided attention (SSGA) is presented. It captures scene graphs and leverages them to discover OV objects. Secondly, we propose scene-graph-based prediction (SGPred), where we build a scene-graph-based offset regression (SGOR) mechanism to enable mutual enhancement between scene graph extraction and object localization. Thirdly, we design a cross-modal learning mechanism in SGPred. It takes scene graphs as bridges to improve the consistency between cross-modal embeddings for OV object classification. Experiments on COCO and LVIS demonstrate the effectiveness of our approach. Moreover, we show the ability of our model for OV scene graph detection, while previous OV scene graph generation methods cannot tackle this task.

Universal Domain Adaptive Network Embedding for Node Classification

Jushuo Chen
Feifei Dai
Xiaoyan Gu
Jiang Zhou
Bo Li
Weipinng Wang

Cross-network node classification aims to leverage the abundant knowledge from a labeled source network to help classify the node in an unlabeled target network. However, existing methods assume that label sets are identical across domains, which is easily violated in practice. Hence, we attempt to integrate network embedding with universal domain adaptation, which transfers valuable knowledge across domains without assumption on the label sets, to assist in node classification. Nonetheless, the complex network relationships between nodes increase the difficulty of this universal domain adaptive node classification task. In this work, we propose a novel Universal Domain Adaptive Network Embedding (UDANE) framework, which learns transferable node representations across networks to succeed in such a task. Technically, we first adopt the cross-network node embedding component to model comprehensive node information of both networks. Then we employ the inter-domain adaptive alignment component to exploit and relate knowledge across domains, learning domain-invariant representation for knowledge transfer. In addition, the intra-domain contrastive alignment component is proposed to learn discriminative representations beneficial for classification by sufficiently utilizing unlabeled data in the target domain. Extensive experiments have been conducted on real-world datasets, demonstrating that the proposed UDANE model outperforms the state-of-the-art baselines by a large margin.

Uncertainty-Guided End-to-End Audio-Visual Speaker Diarization for Far-Field Recordings

Chenyu Yang
Mengxi Chen
Yanfeng Wang
Yu Wang

Audio-visual speaker diarization refers to the task of identifying "who spoke when" by using both audio and video data. Although previous fusion-based approaches have shown exceptional performance over audio-only methods, they have mainly focused on high-quality data and have not accounted for the impacts of acoustic noise or missing faces. To address these limitations, we propose a novel uncertainty-aware end-to-end audio-visual speaker diarization (UAV-SD) approach in this paper. Our approach leverages both framewise inter- and intra-modal confidence to achieve more effective and robust speaker diarization. By taking into account the uncertainty of the data, UAV-SD can achieve better diarization performance even in noisy or low-quality recordings. Additionally, our approach is compatible with multi-channel audio signals without the need to retrain the model, making it a more versatile solution. To evaluate the effectiveness of our approach, we conduct extensive experiments on the Multi-modal Information Based Speech Processing (MISP) 2022 Challenge datasets which consist of far-field audio and video data. The results show that UAV-SD is able to yield significant performance gains compared to baseline methods for both single and multi-channel data, demonstrating its effectiveness in real-world scenarios.

Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization

Tianyu Liu
Peng Zhang
Wei Huang
Yufei Zha
Tao You
Yanning Zhang

Self-supervised sound source localization is usually challenged by the modality inconsistency. In recent studies, contrastive learning based strategies have shown promising to establish such a consistent correspondence between audio and sound sources in visual scenarios. Unfortunately, the insufficient attention to the heterogeneity influence in the different modality features still limits this scheme to be further improved, which also becomes the motivation of our work. In this study, an Induction Network is proposed to bridge the modality gap more effectively. By decoupling the gradients of visual and audio modalities, the discriminative visual representations of sound sources can be learned with the designed Induction Vector in a bootstrap manner, which also enables the audio modality to be aligned with the visual modality consistently. In addition to a visual weighted contrastive loss, an adaptive threshold selection strategy is introduced to enhance the robustness of the Induction Network. Substantial experiments conducted on SoundNet-Flickr and VGG-Sound Source datasets have demonstrated a superior performance compared to other state-of-the-art works in different challenging scenarios. The code is available at https://github.com/Tahy1/AVIN.

HELIOS: Hyper-Relational Schema Modeling from Knowledge Graphs

Yuhuan Lu
Bangchao Deng
Weijian Yu
Dingqi Yang

Knowledge graph (KG) schema, which prescribes a high-level structure and semantics of a KG, is significantly helpful for KG completion and reasoning problems. Despite its usefulness, open-domain KGs do not practically have a unified and fixed schema. Existing approaches usually extract schema information using entity types from a KG where each entity e can be associated with a set of types {Te, by either heuristically taking one type for each entity or exhaustively combining the types of all entities in a fact (to get entity-typed tuples, (h_type, r, t_type) for example). However, these two approaches either overlook the role of multiple types of a single entity across different facts or introduce non-negligible noise as not all the type combinations actually support the fact, thus failing to capture the sophisticated schema information. Against this background, we study the problem of modeling hyper-relational schema, which is formulated as mixed hyper-relational tuples ({Th}, r, {Tt}, k, {Tv1},...) with two-fold hyper-relations: each type set T may contain multiple types and each schema tuple may contain multiple key-type set pairs (k, Tv). To address this problem, we propose HELIOS, a hyper-relational schema model designed to subtly learn from such hyper-relational schema tuples by capturing not only the correlation between multiple types of a single entity, but also the correlation between types of different entities and relations in a schema tuple. We evaluate HELIOS on three real-world KG datasets in different schema prediction tasks. Results show that HELIOS consistently outperforms state-of-the-art hyper-relational link prediction techniques by 20.0-29.7%, and is also much more robust than baselines in predicting types and relations across different positions in a hyper-relational schema tuple.

Breaking the Barrier Between Pre-training and Fine-tuning: A Hybrid Prompting Model for Knowledge-Based VQA

Zhongfan Sun
Yongli Hu
Qingqing Gao
Huajie Jiang
Junbin Gao
Yanfeng Sun
Baocai Yin

Considerable performance gains have been achieved for knowledge-based visual question answering due to the visual-language pre-training models with pre-training-then-fine-tuning paradigm. However, because the targets of the pre-training and fine-tuning stages are different, there is an evident barrier that prevents the cross-modal comprehension ability developed in the pre-training stage from fully endowing the fine-tuning task. To break this barrier, in this paper, we propose a novel hybrid prompting model for knowledge-based VQA, which inherits and incorporates the pre-training and fine-tuning tasks with a shared objective. Specifically, based on static declaration prompt, we construct a consistent goal with the fine-tuning via masked language modeling to inherit capabilities of pre-training task, while selecting the top-t relevant knowledge in a dense retrieval manner. Additionally, a dynamic knowledge prompt is learned from retrieved knowledge, which not only alleviates the length constraint on inputs for visual-language pre-trained models but also assists in providing answer features via fine-tuning. Combining and unifying the aims of the two stages could fully exploit the abilities of pre-training and fine-tuning to predict answer. We evaluate the proposed model on the OKVQA dataset, and the result shows that our model outperforms the state-of-the-art methods based on visual-language pre-training models with a noticeable performance gap and even exceeds the large-scale language model of GPT-3, which proves the benefits of the hybrid prompts and the advantages of unifying pre-training to fine-tuning.

OccluBEV: Occlusion Aware Spatiotemporal Modeling for Multi-view 3D Object Detection

Ziteng Wen
Hai Xu
Chenyu Liu
Tao Guo
Jinshui Hu
Xuming He
Fengren Wang
Shun Lou
Haibo Fan

Bird's-Eye-View (BEV) based 3D visual perception, which formulates a unified space for multi-view representation, has received wide attention in autonomous driving due to its scalability for downstream tasks. However, view transform in transformer-based BEV methods is agnostic of 3D occlusion relationships, resulting in model degradation. To construct a higher-quality BEV space, this paper analyzes the mutual occlusion problems in the view transform process and proposes a new transformer-based method named OccluBEV. OccluBEV alleviates the occlusion issue via point cloud information distillation in both the image and BEV space. Specifically, in the image space, we perform depth estimation for each pixel and utilize it to guide image feature mapping. Further, since predicting depth directly from monocular image is ill-posed, ignoring stereo information such as multi-view and temporal cues, this paper introduces a voxel visibility segmentation task in 3D BEV space. The task explicitly predicts whether each voxel in the 3D BEV grid is occupied or not. In addition, to alleviate the overfitting problem in BEV feature learning under a single task, we design a multi-head learning framework which jointly models multiple strongly-correlated tasks in a unified BEV space. The effectiveness of the proposed method is fully validated on the nuScenes dataset, achieving a competetive NDS/mAP score of 57.5/47.9 on the nuScenes test leaderboard using ResNet101 backbone, which is superior to state-of-the-art camera-based solutions.

SESSION: Poster Session III: Understanding Multimedia Content -- Vision and Language

Semantics-Enriched Cross-Modal Alignment for Complex-Query Video Moment Retrieval

Xingyu Shen
Xiang Zhang
Xun Yang
Yibing Zhan
Long Lan
Jianfeng Dong
Hongzhou Wu

Video moment retrieval (VMR) aims to search for a video segment that matches the search intent in a query sentence, which has received increasing attention in recent years, due to its practical values in various fields. Existing efforts devoted to this interesting yet challenging task typically encode the query sentence and video segments into unstructured global representations for cross-modal interaction and fusion, which may fail to accurately capture the search intent in complex queries with multi-granularity semantics.

To fill the research gap, this paper presents a novel solution termed semantics-enriched video moment retrieval method (SVMR), which can effectively and explicitly model the hierarchical multi-granularity semantics of complex textual query. Specifically, we first explore cross-token relations to offer multiple granularity query representations with hierarchical semantic contexts of semantically associated tokens for fine-grained cross-modal interaction and fusion, which contributes to mining rich visual motion cues semantically related to different activities and entities in complex queries. Furthermore, to fully leverage fine-grained cross-modal cues for moment retrieval, we design a specific temporal boundary reasoning module by explicitly generating start and end time-aware filter kernels with visual cues to perceive the moment boundaries. Extensive experiments and analyses on three public benchmarks clearly demonstrate the advantage of our proposed SVMR over existing state-of-the-art approaches, especially in retrieving complex query-based video moments.

NightHazeFormer: Single Nighttime Haze Removal Using Prior Query Transformer

Yun Liu
Zhongsheng Yan
Sixiang Chen
Tian Ye
Wenqi Ren
Erkang Chen

Nighttime image dehazing is a challenging task due to the presence of multiple types of adverse degrading effects including glow, haze, blur, noise, color distortion, and so on. However, most previous studies mainly focus on daytime image dehazing or partial degradations presented in nighttime hazy scenes, which may lead to unsatisfactory restoration results. In this paper, we propose an end-to-end transformer-based framework for nighttime haze removal, called NightHazeFormer. Our proposed approach consists of two stages: supervised pre-training and semi-supervised fine-tuning. During the pre-training stage, we introduce two powerful priors into the transformer decoder to generate the non-learnable prior queries, which guide the model to extract specific degradations. For the fine-tuning, we combine the generated pseudo ground truths with input real-world nighttime hazy images as paired images and feed into the synthetic domain to fine-tune the pre-trained model. This semi-supervised fine-tuning paradigm helps improve the generalization to real domain. In addition, we also propose a large-scale synthetic dataset called UNREAL-NH, to simulate the real-world nighttime haze scenarios comprehensively. Extensive experiments on several synthetic and real-world datasets demonstrate the superiority of our NightHazeFormer over state-of-the-art nighttime haze removal methods in terms of both visually and quantitatively.

FSNet: Frequency Domain Guided Superpixel Segmentation Network for Complex Scenes

Hua Li
Junyan Liang
Wenjie Li
Wenhui Wu

Existing superpixel segmentation algorithms mainly focus on natural image with high-quality, while neglecting the inevitable environment constraint in complex scenes. In this paper, we propose an end-to-end frequency domain guided superpixel segmentation network (FSNet) to generate superpixels with sharp boundary adherence for complex scenes by fusing the deep features in spatial and frequency domains. To utilize the frequency domain information of the image, an improved frequency information extractor (IFIE) is proposed to extract the frequency domain information with sharp boundary features. Moreover, considering the over-sharp feature may damage the semantic information of superpixel, we further design a dense hybrid atrous convolution (DHAC) block to preserve semantic information via capturing wider and deeper semantic information in spatial domain. Finally, the extracted deep features in spatial and frequency domains will be fused to generate semantic perceptual superpixels with sharp boundary adherence. Extensive experiments on multiple challenging datasets with complex boundaries demonstrate that our method achieves the state-of-the-art performance both quantitatively and qualitatively, and we further verify the superiority of the proposed method when applied in salient object detection.

Zero-Shot Learning by Harnessing Adversarial Samples

Zhi Chen
Pengfei Zhang
Jingjing Li
Sen Wang
Zi Huang

Zero-Shot Learning (ZSL) aims to recognize unseen classes by generalizing the knowledge, i.e., visual and semantic relationships, obtained from seen classes, where image augmentation techniques are commonly applied to improve the generalization ability of a model. However, this approach can also cause adverse effects on ZSL since the conventional augmentation techniques that solely depend on single-label supervision is not able to maintain semantic information and result in the semantic distortion issue consequently. In other words, image argumentation may falsify the semantic (e.g., attribute) information of an image. To take the advantage of image augmentations while mitigating the semantic distortion issue, we propose a novel ZSL approach by Harnessing Adversarial Samples (HAS). HAS advances ZSL through adversarial training which takes into account three crucial aspects: (1) robust generation by enforcing augmentations to be similar to negative classes, while maintaining correct labels, (2) reliable generation by introducing a latent space constraint to avert significant deviations from the original data manifold, and (3) diverse generation by incorporating attribute-based perturbation by adjusting images according to each semantic attribute's localization. Through comprehensive experiments on three prominent zero-shot benchmark datasets, we demonstrate the effectiveness of our adversarial samples approach in both ZSL and Generalized Zero-Shot Learning (GZSL) scenarios. Our source code is available at https://github.com/uqzhichen/HASZSL.

Sequential Affinity Learning for Video Restoration

Tian Ye
Sixiang Chen
Yun Liu
Wenhao Chai
Jinbin Bai
Wenbin Zou
Yunchen Zhang
Mingchao Jiang
Erkang Chen
Chenghao Xue

Video restoration networks aim to restore high-quality frame sequences from degraded ones. However, traditional video restoration methods heavily rely on temporal modeling operators or optical flow estimation, which limits their versatility. The aim of this work is to present a novel approach for video restoration that eliminates inefficient temporal modeling operators and pixel-level feature alignment in the network architecture. The proposed method, Sequential Affinity Learning Network (SALN), is designed based on an affinity mechanism that establishes direct correspondences between the Query frame, degraded sequence, and restored frames in latent space. This unique perspective allows for more accurate and effective restoration of video content without relying on temporal modeling operators or optical flow estimation techniques. Moreover, we enhanced the design of the channel-wise self-attention block to improve the decoder's performance for video restoration. Our method outperformed previous state-of-the-art methods by a significant margin in several classic video tasks, including video deraining, video dehazing, and video waterdrop removal, demonstrating excellent efficiency. As a novel network that differs significantly from previous video restoration methods, SALN aims to provide innovative ideas and directions for video restoration. Our contributions include proposing a novel affinity-based approach for video restoration, enhancing the design of the channel-wise self-attention block, and achieving state-of-the-art performance on several classic video tasks.

Beat: Bi-directional One-to-Many Embedding Alignment for Text-based Person Retrieval

Yiwei Ma
Xiaoshuai Sun
Jiayi Ji
Guannan Jiang
Weilin Zhuang
Rongrong Ji

Text-based person retrieval (TPR) is a challenging task that involves retrieving a specific individual based on a textual description. Despite considerable efforts to bridge the gap between vision and language, the significant differences between these modalities continue to pose a challenge. Previous methods have attempted to align text and image samples in a modal-shared space, but they face uncertainties in optimization directions due to the movable features of both modalities and the failure to account for one-to-many relationships of image-text pairs in TPR datasets. To address this issue, we propose an effective bi-directional one-to-many embedding paradigm that offers a clear optimization direction for each sample, thus mitigating the optimization problem. Additionally, this embedding scheme generates multiple features for each sample without introducing trainable parameters, making it easier to align with several positive samples. Based on this paradigm, we propose a novel Bi-directional one-to-many Embedding Alignment (Beat) model to address the TPR task. Our experimental results demonstrate that the proposed Beat model achieves state-of-the-art performance on three popular TPR datasets, including CUHK-PEDES (65.61 R@1), ICFG-PEDES (58.25 R@1), and RSTPReID (48.10 R@1). Furthermore, additional experiments on MS-COCO, CUB, and Flowers datasets further demonstrate the potential of Beat to be applied to other image-text retrieval tasks.

Transformer-based Point Cloud Generation Network

Rui Xu
Le Hui
Yuehui Han
Jianjun Qian
Jin Xie

Point cloud generation is an important research topic in 3D computer vision, which can provide high-quality datasets for various downstream tasks. However, efficiently capturing the geometry of point clouds remains a challenging problem due to their irregularities. In this paper, we propose a novel transformer-based 3D point cloud generation network to generate realistic point clouds. Specifically, we first develop a transformer-based interpolation module that utilizes k-nearest neighbors at different scales to learn global and local information about point clouds in the feature space. Based on geometric information, we interpolate new point features to upsample the point cloud features. Then, the upsampled features are used to generate a coarse point cloud with spatial coordinate information. We construct a transformer-based refinement module to enhance the upsampled features in feature space with geometric information in coordinate space. Finally, we use a multi-layer perceptron on the upsampled features to generate the final point cloud. Extensive experiments on ShapeNet and ModelNet demonstrate the effectiveness of our proposed method.

Isolation and Induction: Training Robust Deep Neural Networks against Model Stealing Attacks

Jun Guo
Xingyu Zheng
Aishan Liu
Siyuan Liang
Yisong Xiao
Yichao Wu
Xianglong Liu

Despite the broad application of Machine Learning models as a Service (MLaaS), they are vulnerable to model stealing attacks. These attacks can replicate the model functionality by using the black-box query process without any prior knowledge of the target victim model. Existing stealing defenses add deceptive perturbations to the victim's posterior probabilities to mislead the attackers. However, these defenses are now suffering problems of high inference computational overheads and unfavorable trade-offs between benign accuracy and stealing robustness, which challenges the feasibility of deployed models in practice. To address the problems, this paper proposes Isolation and Induction (InI), a novel and effective training framework for model stealing defenses. Instead of deploying auxiliary defense modules that introduce redundant inference time, InI directly trains a defensive model by isolating the adversary's training gradient from the expected gradient, which can effectively reduce the inference computational cost. In contrast to adding perturbations over model predictions that harm the benign accuracy, we train models to produce uninformative outputs against stealing queries, which can induce the adversary to extract little useful knowledge from victim models with minimal impact on the benign performance. Extensive experiments on several visual classification datasets (e.g., MNIST and CIFAR10) demonstrate the superior robustness (up to 48% reduction on stealing accuracy) and speed (up to 25.4× faster) of our InI over other state-of-the-art methods. Our codes can be found in https://github.com/DIG-Beihang/InI-Model-Stealing-Defense.

Filling the Information Gap between Video and Query for Language-Driven Moment Retrieval

Daizong Liu
Xiaoye Qu
Jianfeng Dong
Guoshun Nan
Pan Zhou
Zichuan Xu
Lixing Chen
He Yan
Yu Cheng

This paper addresses the challenging task of language-driven moment retrieval. Previous methods are typically trained to localize the target moment corresponding to a single sentence query in a complicated video. However, this specific moment generally delivers richer contents than the query, i.e., the semantics of one query may miss certain object details or actions in the complex foreground-background visual contents. Such information imbalance between two modalities makes it difficult to finely align their representations. To this end, instead of training with a single query, we propose to utilize the diversity and complementarity among different queries corresponding to the same video moment for enriching the textual semantics. Specifically, we develop a Teacher-Student Moment Retrieval (TSMR) framework to fill this cross-modal information gap. A teacher model is trained to not only encode a certain query but also capture extra complementary queries to aggregate contextual semantics for obtaining more comprehensive moment-related query representations. Since the additional queries are inaccessible during inference, we further introduce an adaptive knowledge distillation mechanism to train a student model with a single query input by selectively absorbing the knowledge from the teacher model. In this manner, the student model is more robust to the cross-modal information gap during the moment retrieval guided by a single query. Experimental results on two benchmarks demonstrate the effectiveness of our proposed method.

Improving Semi-Supervised Semantic Segmentation with Dual-Level Siamese Structure Network

Zhibo Tian
Xiaolin Zhang
Peng Zhang
Kun Zhan

Semi-supervised semantic segmentation (SSS) is an important task that utilizes both labeled and unlabeled data to reduce expenses on labeling training examples. However, the effectiveness of SSS algorithms is limited by the difficulty of fully exploiting the potential of unlabeled data. To address this, we propose a dual-level Siamese structure network (DSSN) for pixel-wise contrastive learning. By aligning positive pairs with a pixel-wise contrastive loss using strong augmented views in both low-level image space and high-level feature space, the proposed DSSN is designed to maximize the utilization of available unlabeled data. Additionally, we introduce a novel class-aware pseudo-label selection strategy for weak-to-strong supervision, which addresses the limitations of most existing methods that do not perform selection or apply a predefined threshold for all classes. Specifically, our strategy selects the top high-confidence prediction of the weak view for each class to generate pseudo labels that supervise the strong augmented views. This strategy is capable of taking into account the class imbalance and improving the performance of long-tailed classes. Our proposed method achieves state-of-the-art results on two datasets, PASCAL VOC 2012 and Cityscapes, outperforming other SSS algorithms by a significant margin. The source code is available at https://github.com/kunzhan/DSSN.

Focusing on Flexible Masks: A Novel Framework for Panoptic Scene Graph Generation with Relation Constraints

Jiarui Yang
Chuan Wang
Zeming Liu
Jiahong Wu
Dongsheng Wang
Liang Yang
Xiaochun Cao

Panoptic Scene Graph Generation (PSG) presents pixel-wise instance detection and localization, leading to comprehensive and precise scene graphs. Current methods employ conventional Scene Graph Generation (SGG) frameworks to solve the PSG problem, neglecting the fundamental differences between bounding boxes and masks, i.e., bounding boxes are allowed overlap but masks are not. Since segmentation from the panoptic head has deviations, non-overlapping masks may not afford complete instance information. Subsequently, in the training phase, incomplete segmented instances may not be well-aligned to annotated ones, causing mismatched relations and insufficient training. During the inference phase, incomplete segmentation leads to incomplete scene graph prediction. To alleviate these problems, we construct a novel two-stage framework for the PSG problem. In the training phase, we design a proposal matching strategy, which replaces deterministic segmentation results with proposals extracted from the off-the-shelf panoptic head for label alignment, thereby ensuring the all-matching of training samples. In the inference phase, we present an innovative concept of employing relation predictions to constrain segmentation and design a relation-constrained segmentation algorithm. By reconstructing the process of generating segmentation results from proposals using predicted relation results, the algorithm recovers more valid instances and predicts more complete scene graphs. The experimental results show overall superiority, effectiveness, and robustness against adversarial attacks.

CCMB: A Large-scale Chinese Cross-modal Benchmark

Chunyu Xie
Heng Cai
Jincheng Li
Fanjing Kong
Xiaoyu Wu
Jianfei Song
Henrique Morimitsu
Lin Yao
Dexin Wang
Xiangzheng Zhang
Dawei Leng
Baochang Zhang
Xiangyang Ji
Yafeng Deng

Vision-language pre-training (VLP) on large-scale datasets has shown premier performance on various downstream tasks. In contrast to plenty of available benchmarks with English corpus, large-scale pre-training datasets and downstream datasets with Chinese corpus remain largely unexplored. In this work, we build a large-scale high-quality Chinese Cross-Modal Benchmark named CCMB for the research community, which contains the currently largest public pre-training dataset Zero and five human-annotated fine-tuning datasets for downstream tasks. Zero contains 250 million images paired with 750 million text descriptions, plus two of the five fine-tuning datasets are also currently the largest ones for Chinese cross-modal downstream tasks. Along with the CCMB, we also develop a VLP framework named R2D2, applying a pre-Ranking + Ranking strategy to learn powerful vision-language representations and a two-way distillation method (i.e., target-guided Distillation and feature-guided Distillation) to further enhance the learning capability. With the Zero and the R2D2 VLP framework, we achieve state-of-the-art performance on twelve downstream datasets from five broad categories of tasks including image-text retrieval, image-text matching, image caption, text-to-image generation, and zero-shot image classification. The datasets, models, and codes are available at https://github.com/yuxie11/R2D2

CPLFormer: Cross-scale Prototype Learning Transformer for Image Snow Removal

Sixiang Chen
Tian Ye
Yun Liu
Jinbin Bai
Haoyu Chen
Yunlong Lin
Jun Shi
Erkang Chen

Removing snow from a single image poses a significant challenge within the image restoration domain, as snowfall's effects are in various scales and forms. Existing methods have tried to tackle this issue by using multi-scale approaches, but their reliance on targeted design for handling each single-scale feature has resulted in unsatisfactory performance. This is primarily due to a lack of cross-scale knowledge, making it difficult to effectively handle degradations. To this end, we propose a novel approach, CPLFormer, which uses snow prototypes to own comprehensive clean scene understanding through learning from cross-scale features, outperforming convolutional network and vanilla transformer-based solutions. CPLFormer has several advantages: firstly, learnable snow prototypes learn global context information from multiple scales to uncover hidden clean cues; secondly, prototypes can propagate cross-scale information to each patch through cross-attention to assist with clean patch reconstruction; thirdly, CPLFormer surpasses advanced state-of-the-art desnowing networks and the prevalent universal image restoration transformers on six synthetic and real-world benchmark tests.

Video Entailment via Reaching a Structure-Aware Cross-modal Consensus

Xuan Yao
Junyu Gao
Mengyuan Chen
Changsheng Xu

This paper targets at the task of video entailment, which aims to achieve a thorough comprehension and draw inferences on whether a natural language statement entails or contradicts a given multi-modal video. Despite the recent progress, most existing methods focus on designing a vision-language encoder for multi-modal feature extraction in video entailment, which ignore the underlying consensus knowledge between two modalities, hindering the reasoning performance. As human beings, we make sense of the world by synthesizing information from different sense perceptions, which can acquire consensus among multiple modalities to form a more thorough and coherent representation of the surroundings, as well as to perform complicated understanding tasks. In this paper, we attempt to recreate this ability to infer the truthfulness of a given statement in the context of video entailment. To this end, we propose a unified structure-aware cross-modal consensus method to excavate the consensus semantics shared between video and language modalities, thereby incorporating which into video entailment as statement-related clues. Specifically, the consensus information is achieved by filtering away redundant information by utilizing the global information from one modality and the local complementary information from the other one. Moreover, a consensus-guided graph reasoning method is designed to explore inter-modality consistency and emphasize the significant features related to the judged statement, generating the inference results. Extensive experiments on two benchmarks demonstrate the accurate and robust performance of our approach compared to state-of-the-arts. Code is available at https://github.com/Feliciaxyao/MM2023-SACCN.

Cerebrovascular Segmentation in TOF-MRA with Topology Regularization Adversarial Model

Cheng Chen
Yunqing Chen
Shuang Song
Jianan Wang
Huansheng Ning
Ruoxiu Xiao

Time-of-flight magnetic resonance angiography (TOF-MRA) is a common cerebrovascular imaging. Accurate and automatic cerebrovascular segmentation in TOF-MRA images is an important auxiliary method in clinical practice. Due to the complex semantics and noise interference, the existing segmentation methods often fail to pay attention to topological correlation, resulting in the neglect of branch vessels and vascular topology destruction. In this paper, we proposed a topology regularization adversarial model for cerebrovascular segmentation in TOF-MRA images. Firstly, we trained a self-supervised model to learn spatial semantic layout in TOF-MRA images by image context restoration. Subsequently, we exploited initialization based on the self-supervised model and constructed an adversarial model to accomplish parameter optimization. Considering the limitations of uneven distribution of cerebrovascular classes, we introduced skeleton structures as discriminative features to enhance vessel topological strength. We constructed some latest models to test our method over two datasets. Results show that the proposed model attains the highest score. Therefore, our method can obtain accurate connectivity information and higher graph similarity, leading more meaningful clinical utility.

Hierarchical Reasoning Network with Contrastive Learning for Few-Shot Human-Object Interaction Recognition

Jiale Yu
Baopeng Zhang
Qirui Li
Haoyang Chen
Zhu Teng

Few-shot learning (FSL) for human-object interaction aims at classifying samples of new unseen HOI classes with only a few labeled samples available. Although progress has been made in few-shot human-object interaction, most of the existing methods encounter two issues in handling fine-grained interactions: the inability to capture more subtle interactive clues and the inadequacy in learning from data with low inter-class variance. To tackle the first issue, we propose a hierarchical reasoning network to integrate multi-level interactive clues (from coarse to fine-grained) for strengthening HOI representations. The hierarchical relation module mainly captures and aggregates more discriminative relation information among human parts at multiple levels (including the human instance, action region, and body part levels) and objects via a unified graph and exploits a language-guided attentive fusion way to highlight informative features of each interaction level. To address the second issue, we introduce a contrastive learning mechanism to alleviate the inter-class variance. Compared with the previous ProtoNet-based methods, our model generates more discriminative representations for low inter-class variance data, since it makes full use of potential contrastive pairs in each training episode. Extensive experimental results on two standard benchmarks demonstrate that the proposed model performs favorably against state-of-the-art FS-HOI methods.

Uncertainty-Driven Dynamic Degradation Perceiving and Background Modeling for Efficient Single Image Desnowing

Sixiang Chen
Tian Ye
Chenghao Xue
Haoyu Chen
Yun Liu
Erkang Chen
Lei Zhu

Single-image snow removal aims to restore clean images from heterogeneous and irregular snow degradations. Recent methods utilize neural networks to remove various degradations directly. However, these approaches suffer from the limited ability to flexibly perceive complicated snow degradation patterns and insufficient representation of background structure information. To further improve the performance and generalization ability of snow removal, this paper aims to develop a novel and efficient paradigm from the perspective of degradation perceiving and background modeling.

For this purpose, we first analyze two critical properties in real snow images, namely local-region heterogeneity and axial anisotropy. Inspired by them, we propose Dynamic Perceiving for Degraded Regions and Axial-Pooling Attention for Background Structure Modeling, which together couple a new network architecture, dubbed as D2P-BMNet. Our proposed D2P-BMNet offers several key advantages: (i) It can effectively segment regions under the uncertainty map's guidance, and dynamically perceives heterogeneous degradations within various regions. (ii) By utilizing linear attention solely along a horizontal axis, it can effectively model clean scene information that is buried beneath the snow. (iii) D2P-BMNet significantly improves over prior methods across all benchmarks and maintains excellent inference speeds.

DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder

Chenpeng Du
Qi Chen
Tianyu He
Xu Tan
Xie Chen
Kai Yu
Sheng Zhao
Jiang Bian

While recent research has made significant progress in speech-driven talking face generation, the quality of the generated video still lags behind that of real recordings. One reason for this is the use of handcrafted intermediate representations like facial landmarks and 3DMM coefficients, which are designed based on human knowledge and are insufficient to precisely describe facial movements. Additionally, these methods require an external pretrained model for extracting these representations, whose performance sets an upper bound on talking face generation. To address these limitations, we propose a novel method called DAE-Talker that leverages data-driven latent representations obtained from a diffusion autoencoder (DAE). DAE contains an image encoder that encodes an image into a latent vector and a DDIM-based image decoder that reconstructs the image from it. We train our DAE on talking face video frames and then extract their latent representations as the training target for a Conformer-based speech2latent model. During inference, DAE-Talker first predicts the latents from speech and then generates the video frames with the image decoder in DAE from the predicted latents. This allows DAE-Talker to synthesize full video frames and produce natural head movements that align with the content of speech, rather than relying on a predetermined head pose from a template video. We also introduce pose modelling in speech2latent for pose controllability. Additionally, we propose a novel method for generating continuous video frames with the DDIM-based image decoder trained on individual frames, eliminating the need for modelling the joint distribution of consecutive frames directly. Our experiments show that DAE-Talker outperforms existing popular methods in lip-sync, video fidelity, and pose naturalness. We also conduct ablation studies to analyze the effectiveness of the proposed techniques and demonstrate the pose controllability of DAE-Talker.

Spatio-Temporal Branching for Motion Prediction using Motion Increments

Jiexin Wang
Yujie Zhou
Wenwen Qiang
Ying Ba
Bing Su
Ji-Rong Wen

Human motion prediction (HMP) has emerged as a popular research topic due to its diverse applications. Traditional methods rely on hand-crafted features and machine learning techniques, which often struggle to model the complex dynamics of human motion. Recent deep learning-based methods have achieved success by learning spatio-temporal representations of motion, but these models often overlook the reliability of motion data. Additionally, the temporal and spatial dependencies of skeleton nodes are distinct. The temporal relationship captures motion information over time, while the spatial relationship describes body structure and the relationships between different nodes. In this paper, we propose a novel spatio-temporal branching network using incremental information for HMP, which decouples the learning of temporal-domain and spatial-domain features, extracts more motion information, and achieves complementary cross-domain knowledge learning through knowledge distillation. Our approach effectively reduces noise interference and provides more expressive information for characterizing motion by separately extracting temporal and spatial features. We evaluate our approach on standard HMP benchmarks and outperform state-of-the-art methods in terms of prediction accuracy. Code is available at https://github.com/JasonWang959/STPMP.

Generative Neutral Features-Disentangled Learning for Facial Expression Recognition

Zhenqian Wu
Yazhou Ren
Xiaorong Pu
Zhifeng Hao
Lifang He

Facial expression recognition (FER) plays a critical role in human-computer interaction and affective computing. Traditional FER methods typically rely on comparing the difference between an examined facial expression and a neutral face of the same person to extract the motion of facial features and filter out expression-irrelevant information. With the extensive use of deep learning, the performance of FER has been further improved. However, existing deep learning-based methods rarely utilize neutral faces. To address this gap, we propose a novel deep learning-based FER method called Generative Neutral Features-Disentangled Learning (GNDL), which draws inspiration from the facial feature manifold. Our approach integrates a neutral feature generator (NFG) that generates neutral features in scenarios where the neutral face of the same subject is not available. The NFG uses fine-grained features from examined images as input and produces corresponding neutral features with the same identity. We train the NFG using a neutral feature reconstruction loss to ensure that the generative neutral features are consistent with the actual neutral features. We then disentangle the generative neutral features from the examined features to remove disturbance features and generate an expression deviation embedding for classification. Extensitive experimental results on three popular databases (CK+, Oulu-CASIA, and MMI) demonstrate that our proposed GNDL method outperforms state-of-the-art FER methods.

Deep Algorithm Unrolling with Registration Embedding for Pansharpening

Tingting Wang
Yongxu Ye
Faming Fang
Guixu Zhang
Ming Xu

Pansharpening aims to sharpen low resolution (LR) multispectral (MS) images with the help of corresponding high resolution (HR) panchromatic (PAN) images to obtain HRMS images. Model-based pansharpening methods manually design objective functions via observation model and hand-crafted priors. However, inevitable performance degradation may occur in the case that the prior is invalid. Although many deep learning based end-to-end pansharpening methods have been proposed recently, they still need to be improved due to the insufficient study on HRMS related domain knowledge. Besides, existing pansharpening methods rarely consider the misalignments between MS and PAN images, leading to poor performance. To tackle these issues, this paper proposes to unrolling the observation model with registration embedding for pansharpening. Inspired by the optical flow estimation, we embed the registration operation into the observation model to reconstruct the pansharpening function with the help of a deep prior of HRMS images, and then unroll the iterative solution into a novel deep convolutional network.. Apart from the single HRMS supervision, we also introduce a consistency loss to supervise the two degradation processes. The use of consistency loss enables the degradation sub-networks to learn more realistic degradation. Experimental results at reduced-resolution and full-resolution are reported to demonstrate the superiority of the proposed method to other state-of-the-art pansharpening methods. In GaoFen-2 dataset evaluation, our method achieves 1.2dB higher PSNR than SOTA techniques.

DAOT: Domain-Agnostically Aligned Optimal Transport for Domain-Adaptive Crowd Counting

Huilin Zhu
Jingling Yuan
Xian Zhong
Zhengwei Yang
Zheng Wang
Shengfeng He

Domain adaptation is commonly employed in crowd counting to bridge the domain gaps between different datasets. However, existing domain adaptation methods tend to focus on inter-dataset differences while overlooking the intra-differences within the same dataset, leading to additional learning ambiguities. These domain-agnostic factors,e.g., density, surveillance perspective, and scale, can cause significant in-domain variations, and the misalignment of these factors across domains can lead to a drop in performance in cross-domain crowd counting. To address this issue, we propose a Domain-agnostically Aligned Optimal Transport (DAOT) strategy that aligns domain-agnostic factors between domains. The DAOT consists of three steps. First, individual-level differences in domain-agnostic factors are measured using structural similarity (SSIM). Second, the optimal transfer (OT) strategy is employed to smooth out these differences and find the optimal domain-to-domain misalignment, with outlier individuals removed via a virtual "dustbin'' column. Third, knowledge is transferred based on the aligned domain-agnostic factors, and the model is retrained for domain adaptation to bridge the gap across domains. We conduct extensive experiments on five standard crowd-counting benchmarks and demonstrate that the proposed method has strong generalizability across diverse datasets. Our code will be available at: https://github.com/HopooLinZ/DAOT/.

Partial Annotation-based Video Moment Retrieval via Iterative Learning

Wei Ji
Renjie Liang
Lizi Liao
Hao Fei
Fuli Feng

Given a descriptive language query, Video Moment Retrieval (VMR) aims to seek the corresponding semantic-consistent moment clip in the video, which is represented as a pair of the start and end timestamps. Although current methods have achieved satisfying performance, training these models heavily relies on the fully-annotated VMR datasets. Nonetheless, precise video temporal annotations are extremely labor-intensive and ambiguous due to the diverse preferences of different annotators.

Although there are several works trying to explore weakly supervised VMR tasks with scattered annotated frames as labels, there is still much room to improve in terms of accuracy. Therefore, we design a new setting of VMR where users can easily point to small segments of non-controversy video moments and our proposed method can automatically fill in the remaining parts based on the video and query semantics. To support this, we propose a new framework named Video Moment Retrieval via Iterative Learning (VMRIL). It treats the partial temporal region as the seed, then expands the pseudo label by iterative training. In order to restrict the expansion with reasonable boundaries, we utilize a pretrained video action localization model to provide coarse guidance of potential video segments. Compared with other VMR methods, our VMRIL achieves a trade-off between satisfying performance and annotation efficiency. Experimental results show that our proposed method can achieve the SOTA performance in the weakly supervised VMR setting, and are even comparable with some fully-supervised VMR methods but with much less annotation cost.

Style Transfer Meets Super-Resolution: Advancing Unpaired Infrared-to-Visible Image Translation with Detail Enhancement

Yirui Shen
Jingxuan Kang
Shuang Li
Zhenjie Yu
Shuigen Wang

The problem of unpaired infrared-to-visible image translation has gained significant attention due to its ability to generate visible images with color information from low-detail grayscale infrared inputs. However, current methodologies often depend on conventional style transfer techniques, which constrain the spatial resolution of the visible output to be equivalent to that of the input infrared image. The fixed generation pattern results in blurry generated results when translating low-resolution infrared inputs, and utilizing high-resolution infrared inputs as a solution necessitates greater computational resources. This spurs us to investigate the challenging unpaired image translation from low-resolution infrared inputs to high-resolution visible outputs, with the ultimate goal of enhancing image details while reducing computational costs. Therefore, we propose a unified framework that integrates the super-resolution process into our unpaired infrared-to-visible image transfer, yielding realistic and high-resolution results. Specifically, we propose the Detail Consistency Loss to establish a connection between the two aforementioned modules, thereby enhancing the quality of visual detail in style transfer results through the super-resolution module. Furthermore, our Texture Perceptual Loss is designed to ensure that the generator generates high-quality visual details accurately and reliably. Experimental results indicate that our method outperforms other comparative approaches when utilizing low-resolution infrared inputs. Remarkably, our approach even surpasses techniques that use high-resolution infrared inputs to generate visible images. Last but equally important, we propose a new and challenging dataset, dubbed as InfraredCity-HD, which comprises 512X512 resolution images, to advance research on high-resolution infrared-related fields.

Mind the Gap: Improving Success Rate of Vision-and-Language Navigation by Revisiting Oracle Success Routes

Chongyang Zhao
Yuankai Qi
Qi Wu

Vision-and-Language Navigation (VLN) aims to navigate to the target location by following a given instruction. Unlike existing methods focused on predicting a more accurate action at each step in navigation, in this paper, we make the first attempt to tackle a long-ignored problem in VLN: narrowing the gap between Success Rate (SR) and Oracle Success Rate (OSR). We observe a consistently large gap (up to 9%) on four state-of-the-art VLN methods across two benchmark datasets: R2R and REVERIE. The high OSR indicates the robot agent passes the target location, while the low SR suggests the agent actually fails to stop at the target location at last. Instead of predicting actions directly, we propose to mine the target location from a trajectory given by off-the-shelf VLN models. Specially, we design a multi-module transformer-based model for learning compact discriminative trajectory viewpoint representation, which is used to predict the confidence of being a target location as described in the instruction. The proposed method is evaluated on three widely-adopted datasets: R2R, REVERIE and NDH, and shows promising results, demonstrating the potential for more future research.

Feature-Suppressed Contrast for Self-Supervised Food Pre-training

Xinda Liu
Yaohui Zhu
Linhu Liu
Jiang Tian
Lili Wang

Most previous approaches for analyzing food images have relied on extensively annotated datasets, resulting in significant human labeling expenses due to the varied and intricate nature of such images. Inspired by the effectiveness of contrastive self-supervised methods in utilizing unlabelled data, weiqing explore leveraging these techniques on unlabelled food images. In contrastive self-supervised methods, two views are randomly generated from an image by data augmentations. However, regarding food images, the two views tend to contain similar informative contents, causing large mutual information, which impedes the efficacy of contrastive self-supervised learning. To address this problem, we propose Feature Suppressed Contrast (FeaSC) to reduce mutual information between views. As the similar contents of the two views are salient or highly responsive in the feature map, the proposed FeaSC uses a response-aware scheme to localize salient features in an unsupervised manner. By suppressing some salient features in one view while leaving another contrast view unchanged, the mutual information between the two views is reduced, thereby enhancing the effectiveness of contrast learning for self-supervised food pre-training. As a plug-and-play module, the proposed method consistently improves BYOL and SimSiam by 1.70% ~ 6.69% classification accuracy on four publicly available food recognition datasets. Superior results have also been achieved on downstream segmentation tasks, demonstrating the effectiveness of the proposed method.

Learning from Easy to Hard Pairs: Multi-step Reasoning Network for Human-Object Interaction Detection

Yuchen Zhou
Guang Tan
Mengtang Li
Chao Gou

Human-object interaction (HOI) detection aims to interpret the interactions of human-object pairs. Existing methods adopt a one-step reasoning paradigm that simultaneously outputs multi-label results for all HOI pairs without distinguishing difficulties. However, there are significant variations among HOI pairs in the same image, making their performance degrade in challenging situations. In this paper, we argue that the model should prioritize hard samples after inferring easy ones, and hard samples can benefit from easy ones. To this end, we propose a novel Multi-step Reasoning Network that progressively learns from easy to hard samples. In particular, an Easy-to-Hard Learning Block is introduced to enhance the representation of hard HOI pairs by prior associations. Additionally, we propose a Multi-step Reasoning Probability Transfer mechanism to enhance multi-label interaction classifications, which leverages cognitive associations and semantic dependencies. Extensive experiments demonstrate that our method outperforms other state-of-the-art on two challenging benchmark datasets.

Separate and Locate: Rethink the Text in Text-based Visual Question Answering

Chengyang Fang
Jiangnan Li
Liang Li
Can Ma
Dayong Hu

Text-based Visual Question Answering (TextVQA) aims at answering questions about the text in images. Most works in this field focus on designing network structures or pre-training tasks. All these methods list the OCR texts in reading order (from left to right and top to bottom) to form a sequence, which is treated as a natural language ''sentence''. However, they ignore the fact that most OCR words in the TextVQA task do not have a semantical contextual relationship. In addition, these approaches use 1-D position embedding to construct the spatial relation between OCR tokens sequentially, which is not reasonable. The 1-D position embedding can only represent the left-right sequence relationship between words in a sentence, but not the complex spatial position relationship. To tackle these problems, we propose a novel method named Separate and Locate (SaL) that explores text contextual cues and designs spatial position embedding to construct spatial relations between OCR texts. Specifically, we propose a Text Semantic Separate (TSS) module that helps the model recognize whether words have semantic contextual relations. Then, we introduce a Spatial Circle Position (SCP) module that helps the model better construct and reason the spatial position relationships between OCR texts. Our SaL model outperforms the baseline model by 4.44% and 3.96% accuracy on TextVQA and ST-VQA datasets. Compared with the pre-training state-of-the-art method pre-trained on 64 million pre-training samples, our method, without any pre-training tasks, still achieves 2.68% and 2.52% accuracy improvement on TextVQA and ST-VQA. Our code and models will be released at https://github.com/fangbufang/SaL.

Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts

Yunshi Lan
Xiang Li
Xin Liu
Yang Li
Wei Qin
Weining Qian

Zero-shot Visual Question Answering (VQA) is a prominent vision-language task that examines both the visual and textual understanding capability of systems in the absence of training data. Recently, by converting the images into captions, information across multi-modalities is bridged and Large Language Models (LLMs) can apply their strong zero-shot generalization capability to unseen questions. To design ideal prompts for solving VQA via LLMs, several studies have explored different strategies to select or generate question-answer pairs as the exemplar prompts, which guide LLMs to answer the current questions effectively. However, they totally ignore the role of question prompts. The original questions in VQA tasks usually encounter ellipses and ambiguity which require intermediate reasoning. To this end, we present Reasoning Question Prompts for VQA tasks, which can further activate the potential of LLMs in zero-shot scenarios. Specifically, for each question, we first generate self-contained questions as reasoning question prompts via an unsupervised question edition module considering sentence fluency, semantic integrity and syntactic invariance. Each reasoning question prompt clearly indicates the intent of the original question. This results in a set of candidate answers. Then, the candidate answers associated with their confidence scores acting as answer heuristics are fed into LLMs and produce the final answer. We evaluate reasoning question prompts on three VQA challenges, experimental results demonstrate that they can significantly improve the results of LLMs on zero-shot setting and outperform existing state-of-the-art zero-shot methods on three out of four data sets. Our source code is publicly released at https://github.com/ECNU-DASE-NLP/RQP.

Adaptive Decoupled Pose Knowledge Distillation

Jie Xu
Shanshan Zhang
Jian Yang

Existing state-of-the-art human pose estimation approaches require heavy computational resources for accurate prediction. One promising technique to obtain an accurate yet lightweight pose estimator is Knowledge Distillation (KD), which distills the pose knowledge from a powerful teacher model to a lightweight student model. However, existing human pose KD methods focus more on designing paired student and teacher network architectures, yet ignore the mechanism of pose knowledge distillation. In this work, we reformulate the human pose KD to a coarse to fine process and decouple the classical KD loss into three terms: Binary Keypoint vs. Non-Keypoint Distillation (BiKD), Keypoint Area Distillation (KAD) and Non-keypoint Area Distillation (NAD). Observing the decoupled formulation, we point out an important limitation of the classical pose KD, i.e. the bias between different loss terms limits the performance gain of the student network. To address the biased knowledge distillation problem, we present a novel KD method named Adaptive Decoupled Pose knowledge Distillation (ADPD), enabling BiKD, KAD and NAD to play their roles more effectively and flexibly. Extensive experiments on two standard human pose datasets, MPII and MS COCO, demonstrate that our proposed method outperforms previous KD methods and is generalizable to different teacher-student pairs. The code will be available at https://github.com/SuperJay1996/ADPD.

Biased-Predicate Annotation Identification via Unbiased Visual Predicate Representation

Li Li
Chenwei Wang
You Qin
Wei Ji
Renjie Liang

Panoptic Scene Graph Generation (PSG) translates visual scenes to structured linguistic descriptions, i.e., mapping visual instances to subjects/objects, and their relationships to predicates. However, the annotators' preferences and semantic overlaps between predicates inevitably lead to the semantic mappings of multiple predicates to one relationship, i.e., biased-predicate annotations. As a result, with the contradictory mapping between visual and linguistics, PSG models are struggled to construct clear decision planes among predicates, so as to cause existing poor performances. Obviously, it is essential for the PSG task to tackle this multi-modal contradiction. Therefore, we propose a novel method that utilizes unbiased visual predicate representations for Biased-Annotation Identification (BAI) as a fundamental step for PSG/SGG tasks. Our BAI includes three main steps: predicate representation extraction, predicate representation debiasing, and biased-annotation identification. With flexible biased annotation processing methods, our BAI can act as a fundamental step of dataset debiasing. Experimental results demonstrate that our proposed BAI has achieved state-of-the-art performance, which promotes the performance of benchmark models to various degrees with ingenious biased annotation processing methods. Furthermore, our BAI shows great generalization and effectiveness on multiple datasets. Our codes are released at https://github.com/lili0415/BAI.

Zero-Shot Object Detection by Semantics-Aware DETR with Adaptive Contrastive Loss

Huan Liu
Lu Zhang
Jihong Guan
Shuigeng Zhou

Zero-shot object detection (ZSD) aims to localize and recognize unseen objects in unconstrained images by leveraging semantic descriptions. Existing ZSD methods typically suffer from two drawbacks: 1) Due to the lack of data on unseen categories during the training phase, the model inevitably has a bias towards the seen categories, i.e., it prefers to subsume objects of unseen categories to seen categories; 2) It is usually very tricky for the feature extractor trained on data of seen categories to learn discriminative features that are good enough to help the model transfer the knowledge learned from data of seen categories to unseen categories. To tackle these problems, this paper proposes a novel zero-shot detection method based on a semantics-aware DETR and a class-wise adaptive contrastive loss. Concretely, to address the first problem, we develop a novel semantics-aware attention mechanism to mitigate the bias towards seen categories and integrate it into DETR, which results in a new end-to-end zero-shot object detection approach. Furthermore, to handle the second problem, a novel class-wise adaptive contrastive loss is proposed, which considers the relevance between each pair of categories according to their semantic description in order to learn separable features for better visual-semantic alignment. Extensive experiments and ablation studies on benchmark datasets demonstrate the effectiveness and superiority of the proposed method.

Rethinking Missing Modality Learning from a Decoding Perspective

Tao Jin
Xize Cheng
Linjun Li
Wang Lin
Ye Wang
Zhou Zhao

Conventional pipeline of multimodal learning consists of three stages, including encoding, fusion, and decoding. Most existing methods under missing modality condition focus on the first stage and aim to learn the modality invariant representation or reconstruct missing features. However, these methods rely on strong assumptions (i.e., all the pre-defined modalities are available for each input sample during training and the number of modalities is fixed). To solve this problem, we propose a simple yet effective method called Interaction Augmented Prototype Decomposition (IPD) for a more general setting, where the number of modalities is arbitrary and there are various incomplete modality conditions happening in both training and inference phases, even there are unseen testing conditions. Different from the previous methods, we improve the decoding stage. Concretely, IPD jointly learns the common and modality-specific task prototypes. Considering that the number of missing modality conditions scales exponentially with the number of modalities O(2n) and different conditions may have implicit interaction, the low-rank partial prototype decomposition with enough theoretical analysis is employed for modality-specific components to reduce the complexity. The decomposition also can promote unseen generalization with the modality factors of existing conditions. To simulate the low-rank setup, we further constrain the explicit interaction of specific modality conditions by employing disentangled contrastive constraints. Extensive results on the newly-created benchmarks of multiple tasks illustrate the effectiveness of our proposed model.

Improving the Transferability of Adversarial Examples with Arbitrary Style Transfer

Zhijin Ge
Fanhua Shang
Hongying Liu
Yuanyuan Liu
Liang Wan
Wei Feng
Xiaosen Wang

Deep neural networks are vulnerable to adversarial examples crafted by applying human-imperceptible perturbations on clean inputs. Although many attack methods can achieve high success rates in the white-box setting, they also exhibit weak transferability in the black-box setting. Recently, various methods have been proposed to improve adversarial transferability, in which the input transformation is one of the most effective methods. In this work, we notice that existing input transformation-based works mainly adopt the transformed data in the same domain for augmentation. Inspired by domain generalization, we aim to further improve the transferability using the data augmented from different domains. Specifically, a style transfer network can alter the distribution of low-level visual features in an image while preserving semantic content for humans. Hence, we propose a novel attack method named Style Transfer Method (STM) that utilizes a proposed arbitrary style transfer network to transform the images into different domains. To avoid inconsistent semantic information of stylized images for the classification network, we fine-tune the style transfer network and mix up the generated images added by random noise with the original images to maintain semantic consistency and boost input diversity. Extensive experimental results on the ImageNet-compatible dataset show that our proposed method can significantly improve the adversarial transferability on either normally trained models or adversarially trained models than state-of-the-art input transformation-based attacks. Code is available at: https://github.com/Zhijin-Ge/STM.

Mixup-Augmented Temporally Debiased Video Grounding with Content-Location Disentanglement

Xin Wang
Zihao Wu
Hong Chen
Xiaohan Lan
Wenwu Zhu

Video Grounding (VG), has drawn widespread attention over the past few years, and numerous studies have been devoted to improving performance on various VG benchmarks. Nevertheless, the label annotation procedures in VG produce imbalanced query-moment-label distributions in the datasets, which severely deteriorate the learning model's capability of truly understanding the video contents. Existing works on debiased VG either focus on adjusting the learning model or conducting video-level augmentation, failing to handle the temporal bias issue caused by imbalanced query-moment-label distributions. In this paper, we propose a Disentangled Feature Mixup (DFM) framework for debiased VG, which is capable of performing unbiased grounding to tackle the temporal bias issue. Specifically, a feature-mixup augmentation strategy is designed to generate new (text, location) pairs with diverse temporal distributions via jointly augmenting the representation of text queries and the location labels. This strategy encourages making prediction based on more diverse data samples with balanced query-moment-label distributions. Furthermore, we also design a content-location disentanglement module to disentangle the representations of the temporal information and content information in videos, which is able to remove the spurious effect of temporal biases on video representation. Given that our proposed DFM framework conducts feature-level augmentation and disentanglement, it is model-agnostic and can be applied to most baselines simply yet effectively. Extensive experiments show that our proposed DFM framework is able to significantly outperform baseline models in various metrics under both independent identical distribution (i.i.d.) and out-of-distribution (o.o.d.) scenes, especially in scenarios with annotation distribution changes.

Learning Semantics-Grounded Vocabulary Representation for Video-Text Retrieval

Yaya Shi
Haowei Liu
Haiyang Xu
Zongyang Ma
Qinghao Ye
Anwen Hu
Ming Yan
Ji Zhang
Fei Huang
Chunfeng Yuan
Bing Li
Weiming Hu
Zheng-Jun Zha

Previous dual-encoder pre-training methods for video-text retrieval employ contrastive learning for cross-modal alignment in a latent space. However, such learned latent spaces often result in modality gap problem [26]. In this paper, we introduce a novel SemVTR framework designed to learn semantics-grounded video-text representations in a vocabulary space, in which each dimension corresponds to a semantic concept represented by a word. The representation is obtained by grounding video and text into semantically-related dimensions with high activation values. As video-text pairs share grounded dimensions, their vocabulary representations are expected to cluster together and thus alleviate modality gap problem. So, the crux of our method lies in grounding video and text into vocabulary space. Specifically, we propose a Multi-Granularity Video Semantics Grounding approach and a Textual Semantics Preserving training strategy. The visualization illustrates that SemVTR obtains semantics-gronded vocabulary representation and also alleviates the modality gap problem. SemVTR significantly outperforms existing methods on four video-text retrieval benchmarks.

Learning a Graph Neural Network with Cross Modality Interaction for Image Fusion

Jiawei Li
Jiansheng Chen
Jinyuan Liu
Huimin Ma

Infrared and visible image fusion has gradually proved to be a vital fork in the field of multi-modality imaging technologies. In recent developments, researchers not only focus on the quality of fused images but also evaluate their performance in downstream tasks. Nevertheless, the majority of methods seldom put their eyes on mutual learning from different modalities, resulting in fused images lacking significant details and textures. To overcome this issue, we propose an interactive graph neural network (GNN)-based architecture between cross modality for fusion, called IGNet. Specifically, we first apply a multi-scale extractor to achieve shallow features, which are employed as the necessary input to build graph structures. Then, the graph interaction module can construct the extracted intermediate features of the infrared/visible branch into graph structures. Meanwhile, the graph structures of two branches interact for cross-modality and semantic learning, so that fused images can maintain the important feature expressions and enhance the performance of downstream tasks. Besides, the proposed leader nodes can improve information propagation in the same modality. Finally, we merge all graph features to get the fusion result. Extensive experiments on different datasets (i.e. TNO, MFNet, and M3FD) demonstrate that our IGNet can generate visually appealing fused images while scoring averagely 2.59% mAP@.5 and 7.77% mIoU higher in detection and segmentation than the compared state-of-the-art methods. The source code of the proposed IGNet can be available at https://github.com/lok-18/IGNet.

COPA : Efficient Vision-Language Pre-training through Collaborative Object- and Patch-Text Alignment

Chaoya Jiang
Haiyang Xu
Wei Ye
Qinghao Ye
Chenliang Li
Ming Yan
Bin Bi
Shikun Zhang
Fei Huang
Ji Zhang

Vision-Language Pre-training (VLP) methods based on object detection enjoy the rich knowledge of fine-grained object-text alignment but at the cost of computationally expensive inference. Recent Visual-Transformer (ViT)-based approaches circumvent this issue while struggling with long visual sequences without detailed cross-modal alignment information. This paper introduces a ViT-based VLP technique that efficiently incorporates object information through a novel patch-text alignment mechanism. Specifically, we convert object-level signals into patch-level ones and devise a Patch-Text Alignment pre-training task (PTA) to learn a text-aware patch detector. By using off-the-shelf delicate object annotations in 5% training images, we jointly train PTA with other conventional VLP objectives in an end-to-end manner, bypassing the high computational cost of object detection and yielding an effective patch detector that accurately detects text-relevant patches, thus considerably reducing patch sequences and accelerating computation within the ViT backbone. Our experiments on a variety of widely-used benchmarks reveal that our method achieves a speedup of nearly 88% compared to prior VLP models while maintaining competitive or superior performance on downstream tasks with similar model size and data scale.

Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark

Shuyu Yang
Yinan Zhou
Zhedong Zheng
Yaxiong Wang
Li Zhu
Yujiao Wu

In this paper, we introduce a large Multi-Attribute and Language Search dataset for text-based person retrieval, called MALS, and explore the feasibility of performing pre-training on both attribute recognition and image-text matching tasks in one stone. In particular, MALS contains 1,510,330 image-text pairs, which is about 37.5 × larger than prevailing CUHK-PEDES, and all images are annotated with 27 attributes. Considering the privacy concerns and annotation costs, we leverage the off-the-shelf diffusion models to generate the dataset. To verify the feasibility of learning from the generated data, we develop a new joint Attribute Prompt Learning and Text Matching Learning (APTM) framework, considering the shared knowledge between attribute and text. As the name implies, APTM contains an attribute prompt learning stream and a text matching learning stream. (1) The attribute prompt learning leverages the attribute prompts for image-attribute alignment, which enhances the text matching learning. (2) The text matching learning facilitates the representation learning on fine-grained details, and in turn, boosts the attribute prompt learning. Extensive experiments validate the effectiveness of the pre-training on MALS, achieving state-of-the-art retrieval performance via APTM on three challenging real-world benchmarks. In particular, APTM achieves a consistent improvement of +6.96 %, +7.68%, and +16.95% Recall@1 accuracy on CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets by a clear margin, respectively. The dataset, model, and code are available at https://github.com/Shuyu-XJTU/APTM.

Towards Real-Time Sign Language Recognition and Translation on Edge Devices

Shiwei Gan
Yafeng Yin
Zhiwei Jiang
Lei Xie
Sanglu Lu

To provide instant communication for hearing-impaired people, it is essential to achieve real-time sign language processing anytime anywhere. Therefore, in this paper, we propose a Region-aware Temporal Graph based neural Network (RTG-Net), aiming to achieve real-time Sign Language Recognition (SLR) and Translation (SLT) on edge devices. To reduce the computation overhead, we first construct a shallow graph convolution network to reduce model size by decreasing model depth. Besides, we apply structural re-parameterization to fuse the convolutional layer, batch normalization layer and all branches to simplify model complexity by reducing model width. To achieve the high performance in sign language processing as well, we extract key regions based on keypoints in skeleton from each frame, and design a region-aware temporal graph to combine key regions and full frame for feature representation. In RTG-Net, we design a multi-stage training strategy to optimize keypoint selection, SLR and SLT step by step. Experimental results demonstrate that RTG-Net achieves comparable performance with existing methods in SLR or SLT, while greatly reducing the computation overhead and achieving real-time sign language processing on edge devices. Our code is available at https://github.com/SignLanguageCode/realtimeSLRT.

Enhancing Visually-Rich Document Understanding via Layout Structure Modeling

Qiwei Li
Zuchao Li
Xiantao Cai
Bo Du
Hai Zhao

In recent years, the use of multi-modal pre-trained Transformers has led to significant advancements in visually-rich document understanding. However, existing models have mainly focused on features such as text and vision while neglecting the importance of layout relationship between text nodes. In this paper, we propose GraphLayoutLM, a novel document understanding model that leverages the modeling of layout structure graph to inject document layout knowledge into the model. GraphLayoutLM utilizes a graph reordering algorithm to adjust the text sequence based on the graph structure. Additionally, our model uses a layout-aware multi-head self-attention layer to learn document layout knowledge. The proposed model enables the understanding of the spatial arrangement of text elements, improving document comprehension. We evaluate our model on various benchmarks, including FUNSD, XFUND and CORD and it achieves state-of-the-art results among these datasets. Our experiment results demonstrate that our proposed method provides a significant improvement over existing approaches and showcases the importance of incorporating layout information into document understanding models. We also conduct an ablation study to investigate the contribution of each component of our model. The results show that both the graph reordering algorithm and the layout-aware multi-head self-attention layer play a crucial role in achieving the best performance.

Non-Exemplar Class-Incremental Learning via Adaptive Old Class Reconstruction

Shaokun Wang
Weiwei Shi
Yuhang He
Yifan Yu
Yihong Gong

In the Class-Incremental Learning (CIL) task, rehearsal-based approaches have received a lot of attention recently. However, storing old class samples is often infeasible in application scenarios where device memory is insufficient or data privacy is important. Therefore, it is necessary to rethink Non-Exemplar Class-Incremental Learning (NECIL). In this paper, we propose a novel NECIL method named POLO with an adaPtive Old cLass recOnstruction mechanism, in which a density-based prototype reinforcement method (DBR), a topology-correction prototype adaptation method (TPA), and an adaptive prototype augmentation method (APA) are designed to reconstruct pseudo features of old classes in new incremental sessions. Specifically, the DBR focuses on the low-density features to maintain the model's discriminative ability for old classes. Afterward, the TPA is designed to adapt old class prototypes to new feature spaces in the incremental learning process. Finally, the APA is developed to further adapt pseudo feature spaces of old classes to new feature spaces. Experimental evaluations on four benchmark datasets demonstrate the effectiveness of our proposed method over the state-of-the-art NECIL methods.

CLIP-Count: Towards Text-Guided Zero-Shot Object Counting

Ruixiang Jiang
Lingbo Liu
Changwen Chen

Recent advances in visual-language models have shown remarkable zero-shot text-image matching ability that is transferable to downstream tasks such as object detection and segmentation. Adapting these models for object counting, however, remains a formidable challenge. In this study, we first investigate transferring vision-language models (VLMs) for class-agnostic object counting. Specifically, we propose CLIP-Count, the first end-to-end pipeline that estimates density maps for open-vocabulary objects with text guidance in a zero-shot manner. To align the text embedding with dense visual features, we introduce a patch-text contrastive loss that guides the model to learn informative patch-level visual representations for dense prediction. Moreover, we design a hierarchical patch-text interaction module to propagate semantic information across different resolution levels of visual features. Benefiting from the full exploitation of the rich image-text alignment knowledge of pretrained VLMs, our method effectively generates high-quality density maps for objects-of-interest. Extensive experiments on FSC-147, CARPK, and ShanghaiTech crowd counting datasets demonstrate state-of-the-art accuracy and generalizability of the proposed method. Code is available: https://github.com/songrise/CLIP-Count. https://github.com/songrise/CLIP-Count.

Self-Supervised Cross-Language Scene Text Editing

Fuxiang Yang
Tonghua Su
Xiang Zhou
Donglin Di
Zhongjie Wang
Songze Li

We propose and formulate the task of cross-language scene text editing, modifying the text content of a scene image into new text in another language, while preserving the scene text style and background texture. The key challenges of this task lie in the difficulty in distinguishing text and background, great distribution differences among languages, and the lack of fine-labeled real-world data. To tackle these problems, we propose a novel network named Cross-LAnguage Scene Text Editing (CLASTE), which is capable of separating the foreground text and background, as well as further decomposing the content and style of the foreground text. Our model can be trained in a self-supervised training manner on the unlabeled and multi-language data in real-world scenarios, where the source images serve as both input and ground truth. Experimental results on the Chinese-English cross-language dataset show that our proposed model can generate realistic text images, specifically, modifying English to Chinese and vice versa. Furthermore, our method is universal and can be extended to other languages such as Arabic, Korean, Japanese, Hindi, Bengali, and so on.

Learning Implicit Entity-object Relations by Bidirectional Generative Alignment for Multimodal NER

Feng Chen
Jiajia Liu
Kaixiang Ji
Wang Ren
Jian Wang
Jingdong Chen

The challenge posed by multimodal named entity recognition (MNER) is mainly two-fold: (1) bridging the semantic gap between text and image and (2) matching the entity with its associated object in image. Existing methods fail to capture the implicit entity-object relations, due to the lack of corresponding annotation. In this paper, we propose a bidirectional generative alignment method named BGA-MNER to tackle these issues. Our BGA-MNER consists of image2text and text2image generation with respect to entity-salient content in two modalities. It jointly optimizes the bidirectional reconstruction objectives, leading to aligning the implicit entity-object relations under such direct and powerful constraints. Furthermore, image-text pairs usually contain unmatched components which are noisy for generation. A stage-refined context sampler is proposed to extract the matched cross-modal content for generation. Extensive experiments on two benchmarks demonstrate that our method achieves state-of-the-art performance without image input during inference.

MORE: A Multimodal Object-Entity Relation Extraction Dataset with a Benchmark Evaluation

Liang He
Hongke Wang
Yongchang Cao
Zhen Wu
Jianbing Zhang
Xinyu Dai

Extracting relational facts from multimodal data is a crucial task in the field of multimedia and knowledge graphs that feeds into widespread real-world applications. The emphasis of recent studies centers on recognizing relational facts in which both entities are present in one modality and supplementary information is used from other modalities. However, such works disregard a substantial amount of multimodal relational facts that arise across different modalities, such as one entity seen in a text and another in an image. In this paper, we propose a new task, namely Multimodal Object-Entity Relation Extraction, which aims to extract "object-entity" relational facts from image and text data. To facilitate research on this task, we introduce MORE, a new dataset comprising 21 relation types and 20,136 multimodal relational facts annotated on 3,522 pairs of textual news titles and corresponding images. To show the challenges of Multimodal Object-Entity Relation Extraction, we evaluated recent state-of-the-art methods for multimodal relation extraction and conducted a comprehensive experimentation analysis on MORE. Our results demonstrate significant challenges for existing methods, underlining the need for further research on this task. Based on our experiments, we identify several promising directions for future research. The MORE dataset and code are available at https://github.com/NJUNLP/MORE.

Weakly-supervised Video Scene Graph Generation via Unbiased Cross-modal Learning

Ziyue Wu
Junyu Gao
Changsheng Xu

Video Scene Graph Generation (VidSGG), which aims to detect the relations between objects in a continuous spatio-temporal environment, has shown great potential in video understanding. Almost all prevailing VidSGG approaches are in a fully-supervised manner where expensive manual annotations are required. Therefore, we introduce a novel and challenging task named Weakly-supervised Video Scene Graph Generation (WS-VidSGG), in which a model is trained with only unlocalized scene graphs as supervisory information. Due to the imbalanced data distribution and the lack of fine-grained annotations, models learned in this setting is prone to be biased. Therefore, we propose an Unbiased Cross-Modal Learning (UCML) framework to address the WS-VidSGG task. Specifically, a cross-modal alignment module is firstly designed for allocating pseudo labels to unlabeled visual objects. We then extract unbiased knowledge from dataset statistics, and utilize prompt to make our model finely comprehend semantic concepts. The learned features that from the prompts and unbiased knowledge reinforced each other, resulting in discriminative textual representations. In order to better explore the relations between visual entities, we design a knowledge-guided attention graph to capture the cross-modal relations. Finally, the learned textual and visual features are integrated into a unified framework for relation prediction. Extensive ablation studies verify the effectiveness of our framework. Moreover, the comparison with state-of-the-art fully-supervised methods shows that our proposed framework also achieves comparable performance. Code https://github.com/ZiyueWu59/UCML is available.

Reducing Intrinsic and Extrinsic Data Biases for Moment Localization with Natural Language

Jiong Yin
Liang Li
Jiehua Zhang
Chenggang Yan
Lei Zhang
Zunjie Zhu

Moment Localization with Natural Language (MLNL) aims to locate the target moment from an untrimmed video by a linguistic query. Recent works reveal the severe data bias problem in MLNL and point out that the multi-modal content may not be understood by fitting the timestamp distribution. In this paper, we study the data biases on the intrinsic and extrinsic aspects: the former is mainly caused by the ambiguity of the moment boundary and the information imbalance between input and output; The latter results from the long-tail distribution of moments in MLNL datasets. To alleviate this, we propose a hybrid multi-modal debiasing network with temporal consistency constraint for MLNL. Specifically, we first design the multi-temporal Transformer to mitigate the ambiguity of boundary by integrating frame-wise features into segment-wise and dynamically matching with moment boundaries. Then, we introduce the temporal consistency constraint that highlights the action information in complex moment content to overcome the intrinsic bias from information imbalance.Furthermore, we design the hybrid linguistic activating module with external knowledge to relieve the extrinsic bias, which introduces a prior guidance to focus the discriminative information from the tail samples. Extensive experiments on three public datasets demonstrate that our model outperforms the existing methods.

VioLET: Vision-Language Efficient Tuning with Collaborative Multi-modal Gradients

Yaoming Wang
Yuchen Liu
Xiaopeng Zhang
Jin Li
Bowen Shi
Chenglin Li
Wenrui Dai
Hongkai Xiong
Qi Tian

Parameter-Efficient Tuning (PET) has emerged as a leading advancement in both Natural Language Processing and Computer Vision, enabling efficient accommodation of downstream tasks without costly fine-tuning. However, most existing PET approaches are limited to uni-modal tuning, even for vision-language models like CLIP. We investigate this limitation and demonstrate that simultaneous tuning of the two modalities in such models leads to multi-modal forgetting and catastrophic performance degradation, particularly when generalizing to new classes. To address this issue, we propose a novel PET approach called VioLET (Vision Language Efficient Tuning) that utilizes collaborative multi-modal gradients to unlock the full potential of both modalities. Specifically, we incorporate an additional visual encoder without learnable parameters and use these two visual encoders to compute the gradients of the context parameters separately. When conflicts arise, we replace the original gradient with an orthogonal gradient. Extensive experiments are conducted on few-shot recognition and unseen class generalization tasks using ResNet-50 or ViT/B-16 as the backbone. VioLET consistently outperforms several state-of-the-art methods on 11 datasets, showcasing its superiority over existing PET approaches. The code is available at https://github.com/Wang-Yaoming/VioLET.

Mirror-NeRF: Learning Neural Radiance Fields for Mirrors with Whitted-Style Ray Tracing

Junyi Zeng
Chong Bao
Rui Chen
Zilong Dong
Guofeng Zhang
Hujun Bao
Zhaopeng Cui

Recently, Neural Radiance Fields (NeRF) has exhibited significant success in novel view synthesis, surface reconstruction, etc. However, since no physical reflection is considered in its rendering pipeline, NeRF mistakes the reflection in the mirror as a separate virtual scene, leading to the inaccurate reconstruction of the mirror and multi-view inconsistent reflections in the mirror. In this paper, we present a novel neural rendering framework, named Mirror-NeRF, which is able to learn accurate geometry and reflection of the mirror and support various scene manipulation applications with mirrors, such as adding new objects or mirrors into the scene and synthesizing the reflections of these new objects in mirrors, controlling mirror roughness, etc. To achieve this goal, we propose a unified radiance field by introducing the reflection probability and tracing rays following the light transport model of Whitted Ray Tracing, and also develop several techniques to facilitate the learning process. Experiments and comparisons on both synthetic and real datasets demonstrate the superiority of our method. The code and supplementary material are available on the project webpage: https://zju3dv.github.io/Mirror-NeRF/.

Semi-supervised Deep Multi-view Stereo

Hongbin Xu
Weitao Chen
Yang Liu
Zhipeng Zhou
Haihong Xiao
Baigui Sun
Xuansong Xie
Wenxiong Kang

Significant progress has been witnessed in learning-based Multi-view Stereo (MVS) under supervised and unsupervised settings. To combine their respective merits in accuracy and completeness, meantime reducing the demand for expensive labeled data, this paper explores the problem of learning-based MVS in a semi-supervised setting that only a tiny part of the MVS data is attached with dense depth ground truth. However, due to huge variation of scenarios and flexible settings in views, it may break the basic assumption in classic semi-supervised learning, that unlabeled data and labeled data share the same label space and data distribution, named as semi-supervised distribution-gap ambiguity in the MVS problem. To handle these issues, we propose a novel semi-supervised distribution-augmented MVS framework, namely SDA-MVS. For the simple case that the basic assumption works in MVS data, consistency regularization encourages the model predictions to be consistent between original sample and randomly augmented sample. For further troublesome case that the basic assumption is conflicted in MVS data, we propose a novel style consistency loss to alleviate the negative effect caused by the distribution gap. The visual style of unlabeled sample is transferred to labeled sample to shrink the gap, and the model prediction of generated sample is further supervised with the label in original labeled sample. The experimental results in semi-supervised settings of multiple MVS datasets show the superior performance of the proposed method. With the same settings in backbone network, our proposed SDA-MVS outperforms its fully-supervised and unsupervised baselines.

Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning

Chen Jiang
Hong Liu
Xuzheng Yu
Qing Wang
Yuan Cheng
Jia Xu
Zhongyi Liu
Qingpei Guo
Wei Chu
Ming Yang
Yuan Qi

In recent years, the explosion of web videos makes text-video retrieval increasingly essential and popular for video filtering, recommendation, and search. Text-video retrieval aims to rank relevant text/video higher than irrelevant ones. The core of this task is to precisely measure the cross-modal similarity between texts and videos. Recently, contrastive learning methods have shown promising results for text-video retrieval, most of which focus on the construction of positive and negative pairs to learn text and video representations. Nevertheless, they do not pay enough attention to hard negative pairs and lack the ability to model different levels of semantic similarity. To address these two issues, this paper improves contrastive learning using two novel techniques. First, to exploit hard examples for robust discriminative power, we propose a novel Dual-Modal Attention-Enhanced Module (DMAE) to mine hard negative pairs from textual and visual clues. By further introducing a Negative-aware InfoNCE (NegNCE) loss, we are able to adaptively identify all these hard negatives and explicitly highlight their impacts in the training loss. Second, our work argues that triplet samples can better model fine-grained semantic similarity compared to pairwise samples. We thereby present a new Triplet Partial Margin Contrastive Learning (TPM-CL) module to construct partial order triplet samples by automatically generating fine-grained hard negatives for matched text-video pairs. The proposed TPM-CL designs an adaptive token masking strategy with cross-modal interaction to model subtle semantic differences. Extensive experiments demonstrate that the proposed approach outperforms existing methods on four widely-used text-video retrieval datasets, including MSR-VTT, MSVD, DiDeMo and ActivityNet.

Temporal Sentence Grounding in Streaming Videos

Tian Gan
Xiao Wang
Yan Sun
Jianlong Wu
Qingpei Guo
Liqiang Nie

This paper aims to tackle a novel task - Temporal Sentence Grounding in Streaming Videos (TSGSV). The goal of TSGSV is to evaluate the relevance between a video stream and a given sentence query. Unlike regular videos, streaming videos are acquired continuously from a particular source, and are always desired to be processed on-the-fly in many applications such as surveillance and live-stream analysis. Thus, TSGSV is challenging since it requires the model to infer without future frames and process long historical frames effectively, which is untouched in the early methods. To specifically address the above challenges, we propose two novel methods: (1) a TwinNet structure that enables the model to learn about upcoming events; and (2) a language-guided feature compressor that eliminates redundant visual frames and reinforces the frames that are relevant to the query. We conduct extensive experiments using ActivityNet Captions, TACoS, and MAD datasets. The results demonstrate the superiority of our proposed methods. A systematic ablation study also confirms their effectiveness.

Modality-agnostic Augmented Multi-Collaboration Representation for Semi-supervised Heterogenous Face Recognition

Decheng Liu
Weizhao Yang
Chunlei Peng
Nannan Wang
Ruimin Hu
Xinbo Gao

Heterogeneous face recognition (HFR) aims to match input face identity across different image modalities. Due to the existing large modality gap and the limited number of training data, HFR is still a challenging problem in biometrics and draws more and more attention. Existing researchers always extract modality invariant features or generate homogeneous images to decrease the modality gap, lacking abundant labeled data to avoid the overfitting problem. In this paper, we proposed a novel Modality-Agnostic Augmented Multi-Collaboration representation for Heterogeneous Face Recognition (MAMCO-HFR) in a semi-supervised manner. The modality-agnostic augmentation strategy is proposed to generate adversarial perturbations to map unlabeled faces into the modality-agnostic domain. The multi-collaboration feature constraint is designed to mine the inherent relationships between diverse layers for discriminative representation. Experiments on several large-scale heterogeneous face datasets (CASIA NIR-VIS 2.0, LAMP-HQ and Tufts Face dataset) prove the proposed algorithm can achieve superior performance compared with state-of-the-art methods. The source code is available at https://github.com/xiyin11/Semi-HFR.

Swin-UNIT: Transformer-based GAN for High-resolution Unpaired Image Translation

Yifan Li
Yaochen Li
Wenneng Tang
Zhifeng Zhu
Jinhuo Yang
Yuehu Liu

The transformer model has gained a lot of success in various computer vision tasks owing to its capacity of modeling long-range dependencies. However, its application has been limited in the area of high-resolution unpaired image translation using GANs due to the quadratic complexity with the spatial resolution of input features. In this paper, we propose a novel transformer-based GAN for high-resolution unpaired image translation named Swin-UNIT. A two-stage generator is designed which consists of a global style translation (GST) module and a recurrent detail supplement (RDS) module. The GST module focuses on translating low-resolution global features using the ability of self-attention. The RDS module offers quick information propagation from the global features to the detail features at a high resolution using cross-attention. Moreover, we customize a dual-branch discriminator to guide the generator. Extensive experiments demonstrate that our model achieves state-of-the-art results on the unpaired image translation tasks.

PixelFace+: Towards Controllable Face Generation and Manipulation with Text Descriptions and Segmentation Masks

Xiaoxiong Du
Jun Peng
Yiyi Zhou
Jinlu Zhang
Siting Chen
Guannan Jiang
Xiaoshuai Sun
Rongrong Ji

Synthesizing vivid human portraits is a research hot spot in image generation with a wide scope of applications. In addition to fidelity, generation controllability is another key factor that has long plagued its development. To address this issue, existing solutions usually adopt either textual or visual conditions for the target face synthesis, e.g., descriptions or segmentation masks, which still cannot fully control the generation due to the intrinsic shortages of each condition. In this paper, we propose to make use of both types of prior information to facilitate controllable face generation. In particular, we hope to produce coarse-grained information about faces based on the segmentation masks, such as face shapes and poses, and the text description is used to render detailed face attributes, e.g., face color, makeup and gender. More importantly, we hope that the generation can be easily controlled via interactively editing both types of information, making face generation more applicable to real-world applications. To accomplish this target, we propose a novel face generation model termed PixelFace+. In PixelFace+, both the text and mask are encoded as pixel-wise priors, based on which the pixel synthesis process is conducted to produce the expected portraits. Meanwhile, the loss objectives are also carefully designed to make sure that the generated faces are semantically aligned with both text and mask inputs. To validate the proposed PixelFace+, we conducted a comprehensive set of experiments on the widely recognized benchmark called MMCelebA. We not only quantitatively compare PixelFace+ with a bunch of newly proposed Text-to-Face(T2F) generation methods, but also give plenty of qualitative analyses. The experimental results demonstrate that PixelFace+ not only outperforms existing generation methods in both image quality and conditional matching but also shows a much superior controllability of face generation. More importantly, PixelFace+ presents a convenient and interactive way of face generation and manipulation via editing the text and mask inputs. Our SOURCE CODE and DEMO are given in our supplementary materials.

LiFT: Transfer Learning in Vision-Language Models for Downstream Adaptation and Generalization

Jingzheng Li
Hailong Sun

Pre-trained Vision-Language Models (VLMs) on large-scale image-text pairs, e.g., CLIP, have shown promising performance on zero-shot knowledge transfer. Recently, fine-tuning pre-trained VLMs to downstream few-shot classification with limited image annotation data yields significant gains. However, there are two limitations. First, most of the methods for fine-tuning VLMs only update newly added parameters while keeping the whole VLM frozen. Thus, it remains unclear how to directly update the VLM itself. Second, fine-tuning VLMs to a specific set of base classes would deteriorate the well-learned representation space such that the VLMs generalize poorly on novel classes. To address these issues, we first propose Layer-wise Fine-Tuning (LiFT) which achieves average gains of 3.9%, 4.3%, 4.2% and 4.5% on base classes under 2-, 4-, 8- and 16-shot respectively compared to the baseline CoOp over 11 datasets. Alternatively, we provide a parameter-efficient LiFT-Adapter exhibiting favorable performance while updating only 1.66% of total parameters. Further, we design scalable LiFT-NCD to identify both base classes and novel classes, which boosts the accuracy by an average of 5.01% over zero-shot generalization of CLIP, exploring the potential of VLMs in discovering novel classes.

VCMaster: Generating Diverse and Fluent Live Video Comments Based on Multimodal Contexts

Manman Zhang
Ge Luo
Yuchen Ma
Sheng Li
Zhenxing Qian
Xinpeng Zhang

Live video commenting, or "bullet screen," is a popular social style on video platforms. Automatic live commenting has been explored as a promising approach to enhance the appeal of videos. However, existing methods neglect the diversity of generated sentences, limiting the potential to obtain human-like comments. In this paper, we introduce a novel framework called "VCMaster" for multimodal live video comments generation, which balances the diversity and quality of generated comments to create human-like sentences. We involve images, subtitles, and contextual comments as inputs to better understand complex video contexts. Then, we propose an effective Hierarchical Cross-Fusion Decoder to integrate high-quality trimodal feature representations by cross-fusing critical information from previous layers. Additionally, we develop a Sentence-Level Contrastive Loss to enlarge the distance between generated and contextual comments by contrastive learning. It helps the model to avoid the pitfall of simply imitating provided contextual comments and losing creativity, encouraging the model to achieve more diverse comments while maintaining high quality. We also construct a large-scale multimodal live video comments dataset with 292,507 comments and three sub-datasets that cover nine general categories. Extensive experiments demonstrate that our model achieves a level of human-like language expression and remarkably fluent, diverse, and engaging generated comments compared to baselines.

Whether you can locate or not? Interactive Referring Expression Generation

Fulong Ye
Yuxing Long
Fangxiang Feng
Xiaojie Wang

Referring Expression Generation (REG) aims to generate unambiguous Referring Expressions (REs) for objects in a visual scene, with a dual task of Referring Expression Comprehension (REC) to locate the referred object. Existing methods construct REG models independently by using only the REs as ground truth for model training, without considering the potential interaction between REG and REC models. In this paper, we propose an Interactive REG (IREG) model that can interact with a real REC model, utilizing signals indicating whether the object is located and the visual region located by the REC model to gradually modify REs. Our experimental results on three RE benchmark datasets, RefCOCO, RefCOCO+, and RefCOCOg show that IREG outperforms previous state-of-the-art methods on popular evaluation metrics. Furthermore, a human evaluation shows that IREG generates better REs with the capability of interaction.

Iterative Learning with Extra and Inner Knowledge for Long-tail Dynamic Scene Graph Generation

Yiming Li
Xiaoshan Yang
Changsheng Xu

Dynamic scene graphs have become a powerful tool for higher-level visual understanding tasks, and the interest in dynamic scene graph generation (dynamic SGG) is grown over time. Recently, numbers of existing methods achieve significant progress in dynamic SGG by capturing temporal information with transformer or recurrent network structures. However, most existing methods only focus on predicting the head predicates, which ignore the long-tail phenomenon, thus the tail predicates are hard to be recognized. In this paper, we propose a novel method named Iterative Learning with Extra and Inner Knowledge (I2LEK) to address the long-tail problem in dynamic SGG. The extra knowledge is obtained from commonsense, while inner knowledge is defined as the temporal evolution patterns of visual relationships. Specifically, we introduce extra knowledge to enrich the representations of predicates in the spatial dimension and adopt inner knowledge to implement knowledge sharing in the temporal dimension. With enriched representations and shared knowledge, I2LEK can accurately predict both the tail and head predicates. Moreover, an iterative learning strategy is proposed to fuse the extra knowledge, inner knowledge, and spatial-temporal context contained in videos, which further enhances the model's understanding of visual relationships. Our experimental results on the public Action Genome dataset demonstrate that our model achieves state-of-the-art performance.

Improving Image Captioning through Visual and Semantic Mutual Promotion

Jing Zhang
Yingshuai Xie
Xiaoqiang Liu

Current image captioning methods commonly use semantic attributes extracted by an object detector to guide visual representation, leaving the mutual guidance and enhancement between vision and semantics under-explored. Neurological studies have revealed that the visual cortex of the brain plays a crucial role in recognizing visual objects, while the prefrontal cortex is involved in the integration of contextual semantics. Inspired by the above studies, we propose a novel Visual-Semantic Transformer (VST) to model the neural interaction between vision and semantics, which explores the mechanism of deep fusion and mutual promotion of multimodal information, realizing more accurate image captioning. To better facilitate the complementary strengths between visual objects and semantic contexts, we propose a global position-sensitive co-attention encoder to realize globally associative, position-aware visual and semantic co-interaction through a mutual cross-attention mechanism. In addition, a multimodal mixed attention module is proposed in the decoder, which achieves adaptive multimodal feature fusion for enhancing the decoding capability. Experimental evidence shows that our VST significantly surpasses the state-of-the-art approaches on MSCOCO dataset and reaches the excellent CIDEr score of 142% on the Karpathy test split.

Fine-Grained Spatiotemporal Motion Alignment for Contrastive Video Representation Learning

Minghao Zhu
Xiao Lin
Ronghao Dang
Chengju Liu
Qijun Chen

As the most essential property in a video, motion information is critical to a robust and generalized video representation. To inject motion dynamics, recent works have adopted frame difference as the source of motion information in video contrastive learning, considering the trade-off between quality and cost. However, existing works align motion features at the instance level, which suffers from spatial and temporal weak alignment across modalities. In this paper, we present a Fine-grained Motion Alignment (FIMA) framework, capable of introducing well-aligned and significant motion information. Specifically, we first develop a dense contrastive learning framework in the spatiotemporal domain to generate pixel-level motion supervision. Then, we design a motion decoder and a foreground sampling strategy to eliminate the weak alignments in terms of time and space. Moreover, a frame-level motion contrastive loss is presented to improve the temporal diversity of the motion features. Extensive experiments demonstrate that the representations learned by FIMA possess great motion-awareness capabilities and achieve state-of-the-art or competitive results on downstream tasks across UCF101, HMDB51, and Diving48 datasets. Code is available at https://github.com/ZMHH-H/FIMA.

Better Integrating Vision and Semantics for Improving Few-shot Classification

Zhuoling Li
Yong Wang

Some recent methods address few-shot classification by integrating visual and semantic prototypes. However, they usually ignore the difference in feature structure between the visual and semantic modalities, which leads to limited performance improvements. In this paper, we propose a novel method, called bimodal integrator (BMI), to better integrate visual and semantic prototypes. In BMI, we first construct a latent space for each modality via a variational autoencoder, and then align the semantic latent space to the visual latent space. Through this semantics-to-vision alignment, the semantic modality is mapped to the visual latent space and has the same feature structure as the visual modality. As a result, the visual and semantic prototypes can be better integrated. In addition, based on the multivariate Gaussian distribution and the prompt engineering, a data augmentation scheme is designed to ensure the accuracy of modality alignment during the training process. Experimental results demonstrate that BMI significantly improves few-shot classification, making simple baselines outperform the most advanced methods on miniImageNet and tieredImageNet datasets.

Multi-Domain Lifelong Visual Question Answering via Self-Critical Distillation

Mingrui Lao
Nan Pu
Yu Liu
Zhun Zhong
Erwin M. Bakker
Nicu Sebe
Michael S. Lew

Visual Question Answering (VQA) has achieved significant success over the last few years, while most studies focus on training a VQA model on a stationary domain (e.g., a given dataset). In real-world application scenarios, however, these methods are often inefficient because VQA systems are always supposed to extend their knowledge and meet the ever-changing demands of users. In this paper, we introduce a new and challenging multi-domain lifelong VQA task, dubbed MDL-VQA, which encourages the VQA model to continuously learn across multiple domains while mitigating the forgetting on previously-learned domains. Furthermore, we propose a novel replay-free Self-Critical Distillation (SCD) framework tailor-made for MDL-VQA, which alleviates forgetting issue via transferring previous-domain knowledge from teacher to student models. First, we propose to introspect the teacher's understanding over original and counterfactual samples, thereby creating informative instance-relevant and domain-relevant knowledge for logits-based distillation. Second, on the side of feature-based distillation, we propose to introspect the reasoning behavior of student model to establish the harmful domain-specific knowledge acquired in current domain, and further leverage the metric learning strategy to encourage student to learn useful knowledge in new domain. Extensive experiments demonstrate that SCD framework outperforms state-of-the-art competitors with different training orders.

Relation Triplet Construction for Cross-modal Text-to-Video Retrieval

Xue Song
Jingjing Chen
Yu-Gang Jiang

Cross-modal text-to-video retrieval aims to find semantically related videos for a text query. Since video and text are distinct modalities, the major challenge comes from building the correspondence between two modalities, thus relevant samples could be matched. Inherently, the text contains multiple relatively complete semantic units and each one is composed of three primary components, i.e., subject, predicate and object (SVO triplet). Therefore, it requires similar modeling of video content -- objects and their relations, to correctly retrieve videos for texts. To model fine-grained visual relations, this paper proposes a Multi-Granularity Matching (MGM) framework that considers both fine-grained relation triplet matching and coarse-grained global semantic matching for text-to-video retrieval. Specifically, in the proposed framework, we represent videos as SVO triplet tracklets by extracting frame-level relation triplets followed by temporal relation association across frames. Moreover, we design a transformer-based Bi-directional Fusion Block (BFB) to express each SVO triplet with a highly unified representation. The constructed SVO triplet tracklets provide a reasonable way to model fine-grained video contents, fulfilling a better alignment between videos and texts. Extensive experiments conducted on three benchmark datasets, i.e., MSR-VTT, LSMDC and MSVD, demonstrate the effectiveness of our proposed method.

HSVLT: Hierarchical Scale-Aware Vision-Language Transformer for Multi-Label Image Classification

Shuyi Ouyang
Hongyi Wang
Ziwei Niu
Zhenjia Bai
Shiao Xie
Yingying Xu
Ruofeng Tong
Yen-Wei Chen
Lanfen Lin

The task of multi-label image classification involves recognizing multiple objects within a single image. Considering both valuable semantic information contained in the labels and essential visual features presented in the image, tight visual-linguistic interactions play a vital role in improving classification performance. Moreover, given the potential variance in object size and appearance within a single image, attention to features of different scales can help to discover possible objects in the image. Recently, Transformer-based methods have achieved great success in multi-label image classification by leveraging the advantage of modeling long-range dependencies, but they have several limitations. Firstly, existing methods treat visual feature extraction and cross-modal fusion as separate steps, resulting in insufficient visual-linguistic alignment in the joint semantic space. Additionally, they only extract visual features and perform cross-modal fusion at a single scale, neglecting objects with different characteristics. To address these issues, we propose a Hierarchical Scale-Aware Vision-Language Transformer (HSVLT) with two appealing designs: (1)A hierarchical multi-scale architecture that involves a Cross-Scale Aggregation module, which leverages joint multi-modal features extracted from multiple scales to recognize objects of varying sizes and appearances in images. (2)Interactive Visual-Linguistic Attention, a novel attention mechanism module that tightly integrates cross-modal interaction, enabling the joint updating of visual, linguistic and multi-modal features. We have evaluated our method on three benchmark datasets. The experimental results demonstrate that HSVLT surpasses state-of-the-art methods with lower computational cost.

Depth-Aware Sparse Transformer for Video-Language Learning

Haonan Zhang
Lianli Gao
Pengpeng Zeng
Alan Hanjalic
Heng Tao Shen

In Video-Language (VL) learning tasks, a massive amount of text annotations are describing geometrical relationships of instances (e.g. 19.6% to 45.0% in MSVD, MSR-VTT, MSVD-QA and MSVRTT-QA), which often become the bottleneck of the current VL tasks (e.g. 60.8% vs. 98.2% CIDEr in MSVD for geometrical and non-geometrical annotations). Considering the rich spatial information of depth map, an intuitive way is to enrich the conventional 2D visual representations with depth information through current SOTA models, e.g. transformer. However, it is cumbersome to compute the self-attention on a long-range sequence and heterogeneous video-level representations with regard to computation cost and flexibility on various frame scales. To tackle this, we propose a hierarchical transformer, termed Depth-Aware Sparse Transformer (DAST). Specifically, to guarantee computational efficiency, a depth-aware sparse attention modular with linear computational complexity is designed for each transformer layer to learn depth-aware 2D representations. Furthermore, we design a hierarchical structure to maintain multi-scale temporal coherence across long-range dependencies. These qualities of DAST make it compatible with a broad range of video-language tasks, including video captioning (achieving MSVD 107.8%, MSR-VTT 52.5% for CIDEr), video question answering (MSVD-QA 44.1%, MSRVTT-QA 39.4%), and video-text matching (MSR-VTT 215.7 for SumR). Our code is available at https://github.com/zchoi/DAST

Invariant Meets Specific: A Scalable Harmful Memes Detection Framework

Chuanpeng Yang
Fuqing Zhu
Jizhong Han
Songlin Hu

Harmful memes detection is a challenging task in the field of multimodal information processing due to the semantic gap between different modalities. Current research on this task mainly focuses on multimodal dual-stream models. However, the existing works ignore the misalignment of the memes caused by the modality gap. Moreover, the cross-modal interaction in the dual-stream models is insufficient to identify harmful memes. To this end, this paper proposes a scalable invariant and specific modality (ISM) representations framework via graph neural networks. The proposed ISM framework provides a comprehensive and disentangled view for memes and promotes inter-modal interaction. Specifically, ISM projects each modality to two distinct spaces. The first space is modality-invariant, learning the corresponding commonalities and reducing the modality gap. The second space is modality-specific, holding the distinctive characteristics of each modality and complementing the common latent features captured in invariant spaces. Then, we construct fully connected visual and textual graphs for each space. The unimodal graphs are fused to dynamically balance inter-modal and intra-modal relationships, which are complementary to the dual-stream models. Finally, an adaptive module is designed to weigh the proportion of each fusion graph for memes. Moreover, the mainstream multimodal dual-stream models could be employed as the backbone flexibly. Extensive experiments on five publicly available datasets show that the proposed ISM provides a stable improvement over baselines and produces a competitive performance compared with the existing harmful memes detection methods.

A Method of Micro-Geometric Details Preserving in Surface Reconstruction from Gradient

Wuyuan Xie
Miaohui Wang

Surface from gradient (SfG) is one of the fundamental methods to densely reconstruct 3D object surface in computer vision. However, the reconstruction of micro-geometric details has not been satisfactorily solved in existing SfG methods due to their non-integrability. In this paper, we present an effective discrete geometric approach to reconstruct fine-grained sharp surface feature with non-integrability. Specifically, We investigate the fine-grained structure of surfaces in the micro geometry domain. based on an adaptive projection on vertexes constrained by neighboring gradient vectors, and develop a gradient angle-guided energy optimization to generate a fine-grained surface. Experimental results on various challenging synthetic and real-world data show that the proposed method is able to effectively reconstruct challenging micro-geometric details for general SfG methods.

Progressive Positive Association Framework for Image and Text Retrieval

Wenhui Li
Yan Wang
Yuting Su
Lanjun Wang
Weizhi Nie
An-An Liu

With the increasing amount of multimedia data, the demand for fast and accurate access to information is growing. Image and text retrieval learns visual and textual semantic relationships for multimedia data management and content recognition. The main challenge of this task is how to derive image and text similarity based on local associations under huge modal gap. However, the existing methods compute semantic relevance using associations of all fragments (visual regions and textual words), which underestimate the uncertainty of associations and discriminative positive associations leading to cross-modal correspondence ambiguity. To address these issues, we propose a novel Progressive Positive Association Framework (PPAF), which models association uncertainty as a normal distribution and progressively mines direct and potential positive associations according to the characteristics of the association distribution. We design positive association matching, which adaptively fuses multi-step associations for local matching depending on the relevance difference. In addition, we apply KL loss constraint on cross-modal association distribution in order to enhance local semantic alignment. Extended experiments demonstrate the leading performance of PPAF.

Globally-Robust Instance Identification and Locally-Accurate Keypoint Alignment for Multi-Person Pose Estimation

Fangzheng Tian
Sungchan Kim

Scenes with a large number of human instances are characterized by significant overlap of the instances with similar appearance, occlusion, and scale variation. We propose GRAPE, a novel method that leverages both Globally Robust human instance identification and locally Accurate keypoint alignment for 2D Pose Estimation. GRAPE predicts instance center and keypoint heatmaps, as global identifications of instance location and scale, and keypoint offset vectors from instance centers, as representations of accurate local keypoint positions. We use Transformer to jointly learn the global and local contexts, which allows us to robustly detect instance centers even in difficult cases such as crowded scenes, and align instance offset vectors with relevant keypoint heatmaps, resulting in refined final poses. GRAPE also predicts keypoint visibility, which is crucial for estimating centers of partially visible instances in crowded scenes. We demonstrate that GRAPE achieves state-of-the-art performance on the CrowdPose, OCHuman, and COCO datasets. The benefit of GRAPE is more apparent on crowded scenes (CrowdPose and OCHuman), where our model significantly outperforms previous methods, especially on hard examples.

Unlocking the Power of Cross-Dimensional Semantic Dependency for Image-Text Matching

Kun Zhang
Lei Zhang
Bo Hu
Mengxiao Zhu
Zhendong Mao

Image-text matching, as a fundamental cross-modal task, bridges vision and language. The key challenge lies in accurately learning the semantic similarity of these two heterogeneous modalities. To determine the semantic similarity between visual and textual features, existing paradigm typically first maps them into a d-dimensional shared representation space, then independently aggregates all dimensional correspondences of cross-modal features to reflect it, e.g., the inner product. However, in this paper, we are motivated by an insightful finding that dimensions are not mutually independent, but there are intrinsic dependencies among dimensions to jointly represent latent semantics. Ignoring this intrinsic information probably leads to suboptimal aggregation for semantic similarity, impairing cross-modal matching learning. To solve this issue, we propose a novel cross-dimensional semantic dependency-aware model (called X-Dim), which explicitly and adaptively mines the semantic dependencies between dimensions in the shared space, enabling dimensions with joint dependencies to be enhanced and utilized. X-Dim (1) designs a generalized framework to learn dimensions' semantic dependency degrees, and (2) devises the adaptive sparse probabilistic learning to autonomously make the model capture precise dependencies. Theoretical analysis and extensive experiments demonstrate the superiority of X-Dim over state-of-the-art methods, achieving 5.9%-7.3% rSum improvements on Flickr30K and MS-COCO benchmarks.

Dark Knowledge Balance Learning for Unbiased Scene Graph Generation

Zhiqing Chen
Yawei Luo
Jian Shao
Yi Yang
Chunping Wang
Lei Chen
Jun Xiao

One of the major obstacles that hinders the current scene graph generation (SGG) performance lies in the severe predicate annotation bias. Conventional solutions to this problem are mainly based on reweighting/resampling heuristics. Despite achieving some improvements on tail classes, these methods are prone to cause serious performance degradation of head predicates. In this paper, we propose to tackle this problem from a brand-new perspective of dark knowledge. In consideration of the unique nature of SGG that requires a large number of negative samples to be employed for predicate learning, we design to capitalize on the dark knowledge contained in negative samples for debiasing the predicate distribution. Along such vein, we propose a novel SGG method dubbed Dark Knowledge Balance Learning (DKBL). In DKBL, we first design a dark knowledge balancing loss, which helps the model learn to balance head and tail predicates while maintaining the overall performance. We further introduce a dark knowledge semantic enhancement module to better encode the semantics of predicates. DKBL is orthogonal to existing SGG methods and can be easily plugged into their training process for further improvement. Extensive experiments on VG dataset show that the proposed DKBL can consistently achieve well trade-off performance between head and tail predicates, which is significantly better than previous state-of-the-art methods. The code is available in https://github.com/chenzqing/DKBL.

Orthogonal Uncertainty Representation of Data Manifold for Robust Long-Tailed Learning

Yanbiao Ma
Licheng Jiao
Fang Liu
Shuyuan Yang
Xu Liu
Lingling Li

In scenarios with long-tailed distributions, the model's ability to identify tail classes is limited due to the under-representation of tail samples. Class rebalancing, information augmentation, and other techniques have been proposed to facilitate models to learn the potential distribution of tail classes. The disadvantage is that these methods generally pursue models with balanced class accuracy on the data manifold, while ignoring the ability of the model to resist interference. By constructing noisy data manifold, we found that the robustness of models trained on unbalanced data has a long-tail phenomenon. That is, even if the class accuracy is balanced on the data domain, it still has bias on the noisy data manifold. However, existing methods cannot effectively mitigate the above phenomenon, which makes the model vulnerable in long-tailed scenarios. In this work, we propose an Orthogonal Uncertainty Representation (hOUR) of feature embedding and an end-to-end training strategy to improve the long-tail phenomenon of model robustness. As a general enhancement tool, OUR has excellent compatibility with other methods and does not require additional data generation, ensuring fast and efficient training. Comprehensive evaluations on long-tailed datasets show that our method significantly improves the long-tail phenomenon of robustness, bringing consistent performance gains to other long-tailed learning methods.

Topological Structure Learning for Weakly-Supervised Out-of-Distribution Detection

Rundong He
Rongxue Li
Zhongyi Han
Xihong Yang
Yilong Yin

Out-of-distribution~(OOD) detection is the key to deploying models safely in the open world. For OOD detection, collecting sufficient in-distribution~(ID) labeled data is usually more time-consuming and costly than unlabeled data. When ID labeled data is limited, the previous OOD detection methods are no longer superior due to their high dependence on the amount of ID labeled data. Based on limited ID labeled data and sufficient unlabeled data, we define a new setting called Weakly-Supervised Out-of-Distribution Detection (WSOOD). To solve the new problem, we propose an effective method called Topological Structure Learning (TSL). Firstly, TSL uses a contrastive learning method to build the initial topological structure space for ID and OOD data. Secondly, TSL mines effective topological connections in the initial topological space. Finally, based on limited ID labeled data and mined topological connections, TSL reconstructs the topological structure in a new topological space to increase the separability of ID and OOD instances. Extensive studies on several representative datasets show that TSL remarkably outperforms the state-of-the-art, verifying the validity and robustness of our method in the new setting of WSOOD.

Efficient Spatio-Temporal Video Grounding with Semantic-Guided Feature Decomposition

Weikang Wang
Jing Liu
Yuting Su
Weizhi Nie

Spatio-temporal video grounding (STVG) aims to localize the spatio-temporal object tube in a video according to a given text query. Current approaches address the STVG task with end-to-end frameworks while suffering from heavy computational complexity and insufficient spatio-temporal interactions. To overcome these limitations, we propose a novel Semantic-Guided Feature Decomposition based Network (SGFDN). A semantic-guided mapping operation is proposed to decompose the 3D spatio-temporal feature into 2D motions and 1D object embedding without losing much object-related semantic information. Thus, the computational complexity in computationally expensive operations such as attention mechanisms can be effectively reduced by replacing the input spatio-temporal feature with the decomposed features. Furthermore, based on this decomposition strategy, a pyramid relevance filtering based attention is proposed to capture the cross-modal interactions at multiple spatio-temporal scales. In addition, a decomposition-based grounding head is proposed to locate the queried objects with less computational complexity. Extensive experiments on two widely-used STVG datasets (VidSTG and HC-STVG) demonstrate that our method enjoys state-of-the-art performance as well as less computational complexity. The code has been available at https://github.com/TJUMMG/SGFDN.

Prior Knowledge-driven Dynamic Scene Graph Generation with Causal Inference

Jiale Lu
Lianggangxu Chen
Youqi Song
Shaohui Lin
Changbo Wang
Gaoqi He

The task of dynamic scene graph generation (DSGG) aims at constructing a set of frame-level scene graphs for the given video. It suffers from two kinds of spurious correlation problems. First, the spurious correlation between input object pair and predicate label is caused by the biased predicate sample distribution in dataset. Second, the spurious correlation between contextual information and predicate label arises from interference caused by background content in both the current frame and adjacent frames of the video sequence. To alleviate spurious correlations, our work is formulated into two sub-tasks: video-specific commonsense graph generation (VsCG) and causal inference (CI). VsCG module aims to alleviate the first correlation by integrating prior knowledge into prediction. Information of all the frames in current video is used to enhance the commonsense graph constructed from co-occurrence patterns of all training samples. Thus, the commonsense graph has been augmented with video-specific temporal dependencies. Then, a CI strategy with both intervention and counterfactual is used. The intervention component further eliminates the first correlation by forcing the model to consider all possible predicate categories fairly, while the counterfactual component resolves the second correlation by removing the bad effect from context. Comprehensive experiments on the Action Genome dataset show that the proposed method achieves state-of-the-art performance.

ATM: Action Temporality Modeling for Video Question Answering

Junwen Chen
Jie Zhu
Yu Kong

Despite significant progress in video question answering (VideoQA), existing methods fall short of questions that require causal/temporal reasoning across frames. This can be attributed to imprecise motion representations. We introduce Action Temporality Modeling (ATM) for temporality reasoning via three-fold uniqueness: (1) rethinking the optical flow and realizing that optical flow is effective in capturing the long horizon temporality reasoning; (2) training the visual-text embedding by contrastive learning in an action-centric manner, leading to better action representations in both vision and text modalities; and (3) preventing the model from answering the question given the shuffled video in the fine-tuning stage, to avoid spurious correlation between appearance and motion and hence ensure faithful temporality reasoning. In the experiments, we show that ATM outperforms existing approaches in terms of the accuracy on multiple VideoQAs and exhibits better true temporality reasoning ability.

CLIP-Hand3D: Exploiting 3D Hand Pose Estimation via Context-Aware Prompting

Shaoxiang Guo
Qing Cai
Lin Qi
Junyu Dong

Contrastive Language-Image Pre-training (CLIP) starts to emerge in many computer vision tasks and has achieved promising performance. However, it remains underexplored whether CLIP can be generalized to 3D hand pose estimation, as bridging text prompts with pose-aware features presents significant challenges due to the discrete nature of joint positions in 3D space. In this paper, we make one of the first attempts to propose a novel 3D hand pose estimator from monocular images, dubbed as CLIP-Hand3D, which successfully bridges the gap between text prompts and irregular detailed pose distribution. In particular, the distribution order of hand joints in various 3D space directions is derived from pose labels, forming corresponding text prompts that are subsequently encoded into text representations. Simultaneously, 21 hand joints in the 3D space are retrieved, and their spatial distribution (in x, y, and z axes) is encoded to form pose-aware features. Subsequently, we maximize semantic consistency for a pair of pose-text features following a CLIP-based contrastive learning paradigm. Furthermore, a coarse-to-fine mesh regressor is designed, which is capable of effectively querying joint-aware cues from the feature pyramid. Extensive experiments on several public hand benchmarks show that the proposed model attains a significantly faster inference speed while achieving state-of-the-art performance compared to methods utilizing the similar scale backbone. Code is available at: https://github.com/ShaoXiang23/CLIP_Hand_Demo.

A Multitask Framework for Graffiti-to-Image Translation

Ying Yang
Mulin Chen
Xuelong Li

Recently, image-to-image translation models have achieved great success in terms of content consistency and visual fidelity. However, in most of these tasks, the inaccuracy of sketches and the high cost of fine semantic masks acquisition limit the large-scale use of image translation models. Therefore, we propose to use graffiti that combines the advantages of sketches and semantic masks as model input. Graffiti reflects the general content of an image using lines and color distinctions, with some unlabeled regions. However, due to the large number of unknown areas in the graffiti, the generated results may be blurred, resulting in poor visual effects. To address these challenges, this paper proposes a multi-task framework that can predict unknown regions by learning semantic mask from graffiti, thereby improving the quality of generated real scene images. Furthermore, by introducing an edge activation module, which utilizes semantic and edge information to optimize the object boundaries of the generated images, the details of the generated images can be improved. Experiments on the Cityscapes dataset demonstrate that our multi-task framework achieves competitive performance on graffiti-based image generation task.

Adaptive Contrastive Learning for Learning Robust Representations under Label Noise

Zihao Wang
Weichen Zhang
Weihong Bao
Fei Long
Chun Yuan

Deep Neural Networks suffer significant performance degeneration when noisy labels corrupt latent data representations. Previous work has attempted to alleviate this problem by exploiting contrastive learning, the pair building of which is critical. However, existing methods either conduct sample-level processes and then use the resultant subset to construct pairs or directly perform pair-level selecting using a fixed threshold, both leading to sub-optimal pairing and subsequent representation learning. To address this issue, we propose a novel adaptive contrastive learning method (ACL) working at the pair level to select contrastive pairs adaptively. Specifically, we consider the model's learning status to adjust the confidence threshold in a self-adaptive manner instead of fixing it. Then, towards the ineffectiveness of the thresholding method on unconfident pairs, we automatically apply instance-specific temperature to boost the confidence of accurately-predicted samples and their pairs. We further introduce temporal cross-ensembling to handle the impact of noisy labels on model predictions. As a result, diverse pairs are correctly selected for contrastive learning to induce discriminative representations robust to various types of label noise. Extensive experimental results on several standard benchmarks and real-world datasets indicate the superiority of ACL, especially in extremely noisy scenarios.

Distilling Vision-Language Foundation Models: A Data-Free Approach via Prompt Diversification

Yunyi Xuan
Weijie Chen
Shicai Yang
Di Xie
Luojun Lin
Yueting Zhuang

Data-Free Knowledge Distillation (DFKD) has shown great potential in creating a compact student model while alleviating the dependency on real training data by synthesizing surrogate data. However, prior arts are seldom discussed under distribution shifts, which may be vulnerable in real-world applications. Recent Vision-Language Foundation Models, e.g., CLIP, have demonstrated remarkable performance in zero-shot out-of-distribution generalization, yet consuming heavy computation resources. In this paper, we discuss the extension of DFKD to Vision-Language Foundation Models without access to the billion-level image-text datasets. The objective is to customize a student model for distribution-agnostic downstream tasks with given category concepts, inheriting the out-of-distribution generalization capability from the pre-trained foundation models. In order to avoid generalization degradation, the primary challenge of this task lies in synthesizing diverse surrogate images driven by text prompts. Since not only category concepts but also style information are encoded in text prompts, we propose three novel Prompt Diversification methods to encourage image synthesis with diverse styles, namely Mix-Prompt, Random-Prompt, and Contrastive-Prompt. Experiments on out-of-distribution generalization datasets demonstrate the effectiveness of the proposed methods, with Contrastive-Prompt performing the best.

Real20M: A Large-scale E-commerce Dataset for Cross-domain Retrieval

Yanzhe Chen
Huasong Zhong
Xiangteng He
Yuxin Peng
Lele Cheng

In e-commerce, products and micro-videos serve as two primary carriers. Introducing cross-domain retrieval between these carriers can establish associations, thereby leading to the advancement of specific scenarios, such as retrieving products based on micro-videos or recommending relevant videos based on products. However, existing datasets only focus on retrieval within the product domain while neglecting the micro-video domain and often ignore the multi-modal characteristics of the product domain. Additionally, these datasets strictly limit their data scale through content alignment and use a content-based data organization format that hinders the inclusion of user retrieval intentions. To address these limitations, we propose the PKU Real20M dataset, a large-scale e-commerce dataset designed for cross-domain retrieval. We adopt a query-driven approach to efficiently gather over 20 million e-commerce products and micro-videos, including multimodal information. Additionally, we design a three-level entity prompt learning framework to align inter-modality information from coarse to fine. Moreover, we introduce the Query-driven Cross-Domain retrieval framework (QCD), which leverages user queries to facilitate efficient alignment between the product and micro-video domains. Extensive experiments on two downstream tasks validate the effectiveness of our proposed approaches. The dataset and source code are available at https://github.com/PKU-ICST-MIPL/Real20M_ACMMM2023.

Zero-TextCap: Zero-shot Framework for Text-based Image Captioning

Dongsheng Xu
Wenye Zhao
Yi Cai
Qingbao Huang

Text-based image captioning is a vital but under-explored task, which aims to describe images by captions containing scene text automatically. Recent studies have made encouraging progress, but they are still suffering from two issues. Firstly, current models cannot capture and generate scene text in non-Latin script languages, which severely limits the objectivity and the information completeness of generated captions. Secondly, current models tend to describe images with monotonous and templated style, which greatly limits the diversity of the generated captions. Although the above-mentioned issues can be alleviated through carefully designed annotations, this process is undoubtedly laborious and time-consuming. To address the above issues, we propose a Zero-shot Framework for Text-based Image Captioning (Zero-TextCap). Concretely, to generate candidate sentences starting from the prompt 'Image of' and iteratively refine them to improve the quality and diversity of captions, we introduce a Hybrid-sampling masked language model (H-MLM). To read multi-lingual scene text and model the relationships between them, we introduce a robust OCR system. To ensure that the captions generated by H-MLM contain scene text and are highly relevant to the image, we propose a CLIP-based generation guidance module to insert OCR tokens and filter candidate sentences. Our Zero-TextCap is capable of generalizing captions containing multi-lingual scene text and boosting the diversity of captions. Sufficient experiments demonstrate the effectiveness of our proposed Zero-TextCap. Our codes are available at https://github.com/Gemhuang79/Zero_TextCap.

Adversarial Training of Deep Neural Networks Guided by Texture and Structural Information

Zhaoxin Wang
Handing Wang
Cong Tian
Yaochu Jin

Adversarial training (AT) is one of the most effective ways for deep neural network models to resist adversarial examples. However, there is still a significant gap between robust training accuracy and testing accuracy. Although recent studies have shown that data augmentation can effectively reduce this gap, most methods heavily rely on generating large amounts of training data without considering which features are beneficial for model robustness, making them inefficient. To address the above issue, we propose a two-stage AT algorithm for image data that adopts different data augmentation strategies during the training process to improve model robustness. In the first stage, we focus on the convergence of the algorithm, which uses structure and texture information to guide AT. In the second stage, we introduce a strategy that randomly fuses the data features to generate diverse adversarial examples for AT. We compare our proposed algorithm with five state-of-the-art algorithms on three models, and the experimental results achieve the best robust accuracy under all evaluation metrics on the CIFAR10 dataset, demonstrating the superiority of our method.

TeViS: Translating Text Synopses to Video Storyboards

Xu Gu
Yuchong Sun
Feiyue Ni
Shizhe Chen
Xihua Wang
Ruihua Song
Boyuan Li
Xiang Cao

A video storyboard is a roadmap for video creation which consists of shot-by-shot images to visualize key plots in a text synopsis. Creating video storyboards, however, remains challenging which not only requires cross-modal association between high-level texts and images but also demands long-term reasoning to make transitions smooth across shots. In this paper, we propose a new task called Text synopsis to Video Storyboard (TeViS) which aims to retrieve an ordered sequence of images as the video storyboard to visualize the text synopsis. We construct a MovieNet-TeViS dataset based on the public MovieNet dataset [17]. It contains 10K text synopses each paired with keyframes manually selected from corresponding movies by considering both relevance and cinematic coherence. To benchmark the task, we present strong CLIP-based baselines and a novel VQ-Trans model. VQ-Trans first encodes text synopsis and images into a joint embedding space and uses vector quantization (VQ) to improve the visual representation. Then, it auto-regressively generates a sequence of visual features for retrieval and ordering. Experimental results demonstrate that VQ-Trans significantly outperforms prior methods and the CLIP-based baselines. Nevertheless, there is still a large gap compared to human performance suggesting room for promising future work. The code and data are available at: https://ruc-aimind.github.io/projects/TeViS/

Chain-of-Look Prompting for Verb-centric Surgical Triplet Recognition in Endoscopic Videos

Nan Xi
Jingjing Meng
Junsong Yuan

Surgical triplet recognition aims to recognize surgical activities as triplets (i.e.,<instrument, verb, target >), which provides fine-grained information essential for surgical scene understanding. Existing methods for surgical triplet recognition rely on compositional methods that recognize the instrument, verb, and target simultaneously. In contrast, our method, called chain-of-look prompting, casts the problem of surgical triplet recognition as visual prompt generation from large-scale vision-language (VL) models, and explicitly decomposes the task into a series of video reasoning processes. Chain-of-Look prompting is inspired by: (1) the chain-of-thought prompting in natural language processing, which divides a problem into a sequence of intermediate reasoning steps; (2) the inter-dependency between motion and visual appearance in the human vision system. Since surgical activities are conveyed by the actions of physicians, we regard the verbs as the carrier of semantics in surgical endoscopic videos. Additionally, we utilize the BioMed large language model to calibrate the generated visual prompt features for surgical scenarios. Our approach captures the visual reasoning processes underlying surgical activities and achieves better performance compared to the state-of-the-art methods on the largest surgical triplet recognition dataset, CholecT50. The code is available at https://github.com/southnx/CoLSurgical.

Dense Object Grounding in 3D Scenes

Wencan Huang
Daizong Liu
Wei Hu

Localizing objects in 3D scenes according to the semantics of a given natural language is a fundamental yet important task in the field of multimedia understanding, which benefits various real-world applications such as robotics and autonomous driving. However, the majority of existing 3D object grounding methods are restricted to a single-sentence input describing an individual object, which cannot comprehend and reason more contextualized descriptions of multiple objects in more practical 3D cases. To this end, we introduce a new challenging task, called 3D Dense Object Grounding (3D DOG), to jointly localize multiple objects described in a more complicated paragraph rather than a single sentence. Instead of naively localizing each sentence-guided object independently, we found that dense objects described in the same paragraph are often semantically related and spatially located in a focused region of the 3D scene. To explore such semantic and spatial relationships of densely referred objects for more accurate localization, we propose a novel Stacked Transformer based framework for 3D DOG, named 3DOGSFormer. Specifically, we first devise a contextual query-driven local transformer decoder to generate initial grounding proposals for each target object. The design of these contextual queries enables the model to capture linguistic semantic relationships of objects in the paragraph in a lightweight manner. Then, we employ a proposal-guided global transformer decoder that exploits the local object features to learn their correlation for further refining initial grounding proposals. In particular, we develop two types of proposal-guided attention layers to encode both explicit and implicit pairwise spatial relations to enhance 3D relation understanding. Extensive experiments on three challenging benchmarks (Nr3D, Sr3D, and ScanRefer) show that our proposed 3DOGSFormer outperforms state-of-the-art 3D single-object grounding methods and their dense-object variants by significant margins.

Uniformly Distributed Category Prototype-Guided Vision-Language Framework for Long-Tail Recognition

Xiaoxuan He
Siming Fu
Xinpeng Ding
Yuchen Cao
Hualiang Wang

Recently, large-scale pre-trained vision-language models have presented benefits for alleviating class imbalance in long-tailed recognition. However, the long-tailed data distribution can corrupt the representation space, where the distance between head and tail categories is much larger than the distance between two tail categories. This uneven feature space distribution causes the model to exhibit unclear and inseparable decision boundaries on the uniformly distributed test set, which lowers its performance. To address these challenges, we propose the uniformly category prototype-guided vision-language framework to effectively mitigate feature space bias caused by data imbalance. Especially, we generate a set of category prototypes uniformly distributed on a hypersphere. Category prototype-guided mechanism for image-text matching makes the features of different classes converge to these distinct and uniformly distributed category prototypes, which maintain a uniform distribution in the feature space, and improve class boundaries. Additionally, our proposed irrelevant text filtering and attribute enhancement module allows the model to ignore irrelevant noisy text and focus more on key attribute information, thereby enhancing the robustness of our framework. In the image recognition fine-tuning stage, to address the positive bias problem of the learnable classifier, we design the class feature prototype-guided classifier, which compensates for the performance of tail classes while maintaining the performance of head classes. Our method outperforms previous vision-language methods for long-tailed learning work by a large margin and achieves state-of-the-art performance.

Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model

Kanzhi Cheng
Wenpo Song
Zheng Ma
Wenhao Zhu
Zixuan Zhu
Jianbing Zhang

Current captioning approaches tend to generate correct but "generic" descriptions that lack real-world knowledge, e.g., named entities and contextual information. Considering that Vision-Language Pre-Training (VLP) models master massive such knowledge from large-scale web-harvested data, it is promising to utilize the generalizability of VLP models to incorporate knowledge into image descriptions. However, using VLP models faces challenges: zero-shot inference suffers from knowledge hallucination that leads to low-quality descriptions, but the generic bias in downstream task fine-tuning hinders the VLP model from expressing knowledge. To address these concerns, we propose a simple yet effective method called Knowledge-guided Replay (K-Replay), which enables the retention of pre-training knowledge during fine-tuning. Our approach consists of two parts: (1) a knowledge prediction task on automatically collected replay exemplars to continuously awaken the VLP model's memory about knowledge, thus preventing the model from collapsing into the generic pattern; (2) a knowledge distillation constraint to improve the faithfulness of generated descriptions hence alleviating the knowledge hallucination. To evaluate knowledge-enhanced descriptions, we construct a novel captioning benchmark KnowCap, containing knowledge of landmarks, famous brands, special foods and movie characters. Experimental results show that our approach effectively incorporates knowledge into descriptions, outperforming strong VLP baseline by 20.9 points (78.7 → 99.6) in CIDEr score and 20.5 percentage points (34.0% → 54.5%) in knowledge recognition accuracy. Our code and data is available at https://github.com/njucckevin/KnowCap.

Toward High Quality Facial Representation Learning

Yue Wang
Jinlong Peng
Jiangning Zhang
Ran Yi
Liang Liu
Yabiao Wang
Chengjie Wang

Face analysis tasks have a wide range of applications, but the universal facial representation has only been explored in a few works. In this paper, we explore high-performance pre-training methods to boost the face analysis tasks such as face alignment and face parsing. We propose a self-supervised pre-training framework, called Mask Contrastive Face (MCF), with mask image modeling and a contrastive strategy specially adjusted for face domain tasks. To improve the facial representation quality, we use feature map of a pre-trained visual backbone as a supervision item and use a partially pre-trained decoder for mask image modeling. To handle the face identity during the pre-training stage, we further use random masks to build contrastive learning pairs. We conduct the pre-training on the LAION-FACE-cropped dataset, a variants of LAION-FACE 20M, which contains more than 20 million face images from Internet websites. For efficiency pre-training, we explore our framework pre-training performance on a small part of LAION-FACE-cropped and verify the superiority with different pre-training settings. Our model pre-trained with the full pre-training dataset outperforms the state-of-the-art methods on multiple downstream tasks. Our model achieves 0.932 NME_diag for AFLW-19 face alignment and 93.96 F1 score for LaPa face parsing. Code is available at https://github.com/nomewang/MCF.

HAAN: Human Action Aware Network for Multi-label Temporal Action Detection

Zikai Gao
Peng Qiao
Yong Dou

The task of multi-label temporal action detection aims to accurately detect dense action instances in untrimmed videos. Previous methods focused on modeling the appearance features of RGB images have struggled to capture the fine details and subtle variations in human actions, resulting in three critical issues: overlapping action confusion, intra-class appearance diversity, and background interferences. These issues have significantly undermined the accuracy and generalization of detection models. To tackle these issues, we propose incorporating the human skeleton into the feature design of the detection model. By utilizing multi-person skeletons, our proposed method can accurately represent various human actions in the scene, balance the salience of overlapping actions, and reduce the impact of changes in human appearance and background interferences on action features. Overall, we propose a novel two-stream human action aware network~(HAAN) for multi-label temporal action detection based on the original RGB frames and the estimated skeleton frames. To leverage the complementary advantages of RGB features and skeleton features, we design a cross-modality fusion module that allows the two features to guide each other and enhance their representation of human actions. On the popular benchmarks MultiTHUMOS and Charades, our HAAN achieves state-of-the-art performance with 56.9% (+5.4%) and 32.1% (+3.3%) mean average precision (mAP) compared to the best available methods. Importantly, HAAN shows superior improvements of +6.83%, +22.35%, and +2.56% on the challenging sample subsets of the three critical issues.

Exploring Coarse-to-Fine Action Token Localization and Interaction for Fine-grained Video Action Recognition

Baoli Sun
Xinchen Ye
Zhihui Wang
Haojie Li
Zhiyong Wang

Vision transformers have achieved impressive performance for video action recognition due to their strong capability of modeling long-range dependencies among spatio-temporal tokens. However, as for fine-grained actions, subtle and discriminative differences mainly exist in the regions of actors, directly utilizing vision transformers without removing irrelevant tokens will compromise recognition performance and lead to high computational costs. In this paper, we propose a coarse-to-fine action token localization and interaction network, namely C2F-ALIN, that dynamically localizes the most informative tokens at a coarse granularity and then partitions these located tokens to a fine granularity for sufficient fine-grained spatio-temporal interaction. Specifically, in the coarse stage, we devise a discriminative token localization module to accurately identify informative tokens and to discard irrelevant tokens, where each localized token corresponds to a large spatial region, thus effectively preserving the continuity of action regions.In the fine stage, we only further partition the localized tokens obtained in the coarse stage into a finer granularity and then characterize fine-grained token interactions in two aspects: (1) first using vanilla transformers to learn compact dependencies among all discriminative tokens; and (2) proposing a global contextual interaction module which enables each fine-grained tokens to communicate with all the spatio-temporal tokens and to embed the global context. As a result, our coarse-to-fine strategy is able to identify more relevant tokens and integrate global context for high recognition accuracy while maintaining high efficiency.Comprehensive experimental results on four widely used action recognition benchmarks, including FineGym, Diving48, Kinetics and Something-Something, clearly demonstrate the advantages of our proposed method in comparison with other state-of-the-art ones.

Semantic-Aware Generator and Low-level Feature Augmentation for Few-shot Image Generation

Zhe Wang
Jiaoyan Guan
Mengping Yang
Ting Xiao
Ziqiu Chi

Few-shot image generation aims to generate novel images for an unseen category with only a few samples. Prior studies fail to produce novel images with desirable diversity and fidelity. To ameliorate the generation quality, we in this paper propose a Semantic-Aware Generator (SAG) to provide explicit semantic guidance to the discriminator, and a Low-level Feature Augmentation (LFA) technique to provide fine-grained information, facilitating the diversity. Specifically, we observe that the generator feature layers contain different levels of semantic information. Such observation motivates us to employ intermediate feature maps of the generator as semantic labels to guide the discriminator, improving the semantic awareness of the generator. Moreover, spatially informative and diverse features obtained via LFA contribute to better generation quality. Together with the aforementioned module, we conduct extensive experiments on three representative benchmarks and the results demonstrate the effectiveness and advancement of our method.

Self-PT: Adaptive Self-Prompt Tuning for Low-Resource Visual Question Answering

Bowen Yuan
Sisi You
Bing-Kun Bao

Pretraining and finetuning large vision-language models (VLMs) have achieved remarkable success in visual question answering (VQA). However, finetuning VLMs requires heavy computation, expensive storage costs, and is prone to overfitting for VQA in low-resource settings. Existing prompt tuning methods have reduced the number of tunable parameters, but they cannot capture valid context-aware information during prompt encoding, resulting in 1) poor generalization of unseen answers and 2) lower improvements with more parameters. To address these issues, we propose a prompt tuning method for low-resource VQA named Adaptive Self-Prompt Tuning (Self-PT), which utilizes representations of question-image pairs as conditions to obtain context-aware prompts. To enhance the generalization of unseen answers, Self-PT uses dynamic instance-level prompts to avoid overfitting the correlations between static prompts and seen answers observed during training. To reduce parameters, we utilize hyper-networks and low-rank parameter factorization to make Self-PT more flexible and efficient. The hyper-network decouples the number of parameters and prompt length to generate flexible-length prompts by the fixed number of parameters. While the low-rank parameter factorization decomposes and reparameterizes the weights of the prompt encoder into a low-rank subspace for better parameter efficiency. Experiments conducted on VQA v2, GQA, and OK-VQA with different low-resource settings show that our Self-PT outperforms the state-of-the-art parameter-efficient methods, especially in lower-shot settings, e.g., 6% average improvements cross three datasets in 16-shot. Code is available at https://github.com/NJUPT-MCC/Self-PT.

SAUNet: Spatial-Attention Unfolding Network for Image Compressive Sensing

Ping Wang
Xin Yuan

Image Compressive Sensing (CS) enables compressed capture of natural images via a spatial multiplexing camera and accurate reconstruction from few measurements via an advanced algorithm. Deep learning, especially deep unfolding, has recently achieved impressive success in image CS reconstruction. However, existing learning-based methods have been developed for block (usually with 33 X 33 pixels) CS instead of full image CS. Apart from the difficulties in hardware implementation, block CS breaks the global pixel interactions, limiting the overall performance. In this paper, we propose the first two-dimensional deep unfolding framework, and further develop a Spatial-Attention Unfolding Network (SAUNet) for full image CS reconstruction by alternately performing a spatially-adaptive gradient descent module and a cross-stage multi-scale denoising module. The gradient descent module has the spatial self-adaptation to the degradation of in-process image. The denoising module is a three-level U-shaped structure powered by Convolutional Self-Attention (CSA) mechanism. Inspired by Transformer, CSA is designed to adaptively aggregate spatially local information and adaptively recalibrate channel-wise global information with only normal convolutional operator. Extensive experiments demonstrate that SAUNet outperforms the state-of-the-art methods by a large margin. The source code and pre-trained models are available at https://github.com/pwangcs/SAUNet.

CONICA: A Contrastive Image Captioning Framework with Robust Similarity Learning

Lin Deng
Yuzhong Zhong
Maoning Wang
Jianwei Zhang

Contrastive Language Image Pre-training (CLIP) has recently made significant advancements in image captioning by providing effective multi-modal representation learning capabilities. However, previous studies primarily rely on the language-aligned visual semantics as input for the captioning model, leaving the learned robust vision-language relevance under-exploited. In this paper, we propose CONICA, a unified CONtrastive Image CAptioning framework that investigates how contrastive learning can further enhance image captioning from three aspects. Firstly, we introduce contrastive learning objectives into the typical image captioning training pipeline with minimal overhead. Secondly, we construct fine-grained contrastive samples to obtain image-text similarities that correlate with the evaluation metric of image captioning. Finally, we incorporate the learned contrastive knowledge into the captioning decoding strategy to search for better captions. Experimental results demonstrate that CONICA significantly improves performance over standard captioning baselines and achieves new state-of-the-art results on the MSCOCO and Flikr30K. Source code is available at https://github.com/DenglinGo/CONICA.

Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner

Zikang Liu
Sihan Chen
Longteng Guo
Handong Li
Xingjian He
Jing Liu

Large pre-trained multimodal models have demonstrated significant success in a range of downstream tasks, including image captioning, image-text retrieval, visual question answering (VQA), etc. However, many of these methods rely on image-text pairs collected from the web as pre-training data and unfortunately overlook the need for fine-grained feature alignment between vision and language modalities, which requires detailed understanding of images and language expressions. While integrating VQA and dense captioning (DC) into pre-training can address this issue, acquiring image-question-answer as well as image-location-caption triplets is challenging and time-consuming. Additionally, publicly available datasets for VQA and dense captioning are typically limited in scale due to manual data collection and labeling efforts. In this paper, we propose a novel method called Joint QA and DC GEneration (JADE), which utilizes a pre-trained multimodal model and easily-crawled image-text pairs to automatically generate and filter large-scale VQA and dense captioning datasets. We apply this method to the Conceptual Caption (CC3M) dataset to generate a new dataset called CC3M-QA-DC. Experiments show that when used for pre-training in a multi-task manner, CC3M-QA-DC can improve the performance with various backbones on various downstream tasks. Furthermore, our generated CC3M-QA-DC can be combined with larger image-text datasets (e.g., CC15M) and achieve competitive results compared with models using much more data. Code and dataset are available at https://github.com/johncaged/OPT_Questioner.

Deconfounded Visual Question Generation with Causal Inference

Jiali Chen
Zhenjun Guo
Jiayuan Xie
Yi Cai
Qing Li

Visual Question Generation (VQG) task aims to generate meaningful and logically reasonable questions about the given image targeting an answer. Existing methods mainly focus on the visual concepts present in the image for question generation and have shown remarkable performance in VQG. However, these models frequently learn highly co-occurring object relationships and attributes, which is an inherent bias in question generation. This previously overlooked bias causes models to over-exploit the spurious correlations among visual features, the target answer, and the question. Therefore, they may generate inappropriate questions that contradict the visual content or facts. In this paper, we first introduce a causal perspective on VQG and adopt the causal graph to analyze spurious correlations among variables. Building on the analysis, we propose a Knowledge Enhanced Causal Visual Question Generation (KECVQG) model to mitigate the impact of spurious correlations in question generation. Specifically, an interventional visual feature extractor (IVE) is introduced in KECVQG, which aims to obtain unbiased visual features by disentangling. Then a knowledge-guided representation extractor (KRE) is employed to align unbiased features with external knowledge. Finally, the output features from KRE are sent into a standard transformer decoder to generate questions. Extensive experiments on the VQA v2.0 and OKVQA datasets show that KECVQG significantly outperforms existing models.

Null-text Guidance in Diffusion Models is Secretly a Cartoon-style Creator

Jing Zhao
Heliang Zheng
Chaoyue Wang
Long Lan
Wanrong Huang
Wenjing Yang

Classifier-free guidance is an effective sampling technique in diffusion models that has been widely adopted. The main idea is to extrapolate the model in the direction of text guidance and away from null-text guidance. In this paper, we demonstrate that null-text guidance in diffusion models is secretly a cartoon-style creator, i.e., the generated images can be efficiently transformed into cartoons by simply perturbing the null-text guidance. Specifically, we proposed two disturbance methods, i.e., Rollback disturbance (Back-D) and Image disturbance (Image-D), to construct misalignment between the noisy images used for predicting null-text guidance and text guidance (subsequently referred to as null-text noisy image and text noisy imageb respectively) in the sampling process. Back-D achieves cartoonization by altering the noisb level of the null-text noisy image via replacing xt with xl + Δ t. Image-D, alternatively, produces high-fidelity, diverse cartoons by defining xt as a clean input image, which further improves the incorporation of finer image details. Through comprehensive experiments, we delved into the principle of noise disturbing for null-text and uncovered that the efficacy of disturbance depends on the correlation between the null-text noisy image and the source image. Moreover, the proposed methods, which can generate cartoon images and cartoonize specific ones, are training-free and easily integrated as a plug-and-play component in any classifier-free guided diffusion model. The project page is available at https://nulltextforcartoon.github.io/.

Triple Correlations-Guided Label Supplementation for Unbiased Video Scene Graph Generation

Wenqing Wang
Kaifeng Gao
Yawei Luo
Tao Jiang
Fei Gao
Jian Shao
Jianwen Sun
Jun Xiao

Video-based scene graph generation (VidSGG) is an approach that aims to represent video content in a dynamic graph by identifying visual entities and their relationships. Due to the inherently biased distribution and missing annotations in the training data, current VidSGG methods have been found to perform poorly on less-represented predicates. In this paper, we propose an explicit solution to address this under-explored issue by supplementing missing predicates that should be included in the ground-truth annotations. Dubbed Trico, our method seeks to supplement the missing predicates that are supposed to appear in the ground-truth annotations, by exploring three complementary spatio-temporal correlations. Guided by these correlations, the missing labels can be effectively supplemented thus achieving an unbiased predicate predictions. We validate the effectiveness of Trico on the most widely used VidSGG datasets, i.e., VidVRD and VidOR. Extensive experiments demonstrate the state-of-the-art performance achieved by Trico, particularly on those tail predicates. The code is available in the supplementary material.

Probability Distribution Based Frame-supervised Language-driven Action Localization

Shuo Yang
Zirui Shang
Xinxiao Wu

Frame-supervised language-driven action localization aims to localize action boundaries in untrimmed videos corresponding to the input natural language query, with only a single frame annotation within the target action in training. This task is challenging due to the absence of complete and accurate annotation of action boundaries, hindering visual-language alignment and action boundary prediction. To address this challenge, we propose a novel method that introduces distribution functions to model both the probability of action frame and that of boundary frame. Specifically, we assign each video frame the probability of being the action frame based on the estimated shape parameters of the distribution function, serving as a foreground pseudo-label that guides cross-modal feature learning. Moreover, we model the probabilities of start frame and end frame of the target action using different distribution functions, and then estimate the probability of each action candidate being a positive candidate based on its start and end boundaries, which facilitates predicting action boundaries by exploring more positive terms in training. Experiments on two benchmark datasets demonstrate that our method outperforms existing methods, achieving a gain of more than 10% of R1@ 0.5 on the challenging TACoS dataset. These results emphasize the significance of generating pseudo labels with appropriate probabilities via distribution functions to address the challenge of frame-supervised language-driven action localization.

LUNA: Language as Continuing Anchors for Referring Expression Comprehension

Yaoyuan Liang
Zhao Yang
Yansong Tang
Jiashuo Fan
Ziran Li
Jingang Wang
Philip H.S. Torr
Shao-Lun Huang

Referring expression comprehension aims to localize a natural language description in an image. Using location priors to help reduce inaccuracies in cross-modal alignments is the state of the art for CNN-based methods tackling this problem. Recent Transformer-based models cast aside this idea, making the case for steering away from hand-designed components. In this work, we propose LUNA, which uses language as continuing anchors to guide box prediction in a Transformer decoder, and thus show that language-guided location priors can be effectively exploited in a Transformer-based architecture. Our method first initializes an anchor box from the input expression via a small "proto-decoder,'' and then uses this anchor and its refined successors as location guidance in a modified Transformer decoder. At each decoder layer, the anchor box is first used as a query for gathering multi-modal context, and then updated based on the gathered context (producing the next, refined anchor). In the end, a lightweight assessment pathway evaluates the quality of all produced anchors, yielding the final prediction in a dynamic way. This approach allows box decoding to be conditioned on learned anchors, which facilitates accurate grounding, as we shown in the experiments. Our method outperforms existing state-of-the-art methods on the datasets of ReferIt Game, RefCOCO/+/g, and Flickr30K Entities.

Prompt Me Up: Unleashing the Power of Alignments for Multimodal Entity and Relation Extraction

Xuming Hu
Junzhe Chen
Aiwei Liu
Shiao Meng
Lijie Wen
Philip S. Yu

How can we better extract entities and relations from text? Using multimodal extraction with images and text obtains more signals for entities and relations, and aligns them through graphs or hierarchical fusion, aiding in extraction. Despite attempts at various fusions, previous works have overlooked many unlabeled image-caption pairs, such as NewsCLIPing. This paper proposes innovative pre-training objectives for entity-object and relation-image alignment, extracting objects from images and aligning them with entity and relation prompts for soft pseudo-labels. These labels are used as self-supervised signals for pre-training, enhancing the ability to extract entities and relations. Experiments on three datasets show an average 3.41% F1 improvement over prior SOTA. Additionally, our method is orthogonal to previous multimodal fusions, and using it on prior SOTA fusions further improves 5.47% F1.

Language-Guided Visual Aggregation Network for Video Question Answering

Xiao Liang
Di Wang
Quan Wang
Bo Wan
Lingling An
Lihuo He

Video Question Answering (VideoQA) aims to comprehend intricate relationships, actions, and events within video content, as well as the inherent links between objects and scenes, to answer text-based questions accurately. Transferring knowledge from the cross-modal pre-trained model CLIP is a natural approach, but its dual-tower structure hinders fine-grained modality interaction, posing challenges for direct application to VideoQA tasks. To address this issue, we introduce a Language-Guided Visual Aggregation (LGVA) network. It employs CLIP as an effective feature extractor to obtain language-aligned visual features with different granularities and avoids resource-intensive video pre-training. The LGVA network progressively aggregates visual information in a bottom-up manner, focusing on both regional and temporal levels, and ultimately facilitating accurate answer prediction. More specifically, it employs local cross-attention to combine pre-extracted question tokens and region embeddings, pinpointing the object of interest in the question. Then, graph attention is utilized to aggregate regions at the frame level and integrate additional captions for enhanced detail. Following this, global cross-attention is used to merge sentence and frame-level embeddings, identifying the video segment relevant to the question. Ultimately, contrastive learning is applied to optimize the similarities between aggregated visual and answer embeddings, unifying upstream and downstream tasks. Our method conserves resources by avoiding large-scale video pre-training and simultaneously demonstrates commendable performance on the NExT-QA, MSVD-QA, MSRVTT-QA, TGIF-QA, and ActivityNet-QA datasets, even outperforming some end-to-end trained models. Our code is available at https://github.com/ecoxial2007/LGVA_VideoQA.

Resource Constrained Model Compression via Minimax Optimization for Spiking Neural Networks

Jue Chen
Huan Yuan
Jianchao Tan
Bin Chen
Chengru Song
Di Zhang

Brain-inspired Spiking Neural Networks (SNNs) have the characteristics of event-driven and high energy-efficient, which are different from traditional Artificial Neural Networks (ANNs) when deployed on edge devices such as neuromorphic chips. Most previous work focuses on SNNs training strategies to improve model performance and brings larger and deeper network architectures. It's difficult to deploy these complex networks on resource-limited edge devices directly. To meet such demand, people compress SNNs very cautiously to balance the performance and the computation efficiency. Existing compression methods either iteratively pruned SNNs using weights norm magnitude or formulated the problem as a sparse learning optimization. We propose an improved end-to-end Minimax optimization method for this sparse learning problem to better balance the model performance and the computation efficiency. We also demonstrate that jointly applying compression and finetuning on SNNs is better than sequentially, especially for extreme compression ratios. The compressed SNN models achieved state-of-the-art (SOTA) performance on various benchmark datasets and architectures. Our code is available athttps://github.com/chenjallen/Resource-Constrained-Compression-on-SNN .

Semi-Supervised Convolutional Vision Transformer with Bi-Level Uncertainty Estimation for Medical Image Segmentation

Huimin Huang
Yawen Huang
Shiao Xie
Lanfen Lin
Tong Ruofeng
Yen-wei Chen
Yuexiang Li
Yefeng Zheng

Semi-supervised learning (SSL) has attracted much attention in the field of medical image segmentation, which enables to alleviate the heavy burden of labelling pixel-wise annotation by extracting knowledge from unlabeled data. The existing methods basically benefit from the success of convolutional neural networks (CNNs) by keeping consistency of the predictions under small perturbations imposed on the networks or inputs. Two main concerns arise when learning such a paradigm: (1) CNNs tend to retain discriminative local features, neglecting global dependency and thus leading to inaccurate localization; (2) CNNs omit reliable feature-level and pixel-level information, resulting in sketchy pseudo-labels, especially around the confusing boundary. In this paper, we revisit the model of semi-supervised learning and develop a novel CNN-Transformer learning framework that allows for effective segmentation of medical images by producing complementary and reliable features and pseudo-label with bi-level uncertainty. Motivated by the uncertainty estimation to gain insight on feature discrimination, we explore the statistical and geometrical properties of features on network optimization and thus launching an alignment method in a more accurate and stable way. We attach equal significance to pixel-level uncertainty estimation for alleviating the influence of unreliable pseudo-labels in the training progress and advocating the reliability of predictions. Experimental results show that our method significantly surpasses existing semi-supervised approaches on two public medical image segmentation datasets.

Enhancing Multi-modal Multi-hop Question Answering via Structured Knowledge and Unified Retrieval-Generation

Qian Yang
Qian Chen
Wen Wang
Baotian Hu
Min Zhang

Multi-modal multi-hop question answering involves answering a question by reasoning over multiple input sources from different modalities. Existing methods often retrieve evidences separately and then use a language model to generate an answer based on the retrieved evidences, and thus do not adequately connect candidates and are unable to model the interdependent relations during retrieval. Moreover, the pipelined approaches of retrieval and generation might result in poor generation performance when retrieval performance is low. To address these issues, we propose a Structured Knowledge and Unified Retrieval-Generation (SKURG) approach. SKURG employs an Entity-centered Fusion Encoder to align sources from different modalities using shared entities. It then uses a unified Retrieval-Generation Decoder to integrate intermediate retrieval results for answer generation and also adaptively determine the number of retrieval steps. Extensive experiments on two representative multi-modal multi-hop QA datasets MultimodalQA and WebQA demonstrate that SKURG outperforms the state-of-the-art models in both source retrieval and answer generation performance with fewer parameters1.

Local Consensus Enhanced Siamese Network with Reciprocal Loss for Two-view Correspondence Learning

Linbo Wang
Jing Wu
Xianyong Fang
Zhengyi Liu
Chenjie Cao
Yanwei Fu

Recent studies of two-view correspondence learning usually establish an end-to-end network to jointly predict correspondence reliability and relative pose. We improve such a framework from two aspects. First, we propose a Local Feature Consensus (LFC) plugin block to augment the features of existing models. Given a correspondence feature, the block augments its neighboring features with mutual neighborhood consensus and aggregates them to produce an enhanced feature. As inliers obey a uniform cross-view transformation and share more consistent learned features than outliers, feature consensus strengthens inlier correlation and suppresses outlier distraction, which makes output features more discriminative for classifying inliers/outliers. Second, existing approaches supervise network training with the ground truth correspondences and essential matrix projecting one image to the other for an input image pair, without considering the information from the reverse mapping. We extend existing models to a Siamese network with a reciprocal loss that exploits the supervision of mutual projection, which considerably promotes the matching performance without introducing additional model parameters. Building upon MSA-Net [30], we implement the two proposals and experimentally achieve state-of-the-art performance on benchmark datasets.

Pro-Cap: Leveraging a Frozen Vision-Language Model for Hateful Meme Detection

Rui Cao
Ming Shan Hee
Adriel Kuek
Wen-Haw Chong
Roy Ka-Wei Lee
Jing Jiang

Hateful meme detection is a challenging multimodal task that requires comprehension of both vision and language, as well as cross-modal interactions. Recent studies have tried to fine-tune pre-trained vision-language models (PVLMs) for this task. However, with increasing model sizes, it becomes important to leverage powerful PVLMs more efficiently, rather than simply fine-tuning them. Recently, researchers have attempted to convert meme images into textual captions and prompt language models for predictions. This approach has shown good performance but suffers from non-informative image captions. Considering the two factors mentioned above, we propose a probing-based captioning approach to leverage PVLMs in a zero-shot visual question answering (VQA) manner. Specifically, we prompt a frozen PVLM by asking hateful content-related questions and use the answers as image captions (which we call Pro-Cap), so that the captions contain information critical for hateful content detection. The good performance of models with Pro-Cap on three benchmarks validates the effectiveness and generalization of the proposed method1.

Prototype-guided Cross-modal Completion and Alignment for Incomplete Text-based Person Re-identification

Tiantian Gong
Guodong Du
Junsheng Wang
Yongkang Ding
Liyan Zhang

Traditional text-based person re-identification (ReID) techniques heavily rely on fully matched multi-modal data, which is an ideal scenario. However, due to inevitable data missing and corruption during the collection and processing of cross-modal data, the incomplete data issue is usually met in real-world applications. Therefore, we consider a more practical task termed the incomplete text-based ReID task, where person images and text descriptions are not completely matched and contain partially missing modality data. To this end, we propose a novel Prototype-guided Cross-modal Completion and Alignment (PCCA) framework to handle the aforementioned issues for incomplete text-based ReID. Specifically, we cannot directly retrieve person images based on a text query on missing modality data. Therefore, we propose the cross-modal nearest neighbor construction strategy for missing data by computing the cross-modal similarity between existing images and texts, which provides key guidance for the completion of missing modal features. Furthermore, to efficiently complete the missing modal features, we construct the relation graphs with the aforementioned cross-modal nearest neighbor sets of missing modal data and the corresponding prototypes, which can further enhance the generated missing modal features. Additionally, for tighter fine-grained alignment between images and texts, we raise a prototype-aware cross-modal alignment loss that can effectively reduce the modality heterogeneity gap for better fine-grained alignment in common space. Extensive experimental results on several benchmarks with different missing ratios amply demonstrate that our method can consistently outperform state-of-the-art text-image ReID approaches.

Language-guided Human Motion Synthesis with Atomic Actions

Yuanhao Zhai
Mingzhen Huang
Tianyu Luan
Lu Dong
Ifeoma Nwogu
Siwei Lyu
David Doermann
Junsong Yuan

Language-guided human motion synthesis has been a challenging task due to the inherent complexity and diversity of human behaviors. Previous methods face limitations in generalization to novel actions, often resulting in unrealistic or incoherent motion sequences. In this paper, we propose ATOM (ATomic mOtion Modeling) to mitigate this problem, by decomposing actions into atomic actions, and employing a curriculum learning strategy to learn atomic action composition. First, we disentangle complex human motions into a set of atomic actions during learning, and then assemble novel actions using the learned atomic actions, which offers better adaptability to new actions. Moreover, we introduce a curriculum learning training strategy that leverages masked motion modeling with a gradual increase in the mask ratio, and thus facilitates atomic action assembly. This approach mitigates the overfitting problem commonly encountered in previous methods while enforcing the model to learn better motion representations. We demonstrate the effectiveness of ATOM through extensive experiments, including text-to-motion and action-to-motion synthesis tasks. We further illustrate its superiority in synthesizing plausible and coherent text-guided human motion sequences.

Avatar Knowledge Distillation: Self-ensemble Teacher Paradigm with Uncertainty

Yuan Zhang
Weihua Chen
Yichen Lu
Tao Huang
Xiuyu Sun
Jian Cao

Knowledge distillation is an effective paradigm for boosting the performance of pocket-size model, especially when multiple teacher models are available, the student would break the upper limit again. However, it is not economical to train diverse teacher models for the disposable distillation. In this paper, we introduce a new concept dubbed Avatars for distillation, which are the inference ensemble models derived from the teacher. Concretely, (1) For each iteration of distillation training, various Avatars are generated by a perturbation transformation. We validate that Avatars own higher upper limit of working capacity and teaching ability, aiding the student model in learning diverse and receptive knowledge perspectives from the teacher model. (2) During the distillation, we propose an uncertainty-aware factor from the variance of statistical differences between the vanilla teacher and Avatars, to adjust Avatars' contribution on knowledge transfer adaptively. Avatar Knowledge Distillation (AKD) is fundamentally different from existing methods and refines with the innovative view of unequal training. Comprehensive experiments demonstrate the effectiveness of our Avatars mechanism, which polishes up the state-of-the-art distillation methods for dense prediction without more extra computational cost. The AKD brings at most 0.7 AP gains on COCO 2017 for Object Detection and 1.83 mIoU gains on Cityscapes for Semantic Segmentation, respectively.

Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic Role Labeling

Yu Zhao
Hao Fei
Yixin Cao
Bobo Li
Meishan Zhang
Jianguo Wei
Min Zhang
Tat-Seng Chua

As one of the core video semantic understanding tasks, Video Semantic Role Labeling (VidSRL) aims to detect the salient events from given videos, by recognizing the predict-argument event structures and the interrelationships between events. While recent endeavors have put forth methods for VidSRL, they can be mostly subject to two key drawbacks, including the lack of fine-grained spatial scene perception and the insufficiently modeling of video temporality. Towards this end, this work explores a novel holistic spatio-temporal scene graph (namely HostSG) representation based on the existing dynamic scene graph structures, which well model both the fine-grained spatial semantics and temporal dynamics of videos for VidSRL. Built upon the HostSG, we present a nichetargeting VidSRL framework. A scene-event mapping mechanism is first designed to bridge the gap between the underlying scene structure and the high-level event semantic structure, resulting in an overall hierarchical scene-event (termed ICE) graph structure. We further perform iterative structure refinement to optimize the ICE graph, e.g., filtering noisy branches and newly building informative connections, such that the overall structure representation can best coincide with end task demand. Finally, three subtask predictions of VidSRL are jointly decoded, where the end-to-end paradigm effectively avoids error propagation. On the benchmark dataset, our framework boosts significantly over the current best-performing model. Further analyses are shown for a better understanding of the advances of our methods. Our HostSG representation shows greater potential to facilitate a broader range of other video understanding tasks.

SelfTalk: A Self-Supervised Commutative Training Diagram to Comprehend 3D Talking Faces

Ziqiao Peng
Yihao Luo
Yue Shi
Hao Xu
Xiangyu Zhu
Hongyan Liu
Jun He
Zhaoxin Fan

Speech-driven 3D face animation technique, extending its applications to various multimedia fields.Previous research has generated promising realistic lip movements and facial expressions from audio signals. However, traditional regression models solely driven by data face several essential problems, such as difficulties in accessing precise labels and domain gaps between different modalities, leading to unsatisfactory results lacking precision and coherence.To enhance the visual accuracy of generated lip movement while reducing the dependence on labeled data, we propose a novel framework SelfTalk, by involving self-supervision in a cross-modals network system to learn 3D talking faces. The framework constructs a network system consisting of three modules: facial animator, speech recognizer, and lip-reading interpreter. The core of SelfTalk is a commutative training diagram that facilitates compatible features exchange among audio, text, and lip shape, enabling our models to learn the intricate connection between these factors. The proposed framework leverages the knowledge learned from the lip-reading interpreter to generate more plausible lip shapes. Extensive experiments and user studies demonstrate that our proposed approach achieves state-of-the-art performance both qualitatively and quantitatively. We recommend watching the supplementary video.

Zero-shot Skeleton-based Action Recognition via Mutual Information Estimation and Maximization

Yujie Zhou
Wenwen Qiang
Anyi Rao
Ning Lin
Bing Su
Jiaqi Wang

Zero-shot skeleton-based action recognition aims to recognize actions of unseen categories after training on data of seen categories. The key is to build the connection between visual and semantic space from seen to unseen classes. Previous studies have primarily focused on encoding sequences into a singular feature vector, with subsequent mapping the features to an identical anchor point within the embedded space. Their performance is hindered by 1) the ignorance of the global visual/semantic distribution alignment, which results in a limitation to capture the true interdependence between the two spaces. 2) the negligence of temporal information since the frame-wise features with rich action clues are directly pooled into a single feature vector. We propose a new zero-shot skeleton-based action recognition method via mutual information (MI) estimation and maximization. Specifically, 1) we maximize the MI between visual and semantic space for distribution alignment; 2) we leverage the temporal information for estimating the MI by encouraging MI to increase as more frames are observed. Extensive experiments on three large-scale skeleton action datasets confirm the effectiveness of our method.

Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment for Markup-to-Image Generation

Guojin Zhong
Jin Yuan
Pan Wang
Kailun Yang
Weili Guan
Zhiyong Li

The recently rising markup-to-image generation poses greater challenges as compared to natural image generation, due to its low tolerance for errors as well as the complex sequence and context correlations between markup and rendered image. This paper proposes a novel model named "Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment'' (FSA-CDM), which introduces contrastive positive/negative samples into the diffusion model to boost performance for markup-to-image generation. Technically, we design a fine-grained cross-modal alignment module to well explore the sequence similarity between the two modalities for learning robust feature representations. To improve the generalization ability, we propose a contrast-augmented diffusion model to explicitly explore positive and negative samples by maximizing a novel contrastive variational objective, which is mathematically inferred to provide a tighter bound for the model's optimization. Moreover, the context-aware cross attention module is developed to capture the contextual information within markup language during the denoising process, yielding better noise prediction results. Extensive experiments are conducted on four benchmark datasets from different domains, and the experimental results demonstrate the effectiveness of the proposed components in FSA-CDM, significantly exceeding state-of-the-art performance by about 2% ~ 12% DTW improvements.

Guided Image Synthesis via Initial Image Editing in Diffusion Model

Jiafeng Mao
Xueting Wang
Kiyoharu Aizawa

Diffusion models have the ability to generate high quality images by denoising pure Gaussian noise images. While previous research has primarily focused on improving the control of image generation through adjusting the denoising process, we propose a novel direction of manipulating the initial noise to control the generated image. Through experiments on stable diffusion, we show that blocks of pixels in the initial latent images have a preference for generating specific content, and that modifying these blocks can significantly influence the generated image. In particular, we show that modifying a part of the initial image affects the corresponding region of the generated image while leaving other regions unaffected, which is useful for repainting tasks. Furthermore, we find that the generation preferences of pixel blocks are primarily determined by their values, rather than their position. By moving pixel blocks with a tendency to generate user-desired content to user-specified regions, our approach achieves state-of-the-art performance in layout-to-image generation. Our results highlight the flexibility and power of initial image manipulation in controlling the generated image.

External Knowledge Dynamic Modeling for Image-text Retrieval

Song Yang
Qiang Li
Wenhui Li
Min Liu
Xuanya Li
Anan Liu

Image-text retrieval is a fundamental branch in cross-modal retrieval. The core is to explore the semantic correspondence to align relevant image-text pairs. Some existing methods rely on global semantics and co-occurrence frequency to design knowledge introduction patterns for consistent representations. However, they lack flexibility due to the limitations of fixed information and empirical feedback. To address these issues, we develop an External Knowledge Dynamic Modeling~(EKDM) architecture based on the filtering mechanism, which dynamically explores different knowledge towards varied image-text pairs. Specially, we first capture abundant concepts and relationships from external knowledge to construct visual and textual corpus sets. Then, we progressively explores concepts related to images and texts by dynamic global representations. To endow the model with the capability of relationship decision, we integrate the variable spatial locations between objects for association exploration. Since the filtering mechanism is conditioned on dynamic semantics and variable spatial locations, our model can dynamically model different knowledge for different image-text pairs. Extensive experimental results on two benchmark datasets demonstrate the effectiveness of our proposed method.

Seeing in Flowing: Adapting CLIP for Action Recognition with Motion Prompts Learning

Qiang Wang
Junlong Du
Ke Yan
Shouhong Ding

The Contrastive Language-Image Pre-training (CLIP) has recently shown remarkable generalization on "zero-shot" training and has applied to many downstream tasks. We explore the adaptation of CLIP to achieve a more efficient and generalized action recognition method. We propose that the key lies in explicitly modeling the motion cues flowing in video frames. To that end, we design a two-stream motion modeling block to capture motion and spatial information at the same time. And then, the obtained motion cues are utilized to drive a dynamic prompts learner to generate motion-aware prompts, which contain much semantic information concerning human actions. In addition, we propose a multimodal communication block to achieve a collaborative learning and further improve the performance. We conduct extensive experiments on HMDB-51, UCF-101, and Kinetics-400 datasets. Our method outperforms most existing state-of-the-art methods by a significant margin on "few-shot" and "zero-shot" training. We also achieve competitive performance on "closed-set" training with extremely few trainable parameters and additional computational costs.

Enhancing Real-Time Super Resolution with Partial Convolution and Efficient Variance Attention

Zhou Zhou
Jiahao Chao
Jiali Gong
Hongfan Gao
Zhenbing Zeng
Zhengfeng Yang

With the increasing availability of devices that support ultra-high-definition (UHD) images, Single Image Super Resolution (SISR) has emerged as a crucial problem in the field of computer vision. In recent years, CNN-based super resolution approaches have made significant advances, producing high-quality upscaled images. However, these methods can be computationally and memory intensive, making them impractical for real-time applications such as upscaling to UHD images. The performance and reconstruction quality may suffer due to the complexity and diversity of larger image content. Therefore, there is a need to develop efficient super resolution approaches that can meet the demands of processing high-resolution images. In this paper, we propose a simple network named PCEVAnet by constructing the PCEVA block, which leverages Partial Convolution and Efficient Variance Attention. Partial Convolution is employed to streamline the feature extraction process by minimizing memory access. And Efficient Variance Attention (EVA) captures the high-frequency information and long-range dependency via the variance and max pooling. We conduct extensive experiments to demonstrate that our model achieves a better trade-off between performance and actual running time than previous methods.

HSIC-based Moving Weight Averaging for Few-Shot Open-Set Object Detection

Binyi Su
Hua Zhang
Zhong Zhou

We study the problem of few-shot open-set object detection (FOOD), whose goal is to quickly adapt a model to a small set of labeled samples and reject unknown class samples. Recent works usually use the weight sparsification for unknown rejection, but due to the lack of tailored considerations for data-scarce scenarios, the performance is not satisfactory. In this work, we solve the challenging few-shot open-set object detection problems from three aspects. First, different from previous pseudo-unknown sample mining methods, we employ the evidential uncertainty estimated by the Dirichlet distribution of probability to mine the pseudo-unknown samples from the foreground and background proposal space. Second, based on the statistical analysis between the number of pseudo-unknown samples and the Intersection over Union (IoU), we propose an IoU-aware unknown objective, which sharps the unknown decision boundary by considering the localization quality. Third, to suppress the over-fitting problem and improve the model's generalization ability for unknown rejection, we propose the HSIC-based (Hilbert-Schmidt Independence Criterion) moving weight averaging to update the weights of classification and regression heads, which considers the degree of independence between the current weights and previous weights stored in the long-term memory banks. We compare our method with several state-of-the-art methods and observe that our method improves the mean recall of unknown classes by 12.87% across all shots in the VOC-COCO dataset settings. Our code is available at https://github.com/binyisu/food.

Exploiting Low-confidence Pseudo-labels for Source-free Object Detection

Zhihong Chen
Zilei Wang
Yixin Zhang

Source-free object detection (SFOD) aims to adapt a source-trained detector to an unlabeled target domain without access to the labeled source data. Current SFOD methods utilize a threshold-based pseudo-label approach in the adaptation phase, which is typically limited to high-confidence pseudo-labels and results in a loss of information. To address this issue, we propose a new approach to take full advantage of pseudo-labels by introducing high and low confidence thresholds. Specifically, the pseudo-labels with confidence scores above the high threshold are used conventionally, while those between the low and high thresholds are exploited using the Low-confidence Pseudo-labels Utilization (LPU) module. The LPU module consists of Proposal Soft Training (PST) and Local Spatial Contrastive Learning (LSCL). PST generates soft labels of proposals for soft training, which can mitigate the label mismatch problem. LSCL exploits the local spatial relationship of proposals to improve the model's ability to differentiate between spatially adjacent proposals, thereby optimizing representational features further. Combining the two components overcomes the challenges faced by traditional methods in utilizing low-confidence pseudo-labels. Extensive experiments on five cross-domain object detection benchmarks demonstrate that our proposed method outperforms the previous SFOD methods, achieving state-of-the-art performance.

Bridging Language and Geometric Primitives for Zero-shot Point Cloud Segmentation

Runnan Chen
Xinge Zhu
Nenglun Chen
Wei Li
Yuexin Ma
Ruigang Yang
Wenping Wang

We investigate transductive zero-shot point cloud semantic segmentation, where the network is trained on seen objects and able to segment unseen objects. The 3D geometric elements are essential cues to imply a novel 3D object type. However, previous methods neglect the fine-grained relationship between the language and the 3D geometric elements. To this end, we propose a novel framework to learn the geometric primitives shared in seen and unseen categories' objects and employ a fine-grained alignment between language and the learned geometric primitives. Therefore, guided by language, the network recognizes the novel objects represented with geometric primitives. Specifically, we formulate a novel point visual representation, the similarity vector of the point's feature to the learnable prototypes, where the prototypes automatically encode geometric primitives via back-propagation. Besides, we propose a novel Unknown-aware InfoNCE Loss to fine-grained align the visual representation with language. Extensive experiments show that our method significantly outperforms other state-of-the-art methods in the harmonic mean-intersection-over-union (hIoU), with the improvement of 17.8%, 30.4%, 9.2% and 7.9% on S3DIS, ScanNet, SemanticKITTI and nuScenes datasets, respectively. Codes are available1 https://github.com/runnanchen/Zero-Shot-Point-Cloud-Segmentation.

Graph Spectral Perturbation for 3D Point Cloud Contrastive Learning

Yuehui Han
Jiaxin Chen
Jianjun Qian
Jin Xie

3D point cloud contrastive learning has attracted increasing attention due to its efficient learning ability. By distinguishing the similarity relationship between positive and negative samples in the feature space, it can learn effective point cloud feature representations without manual annotation. However, most point cloud contrastive learning methods construct contrastive samples by perturbing point clouds in data space or introducing multi-modality/format data, which may be difficult to control the intensity of the perturbation or introduce interference from different modalities/formats. To this end, in this paper, we propose a novel graph spectral perturbation based contrastive learning framework (GSPCon) for efficient and robust self-supervised 3D point cloud representation learning. It aims to perform perturbations in the graph spectral domain to construct contrastive samples of the point cloud. Specifically, we first naturally represent the point cloud as a k-nearest neighbors (KNN) graph, and adaptively transform the coordinates of the points into the graph spectral domain based on the graph Fourier transform (GFT). Then we implement data augmentation in the graph spectral domain by perturbing the spectral representations. Finally, the contrastive samples are generated by employing the inverse graph Fourier transform (IGFT) to transform the augmented spectral representations back to the point clouds. Experimental results show that our method achieves the state-of-the-art performance on various downstream tasks. Source code is available at https://github.com/yh-han/GSPCon.git.

Retrieval-based Knowledge Augmented Vision Language Pre-training

Jiahua Rao
Zifei Shan
Longpo Liu
Yao Zhou
Yuedong Yang

With the recent progress in large-scale vision and language representation learning, Vision Language Pre-training (VLP) models have achieved promising improvements on various multi-modal downstream tasks. Albeit powerful, these models have not fully leveraged world knowledge to their advantage. A key challenge of knowledge-augmented VLP is the lack of clear connections between knowledge and multi-modal data. Moreover, not all knowledge present in images/texts is useful, therefore prior approaches often struggle to effectively integrate knowledge, visual, and textual information. In this study, we propose REtrieval-based knowledge Augmented Vision Language (REAVL), a novel knowledge-augmented pre-training framework to address the above issues. For the first time, we introduce a knowledge-aware self-supervised learning scheme that efficiently establishes the correspondence between knowledge and multi-modal data and identifies informative knowledge to improve the modeling of alignment and interactions between visual and textual modalities. By adaptively integrating informative knowledge with visual and textual information, REAVL achieves new state-of-the-art performance uniformly on knowledge-based vision-language understanding and multi-modal entity linking tasks, as well as competitive results on general vision-language tasks while only using 0.2% pre-training data of the best models. Our model shows strong sample efficiency and effective knowledge utilization.

ACQ: Few-shot Backdoor Defense via Activation Clipping and Quantizing

Yulin Jin
Xiaoyu Zhang
Jian Lou
Xiaofeng Chen

In recent years, deep neural networks(DNNs) have relied on an increasing amount of training samples as the premise of the deployment for real-world scenarios. This gives rise to backdoor attacks, where a small fraction of poisoned data is inserted into the training dataset to manipulate the predictions of DNNs when presented with backdoor inputs. Backdoor attacks pose serious security threats during the prediction stage of DNNs. As a result, there is growing research attention to defend against backdoor attacks. This paper proposes Activation Clipping and Quantizing (ACQ), a novel backdoor elimination module via transforming the intermediate-layer output of DNNs during forward propagation by embedding Clipper and Quantizer into the backdoored DNNs. ACQ is motivated by the observation that the backdoored DNNs always output abnormally large or small intermediate-layer activations when presented with backdoored samples, eventually leading to the malicious prediction of backdoored DNNs. ACQ modifies backdoored DNNs to keep the intermediate-layer activations in a proper domain and align the forward propagation of backdoored samples with that of clean samples. Besides, we highlight that ACQ has the ability to eliminate the backdoor of DNNs in few-shot even zero-shot scenarios, which requires much fewer or even no clean samples for the backdoor elimination stage than existing approaches. Experiments demonstrate the effectiveness and robustness of ACQ against various attacks and tasks compared to existing methods. Our code and Appendix can be found in https://github.com/Backdoor-defense/ACQ

Underwater Image Enhancement by Transformer-based Diffusion Model with Non-uniform Sampling for Skip Strategy

Yi Tang
Hiroshi Kawasaki
Takafumi Iwaguchi

In this paper, we present an approach to image enhancement with diffusion model in underwater scenes. Our method adapts conditional denoising diffusion probabilistic models to generate the corresponding enhanced images by using the underwater images and the Gaussian noise as the inputs. Additionally, in order to improve the efficiency of the reverse process in the diffusion model, we adopt two different ways. We firstly propose a lightweight transformer-based denoising network, which can effectively promote the time of network forward per iteration. On the other hand, we introduce a skip sampling strategy to reduce the number of iterations. Besides, based on the skip sampling strategy, we propose two different non-uniform sampling methods for the sequence of the time step, namely piecewise sampling and searching with the evolutionary algorithm. Both of them are effective and can further improve performance by using the same steps against the previous uniform sampling. In the end, we conduct a relative evaluation of the widely used underwater enhancement datasets between the recent state-of-the-art methods and the proposed approach. The experimental results prove that our approach can achieve both competitive performance and high efficiency. Our code is available at https://github.com/piggy2009/DM_underwater.

LGFat-RGCN: Faster Attention with Heterogeneous RGCN for Medical ICD Coding Generation

Zhenghan Chen
Changzeng Fu
Ruoxue Wu
Ye Wang
Xunzhu Tang
Xiaoxuan Liang

With the increasing volume of healthcare data, automated International Classification of Diseases (ICD) has become increasingly relevant and is frequently regarded as a medical multi-label prediction problem. Current methods struggle to accurately classify medical diagnosis texts that represent deep and sparse categories. Unlike these works that model the label with code hierarchy or description for label prediction, we argue that the label generation with structural information can provide more comprehensive knowledge based on the observation that label synonyms and parent-child relationships in vary from their context in clinical contexts. In this study, we introduce \tool, a heterogeneous graph model with improved attention for automated ICD coding. Notably, our approach represents the model to consider this task as a labelled graph generation problem. Our enhanced attention mechanism boosts the model's capacity to learn from multi-relational heterogeneous graph representations. Additionally, we propose a discriminator for labelled graphs (LG) that computes the reward for each ICD code in the labelled graph generator. Our experimental findings demonstrate that our proposed model significantly outperforms all existing strong baseline methods and attains the best performance on three benchmark datasets.

Semi-supervised Semantic Segmentation with Mutual Knowledge Distillation

Jianlong Yuan
Jinchao Ge
Zhibin Wang
Yifan liu

Consistency regularization has been widely studied in recent semi- supervised semantic segmentation methods, and promising per- formance has been achieved. In this work, we propose a new con- sistency regularization framework, termed mutual knowledge dis- tillation (MKD), combined with data and feature augmentation. We introduce two auxiliary mean-teacher models based on consis- tency regularization. More specifically, we use the pseudo-labels generated by a mean teacher to supervise the student network to achieve a mutual knowledge distillation between the two branches. In addition to using image-level strong and weak augmentation, we also discuss feature augmentation. This involves considering various sources of knowledge to distill the student network. Thus, we can significantly increase the diversity of the training samples. Experiments on public benchmarks show that our framework out- performs previous state-of-the-art (SOTA) methods under various semi-supervised settings. Code is available at https://github.com/jianlong-yuan/semi-mmseg.

Shift Pruning: Equivalent Weight Pruning for CNN via Differentiable Shift Operator

Tao Niu
Yihang Lou
Yinglei Teng
Jianzhong He
Yiding Liu

Weight pruning is a well-known technique used for network compression. In contrast to filter pruning, weight pruning produces higher compression ratios as it is more fine-grained. However, pruning individual weights results in broken kernels, which cannot be directly accelerated on general platforms, leading to hardware compatibility issues. To address this issue, we propose Shift Pruning (SP), a novel weight pruning method that is compatible with general platforms. SP converts spatial convolutions into regular 1 X 1 convolutions and shift operations, which are simply memory movements that do not require additional FLOPs or parameters. Specifically, we decompose the original K X K convolution into parallel branches of shift-convolution operations and devise the Differentiable Shift Operator (DSO), an approximation form of the actual shift operation, to automatically learn the crucial directions for adequate spatial interactions with the designed shift-related loss function. A regularization term is proposed to prevent redundant shifting, which is beneficial for low-resolution situations. To further improve inference efficacy, we develop a post-training transformation that can construct a more compact model. The introduced channel-wise slimming allows SP to prune in a hybrid-structural manner, catering for both hardware compatibility and a high compression ratio. Extensive experiments on the CIFAR-10 and ImageNet datasets demonstrate that our proposed method achieves superior performance in both accuracy and FLOPs reduction compared to other state-of-the-art techniques. For instance, on ImageNet, we can reduce 48.8% of total FLOPs on ResNet-34 with only 0.22% Top-1 accuracy drop.

Improving Human-Object Interaction Detection via Virtual Image Learning

Shuman Fang
Shuai Liu
Jie Li
Guannan Jiang
Xianming Lin
Rongrong Ji

Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects, which plays a curtail role in high-level semantic understanding tasks. However, most works pursue designing better architectures to learn overall features more efficiently, while ignoring the long-tail nature of interaction-object pair categories. In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL). Firstly, a novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images. In this stage, virtual images are generated based on prompts with specific characterizations and selected by multi-filtering processes. Secondly, we use both virtual and real images to train the model with the teacher-student framework. Considering the initial labels of some virtual images are inaccurate and inadequate, we devise an Adaptive Matching-and-Filtering (AMF) module to construct pseudo-labels. Our method is independent of the internal structure of HOI detectors, so it can be combined with off-the-shelf methods by training merely 10 additional epochs. With the assistance of our method, multiple methods obtain significant improvements, and new state-of-the-art results are achieved on two benchmarks.

ZRIGF: An Innovative Multimodal Framework for Zero-Resource Image-Grounded Dialogue Generation

Bo Zhang
Jian Wang
Hui Ma
Bo Xu
Hongfei Lin

Image-grounded dialogue systems benefit greatly from integrating visual information, resulting in high-quality response generation. However, current models struggle to effectively utilize such information in zero-resource scenarios, mainly due to the disparity between image and text modalities. To overcome this challenge, we propose an innovative multimodal framework, called ZRIGF, which assimilates image-grounded information for dialogue generation in zero-resource situations. ZRIGF implements a two-stage learning strategy, comprising contrastive pre-training and generative pre-training. Contrastive pre-training includes a text-image matching module that maps images and texts into a unified encoded vector space, along with a text-assisted masked image modeling module that preserves pre-training visual features and fosters further multimodal feature alignment. Generative pre-training employs a multimodal fusion module and an information transfer module to produce insightful responses based on harmonized multimodal representations. Comprehensive experiments conducted on both text-based and image-grounded dialogue datasets demonstrate ZRIGF's efficacy in generating contextually pertinent and informative responses. Furthermore, we adopt a fully zero-resource scenario in the image-grounded dialogue dataset to demonstrate our framework's robust generalization capabilities in novel domains.

Diffused Fourier Network for Video Action Segmentation

Borui Jiang
Yadong MU

Video action segmentation aims to densely cast each video frame into a set of pre-defined human action categories. This work proposes a novel model, dubbed as diffused Fourier network (DFN) for video action segmentation. It advances the research frontier by addressing several central bottlenecks in the existing methods for video action segmentation. First, capturing long-range dependence among video frames is known to be crucial for precisely estimating the temporal boundaries for actions. Rather than relying on compute-intensive self-attention modules or stacking multi-rate dilated convolutions as in previous models (e.g., ASFormer), we devise Fourier token mixer over shiftable temporal windows in the video sequence, which harnesses the parameter-free and light-weighted Fast Fourier Transform (FFT) for efficient spectral-temporal feature learning. Essentially, even simple spectral operations (e.g., pointwise product) bring global receptive field across the entire temporal window. The proposed Fourier token mixer thus provides a low-cost alternative for existing practice. Secondly, the results of action segmentation tend to be fragmented, primarily due to the noisy per-frame action likelihood, known as over-segmentation in the literature. Inspired by the recently-proposed diffusion models, we treat over-segments as noises corrupting the true temporal boundaries, and conduct denoising via a recurrent execution of a parameter-sharing module, akin to the backward denoising process in the diffusion models. Comprehensive experiments on three video benchmarks (GTEA, 50salads and Breakfast) have clearly validated that the proposed method can strike an excellent balance between computations / parameter count and accuracy.

Rethinking the Localization in Weakly Supervised Object Localization

Rui Xu
Yong Luo
Han Hu
Bo Du
Jialie Shen
Yonggang Wen

Weakly supervised object localization (WSOL) is one of the most popular and challenging tasks in computer vision. This task is to localize the objects in the images given only the image-level supervision. Recently, dividing WSOL into two parts (class-agnostic object localization and object classification) has become the state-of-the-art pipeline for this task. However, existing solutions under this pipeline usually suffer from the following drawbacks: 1) they are not flexible since they can only localize one object for each image due to the adopted single-class regression (SCR) for localization; 2) the generated pseudo bounding boxes may be noisy, but the negative impact of such noise is not well addressed. To remedy these drawbacks, we first propose to replace SCR with a binary-class detector (BCD) for localizing multiple objects, where the detector is trained by discriminating the foreground and background. Then we design a weighted entropy (WE) loss using the unlabeled data to reduce the negative impact of noisy bounding boxes. Extensive experiments on the popular CUB-200-2011 and ImageNet-1K datasets demonstrate the effectiveness of our method.

Hyperspectral Image Denoising with Spectrum Alignment

Jiahua Xiao
Yantao Ji
Xing Wei

Spectral modeling plays a critical role in denoising hyperspectral images (HSIs), with recent approaches leveraging well-designed network architectures to extract spectral contexts for noise removal. However, these approaches overlook a striking finding: the presence of spectral differences in noisy contexts can pose challenges for the denoising network during the restoration process of each band in the HSI. We attribute this to the varying levels of spectral difference between different bands and the unknown distribution of various noises. These factors can make it difficult for the network to capture consistent features, ultimately leading to suboptimal solutions. We propose a novel concept termed 'spectral displacement,' which views spectral differences as pixel motion displacement along the spectral domain. To eliminate the effect of spectral displacement, we introduce a potential solution: spectral alignment. This approach can increase the mutual information between different spectral bands and enhance the effectiveness of denoising. We then present the Spectral Alignment Recurrent Network (SARN) for efficient and effective displacement estimation and pixel-level alignment between neighboring bands. SARN can serve as a general plug-in for HSI backbones without requiring any model-specific design. Experimental results on several benchmark datasets confirm the effectiveness and superiority of our concept and network. The source code will be available at https://github.com/MIV-XJTU/SARN.

Training Multimedia Event Extraction With Generated Images and Captions

Zilin Du
Yunxin Li
Xu Guo
Yidan Sun
Boyang Li

Contemporary news reporting increasingly features multimedia content, motivating research on multimedia event extraction. However, the task lacks annotated multimodal training data and artificially generated training data suffer from the distribution shift from the real-world data. In this paper, we propose Cross-modality Augmented Multimedia Event Learning (CAMEL), which successfully utilizes artificially generated multimodal training data and achieves state-of-the-art performance. Conditioned on unimodal training data, we generate multimodal training data using off-the-shelf image generators like Stable Diffusion [45] and image captioners like BLIP [24]. After that, we train the network on the resultant multimodal datasets. In order to learn robust features that are effective across domains, we devise an iterative and gradual training strategy. Substantial experiments show that CAMEL surpasses state-of-the-art (SOTA) baselines on the M2E2 benchmark. On multimedia events in particular, we outperform the prior SOTA by 4.2% F1 on event mention identification and by 9.8% F1 on argument identification, which demonstrates that CAMEL learns synergistic representations from the two modalities. Our work demonstrates a recipe to unleash the power of synthetic training data in structured prediction.

BMI-Net: A Brain-inspired Multimodal Interaction Network for Image Aesthetic Assessment

Xixi Nie
Bo Hu
Xinbo Gao
Leida Li
Xiaodan Zhang
Bin Xiao

Image aesthetic assessment (IAA) has drawn wide attention in recent years as more and more users post images and texts on the Internet to share their views. The intense subjectivity and complexity of IAA make it extremely challenging. Text triggers the subjective expression of human aesthetic experience based on human implicit memory, so incorporating the textual information and identifying the relationship with the image is of great importance for IAA. However, IAA with the image as input fails to fully consider subjectivity, while existing multimodal IAA ignores the interrelationship among modalities. To this end, we propose a brain-inspired multimodal interaction network (BMI-Net) that simulates how the association area of the cerebral cortex processes sensory stimuli. In particular, the knowledge integration LSTM (KI-LSTM) is proposed to learn the image-text interaction relation. The proposed scalable multimodal fusion (SMF) based on low-rank decomposition fuses image, text and interaction modalities to predict the aesthetic distribution. Extensive experiments show that the proposed BMI-Net outperforms existing state-of-the-art methods on three IAA tasks.

Towards Accurate Lip-to-Speech Synthesis in-the-Wild

Sindhu Hegde
Rudrabha Mukhopadhyay
C.V Jawahar
Vinay Namboodiri

In this paper, we introduce a novel approach to address the task of synthesizing speech from silent videos of any in-the-wild speaker solely based on lip movements. The traditional approach of directly generating speech from lip videos faces the challenge of not being able to learn a robust language model from speech alone, resulting in unsatisfactory outcomes. To overcome this issue, we propose incorporating noisy text supervision using a state-of-the-art lip-to-text network that instills language information into our model. The noisy text is generated using a pre-trained lip-to-text model, enabling our approach to work without text annotations during inference. We design a visual text-to-speech network that utilizes the visual stream to generate accurate speech, which is in-sync with the silent input video. We perform extensive experiments and ablation studies, demonstrating our approach's superiority over the current state-of-the-art methods on various benchmark datasets. Further, we demonstrate an essential practical application of our method in assistive technology by generating speech for an ALS patient who has lost the voice but can make mouth movements. Our demo video, code, and additional details can be found at http://cvit.iiit.ac.in/research/projects/cvit-projects/ms-l2s-itw.

Towards End-to-End Unsupervised Saliency Detection with Self-Supervised Top-Down Context

Yicheng Song
Shuyong Gao
Haozhe Xing
Yiting Cheng
Yan Wang
Wenqiang Zhang

Unsupervised salient object detection aims to detect salient objects without using supervision signals eliminating the tedious task of manually labeling salient objects. To improve training efficiency, end-to-end methods for USOD have been proposed as a promising alternative. However, current solutions rely heavily on noisy handcraft labels and fail to mine rich semantic information from deep features. In this paper, we propose a self-supervised end-to-end salient object detection framework via top-down context. Specifically, motivated by contrastive learning, we exploit the self-localization from the deepest feature to construct the location maps which are then leveraged to learn the most instructive segmentation guidance. Further considering the lack of detailed information in deepest features, we exploit the detail-boosting refiner module to enrich the location labels with details. Moreover, we observe that due to lack of supervision, current unsupervised saliency models tend to detect non-salient objects that are salient in some other samples of corresponding scenarios. To address this widespread issue, we design a novel Unsupervised Non-Salient Suppression (UNSS) method developing the ability to ignore non-salient objects. Extensive experiments on benchmark datasets demonstrate that our method achieves leading performance among the recent end-to-end methods and most of the multi-stage solutions. The code is available.

PoSynDA: Multi-Hypothesis Pose Synthesis Domain Adaptation for Robust 3D Human Pose Estimation

Hanbing Liu
Jun-Yan He
Zhi-Qi Cheng
Wangmeng Xiang
Qize Yang
Wenhao Chai
Gaoang Wang
Xu Bao
Bin Luo
Yifeng Geng
Xuansong Xie

The current 3D human pose estimators face challenges in adapting to new datasets due to the scarcity of 2D-3D pose pairs in target domain training sets. We present the Multi-Hypothesis Pose Synthesis Domain Adaptation (PoSynDA) framework to overcome this issue without extensive target domain annotation. Utilizing a diffusion-centric structure, PoSynDA simulates the 3D pose distribution in the target domain, filling the data diversity gap. By incorporating a multi-hypothesis network, it creates diverse pose hypotheses and aligns them with the target domain. Target-specific source augmentation obtains the target domain distribution data from the source domain by decoupling the scale and position parameters. The teacher-student paradigm and low-rank adaptation further refine the process. PoSynDA demonstrates competitive performance on benchmarks, such as Human3.6M, MPI-INF-3DHP, and 3DPW, even comparable with the target-trained MixSTE model. This work paves the way for the practical application of 3D human pose estimation1. The source code is available at https://github.com/hbing-l/PoSynDA.

All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment

Chunhui Zhang
Xin Sun
Yiqian Yang
Li Liu
Qiong Liu
Xi Zhou
Yanfeng Wang

Current mainstream vision-language (VL) tracking framework consists of three parts,i.e., a visual feature extractor, a language feature extractor, and a fusion model. To pursue better performance, a natural modus operandi for VL tracking is employing customized and heavier unimodal encoders, and multi-modal fusion models. Albeit effective, existing VL trackers separate feature extraction and feature integration, resulting in extracted features that lack semantic guidance and have limited target-aware capability in complex scenarios, e.g., similar distractors and extreme illumination. In this work, inspired by the recent success of exploring foundation models with unified architecture for both natural language and computer vision tasks, we propose an All-in-One framework, which learns joint feature extraction and interaction by adopting a unified transformer backbone. Specifically, we mix raw vision and language signals to generate language-injected vision tokens, which we then concatenate before feeding into the unified backbone architecture. This approach achieves feature integration in a unified backbone, removing the need for carefully-designed fusion modules and resulting in a more effective and efficient VL tracking framework. To further improve the learning efficiency, we introduce a multi-modal alignment module based on cross-modal and intra-modal contrastive objectives, providing more reasonable representations for the unified All-in-One transformer backbone. Extensive experiments on five benchmarks, i.e., OTB99-L, TNL2K, LaSOT, LaSOTExt and WebUAV-3M, demonstrate the superiority of the proposed tracker against existing state-of-the-art (SOTA) methods on VL tracking. Codes will be available at https://github.com/983632847/All-in-One here.

Scene-text Oriented Visual Entailment: Task, Dataset and Solution

Nan Li
Pijian Li
Dongsheng Xu
Wenye Zhao
Yi Cai
Qingbao Huang

Visual Entailment (VE) is a fine-grained reasoning task aiming to predict whether the image semantically entails a hypothesis in textual form.Existing studies of VE only focus on basic visual attributes but largely overlook the importance of scene text, which usually entails rich semantic information and crucial clues (e.g., time, place, affiliation, and topic), leading to superficial design of hypothesis or incorrect entailment prediction. To fill this gap, we propose a new task called scene-text oriented Visual Entailment (STOVE), which requires models to predict whether an image semantically entails the corresponding hypothesis designed based on the scene text-centered visual information.STOVE task challenges a model to deeply understand the interplay between language and images containing scene text, requiring aligning hypotheses tokens, scene text, and visual contents.To support the researches on STOVE, we further collect a dataset termed TextVE, consisting of 23,864 images and 47,728 hypotheses related to scene text, which is constructed with the strategy of minimizing biases.Additionally, we present a baseline named MMTVE applying a multimodal transformer to model the spatial, semantic, and visual reasoning relations between multiple scene text tokens, hypotheses, and visual features.Experimental results illustrate that our model is effective in comprehending STOVE and achieves outstanding performance.Our codes are available at https://github.com/VISLANG-Lab/TextVE.

QA-CLIMS: Question-Answer Cross Language Image Matching for Weakly Supervised Semantic Segmentation

Songhe Deng
Wei Zhuo
Jinheng Xie
Linlin Shen

Class Activation Map (CAM) has emerged as a popular tool for weakly supervised semantic segmentation (WSSS), allowing the localization of object regions in an image using only image-level labels. However, existing CAM methods suffer from under-activation of target object regions and false-activation of background regions due to the fact that a lack of detailed supervision can hinder the model's ability to understand the image as a whole. In this paper, we propose a novel Question-Answer Cross-Language-Image Matching framework for WSSS (QA-CLIMS), leveraging the vision-language foundation model to maximize the text-based understanding of images and guide the generation of activation maps. First, a series of carefully designed questions are posed to the VQA (Visual Question Answering) model with Question-Answer Prompt Engineering (QAPE) to generate a corpus of both foreground target objects and backgrounds that are adaptive to query images. We then employ contrastive learning in a Region Image Text Contrastive (RITC) network to compare the obtained foreground and background regions with the generated corpus. Our approach exploits the rich textual information from the open vocabulary as additional supervision, enabling the model to generate high-quality CAMs with a more complete object region and reduce false-activation of background regions. We conduct extensive analysis to validate the proposed method and show that our approach performs state-of-the-art on both PASCAL VOC 2012 and MS COCO datasets.

Confidence-Aware Contrastive Learning for Semantic Segmentation

Lele Lv
Qing Liu
Shichao Kan
Yixiong Liang

Recently supervised contrastive learning (SCL) has achieved remarkable progress in semantic segmentation. Nevertheless, prior works have often necessitated a substantial number of samples to attain satisfactory performance, leading to a significant increase in training overhead. In this work, we leverage the idea of reweighting each pair to reduce the demand for large numbers of training samples in contrastive learning and propose a novel loss, dubbed confidence-aware contrastive (CAC) loss, which adaptively reweights each pair according to the predicted confidence for semantic segmentation. To alleviate the misalignment between supervised learning and contrastive learning, we further introduce an extra weight branch with a stop-gradient operator to generate the pair weights. Moreover, we present a confidence-aware marginal anchor sampling method for the calculation of supervised contrastive loss which focuses on marginal rather than the hardest pairs. Coupling with our method consistently improves the performance of various models (e.g. HRNet, OCRNet, SegFormer) on Cityscapes, ADE20K, PASCAL-Context, and COCO-Stuff datasets. Compared to existing SCL-based methods, the proposed method achieves competitive or even better results without relying on a memory bank or a large number of samples. Our code is at https://github.com/CVIU-CSU/Confidence-Aware-Contrastive-Loss.

Hierarchical Prompt Learning Using CLIP for Multi-label Classification with Single Positive Labels

Ao Wang
Hui Chen
Zijia Lin
Zixuan Ding
Pengzhang Liu
Yongjun Bao
Weipeng Yan
Guiguang Ding

Collecting full annotations to construct multi-label datasets is difficult and labor-consuming. As an effective solution to relieve the annotation burden, single positive multi-label learning (SPML) draws increasing attention from both academia and industry. It only annotates each image with one positive label, leaving other labels unobserved. Therefore, existing methods strive to explore the cue of unobserved labels to compensate for the insufficiency of label supervision. Though achieving promising performance, they generally consider labels independently, leaving out the inherent hierarchical semantic relationship among labels which reveals that labels can be clustered into groups. In this paper, we propose a hierarchical prompt learning method with a novel Hierarchical Semantic Prompt Network (HSPNet) to harness such hierarchical semantic relationships using a large-scale pretrained vision and language model, i.e., CLIP, for SPML. We first introduce a Hierarchical Conditional Prompt (HCP) strategy to grasp the hierarchical label-group dependency. Then we equip a Hierarchical Graph Convolutional Network (HGCN) to capture the high-order inter-label and inter-group dependencies. Comprehensive experiments and analyses on several benchmark datasets show that our method significantly outperforms the state-of-the-art methods, well demonstrating its superiority and effectiveness. Our code will be available at https://github.com/jameslahm/HSPNet.

Reservoir Computing Transformer for Image-Text Retrieval

Wenrui Li
Zhengyu Ma
Liang-Jian Deng
Penghong Wang
Jinqiao Shi
Xiaopeng Fan

Although the attention mechanism in transformers has proven successful in image-text retrieval tasks, most transformer models suffer from a large number of parameters. Inspired by brain circuits that process information with recurrent connected neurons, we propose a novel Reservoir Computing Transformer Reasoning Network (RCTRN) for image-text retrieval. The proposed RCTRN employs a two-step strategy to focus on feature representation and data distribution of different modalities respectively. Specifically, we send visual and textual features through a unified meshed reasoning module, which encodes multi-level feature relationships with prior knowledge and aggregates the complementary outputs in a more effective way. The reservoir reasoning network is proposed to optimize memory connections between features at different stages and address the data distribution mismatch problem introduced by the unified scheme. To investigate the significance of the low power dissipation and low bandwidth characteristics of RRN in practical scenarios, we deployed the model in the wireless transmission system, demonstrating that RRN's optimization of data structures also has a certain robustness against channel noise. Extensive experiments on two benchmark datasets, Flickr30K and MS-COCO, demonstrate the superiority of RCTRN in terms of performance and low-power dissipation compared to state-of-the-art baselines.

Model Inversion Attack via Dynamic Memory Learning

Gege Qi
YueFeng Chen
Xiaofeng Mao
Binyuan Hui
Xiaodan Li
Rong Zhang
Hui Xue

Model Inversion (MI) attacks aim to recover the private training data from the target model, which has raised security concerns about the deployment of DNNs in practice. Recent advances in generative adversarial models have rendered them particularly effective in MI attacks, primarily due to their ability to generate high-fidelity and perceptually realistic images that closely resemble the target data. In this work, we propose a novel Dynamic Memory Model Inversion Attack (DMMIA) to leverage historically learned knowledge, which interacts with samples (during the training) to induce diverse generations. DMMIA constructs two types of prototypes to inject the information about historically learned knowledge: Intra-class Multicentric Representation (IMR) representing target-related concepts by multiple learnable prototypes, and Inter-class Discriminative Representation (IDR) characterizing the memorized samples as learned prototypes to capture more privacy-related information. As a result, our DMMIA has a more informative representation, which brings more diverse and discriminative generated results. Experiments on multiple benchmarks show that DMMIA performs better than state-of-the-art MI attack methods.

AdaCLIP: Towards Pragmatic Multimodal Video Retrieval

Zhiming Hu
Angela Ning Ye
Salar Hosseini Khorasgani
Iqbal Mohomed

Incorporating large image-text foundation models such as CLIP has substantially improved the performance of the multimodal video retrieval task. However, how to practically sample the frames from a video and aggregate the frame features into a video representation is still an open research question. In particular, real-world deployment scenarios, such as embodiment within consumer electronics or cloud-based inference pipelines, require two key facets of retrieval (representation building and search) to be computationally light and fast. In this paper, we propose AdaCLIP, a computation- and latency-aware system for pragmatic multimodal video retrieval. AdaCLIP consists of a learning-based frame selection module to select informative frames and a query-independent frame aggregation module to obtain strong video representations from the frame features. Specifically, in the frame selection module, we introduce a differentiable Hard-Top-k algorithm to sample a subset of the frames while optimizing the performance of the video retrieval task in an end-to-end manner. Moreover, to be latency-aware, we also propose a query-independent lightweight approach, MLP-Score, to aggregate the frame features into the video representation, which offers up to 142x speedup on GPU and 822x speedup on CPU in similarity search time compared to query-dependent matching methods. Experimental results on several popular video retrieval datasets confirm the effectiveness of AdaCLIP.

Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR

Zhenyang Li
Yangyang Guo
Kejie Wang
Xiaolin Chen
Liqiang Nie
Mohan Kankanhalli

Visual Commonsense Reasoning (VCR) calls for explanatory reasoning behind question answering over visual scenes. To achieve this goal, a model is required to provide an acceptable rationale as the reason for the predicted answers. Progress on the benchmark dataset stems largely from the recent advancement of Vision-Language Transformers (VL Transformers). These models are first pre-trained on some generic large-scale vision-text datasets, and then the learned representations are transferred to the downstream VCR task. Despite their attractive performance, this paper posits that the VL Transformers do not exhibit visual commonsense, which is the key to VCR. In particular, our empirical results pinpoint several shortcomings of existing VL Transformers: small gains from pre-training, unexpected language bias, limited model architecture for the two inseparable sub-tasks, and neglect of the important object-tag correlation. With these findings, we tentatively suggest some future directions from the aspect of dataset, evaluation metric, and training tricks. We believe this work could make researchers revisit the intuition and goals of VCR, and thus help tackle the remaining challenges in visual reasoning.

Semi-supervised Domain Adaptation via Joint Contrastive Learning with Sensitivity

Keyu Tu
Zilei Wang
Junjie Li
Yixin Zhang

Semi-supervised Domain Adaptation (SSDA) aims to learn a well-performed model using fully labeled source samples and scarcely labeled target samples, along with unlabeled target samples. Due to the dominant presence of labeled samples from the source domain in the training data, both the feature extractor and classifier can display bias towards the source domain. This can result in sub-optimal feature extraction for the challenging target samples that have notable differences from the source domain. Moreover, the source-favored classifier can hinder the classification performance of the target domain. To this end, we propose a novel Joint Contrastive Learning with Sensitivity (JCLS) in this paper, which consists of sensitivity-aware feature contrastive learning (SFCL) and class-wise probabilistic contrastive learning (CPCL). Different from the traditional contrastive learning, SFCL pays more attention to the sensitive samples during optimizing the feature extractor, and consequently the feature discrimination of unlabeled samples can be enhanced. CPCL performs class-wise contrastive learning in the probabilistic space to enforce the cross-domain classifier to match the real distribution of source and target samples. By combining these two components, our JCLS is able to extract domain-invariant and compact features and obtain a well-performed classifier. We conduct the experiments on the DomainNet and Office-Home benchmarks, and the results show that our approach achieves state-of-the-art performance.

LocLoc: Low-level Cues and Local-area Guides for Weakly Supervised Object Localization

Xinzi Cao
Xiawu Zheng
Yunhang Shen
Ke Li
Jie Chen
Yutong Lu
Yonghong Tian

Weakly Supervised Object Localization (WSOL) aims to localize objects using only image-level labels while ensuring competitive classification performance. However, previous efforts have prioritized localization over classification accuracy in discriminative features, in which low-level information is neglected. We argue that low-level image representations, such as edges, color, texture, and motions are crucial for accurate detection. That is, using such information further achieves more refined localization, which can be used to promote classification accuracy. In this paper, we propose a unified framework that simultaneously improves localization and classification accuracy, termed as LocLoc (Low-level Cues and Local-area Guides). It leverages low-level image cues to explore global and local representations for accurate localization and classification. Specifically, we introduce a GrabCut-Enhanced Generator (GEG) to learn global semantic representations for localization based on graph cuts to enhance low-level information based on long-range dependencies captured by the transformer. We further design a Local Feature Digging Module (LFDM) that utilizes low-level cues to guide the learning route of local feature representations for accurate classification. Extensive experiments demonstrate the effectiveness of LocLoc with 84.4%(↑5.2%) Top-1 Loc., 85.8% Top-1 Cls. on CUB-200-2011 and 57.6% (↑1.5%) Top-1 Loc., 78.6% Top-1Cls. on ILSVRC 2012, indicating that our method achieves competitive performance with a large margin compared to previous approaches. Code and models are available at https://github.com/Cliffia123/LocLoc.

Expand BERT Representation with Visual Information via Grounded Language Learning with Multimodal Partial Alignment

Cong-Duy Nguyen
The-Anh Vu-Le
Thong Nguyen
Tho Quan
Anh-Tuan Luu

Language models have been supervised with both language-only objective and visual grounding in existing studies of visual-grounded language learning. However, due to differences in the distribution and scale of visual-grounded datasets and language corpora, the language model tends to mix up the context of the tokens that occurred in the grounded data with those that do not. As a result, during representation learning, there is a mismatch between the visual information and the contextual meaning of the sentence. To overcome this limitation, we propose GroundedBERT - a grounded language learning method that enhances the BERT representation with visually grounded information. GroundedBERT comprises two components: (i) the original BERT which captures the contextual representation of words learned from the language corpora, and (ii) a visual grounding module which captures visual information learned from visual-grounded datasets. Moreover, we employ Optimal Transport (OT), specifically its partial variant, to solve the fractional alignment problem between the two modalities. Our proposed method significantly outperforms the baseline language models on various language tasks of the GLUE and SQuAD datasets.

Food-500 Cap: A Fine-Grained Food Caption Benchmark for Evaluating Vision-Language Models

Zheng Ma
Mianzhi Pan
Wenhan Wu
Kanzhi Cheng
Jianbing Zhang
Shujian Huang
Jiajun Chen

Vision-language models (VLMs) have shown impressive performance in substantial downstream multi-modal tasks. However, only comparing the fine-tuned performance on downstream tasks leads to the poor interpretability of VLMs, which is adverse to their future improvement. Several prior works have identified this issue and used various probing methods under a zero-shot setting to detect VLMs' limitations, but they all examine VLMs using general datasets instead of specialized ones. In practical applications, VLMs are usually applied to specific scenarios, such as e-commerce and news fields, so the generalization of VLMs in specific domains should be given more attention. In this paper, we comprehensively investigate the capabilities of popular VLMs in a specific field, the food domain. To this end, we build a food caption dataset, Food-500 Cap, which contains 24,700 food images with 494 categories. Each image is accompanied by a detailed caption, including fine-grained attributes of food, such as the ingredient, shape, and color. We also provide a culinary culture taxonomy that classifies each food category based on its geographic origin in order to better analyze the performance differences of VLM in different regions. Experiments on our proposed datasets demonstrate that popular VLMs underperform in the food domain compared with their performance in the general domain. Furthermore, our research reveals severe bias in VLMs' ability to handle food items from different geographic regions. We adopt diverse probing methods and evaluate nine VLMs belonging to different architectures to verify the aforementioned observations. We hope that our study will bring researchers' attention to VLM's limitations when applying them to the domain of food or culinary cultures, and spur further investigations to address this issue.

Enhancing Sentence Representation with Visually-supervised Multimodal Pre-training

Zhe Li
Laurence T. Yang
Xin Nie
Bocheng Ren
Xianjun Deng

Large-scale pre-trained language models have garnered significant attention in recent years due to their effectiveness in extracting sentence representations. However, most pre-trained models currently use transformer-based encoder with a single modality and are primarily designed for specific tasks such as natural language inference and question-answering. Unfortunately, this approach neglects the complementary information provided by multimodal data, which can enhance the effectiveness of sentence representation. To address this issue, we propose a Visually-supervised Pre-trained Multimodal Model (ViP) for sentence representation. Our model leverages diverse label-free multimodal proxy tasks to embed visual information into language, facilitating effective modality alignment and complementarity exploration. Additionally, our model utilizes a novel approach to distinguish highly similar negative and positive samples. We conduct comprehensive downstream experiments on natural language understanding and sentiment classification, demonstrating that ViP outperforms both existing unimodal and multimodal pre-trained models. Our contributions include a novel approach to multimodal pre-training and a state-of-the-art model for sentence representation that incorporates visual information.1 Our code is available at https://github.com/gentlefress/ViP

Cross-modal Contrastive Learning for Multimodal Fake News Detection

Longzheng Wang
Chuang Zhang
Hongbo Xu
Yongxiu Xu
Xiaohan Xu
Siqi Wang

Automatic detection of multimodal fake news has gained a widespread attention recently. Many existing approaches seek to fuse unimodal features to produce multimodal news representations. However, the potential of powerful cross-modal contrastive learning methods for fake news detection has not been well exploited. Besides, how to aggregate features from different modalities to boost the performance of the decision-making process is still an open question. To address that, we propose COOLANT, a cross-modal contrastive learning framework for multimodal fake news detection, aiming to achieve more accurate image-text alignment. To further capture the fine-grained alignment between vision and language, we leverage an auxiliary task to soften the loss term of negative samples during the contrast process. A cross-modal fusion module is developed to learn the cross-modality correlations. An attention mechanism with an attention guidance module is implemented to help effectively and interpretably aggregate the aligned unimodal representations and the cross-modality correlations. Finally, we evaluate the COOLANT and conduct a comparative study on two widely used datasets, Twitter and Weibo. The experimental results demonstrate that our COOLANT outperforms previous approaches by a large margin and achieves new state-of-the-art results on the two datasets.

Visual Captioning at Will: Describing Images and Videos Guided by a Few Stylized Sentences

Dingyi Yang
Hongyu Chen
Xinglin Hou
Tiezheng Ge
Yuning Jiang
Qin Jin

Stylized visual captioning aims to generate image or video descriptions with specific styles, making them more attractive and emotionally appropriate. One major challenge with this task is the lack of paired stylized captions for visual content, so most existing works focus on unsupervised methods that do not rely on parallel datasets. However, these approaches still require training with sufficient examples that have style labels, and the generated captions are limited to predefined styles. To address these limitations, we explore the problem of Few-Shot Stylized Visual Captioning, which aims to generate captions in any desired style, using only a few examples as guidance during inference, without requiring further training. We propose a framework called FS-StyleCap for this task, which utilizes a conditional encoder-decoder language model and a visual projection module. Our two-step training scheme proceeds as follows: first, we train a style extractor to generate style representations on an unlabeled text-only corpus. Then, we freeze the extractor and enable our decoder to generate stylized descriptions based on the extracted style vector and projected visual content vectors. During inference, our model can generate desired stylized captions by deriving the style representation from user-supplied examples. Our automatic evaluation results for few-shot sentimental visual captioning outperform state-of-the-art approaches and are comparable to models that are fully trained on labeled style corpora. Human evaluations further confirm our model's ability to handle multiple styles.

Category-Specific Prompts for Animal Action Recognition with Pretrained Vision-Language Models

Yinuo Jing
Chunyu Wang
Ruxu Zhang
Kongming Liang
Zhanyu Ma

Animal action recognition has a wide range of applications. However, the field largely remains unexplored due to the greater challenges compared to human action recognition, such as lack of annotated training data, large intra-class variation, and interference of cluttered background. Most of the existing methods directly apply human action recognition techniques, which essentially require a large amount of annotated data. In recent years, contrastive vision-language pretraining has demonstrated strong zero-shot generalization ability and has been used for human action recognition. Inspired by the success, we develop a highly performant action recognition framework based on the CLIP model. Our model addresses the above challenges via a novel category-specific prompting module to generate adaptive prompts for both text and video based on the animal category detected in input videos. On one hand, it can generate more precise and customized textual descriptions for each action and animal category pair, being helpful in the alignment of textual and visual space. On the other hand, it allows the model to focus on video features of the target animal in the video and reduce the interference of video background noise. Experimental results demonstrate that our method outperforms five previous action recognition methods on the Animal Kingdom dataset and has shown best generalization ability on unseen animals.

Scene Graph Masked Variational Autoencoders for 3D Scene Generation

Rui Xu
Le Hui
Yuehui Han
Jianjun Qian
Jin Xie

Generating realistic 3D indoor scenes requires a deep understanding of objects and their spatial relationships. However, existing methods often fail to generate realistic 3D scenes due to the limited understanding of object relationships. To tackle this problem, we propose a Scene Graph Masked Variational Auto-Encoder (SG-MVAE) framework that fully captures the relationships between objects to generate more realistic 3D scenes. Specifically, we first introduce a relationship completion module that adaptively learns the missing relationships between objects in the scene graph. To accurately predict the missing relationships, we employ multi-group attention to capture the correlations between the objects with missing relationships and other objects in the scene. After obtaining the complete scene relationships, we mask the relationships between objects and use a decoder to reconstruct the scene. The reconstruction process enhances the model's understanding of relationships, generating more realistic scenes. Extensive experiments on benchmark datasets show that our model outperforms state-of-the-art methods.

AvatarFusion: Zero-shot Generation of Clothing-Decoupled 3D Avatars Using 2D Diffusion

Shuo Huang
Zongxin Yang
Liangting Li
Yi Yang
Jia Jia

Large-scale pre-trained vision-language models allow for the zero-shot text-based generation of 3D avatars. The previous state-of-the-art method utilized CLIP to supervise neural implicit models that reconstructed a human body mesh. However, this approach has two limitations. Firstly, the lack of avatar-specific models can cause facial distortion and unrealistic clothing in the generated avatars. Secondly, CLIP only provides optimization direction for the overall appearance, resulting in less impressive results. To address these limitations, we propose AvatarFusion, the first framework to use a latent diffusion model to provide pixel-level guidance for generating human-realistic avatars while simultaneously segmenting clothing from the avatar's body. AvatarFusion includes the first clothing-decoupled neural implicit avatar model that employs a novel Dual Volume Rendering strategy to render the decoupled skin and clothing sub-models in one space. We also introduce a novel optimization method, called Pixel-Semantics Difference-Sampling (PS-DS), which semantically separates the generation of body and clothes, and generates a variety of clothing styles. Moreover, we establish the first benchmark for zero-shot text-to-avatar generation. Our experimental results demonstrate that our framework outperforms previous approaches, with significant improvements observed in all metrics. Additionally, since our model is clothing-decoupled, we can exchange the clothes of avatars. Code are available on our project page https://hansenhuang0823.github.io/AvatarFusion.

KeyPosS: Plug-and-Play Facial Landmark Detection through GPS-Inspired True-Range Multilateration

Xu Bao
Zhi-Qi Cheng
Jun-Yan He
Wangmeng Xiang
Chenyang Li
Jingdong Sun
Hanbing Liu
Wei Liu
Bin Luo
Yifeng Geng
Xuansong Xie

In the realm of facial analysis, accurate landmark detection is crucial for various applications, ranging from face recognition and expression analysis to animation. Conventional heatmap or coordinate regression-based techniques, however, often face challenges in terms of computational burden and quantization errors. To address these issues, we present the KeyPoint Positioning System (KeyPosS) - a groundbreaking facial landmark detection framework that stands out from existing methods. The framework utilizes a fully convolutional network to predict a distance map, which computes the distance between a Point of Interest (POI) and multiple anchor points. These anchor points are ingeniously harnessed to triangulate the POI's position through the True-range Multilateration algorithm. Notably, the plug-and-play nature of KeyPosS enables seamless integration into any decoding stage, ensuring a versatile and adaptable solution. We conducted a thorough evaluation of KeyPosS's performance by benchmarking it against state-of-the-art models on four different datasets. The results show that KeyPosS substantially outperforms leading methods in low-resolution settings while requiring a minimal time overhead.1 The code is available at https://github.com/zhiqic/KeyPosS.

WormTrack: Dataset and Benchmark for Multi-Object Tracking in Worm Crowds

Zhiyu Jin
Hanyang Yu
Chen Haul
Linxiang Wang
Zuobin Zhu
Qiu Shen
Xun Cao

Currently, multimedia systems and computer vision algorithms are increasingly playing a crucial role in biological research. However, due to the significant difference between macro and micro scenarios, it is impractical to directly transfer existing computer vision methods to the images captured by microscopes. Taking social behavior analysis of worm for example, it heavily depends on accurate and efficient Multi-object tracking (MOT) methods. Meanwhile, it faces great challenges due to the unique physical characteristics of worm, such as small size, highly uniform appearance, rapid deformation and overlapping movement. This paper studies on the challenges and existing solutions for MOT in worm crowds by building a well-designed dataset ("WormTrack") and a tracking-by-detection benchmark. We observed that the state-of-the-art MOT methods suffers from considerable performance drop on the new dataset. Therefore, we propose a customized MOT method for worm crowds by deeply understanding the physical characteristics of worms and scenes. The method is composed by an instance segmentation based detector, a multiple model fused Kalman filter based tracker and a multi-constraint based trajectory repairer. The experimental results demonstrate that our method can accurately track over 100 worms with almost identical appearance for a long period, which is exceptional compared to existing methods. We hope our work will attract further researches to explore more in this new field, and promote the crossing field researches with biology and medicine. Our code and data is available at https://github.com/Jeerrzy/wormstudio.

Relational Contrastive Learning for Scene Text Recognition

Jinglei Zhang
Tiancheng Lin
Yi Xu
Kai Chen
Rui Zhang

Context-aware methods achieved great success in supervised scene text recognition via incorporating semantic priors from words. We argue that such prior contextual information can be interpreted as the relations of textual primitives due to the heterogeneous text and background, which can provide effective self-supervised labels for representation learning. However, textual relations are restricted to the finite size of dataset due to lexical dependencies, which causes the problem of over-fitting and compromises representation robustness. To this end, we propose to enrich the textual relations via rearrangement, hierarchy and interaction, and design a unified framework called RCLSTR: Relational Contrastive Learning for Scene Text Recognition. Based on causality, we theoretically explain that three modules suppress the bias caused by the contextual prior and thus guarantee representation robustness. Experiments on representation quality show that our method outperforms state-of-the-art self-supervised STR methods. Code is available at https://github.com/ThunderVVV/RCLSTR.

MaTCR: Modality-Aligned Thought Chain Reasoning for Multimodal Task-Oriented Dialogue Generation

Yiting Liu
Liang Li
Beichen Zhang
Shan Huang
Zheng-Jun Zha
Qingming Huang

In recent years, multimodal task-oriented dialogue systems have attracted increasing attention from communities, owing to their ability to naturally and efficiently provide user service. Despite the commercial value of multimodal dialogue systems, they are still confronted with two challenges: (1) capture users' intention from lengthy context and side knowledge for question comprehension; (2) jointly consider the multimodal information for response generation. In view of the challenges, previous methods designed for specific scenario lack auxiliary reasoning structures with effective modality interaction, which hinders the comprehension of user's needs and impedes the generation of desired responses. To address these issues, we propose a Modality-aligned Thought Chain Reasoning (MaTCR) framework to insert explicit reasoning process for multimodal task-oriented dialogue generation. We construct a multimodal thought chain by summarizing intermediate user queries from aligned visual and textual context, which helps to guide the comprehension of user intentions for generating reasonable responses. To effectively extract and integrate multimodal information for high-quality thought chain reasoning, we design a multimodal reasoner consisting of visual representation learning and modality-aligned fusion. We comparatively justify MaTCR with several strong baselines, including the currently highly regarded large language model. Extensive experiments over a benchmark dataset demonstrate that MaTCR outperforms the existing methods and provides stronger interpretability.

Fine-grained Pseudo Labels for Scene Text Recognition

Xiaoyu Li
Xiaoxue Chen
Zuming Huang
Lele Xie
Jingdong Chen
Ming Yang

Pseudo-Labeling based semi-supervised learning has shown promising advantages in Scene Text Recognition (STR). Most of them usually use a pre-trained model to generate sequence-level pseudo labels for text images and then re-train the model. Recently, conducting Pseudo-Labeling in a teacher-student framework (a student model is supervised by the pseudo labels from a teacher model) has become increasingly popular, which trains in an end-to-end manner and yields outstanding performance in semi-supervised learning. However, applying this framework directly to Pseudo-Labeling STR exhibits unstable convergence, as generating pseudo labels at the coarse-grained sequence-level leads to inefficient utilization of unlabelled data. Furthermore, the inherent domain shift between labeled and unlabeled data results in low quality of derived pseudo labels. To mitigate the above issues, we propose a novel Cross-domain Pseudo-Labeling (CPL) approach for scene text recognition, which makes better utilization of unlabeled data at the character-level and provides more accurate pseudo labels. Specifically, our proposed Pseudo-Labeled Curriculum Learning dynamically adjusts the thresholds for different character classes according to the model's learning status. Moreover, an Adaptive Distribution Regularizer is employed to bridge the domain gap and improve the quality of pseudo labels. Extensive experiments show that CPL boosts those representative STR models to achieve state-of-the-art results on six challenging STR benchmarks. Besides, it can be effectively generalized to handwritten text.

VPA: Fully Test-Time Visual Prompt Adaptation

Jiachen Sun
Mark Ibrahim
Melissa Hall
Ivan Evtimov
Z. Morley Mao
Cristian Canton Ferrer
Caner Hazirbas

Textual prompt tuning has demonstrated significant performance improvements in adapting natural language processing models to a variety of downstream tasks by treating hand-engineered prompts as trainable parameters. Inspired by the success of textual prompting, several studies have investigated the efficacy of visual prompt tuning. In this work, we present Visual Prompt Adaptation (VPA), the first framework that generalizes visual prompting with test-time adaptation. VPA introduces a small number of learnable tokens, enabling fully test-time and storage-efficient adaptation without necessitating source-domain information. We examine our VPA design under diverse adaptation settings, encompassing single-image, batched-image, and pseudo-label adaptation. We evaluate VPA on multiple tasks, including out-of-distribution (OOD) generalization, corruption robustness, and domain adaptation. Experimental results reveal that VPA effectively enhances OOD generalization by 3.3% across various models, surpassing previous test-time approaches. Furthermore, we show that VPA improves corruption robustness by 6.5% compared to strong baselines. Finally, we demonstrate that VPA also boosts domain adaptation performance by relatively 5.2%. Our VPA also exhibits marked effectiveness in improving the robustness of zero-shot recognition for vision-language models.

Unsupervised Domain Adaptation for Referring Semantic Segmentation

Haonan Shi
Wenwen Pan
Zhou Zhao
Mingmin Zhang
Fei Wu

In this paper, we study the task of referring semantic segmentation in a highly practical setting, in which labeled visual data with corresponding text descriptions are available in the source, but only unlabeled visual data (without text descriptions) are available in the target. It is a challenging task that has many difficulties: (1) how to obtain proper queries for the target domain; (2) how to adapt visual-text joint distribution shifts; (3) how to maintain the original segmentation performance. Thus, we propose a cycle-consistent vision-language matching network to narrow down the domain gap and ease adaptation difficulty. Our model has significant practical applications since they are capable generalising to new data sources without requiring corresponding text annotations. First, a pseudo-text selector is devised to handle the missing modality, through the pre-trained clip model to measure the gap between query features of the source and visual features of the target. Next, a cross-domain segmentation predictor is adopted, which prompts the joint representations to be domain invariant and minimize the discrepancy between two domains. Then, we present a cycle-consistent query matcher to learn discriminative features via reconstructing visual features from masks. Instead of doing the textual comparison, we match the visual features to the pseudo queries. Extensive experiments show the effectiveness of our method.

OCSKB: An Object Component Sketch Knowledge Base for Fast 6D Pose Estimation

Guangming Shi
Xuyang Li
Xuemei Xie
Mingxuan Yu
Chengwei Rao
Jiakai Luo

6D pose estimation from a single RGB image is a fundamental task in computer vision. In most methods of instance-level or category-level 6D pose estimation, accurate CAD models or point cloud models are indispensable part. It is not easy to quickly obtain the models of these everyday objects. To address this issue, we present a part-level object component sketch knowledge base which consists of 270 real-world object sketch models of 30 categories. Objects are disassembled into geometry components with spatial relationship according to their functions and structures, and convert them into three basic spatial structures: frustum, circular truncated cone, and sphere. We present a fast pipeline for sketch modeling with our tool. The average time for this method to build a simple model for everyday objects is about 2 minutes. Additionally, we leverage the geometric information and spatial relationships inherent in the multiple viewpoint projection maps of these sketch bases to develop a rapid inference framework for 6D pose estimation. The interpretable steps in our framework gradually retrieve and activate valid solutions in the discrete 6D pose space. Extensive experiments in real-world environments have demonstrated that our method can reliably and robustly estimate the 6D pose of objects, even without access to accurate CAD or point cloud models. Furthermore, our method achieves state-of-the-art performance, operating at a speed of 90 frames per second using parallel computing on GPU.

Fine-Grained Visual Prompt Learning of Vision-Language Models for Image Recognition

Hongbo Sun
Xiangteng He
Jiahuan Zhou
Yuxin Peng

Large-scale pre-trained vision-language (VL) models have shown powerful generic representation capabilities for adapting to downstream tasks with limited training data, which are data-efficient solutions to various applications such as image recognition. In order to enhance the adaption performance, most existing methods attempt to introduce learnable vectors into the text prompt to generate adaptive classification weights for the class in the downstream task. However, they generally focus on the text side while neglecting adaptive visual feature generation on the image side, which is insufficient to fit the downstream task data. In this paper, we propose fine-grained visual prompt learning (FG-VPL) of vision-language models for image recognition with few training samples, and the main contributions are: (1) Fine-grained visual prompt is introduced into the image encoder of the vision-language model for focusing on the target object and conducting information interaction within the object, which facilitates generating discriminative visual features for image recognition. (2) A two-pathway adaptive recognition module is proposed to narrow the domain gap and utilize both the cross-modal knowledge of the vision-language model and the visual information of the few-sample training set for classifying images with the help of feature adapters. We conduct extensive experiments on 11 image recognition benchmark datasets under the few training samples setting, which demonstrate that our proposed approach can achieve state-of-the-art performance. The code is available at https://github.com/PKU-ICST-MIPL/FG-VPL_ACMMM2023.

SESSION: Poster Session IV: Engaging Users with Multimedia -- Emotional and Social Signals

General Debiasing for Multimodal Sentiment Analysis

Teng Sun
Juntong Ni
Wenjie Wang
Liqiang Jing
Yinwei Wei
Liqiang Nie

Existing work on Multimodal Sentiment Analysis (MSA) utilizes multimodal information for prediction yet unavoidably suffers from fitting the spurious correlations between multimodal features and sentiment labels. For example, if most videos with a blue background have positive labels in a dataset, the model will rely on such correlations for prediction, while "blue background'' is not a sentiment-related feature. To address this problem, we define a general debiasing MSA task, which aims to enhance the Out-Of-Distribution (OOD) generalization ability of MSA models by reducing their reliance on spurious correlations. To this end, we propose a general debiasing framework based on Inverse Probability Weighting (IPW), which adaptively assigns small weights to the samples with larger bias (i.e., the severer spurious correlations). The key to this debiasing framework is to estimate the bias of each sample, which is achieved by two steps: 1) disentangling the robust features and biased features in each modality, and 2) utilizing the biased features to estimate the bias. Finally, we employ IPW to reduce the effects of large-biased samples, facilitating robust feature learning for sentiment prediction. To examine the model's generalization ability, we keep the original testing sets on two benchmarks and additionally construct multiple unimodal and multimodal OOD testing sets. The empirical results demonstrate the superior generalization ability of our proposed framework. We have released the code to facilitate the reproduction https://github.com/Teng-Sun/GEAR.

Feeling Positive? Predicting Emotional Image Similarity from Brain Signals

Tuukka Ruotsalo
Kalle Mäkel
Michiel M. Spapé
Luis A. Leiva

The present notion of visual similarity is based on features derived from image contents. This ignores the users' emotional or affective experiences toward the content, and how users feel when they search for images. Here we consider valence, a positive or negative quantification of affective appraisal, as a novel dimension of image similarity. We report the largest neuroimaging experiment that quantifies and predicts the valence of visual content by using functional near-infrared spectroscopy from brain-computer interfacing. We show that affective similarity can be (1)~decoded directly from brain signals in response to visual stimuli, (2)~utilized for predicting affective image similarity with an average accuracy of 0.58 and an accuracy of 0.65 for high-arousal stimuli, and (3)~effectively used to complement affective similarity estimates of content-based models; for example when fused fNIRS and image rankings the retrieval F-measure@20 is 0.70. Our work opens new research avenues for affective multimedia analysis, retrieval, and user modeling.

Multimodal Physiological Signals Fusion for Online Emotion Recognition

Tongjie Pan
Yalan Ye
Hecheng Cai
Shudong Huang
Yang Yang
Guoqing Wang

Multimodal physiological-based emotion recognition is one of the most available but challenging studies due to complexity of emotions and individual differences in physiological signals. However, existing studies mainly combine multimodal data to fuse multimodal information in offline scenarios, ignoring data/modalities correlation among multimodal data and individual differences of non-stationary physiological signals in online scenarios. In this paper, we propose a novel Online Multimodal HyperGraph Learning (OMHGL) method to fuse multimodal information for emotion recognition based on time-series physiological signals. Our method consists of multimodal hypergraph fusion and online hypergraph learning. Specifically, the multimodal hypergraph fusion can fuse multimodal physiological signals to effectively obtain emotionally dependent information via leveraging multimodal information and higher-order correlations among multimodal data/modalities. The online hypergraph learning is designed to learn new information from online data by updating hypergraph projection. As a result, the proposed online emotion recognition model can be more effective for emotion recognition of target subjects when target data arrive in an online manner. Experimental results have demonstrated that the proposed method significantly outperforms the baselines and compared state-of-the-art methods in online emotion recognition tasks.

Learning from More: Combating Uncertainty Cross-multidomain for Facial Expression Recognition

Hanwei Liu
Huiling Cai
Qingcheng Lin
Xuefeng Li
Hui Xiao

Domain adaptation has driven the progress of Facial Expression Recognition (FER). Existing cross-domain FER methods focus only on the domain alignment of a single source domain to the target domain, ignoring the importance of multisource domains that contain richer knowledge. However, Cross-Multidomain FER (CMFER)needs to combat the domain conflicts caused by the uncertainty of intradomain annotations and the inconsistency of interdomain distributions. To this end, this paper proposes a Domain-Uncertain Mutual Learning (DUML) method to deal with the more challenging CMFER problem. Specifically, we consider a domain-specific global perspective for domain-invariance representation and domain fusion for facial generic detail representation to mitigate cross-domain distribution differences. Further, we develop Intra-Domain Uncertainty (Intra-DU) and Inter-Domain Uncertainty (Inter-DU) to combat the large dataset shifts caused by annotation uncertainty. Finally, extensive experimental results on multiple benchmark across multidomain FER datasets demonstrate the remarkable effectiveness of DUML against CMFER uncertainty. All codes and training logs are publicly available at https://github.com/liuhw01/DUML.

MindDiffuser: Controlled Image Reconstruction from Human Brain Activity with Semantic and Structural Diffusion

Yizhuo Lu
Changde Du
Qiongyi Zhou
Dianpeng Wang
Huiguang He

Reconstructing visual stimuli from brain recordings has been a meaningful and challenging task. Especially, the achievement of precise and controllable image reconstruction bears great significance in propelling the progress and utilization of brain-computer interfaces. Despite the advancements in complex image reconstruction techniques, the challenge persists in achieving a cohesive alignment of both semantic (concepts and objects) and structure (position, orientation, and size) with the image stimuli. To address the aforementioned issue, we propose a two-stage image reconstruction model called MindDiffuser1. In Stage 1, the VQ-VAE latent representations and the CLIP text embeddings decoded from fMRI are put into Stable Diffusion, which yields a preliminary image that contains semantic information. In Stage 2, we utilize the CLIP visual feature decoded from fMRI as supervisory information, and continually adjust the two feature vectors decoded in Stage 1 through backpropagation to align the structural information. The results of both qualitative and quantitative analyses demonstrate that our model has surpassed the current state-of-the-art models on Natural Scenes Dataset (NSD). The subsequent experimental findings corroborate the neurobiological plausibility of the model, as evidenced by the interpretability of the multimodal feature employed, which align with the corresponding brain responses.

Pretrained Implicit-Ensemble Transformer for Open-Set Authentication on Multimodal Mobile Biometrics

Jaeho Yoon
Jaewoo Park
Kensuke Wagata
Hojin Park
Andrew Beng Jin Teoh

Smartphones have become indispensable in our lives, even for security-critical tasks. Traditional security measures such as PINs provide only one-time authentication, while biometrics enable continuous authentication in mobile devices. This paper introduces a simple, lightweight, pretrained Transformer dubbed PIEformer for open-set authentication (OSA) of multimodal touchstrokes and gait biometrics. Compared to conventional mobile closed-set authentication, OSA enables more secure and practical authentication, with genuine and impostor users disjoint from the training set. PIEFormer incorporates a novel implicit ensembling mechanism for extracting discriminative embeddings within an open-set environment and enhancing generalization performance. This approach learns multiple diverse sub-embeddings, capturing complementary aspects of biometrics data with minimal computational overhead, allowing Transformers to exhibit robust capabilities in OSA. Our proposed methods demonstrate state-of-the-art results on HMOG and BBMAS datasets, particularly in open-set scenarios compared to closed-set literature, thus bringing mobile biometric authentication closer to real-world applications.

Revisiting Disentanglement and Fusion on Modality and Context in Conversational Multimodal Emotion Recognition

Bobo Li
Hao Fei
Lizi Liao
Yu Zhao
Chong Teng
Tat-Seng Chua
Donghong Ji
Fei Li

It has been a hot research topic to enable machines to understand human emotions in multimodal contexts under dialogue scenarios, which is tasked with multimodal emotion analysis in conversation (MM-ERC). MM-ERC has received consistent attention in recent years, where a diverse range of methods has been proposed for securing better task performance. Most existing works treat MM-ERC as a standard multimodal classification problem and perform multimodal feature disentanglement and fusion for maximizing feature utility. Yet after revisiting the characteristic of MM-ERC, we argue that both the feature multimodality and conversational contextualization should be properly modeled simultaneously during the feature disentanglement and fusion steps. In this work, we target further pushing the task performance by taking full consideration of the above insights. On the one hand, during feature disentanglement, based on the contrastive learning technique, we devise a Dual-level Disentanglement Mechanism (DDM) to decouple the features into both the modality space and utterance space. On the other hand, during the feature fusion stage, we propose a Contribution-aware Fusion Mechanism (CFM) and a Context Refusion Mechanism (CRM) for multimodal and context integration, respectively. They together schedule the proper integrations of multimodal and context features. Specifically, CFM explicitly manages the multimodal feature contributions dynamically, while CRM flexibly coordinates the introduction of dialogue contexts. On two public MM-ERC datasets, our system achieves new state-of-the-art performance consistently. Further analyses demonstrate that all our proposed mechanisms greatly facilitate the MM-ERC task by making full use of the multimodal and context features adaptively. Note that our proposed methods have the great potential to facilitate a broader range of other conversational multimodal tasks.

Sensing Micro-Motion Human Patterns using Multimodal mmRadar and Video Signal for Affective and Psychological Intelligence

Yiwei Ru
Peipei Li
Muyi Sun
Yunlong Wang
Kunbo Zhang
Qi Li
Zhaofeng He
Zhenan Sun

Affective and psychological perception are pivotal in human-machine interaction and essential domains within artificial intelligence. Existing physiological signal-based affective and psychological datasets primarily rely on contact-based sensors, potentially introducing extraneous affectives during the measurement process. Consequently, creating accurate non-contact affective and psychological perception datasets is crucial for overcoming these limitations and advancing affective intelligence. In this paper, we introduce the Remote Multimodal Affective and Psychological (ReMAP) dataset, for the first time, apply head micro-tremor (HMT) signals for affective and psychological perception. ReMAP features 68 participants and comprises two sub-datasets. The stimuli videos utilized for affective perception undergo rigorous screening to ensure the efficacy and universality of affective elicitation. Additionally, we propose a novel remote affective and psychological perception framework, leveraging multimodal complementarity and interrelationships to enhance affective and psychological perception capabilities. Extensive experiments demonstrate HMT as a "small yet powerful" physiological signal in psychological perception. Our method outperforms existing state-of-the-art approaches in remote affective recognition and psychological perception. The ReMAP dataset is publicly accessible at https://remap-dataset.github.io/ReMAP.

Unlocking the Power of Multimodal Learning for Emotion Recognition in Conversation

Yunxiao Wang
Meng Liu
Zhe Li
Yupeng Hu
Xin Luo
Liqiang Nie

Emotion recognition in conversation aims to identify the emotions underlying each utterance, and it has great potential in various domains. Human perception of emotions relies on multiple modalities, such as language, vocal tonality, and facial expressions. While many studies have incorporated multimodal information to enhance emotion recognition, the performance of multimodal models often plateaus when additional modalities are added. We demonstrate through experiments that the main reason for this plateau is an imbalanced assignment of gradients across modalities. To address this issue, we propose fine-grained adaptive gradient modulation, a plug-in approach to rebalance the gradients of modalities. Experimental results show that our method improves the performance of all baseline models and outperforms existing plug-in methods.

Emo-DNA: Emotion Decoupling and Alignment Learning for Cross-Corpus Speech Emotion Recognition

Jiaxin Ye
Yujie Wei
Xin-Cheng Wen
Chenglong Ma
Zhizhong Huang
Kunhong Liu
Hongming Shan

Cross-corpus speech emotion recognition (SER) seeks to generalize the ability of inferring speech emotion from a well-labeled corpus to an unlabeled one, which is a rather challenging task due to the significant discrepancy between two corpora. Existing methods, typically based on unsupervised domain adaptation (UDA), struggle to learn corpus-invariant features by global distribution alignment, but unfortunately, the resulting features are mixed with corpus-specific features or not class-discriminative. To tackle these challenges, we propose a novel Emotion Decoupling aNd Alignment learning framework (EMO-DNA) for cross-corpus SER, a novel UDA method to learn emotion-relevant corpus-invariant features. The novelties of EMO-DNA are two-fold: contrastive emotion decoupling and dual-level emotion alignment. On one hand, our contrastive emotion decoupling achieves decoupling learning via a contrastive decoupling loss to strengthen the separability of emotion-relevant features from corpus-specific ones. On the other hand, our dual-level emotion alignment introduces an adaptive threshold pseudo-labeling to select confident target samples for class-level alignment, and performs corpus-level alignment to jointly guide model for learning class-discriminative corpus-invariant features across corpora. Extensive experimental results demonstrate the superior performance of EMO-DNA over the state-of-the-art methods in several cross-corpus scenarios. Source code is available at https://github.com/Jiaxin-Ye/Emo-DNA.

Emotionally Situated Text-to-Speech Synthesis in User-Agent Conversation

Yuchen Liu
Haoyu Zhang
Shichao Liu
Xiang Yin
Zejun Ma
Qin Jin

Conversational Text-to-speech Synthesis (TTS) aims to generate speech with proper style in the user-agent conversation scenario. Although previous works have explored modeling the context in the dialogue history to provide style information for the agent, there are still deficiencies in modeling the role-aware multi-modal context. Moreover, previous works ignore the emotional dependencies between the user and the agent, which includes: 1) agent understands emotional states of users, and 2) agent expresses proper emotion in the generated speech. In this work, we propose an Emotionally Situated Text-to-speech Synthesis (EmoSit-TTS) framework to understand users' semantics and subtle emotional states, and generate speech with proper speaking style and emotional expression in the user-agent conversation. Experiments on the DailyTalk dataset show the superiority of our proposed framework for the user-agent conversational TTS, especially in terms of emotion-aware expressiveness, which outperforms other state-of-the-art methods by 0.69 on MOS. Demos of our proposed framework are available at https://anonydemo.github.io.

Multimodal Adaptive Emotion Transformer with Flexible Modality Inputs on A Novel Dataset with Continuous Labels

Wei-Bang Jiang
Xuan-Hao Liu
Wei-Long Zheng
Bao-Liang Lu

Emotion recognition from physiological signals is a topic of widespread interest, and researchers continue to develop novel techniques for perceiving emotions. However, the emergence of deep learning has highlighted the need for high-quality emotional datasets to accurately decode human emotions. In this study, we present a novel multimodal emotion dataset that incorporates electroencephalography (EEG) and eye movement signals to systematically explore human emotions. Seven basic emotions (happy, sad, fear, disgust, surprise, anger, and neutral) are elicited by a large number of 80 videos and fully investigated with continuous labels that indicate the intensity of the corresponding emotions. Additionally, we propose a novel Multimodal Adaptive Emotion Transformer (MAET), that can flexibly process both unimodal and multimodal inputs. Adversarial training is utilized in MAET to mitigate subject discrepancy, which enhances domain generalization. Our extensive experiments, encompassing both subject-dependent and cross-subject conditions, demonstrate MAET's superior performance in handling various inputs. The filtering of data for high emotional evocation using continuous labels proved to be effective in the experiments. Furthermore, the complementary properties between EEG and eye movements are observed. Our code is available at https://github.com/935963004/MAET.

Graph to Grid: Learning Deep Representations for Multimodal Emotion Recognition

Ming Jin
Jinpeng Li

Multimodal emotion recognition based on electroencephalogram (EEG) and compensating physiological signals (e.g., eye tracking) has shown potential in the diagnosis and rehabilitation tracking of depression. Since the multi-channel EEG signals are generally processed as one-dimensional (1-D) graph-like features, existing approaches can only adopt underdeveloped shallow models to recognize emotions. However, these simple models have difficulty decoupling complex emotion patterns due to their limited representation capacity. To address this problem, we propose the graph-to-grid (G2G), a concise and plug-and-play module that transforms the 1-D graph-like data into the two-dimensional (2-D) grid-like data via the numerical relation coding. After that, the well developed deep models, e.g., ResNet can be used to downstream tasks. In addition, G2G simplifies the previous complex multimodal fusion into an input matrix augmentation operation, which greatly reduces the difficulty of model design and parameter tuning. Extensive results on three public datasets (SEED, SEED5 and MPED) indicate that the proposed approach achieves state-of-the-art emotion recognition accuracy in both unimodal and multimodal settings, with good cross-session generalization ability. G2G enables the development of more appropriate multimodal emotion recognition algorithms for follow-up studies. Our code is publicly available at https://github.com/Jinminbox/G2G.

Multimodal Prompt Transformer with Hybrid Contrastive Learning for Emotion Recognition in Conversation

Shihao Zou
Xianying Huang
Xudong Shen

Emotion Recognition in Conversation (ERC) plays an important role in driving the development of human-machine interaction. Emotions can exist in multiple modalities, and multimodal ERC mainly faces two problems: (1) the noise problem in the cross-modal information fusion process, and (2) the prediction problem of less sample emotion labels that are semantically similar but different categories. To address these issues and fully utilize the features of each modality, we adopted the following strategies: first, deep emotion cues extraction was performed on modalities with strong representation ability, and feature filters were designed as multimodal prompt information for modalities with weak representation ability. Then, we designed a Multimodal Prompt Transformer (MPT) to perform cross-modal information fusion. MPT embeds multimodal fusion information into each attention layer of the Transformer, allowing prompt information to participate in encoding textual features and being fused with multi-level textual information to obtain better multimodal fusion features. Finally, we used the Hybrid Contrastive Learning (HCL) strategy to optimize the model's ability to handle labels with few samples. This strategy uses unsupervised contrastive learning to improve the representation ability of multimodal fusion and supervised contrastive learning to mine the information of labels with few samples. Experimental results show that our proposed model outperforms state-of-the-art models in ERC on two benchmark datasets.

SAAML: A Framework for Semi-supervised Affective Adaptation via Metric Learning

Minh Tran
Yelin Kim
Che-Chun Su
Cheng-Hao Kuo
Mohammad Soleymani

Socially intelligent systems such as home robots should be able to perceive emotions and social behaviors. Affect recognition datasets have limited labeled data, and existing large unlabeled datasets, e.g., VoxCeleb2, suitable for pre-training, mostly contain neutral expressions, limiting their application to affective downstream tasks. We introduce a novel Semi-supervised Affective Adaptation framework via Metric Learning (SAAML) to adapt pre-trained audiovisual models (e.g., AV-HuBERT) to expressive behaviors associated with emotions and social communication. The proposed framework automatically retrieves a large number of emotional excerpts (>100 hours) from the VoxCeleb2 dataset via metric learning from two emotion recognition datasets (MSP-IMPROV and CREMA-D), and learns domain-invariant emotion-aware representations. Experimental results show that fine-tuning the proposed affect-aware AV-HuBERT (AW-HuBERT) improves the emotion recognition accuracy by 3-6% compared to fine-tuning the original pre-trained models. We further validate the effectiveness of the AW-HuBERT on human-centered visual understanding tasks, namely, facial expression recognition, video highlight detection, and continuous emotion recognition. The proposed approach consistently outperforms AV-HuBERT and delivers competitive performance compared to the existing methods. With this work, we demonstrate the effectiveness of adaptive pre-training for existing models on domain-specific data to enhance their performance for human-centered tasks.

Learning Shared Semantic Information from Multimodal Bio-signals for Brain-Muscle Modulation Analysis

Tian-Yu Xiang
Xiao-Hu Zhou
Xiao-Liang Xie
Shi-Qi Liu
Hong-Jun Yang
Zhen-Qiu Feng
Mei-Jiang Gui
Hao Li
De-Xing Huang
Zeng-Guang Hou

This paper presents a novel learning-based algorithm to investigate the high-level shared semantic information between electroencephalography (EEG) and electromyography (EMG) signals, for understanding brain-muscle modulation during movement execution. The proposed algorithm incorporates a spatial encoder that condenses spatial information obtained from EEG/EMG signals into unified temporal tokens using a learnable correlation matrix. These tokens are then encoded and decoded via a siamese temporal encoder and classification head to extract joint semantic information presented in cross-modal signals. Additionally, an analysis pipeline is designed to examine brain-muscle modulation based on the proposed algorithm. Experimental results from a self-collected multimodal bio-signals dataset validate the efficacy of the proposed algorithm in extracting and analyzing high-level latent semantic information shared in EEG and EMG signals, outperforming the state-of-the-art model by 5.35% in accuracy, 4.69% in precision, and 8.65% in recall. Notably, the designed analysis pipeline can also reveal low-level relationships, such as those related to time and space, between multimodal bio-signals. This research provides neuroscientists with a valuable tool for obtaining enhanced insights into brain-muscle modulation.

Auditory Attention Decoding with Task-Related Multi-View Contrastive Learning

Xiaoyu Chen
Changde Du
Qiongyi Zhou
Huiguang He

The human brain can easily focus on one speaker and suppress others in scenarios such as a cocktail party. Recently, researchers found that auditory attention can be decoded from the electroencephalogram (EEG) data. However, most existing deep learning methods are difficult to use prior knowledge of different views (that is attended speech and EEG are task-related views) and extract an unsatisfactory representation. Inspired by Broadbent's filter model, we decode auditory attention in a multi-view paradigm and extract the most relevant and important information utilizing the missing view. Specifically, we propose an auditory attention decoding (AAD) method based on multi-view VAE with task-related multi-view contrastive (TMC) learning. Employing TMC learning in multi-view VAE can utilize the missing view to accumulate prior knowledge of different views into the fusion of representation, and extract the approximate task-related representation. We examine our method on two popular AAD datasets, and demonstrate the superiority of our method by comparing it to the state-of-the-art method.

Progressive Visual Content Understanding Network for Image Emotion Classification

Jicai Pan
Shangfei Wang

Most existing methods for image emotion classification extract features directly from images supervised by a single emotional label. However, this approach has a limitation known as the affective gap which restricts the capability of these features as they do not always align with the emotions perceived by users. To effectively bridge the affective gap, this paper proposes a visual content understanding network inspired by the human staged emotion perception process. The proposed network is comprised of three perception modules designed to extract multi-level information. Firstly, an entity perception module extracts entities from images. Secondly, an attribute perception module extracts the attribute content of each entity. Thirdly, an emotion perception module extracts emotion features based on both the entity and attribute information. We generate pseudo-labels of entities and attributes through image segmentation and vision-language models to provide auxiliary guidance for network learning. The progressive entity and attribute understanding enable the network to hierarchically extract semantic-level features for emotion analysis. Extensive experiments demonstrate that our progressive learning network achieves superior performance on various benchmark datasets for image emotion classification.

Few-shot Multimodal Sentiment Analysis Based on Multimodal Probabilistic Fusion Prompts

Xiaocui Yang
Shi Feng
Daling Wang
Yifei Zhang
Soujanya Poria

Multimodal sentiment analysis has gained significant attention due to the proliferation of multimodal content on social media. However, existing studies in this area rely heavily on large-scale supervised data, which is time-consuming and labor-intensive to collect. Thus, there is a need to address the challenge of few-shot multimodal sentiment analysis. To tackle this problem, we propose a novel method called Multimodal Probabilistic Fusion Prompts (MultiPoint) that leverages diverse cues from different modalities for multimodal sentiment detection in the few-shot scenario. Specifically, we start by introducing a Consistently Distributed Sampling approach called CDS, which ensures that the few-shot dataset has the same category distribution as the full dataset. Unlike previous approaches primarily using prompts based on the text modality, we design unified multimodal prompts to reduce discrepancies between different modalities and dynamically incorporate multimodal demonstrations into the context of each multimodal instance. To enhance the model's robustness, we introduce a probabilistic fusion method to fuse output predictions from multiple diverse prompts for each input. Our extensive experiments on six datasets demonstrate the effectiveness of our approach. First, our method outperforms strong baselines in the multimodal few-shot setting. Furthermore, under the same amount of data (1% of the full dataset), our CDS-based experimental results significantly outperform those based on previously sampled datasets constructed from the same number of instances of each class.

MEDIC: A Multimodal Empathy Dataset in Counseling

Zhouan Zhu
Chenguang Li
Jicai Pan
Xin Li
Yufei Xiao
Yanan Chang
Feiyi Zheng
Shangfei Wang

Although empathic interaction between counselor and client is fundamental to success in the psychotherapeutic process, there are currently few datasets to aid a computational approach to empathy understanding. In this paper, we construct a multimodal empathy dataset collected from face-to-face psychological counseling sessions. The dataset consists of 771 video clips. We also propose three labels (i.e., expression of experience, emotional reaction, and cognitive reaction) to describe the degree of empathy between counselors and their clients. Expression of experience describes whether the client has expressed experiences that can trigger empathy, and emotional and cognitive reactions indicate the counselor's empathic reactions. As an elementary assessment of the usability of the constructed multimodal empathy dataset, an interrater reliability analysis of annotators' subjective evaluations for video clips is conducted using the intraclass correlation coefficient and Fleiss' Kappa. Results prove that our data annotation is reliable. Furthermore, we conduct empathy prediction using three typical methods, including the tensor fusion network, the sentimental words aware fusion network, and a simple concatenation model. The experimental results show that empathy can be well predicted on our dataset. Our dataset is available for research purposes.

Towards Adaptable Graph Representation Learning: An Adaptive Multi-Graph Contrastive Transformer

Yan Li
Liang Zhang
Xiangyuan Lan
Dongmei Jiang

Significant progress has been made in graph representation learning in recent years. However, most of these methods model spatial relationships via predefined graphs or decouple spatial-temporal representations, which limits the generalization and effectiveness of the model. To address these issues, we introduce an adaptive multi-graph contrastive transformer (AMGCT) for general spatial-temporal graph representation learning. Specifically, we first propose adaptive multi-graph contrastive learning (AMGCL). Without any expert knowledge, AMGCL can gradually generate adaptive spatial graphs with different topologies to learn spatial representations from different views. Cross-graph contrastive learning further explores potential correlations between different views, making each view's features more discriminative. In addition, to avoid insufficient interaction caused by decoupling spatial-temporal information in existing methods, we design a coupled graph transformer (CGT) to consider spatial relationships at each stage of temporal modeling, explore complementary information between spatial and temporal domains, and obtain more compact spatial-temporal representations. Experimental results on two different spatial-temporal graph datasets and tasks demonstrate that the proposed method achieves excellent performance.

MetaFBP: Learning to Learn High-Order Predictor for Personalized Facial Beauty Prediction

Luojun Lin
Zhifeng Shen
Jia-Li Yin
Qipeng Liu
Yuanlong Yu
Weijie Chen

Predicting individual aesthetic preferences holds significant practical applications and academic implications for human society. However, existing studies mainly focus on learning and predicting the commonality of facial attractiveness, with little attention given to Personalized Facial Beauty Prediction (PFBP). PFBP aims to develop a machine that can adapt to individual aesthetic preferences with only a few images rated by each user. In this paper, we formulate this task from a meta-learning perspective that each user corresponds to a meta-task. To address such PFBP task, we draw inspiration from the human aesthetic mechanism that visual aesthetics in society follows a Gaussian distribution, which motivates us to disentangle user preferences into a commonality and an individuality part. To this end, we propose a novel MetaFBP framework, in which we devise a universal feature extractor to capture the aesthetic commonality and then optimize to adapt the aesthetic individuality by shifting the decision boundary of the predictor via a meta-learning mechanism. Unlike conventional meta-learning methods that may struggle with slow adaptation or overfitting to tiny support sets, we propose a novel approach that optimizes a high-order predictor for fast adaptation. In order to validate the performance of the proposed method, we build several PFBP benchmarks by using existing facial beauty prediction datasets rated by numerous users. Extensive experiments on these benchmarks demonstrate the effectiveness of the proposed MetaFBP method.

CMCU-CSS: Enhancing Naturalness via Commonsense-based Multi-modal Context Understanding in Conversational Speech Synthesis

Yayue Deng
Jinlong Xue
Fengping Wang
Yingming Gao
Ya Li

Conversational Speech Synthesis (CSS) aims to produce speech appropriate for oral communication. However, the complexity of context dependency modeling poses significant challenges in the field of CSS, especially the mutual psychological influence between interlocutors. Previous studies have verified that prior commonsense knowledge helps machines understand subtle psychological information (e.g., feelings and intentions) in spontaneous oral dialogues. Therefore, to enhance context understanding and improve the naturalness of synthesized speech, we propose a novel conversational speech synthesis system (CMCU-CSS) that incorporates the Commonsense-based Multi-modal Context Understanding (CMCU) module to model the dynamic emotional interaction among interlocutors. Specifically, we first utilize three implicit states (intent state, internal state and external state) in CMCU to model the context dependency between inter/intra speakers with the help of commonsense knowledge. Furthermore, we infer emotion vectors from the fusion of these implicit states and multi-modal features to enhance the emotion discriminability of synthesized speech. This is the first attempt to combine commonsense knowledge with conversational speech synthesis, and its effect in terms of emotion discriminability of synthetic speech is evaluated by emotion recognition in conversation task. The results of subjective and objective evaluations demonstrate that the CMCU-CSS model achieves more natural speech with context-appropriate emotion and is equipped with the best emotion discriminability, surpassing that of other conversational speech synthesis models.

Multi-label Emotion Analysis in Conversation via Multimodal Knowledge Distillation

Sidharth Anand
Naresh Kumar Devulapally
Sreyasee Das Bhattacharjee
Junsong Yuan

Evaluating speaker emotion in conversations is crucial for various applications requiring human-computer interaction. However, co-occurrences of multiple emotional states (e.g. 'anger' and 'frustration' may occur together or one may influence the occurrence of the other) and their dynamic evolution may vary dramatically due to the speaker's internal (e.g., influence of their personalized socio-cultural-educational and demographic backgrounds) and external contexts. Thus far, the previous focus has been on evaluating only the dominant emotion observed in a speaker at a given time, which is susceptible to producing misleading classification decisions for difficult multi-labels during testing. In this work, we present Self-supervised Multi- Label Peer Collaborative Distillation (SeMuL-PCD) Learning via an efficient Multimodal Transformer Network, in which complementary feedback from multiple mode-specific peer networks (e.g.transcript, audio, visual) are distilled into a single mode-ensembled fusion network for estimating multiple emotions simultaneously. The proposed Multimodal Distillation Loss calibrates the fusion network by minimizing the Kullback-Leibler divergence with the peer networks. Additionally, each peer network is conditioned using a self-supervised contrastive objective to improve the generalization across diverse socio-demographic speaker backgrounds. By enabling peer collaborative learning that allows each network to independently learn their mode-specific discriminative patterns,SeMUL-PCD is effective across different conversation environments. In particular, the model not only outperforms the current state-of-the-art models on several large-scale public datasets (e.g., MOSEI, EmoReact and ElderReact), but with around 17% improved weighted F1-score in the cross-dataset experimental settings. The model also demonstrates an impressive generalization ability across age and demography-diverse populations.

Facial Auto Rigging from 4D Expressions via Skinning Decomposition

Zhihe Zhao
Dongdong Weng
Hanzhi Guo
Jing Hou
Jixiang Zhou

This paper proposes a framework that utilizes skinning decomposition to automatically generate facial rigging from 4D expressions. The framework inputs a predefined rigging template and an actor's 4D facial expressions, including a neutral expression, as well as a group of arbitrary expressions. The output includes not only the linear blend skinning weights and joint positions of the actor's head mesh but also other facial components such as teeth and eyes. Compared to traditional methods, this paper applies a soft constraint to optimize joint positions and imposes a fixed sparsity distribution constraint to improve weight distribution. To further enhance rigging efficiency, this paper leverages GPU expression-parallel and CPU vertex-parallel strategies for joint transformation and weight updates, respectively. The experiments show that the proposed method generates high-fidelity facial rigging that outperforms existing solutions in terms of computational speed, joint position correctness, weight distribution correctness, or computational cost.

MAE-DFER: Efficient Masked Autoencoder for Self-supervised Dynamic Facial Expression Recognition

Licai Sun
Zheng Lian
Bin Liu
Jianhua Tao

Dynamic facial expression recognition (DFER) is essential to the development of intelligent and empathetic machines. Prior efforts in this field mainly fall into supervised learning paradigm, which is severely restricted by the limited labeled data in existing datasets. Inspired by recent unprecedented success of masked autoencoders (e.g., VideoMAE), this paper proposes MAE-DFER, a novel self-supervised method which leverages large-scale self-supervised pre-training on abundant unlabeled data to largely advance the development of DFER. Since the vanilla Vision Transformer (ViT) employed in VideoMAE requires substantial computation during fine-tuning, MAE-DFER develops an efficient local-global interaction Transformer (LGI-Former) as the encoder. Moreover, in addition to the standalone appearance content reconstruction in VideoMAE, MAE-DFER also introduces explicit temporal facial motion modeling to encourage LGI-Former to excavate both static appearance and dynamic motion information. Extensive experiments on six datasets show that MAE-DFER consistently outperforms state-of-the-art supervised methods by significant margins (e.g., +6.30% UAR on DFEW and +8.34% UAR on MAFW), verifying that it can learn powerful dynamic facial representations via large-scale self-supervised pre-training. Besides, it has comparable or even better performance than VideoMAE, while largely reducing the computational cost (about 38% FLOPs). We believe MAE-DFER has paved a new way for the advancement of DFER and can inspire more relevant research in this field and even other related tasks. Codes and models are publicly available at https://github.com/sunlicai/MAE-DFER.

EmotionKD: A Cross-Modal Knowledge Distillation Framework for Emotion Recognition Based on Physiological Signals

Yucheng Liu
Ziyu Jia
Haichao Wang

Emotion recognition using multi-modal physiological signals is an emerging field in affective computing that significantly improves performance compared to unimodal approaches. The combination of Electroencephalogram(EEG) and Galvanic Skin Response(GSR) signals are particularly effective for objective and complementary emotion recognition. However, the high cost and inconvenience of EEG signal acquisition severely hinder the popularity of multi-modal emotion recognition in real-world scenarios, while GSR signals are easier to obtain. To address this challenge, we propose EmotionKD, a framework for cross-modal knowledge distillation that simultaneously models the heterogeneity and interactivity of GSR and EEG signals under a unified framework. By using knowledge distillation, fully fused multi-modal features can be transferred to an unimodal GSR model to improve performance. Additionally, an adaptive feedback mechanism is proposed to enable the multi-modal model to dynamically adjust according to the performance of the unimodal model during knowledge distillation, which guides the unimodal model to enhance its performance in emotion recognition. Our experiment results demonstrate that the proposed model achieves state-of-the-art performance on two public datasets. Furthermore, our approach has the potential to reduce reliance on multi-modal data with lower sacrificed performance, making emotion recognition more applicable and feasible. The source code is available at https://github.com/YuchengLiu-Alex/EmotionKD

UniSA: Unified Generative Framework for Sentiment Analysis

Zaijing Li
Ting-En Lin
Yuchuan Wu
Meng Liu
Fengxiao Tang
Ming Zhao
Yongbin Li

Sentiment analysis is a crucial task that aims to understand people's emotional states and predict emotional categories based on multimodal information. It consists of several subtasks, such as emotion recognition in conversation (ERC), aspect-based sentiment analysis (ABSA), and multimodal sentiment analysis (MSA). However, unifying all subtasks in sentiment analysis presents numerous challenges, including modality alignment, unified input/output forms, and dataset bias. To address these challenges, we propose a Task-Specific Prompt method to jointly model subtasks and introduce a multimodal generative framework called UniSA. Additionally, we organize the benchmark datasets of main subtasks into a new Sentiment Analysis Evaluation benchmark, SAEval. We design novel pre-training tasks and training methods to enable the model to learn generic sentiment knowledge among subtasks to improve the model's multimodal sentiment perception ability. Our experimental results show that UniSA performs comparably to the state-of-the-art on all subtasks and generalizes well to various subtasks in sentiment analysis.

Patch-Aware Representation Learning for Facial Expression Recognition

Yi Wu
Shangfei Wang
Yanan Chang

Existing methods for facial expression recognition (FER) lack the utilization of prior facial knowledge, primarily focusing on expression-related regions while disregarding explicitly processing expression-independent information. This paper proposes a patch-aware FER method that incorporates facial keypoints to guide the model and learns precise representations through two collaborative streams, addressing these issues. First, facial keypoints are detected using a facial landmark detection algorithm, and the facial image is divided into equal-sized patches using the Patch Embedding Module. Then, a correlation is established between the keypoints and patches using a simplified conversion relationship. Two collaborative streams are introduced, each corresponding to a specific mask strategy. The first stream masks patches corresponding to the keypoints, excluding those along the facial contour, with a certain probability. The resulting image embedding is input into the Encoder to obtain expression-related features. The features are passed through the Decoder and Classifier to reconstruct the masked patches and recognize the expression, respectively. The second stream masks patches corresponding to all the above keypoints. The resulting image embedding is input into the Encoder and Classifier successively, with the resulting logit approximating a uniform distribution. Through the first stream, the Encoder learns features in the regions related to expression, while the second stream enables the Encoder to better ignore expression-independent information, such as the background, facial contours, and hair. Experiments on two benchmark datasets demonstrate that the proposed method outperforms state-of-the-art methods.

COVES: A Cognitive-Affective Deep Model that Personalizes Math Problem Difficulty in Real Time and Improves Student Engagement with an Online Tutor

Hao Yu
Danielle A. Allessio
Will Lee
William Rebelsky
Frank Sylvia
Tom Murray
John J. Magee
Ivon Arroyo
Beverly P. Woolf
Sarah Adel Bargal
Margrit Betke

A key to personalized online learning is presenting content at an appropriate difficulty level; content that is too difficult can cause frustration and content that is too easy may result in boredom. Appropriate content can improve students' engagement and learning outcome. In this research, we propose a computer vision enhanced problem selector (COVES), a deep learning model to select a personalized difficulty level for each student. A combination of visual information and traditional log data is used to predict student-problem interactions, which are then used to guide problem difficulty selection in real time. COVES was trained on a dataset of fifty-one sixth-grade students interacting with the online math tutor MathSpring. Once COVES was integrated into the tutor, its effectiveness was tested with twenty-two seventh-grade students in controlled experiments. Students who received problems at an appropriate difficulty level, based on real-time predictions of their performance, demonstrated improved engagement with the math tutor. Results indicate that COVES leads to higher mastery of math concepts, better timing, and higher scores, thus providing a positive learning experience for the participants.

Efficient Labelling of Affective Video Datasets via Few-Shot & Multi-Task Contrastive Learning

Ravikiran Parameshwara
Ibrahim Radwan
Akshay Asthana
Iman Abbasnejad
Ramanathan Subramanian
Roland Goecke

Whilst deep learning techniques have achieved excellent emotion prediction, they still require large amounts of labelled training data, which are (a) onerous and tedious to compile, and (b) prone to errors and biases. We propose Multi-Task Contrastive Learning for Affect Representation (MT-CLAR) for few-shot affect inference. MT-CLAR combines multi-task learning with a Siamese network trained via contrastive learning to infer from a pair of expressive facial images (a) the (dis)similarity between the facial expressions, and (b) the difference in valence and arousal levels of the two faces. We further extend the image-based MT-CLAR framework for automated video labelling where, given one or a few labelled video frames (termed support-set), MT-CLAR labels the remainder of the video for valence and arousal. Experiments are performed on the AFEW-VA dataset with multiple support-set configurations; moreover, supervised learning on representations learnt via MT-CLAR are used for valence, arousal and categorical emotion prediction on the AffectNet and AFEW-VA datasets. The results show that valence and arousal predictions via MT-CLAR are very comparable to the state-of-the-art (SOTA), and we significantly outperform SOTA with a support-set ≈6% the size of the video dataset.

SESSION: Poster Session V: Engaging Users with Multimedia -- Multimedia Search and Recommendation

Learning Style-Invariant Robust Representation for Generalizable Visual Instance Retrieval

Tianyu Chang
Xun Yang
Xin Luo
Wei Ji
Meng Wang

Visual Instance Retrieval (VIR) is a hot research topic for its wide application in real world, such as object re-identification in smart city scenarios. However, due to the limited style diversity in source training data, most existing VIR models always fail to generalize well to unseen domain. How to improve the generalizability of VIR models has received increasing attention in most recent years.

In this paper, we pay attention to the Single Domain Generalization (SDG) based VIR task, a more challenging but practical problem, where the model is only trained on single domain data and directly evaluated on unseen target domain without any fine-tuning or adaptations. In this case, the limited style variance in training data may cause the model learning incorrect reliance on the superficial style feature and reduce the generalizability of the model. To address this issue, we propose a novel Style-Invariant robust Representation Learning (SIRL) method for the challenging task, which mainly aims to first diversify the training data with style augmentation, and then enforce the model to learn style-invariant features. Specifically, we first design an adversarial style synthesis module which learns to synthesize diverse augmented samples with adversarially learned styles. Then, we devise an invariant feature learning module to minimize cross-domain feature inconsistency between source images and style-augmented images for capturing domain-invariant instance features. In this way, we can prevent the model from over-exploiting semantic content-independent cues (e.g., color) as shortcut features, thereby estimating the pairwise instance similarity more robustly. We integrate our SIRL method with SOTA VIR networks and evaluate its effectiveness on several public benchmark datasets. Extensive experiments clearly show that the SIRL method can substantially improve the generalizability of existing VIR networks in the challenging SDG-VIR setting.

Hierarchical Category-Enhanced Prototype Learning for Imbalanced Temporal Recommendation

Xiyue Gao
Zhuoqi Ma
Jiangtao Cui
Xiaofang Xia
Cai Xu

Temporal recommendation systems aim to suggest items to users at the optimal time. However, the significant imbalance of items in the training data poses a major challenge to predictive accuracy. Existing approaches attempt to alleviate this issue by modifying the loss function or utilizing resampling techniques, but such approaches may inadvertently amplify the specificity of certain behaviors.

To address this problem, we propose a novel temporal recommendation algorithm called HCRec. The algorithm adopts a transformer-like architecture, focuses on decoupling variant and invariant information according to hierarchical categories, and then uses invariant prototypes to mitigate the impact of special behaviors. Although category-enhanced recommendation is not a new concept, current approaches have considerable limitations. They usually incorporate the category information into a black-box model as another input sequence, which can easily lead to overfitting. In contrast, HCRec creates a hierarchical relationship tree of items and categories, and establishes mapping rules to classify nodes as variant or invariant. Then, an attention model is used to extract potential variant node features from historical behaviors and enhance them with external knowledge. The invariant node features are fused based on the relationship tree, and finally used for recommendation. On top of this, two time kernel functions are designed to simulate short-term and long-period time effects. Experiments on real-world datasets show that hierarchical category-enhanced prototype learning helps alleviate the imbalance problem, enabling HCRec to outperform all baseline methods.

In-processing User Constrained Dominant Sets for User-Oriented Fairness in Recommender Systems

Zhongxuan Han
Chaochao Chen
Xiaolin Zheng
Weiming Liu
Jun Wang
Wenjie Cheng
Yuyuan Li

Recommender systems are typically biased toward a small group of users, leading to severe unfairness in recommendation performance, i.e., User-Oriented Fairness (UOF) issue. The existing research on UOF is limited and fails to deal with the root cause of the UOF issue: the learning process between advantaged and disadvantaged users is unfair. To tackle this issue, we propose an In-processing User Constrained Dominant Sets (In-UCDS) framework, which is a general framework that can be applied to any backbone recommendation model to achieve user-oriented fairness. We split In-UCDS into two stages, i.e., the UCDS modeling stage and the in-processing training stage. In the In-UCDS modeling stage, for each disadvantaged user, we extract a constrained dominant set (a user cluster) containing some advantaged users that are similar to it. In the in-processing training stage, we move the representations of disadvantaged users closer to their corresponding cluster by calculating a fairness loss. By combining the fairness loss with the original backbone model loss, we address the UOF issue and maintain the overall recommendation performance simultaneously. Comprehensive experiments on three real-world datasets demonstrate that In-UCDS outperforms the state-of-the-art methods, leading to a fairer model with better overall recommendation performance.

Learning Comprehensive Representations with Richer Self for Text-to-Image Person Re-Identification

Shuanglin Yan
Neng Dong
Jun Liu
Liyan Zhang
Jinhui Tang

Text-to-image person re-identification (TIReID) retrieves pedestrian images of the same identity based on a query text. However, existing methods typically treat it as a one-to-one image-text matching problem, only focusing on the relationship between image-text pairs within a view. The many-to-many matching between image-text pairs across views under the same identity is not taken into account, which is one of the main reasons for the poor performance of existing methods. To this end, we propose a simple yet effective framework, called LCR2S, for modeling many-to-many correspondences of the same identity by learning comprehensive representations for both modalities from a novel perspective. We construct a support set for each image (text) by using other images (texts) under the same identity and design a multi-head attentional fusion module to fuse the image (text) and its support set. The resulting enriched image and text features are aligned to train a "richer" TIReID model with many-to-many correspondences. Since the support set is unavailable during inference, we propose to distill the knowledge learned by the "richer" model into a lightweight model for inference with a single image/text as input. The lightweight model focus on semantic association and reasoning of multi-view information, which can generate a comprehensive representation containing multi-view information with only a single-view input to perform accurate text-to-image retrieval during inference. In particular, we use the intra-modal features and inter-modal semantic relations of the "richer" model to supervise the lightweight model to inherit its powerful capability. Extensive experiments demonstrate the effectiveness of LCR2S, and it also achieves new state-of-the-art performance on three popular TIReID datasets.

Doubly Intention Learning for Cold-start Recommendation with Uncertainty-aware Stochastic Meta Process

Huafeng Liu
Mingjie Zhou
Liping Jing
Michael K. Ng

The cold-start recommendation has been one of the most central problems in online platforms where new users or items arrive continuously. Although existing meta-learning based models with globally sharing knowledge show good performance in most cold-start scenarios, the ability to handle challenges on intention heterogeneity and prediction uncertainty is missing, and these two challenges are particularly evident in cold-start scenarios with fewer interaction data. To tackle the above challenges, in this paper, we present an uncertainty-aware Stochastic Meta Process with Doubly Intention learning (DISMP) for the cold-start recommendation, which has promising properties in uncertainty quantification. With the aid of the meta-learning stochastic process, DISMP can store general knowledge by capturing the relevance of different user-item pairs in terms of intentions and concepts, which is capable of rapid adaptation to new users and items. Furthermore, intentions with general and specific levels are extracted by doubly distinguishing the role of latent variables, which is able to capture the dependencies across different types of intentions and concepts. Empirical results show that our approach can achieve substantial improvement over the state-of-the-art baselines on cold-start recommendations with different perspectives.

DiVa: An Iterative Framework to Harvest More Diverse and Valid Labels from User Comments for Music

Hongru Liang
Jingyao Liu
Yuanxin Xiang
Jiachen Du
Lanjun Zhou
Shushen Pan
Wenqiang Lei

Towards sufficient music searching, it is vital to form a complete set of labels for each song. However, current solutions fail to resolve it as they cannot produce diverse enough mappings to make up for the information missed by the gold labels. Based on the observation that such missing information may already be presented in user comments, we propose to study the automated music labeling in an essential but under-explored setting, where the model is required to harvest more diverse and valid labels from the users' comments given limited gold labels. To this end, we design an iterative framework (DiVa) to harvest more Diverse and Valid labels from user comments for music. The framework makes a classifier able to form complete sets of labels for songs via pseudo-labels inferred from pre-trained classifiers and a novel joint score function. The experiment on a densely annotated testing set reveals the superiority of the DiVa over state-of-the-art solutions in producing more diverse labels missed by the gold labels. We hope our work can inspire future research on automated music labeling.

Contrastive Intra- and Inter-Modality Generation for Enhancing Incomplete Multimedia Recommendation

Zhenghong Lin
Yanchao Tan
Yunfei Zhan
Weiming Liu
Fan Wang
Chaochao Chen
Shiping Wang
Carl Yang

With the rapid growth of multimedia-sharing platforms (e.g. Twitter and TikTok), multimedia recommender systems have become fundamental for helping users alleviate information overload and discover items of interest. Existing multimedia recommendation methods often incorporate various auxiliary modalities (e.g., visual, textual, and acoustic) to describe item characteristics and improve task performance. However, these methods usually assume that each item is associated with complete modalities, ignoring the prevalence of missing modality issues in real-world scenarios. To deal with the challenge of missing modalities, in this paper, we propose a novel framework of Contrastive Intra- and Inter-Modality Generation (CI2MG) for enhancing incomplete multimedia recommendation. We first develop a contrastive intra- and inter-modality generation module for the missing modalities, where the intra-modality representation is updated through clustering-based hypergraph convolution and inter-modality representation is obtained by optimal transport between different modalities. To tackle the challenge of insufficient and incomplete supervision labels during intra- and inter-modality generation, a modality-aware contrastive learning paradigm is introduced based on an augmentation between the intra-modality view and inter-modality view. Furthermore, to learn task-related representations from the generative modalities and further improve the performance of recommendation, we design an enhanced multimedia recommendation module to alleviate the influences driven by task-irrelevant noise. Extensive experiments on real-world datasets show the superiority of our proposed CI2MG framework in offering great potential for personalized multimedia recommendation over the state-of-the-art baselines regarding Recall, NDCG, and Precision metrics.

Differentially Private Sparse Mapping for Privacy-Preserving Cross Domain Recommendation

Weiming Liu
Xiaolin Zheng
Chaochao Chen
Mengling Hu
Xinting Liao
Fan Wang
Yanchao Tan
Dan Meng
Jun Wang

Cross-Domain Recommendation (CDR) has been popularly studied for solving the data sparsity problem via leveraging rich knowledge from the auxiliary domain. Most of the current CDR models assume that user-item ratings/reviews are both accessible during the training procedure. However, it is impractical nowadays due to the strict data privacy protection policy. In this paper, we focus on the Privacy-Preserving Cross-Domain Recommendation problem (PPCDR). Although some previous approaches have investigated the problems, they always fail to effectively utilize the rating and review for user preference modeling. What is worse, they separate the processes of privacy data modeling and user-item collaborative filtering, leading to suboptimal solutions. To fill this gap, we propose the Differentially Private Sparse Mapping Recommendation model (DPSMRec), an end-to-end cross-domain recommendation framework for solving the PPCDR problem. DPSMRec includes three main modules, i.e., source embedding generation module, target rating prediction module, and differentially private sparse mapping module. Specifically, the source embedding generation module and target rating prediction module are set to exploit the ratings and review in each domain. The differentially private sparse mapping module aims to transfer knowledge among the overlapped users via Wasserstein distance privately. To obtain more accurate mapping solutions, we further propose differentially private enhanced sparse optimal transport via fused Gromov-Wasserstein distance which can consider both structure and semantic information among the users across domains. Our empirical study on Amazon and Douban datasets demonstrates that DPSMRec significantly outperforms the state-of-the-art models under the PPCDR setting.

Handling Label Uncertainty for Camera Incremental Person Re-Identification

Zexian Yang
Dayan Wu
Wanqian Zhang
Bo Li
Weipinng Wang

Incremental learning for person re-identification (ReID) aims to develop models that can be trained with a continuous data stream, which is a more practical setting for real-world applications. However, the existing incremental ReID methods make two strong assumptions that the cameras are fixed and the new-emerging data is class-disjoint from previous classes. This is unrealistic as previously observed pedestrians may re-appear and be captured again by new cameras. In this paper, we investigate person ReID in an unexplored scenario named Camera Incremental Person ReID (CIPR), which advances existing lifelong person ReID by taking into account the class overlap issue. Specifically, new data collected from new cameras may probably contain an unknown proportion of identities seen before. This subsequently leads to the lack of cross-camera annotations for new data due to privacy concerns. To address these challenges, we propose a novel framework ExtendOVA. First, to handle the class overlap issue, we introduce an instance-wise seen-class identification module to discover previously seen identities at the instance level. Then, we propose a criterion for selecting confident ID-wise candidates and also devise an early learning regularization term to correct noise issues in pseudo labels. Furthermore, to compensate for the lack of previous data, we resort prototypical memory bank to create surrogate features, along with a cross-camera distillation loss to further retain the inter-camera relationship. The comprehensive experimental results on multiple benchmarks show that ExtendOVA significantly outperforms the state-of-the-arts with remarkable advantages.

Towards Deconfounded Image-Text Matching with Causal Inference

Wenhui Li
Xinqi Su
Dan Song
Lanjun Wang
Kun Zhang
An-An Liu

Prior image-text matching methods have shown remarkable performance on many benchmark datasets, but most of them overlook the bias in the dataset, which exists in intra-modal and inter-modal, and tend to learn the spurious correlations that extremely degrade the generalization ability of the model. Furthermore, these methods often incorporate biased external knowledge from large-scale datasets as prior knowledge into image-text matching model, which is inevitable to force model further learn biased associations. To address above limitations, this paper firstly utilizes Structural Causal Models (SCMs) to illustrate how intra- and inter-modal confounders damage the image-text matching. Then, we employ backdoor adjustment to propose an innovative Deconfounded Causal Inference Network (DCIN) for image-text matching task. DCIN (1) decomposes the intra- and inter-modal confounders and incorporates them into the encoding stage of visual and textual features, effectively eliminating the spurious correlations during image-text matching, and (2) uses causal inference to mitigate biases of external knowledge. Consequently, the model can learn causality instead of spurious correlations caused by dataset bias. Extensive experiments on two well-known benchmark datasets, i.e., Flickr30K and MSCOCO, demonstrate the superiority of our proposed method.

Enhancing Adversarial Robustness of Multi-modal Recommendation via Modality Balancing

Yu Shang
Chen Gao
Jiansheng Chen
Depeng Jin
Huimin Ma
Yong Li

Recently multi-modal recommender systems have been widely applied in real scenarios such as e-commerce businesses. Existing multi-modal recommendation methods exploit the multi-modal content of items as auxiliary information and fuse them to boost performance. Despite the superior performance achieved by multi-modal recommendation models, there's currently no understanding of their robustness to adversarial attacks. In this work, we first identify the vulnerability of existing multi-modal recommendation models. Next, we show the key reason for such vulnerability is modality imbalance, i.e., the prediction score margin between positive and negative samples in the sensitive modality will drop dramatically facing adversarial attacks and fail to be compensated by other modalities. Finally, based on this finding we propose a novel defense method to enhance the robustness of multi-modal recommendation models through modality balancing. Specifically, we first adopt an embedding distillation to obtain a pair of content-similar but prediction-different item embeddings in the sensitive modality and calculate the score margin reflecting the modality vulnerability. Then we optimize the model to utilize the score margin between positive and negative samples in other modalities to compensate for the vulnerability. The proposed method can serve as a plug-and-play module and is flexible to be applied to a wide range of multi-modal recommendation models. Extensive experiments on two real-world datasets demonstrate that our method significantly improves the robustness of multi-modal recommendation models with nearly no performance degradation on clean data.

Enhancing Domain-Invariant Parts for Generalized Zero-Shot Learning

Yang Zhang
Songhe Feng

Generalized Zero-Shot Learning (GZSL) aims to recognize unseen classes, which are not observable during training, with auxiliary semantic information, e.g., attributes. Attributes play an important role in GZSL due to their ability to provide rich semantic information and visual guidance. In this paper, we will review the characteristics of attributes. First, the visual appearance of the same attribute varies greatly between different classes, which leads to the projection domain shift problem and inaccurate attribute localization. Second, the predefined semantic prototypes are not able to faithfully represent the credible visual information of each sample, which leads to suboptimal visual-semantic alignment. Therefore, we propose a novel framework called Enhancing doMain-invariant Parts (EMP) to solve above issues. To be specific, we mitigate the projection domain shift problem by the feature disentanglement technology in domain generalization, which can disentangle the attribute-based visual features into domain-invariant and domain-specific parts. So that the model can pay more attention to the essential parts of attributes rather than using all the information learned from the seen classes to identify the unseen classes. Then we achieve better visual-semantic alignment by refining the predefined semantic prototypes with the extracted credible visual information from corresponding sample. To make the extracted visual information be more in line with the image content rather than overfitting the semantic prototype, we draw on the idea of self-paced learning to help the model learn attributes from easy to complex. Experimental results show that our method achieves a new state-of-the-art performance on three generalized zero-shot learning benchmarks.

DCEL: Deep Cross-modal Evidential Learning for Text-Based Person Retrieval

Shenshen Li
Xing Xu
Yang Yang
Fumin Shen
Yijun Mo
Yujie Li
Heng Tao Shen

Text-based person retrieval aims at searching for a pedestrian image from multiple candidates with textual descriptions. It is challenging due to uncertain cross-modal alignments caused by the large intra-class variations. To address the challenge, most existing approaches rely on various attention mechanisms and auxiliary information, yet still struggle with the uncertain cross-modal alignments arising from significant intra-class variation, leading to coarse retrieval results. To this end, we propose a novel framework termed Deep Cross-modal Evidential Learning (DCEL), which deploys evidential deep learning to consider the cross-modal alignment uncertainty. Our DCEL model comprises three components: (1) Bidirectional Evidential Learning, which models alignment uncertainty to measure and mitigate the influence of large intra-class variation; (2) Multi-level Semantic Alignment, which leverages a proposed Semantic Filtration module and image-text similarity distribution to facilitate cross-modal alignments; (3) Cross-modal Relation Learning, which reasons about latent correspondences between multi-level tokens of image and text. Finally, we integrate the advantages of the three proposed components to enhance the model to achieve reliable cross-modal alignments. Our DCEL method consistently outperforms more than ten state-of-the-art methods in supervised, weakly supervised, and domain generalization settings on three benchmarks: CUHK-PEDES, ICFG-PEDES, and RSTPReid.

Tran-GCN: Multi-label Pattern Image Retrieval via Transformer Driven Graph Convolutional Network

Ying Li
Chunming Guan
Rui Cai
Ye Erwan
Ding Yuxiang
Jiaquan Gao

Pattern images are artificially designed images that possess distinctiveness in their elements, styles, and arrangements. With the ever-growing number of pattern images, pattern image retrieval emerges as a promising technique with significant potential for commercial and industrial applications, such as fashion and home decoration, facilitating rapid identification of preferred print patterns by users. The main purpose of multi-label pattern image retrieval is to effectively represent and match images with their corresponding labels. Compared to conventional image retrieval, multi-label pattern image retrieval faces greater challenges due to the richer semantic information contained within the abstract print patterns and the complex relationships between multiple labels. To tackle these challenges, we propose a model specifically designed for multi-label pattern image retrieval, called Tran-GCN. Our proposed model is built upon a Transformer-based autoregressive architecture, which leverages image information to guide the exploration of correlations between different labels through the textual modality. By utilizing this correlation information, we construct a graph convolutional network (GCN) model to further enhance the correlations between image and label representations. To be more specific, our Tran-GCN model utilizes a cross-modal attention mechanisms at each layer to effectively aggregate visual features from the input image and update label semantics through residual connections. The GCN module is updated based on the correlation between textual features, as represented in a relationship matrix. Extensive experiments on two widely used public visual benchmarks, MS-COCO and NUS-WIDE, as well as a multi-label pattern image dataset, Pattern 2, consistently demonstrate the ability of our proposed Tran-GCN model for general use and its superior performance in multi-label pattern image retrieval tasks as well.

AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning

Ziqi Zhou
Shengshan Hu
Minghui Li
Hangtao Zhang
Yechao Zhang
Hai Jin

Multimodal contrastive learning aims to train a general-purpose feature extractor, such as CLIP, on vast amounts of raw, unlabeled paired image-text data. This can greatly benefit various complex downstream tasks, including cross-modal image-text retrieval and image classification. Despite its promising prospect, the security issue of cross-modal pre-trained encoder has not been fully explored yet, especially when the pre-trained encoder is publicly available for commercial use.

In this work, we propose AdvCLIP, the first attack framework for generating downstream-agnostic adversarial examples based on cross-modal pre-trained encoders. AdvCLIP aims to construct a universal adversarial patch for a set of natural images that can fool all the downstream tasks inheriting the victim cross-modal pre-trained encoder. To address the challenges of heterogeneity between different modalities and unknown downstream tasks, we first build a topological graph structure to capture the relevant positions between target samples and their neighbors. Then, we design a topology-deviation based generative adversarial network to generate a universal adversarial patch. By adding the patch to images, we minimize their embeddings similarity to different modality and perturb the sample distribution in the feature space, achieving unviersal non-targeted attacks. Our results demonstrate the excellent attack performance of AdvCLIP on two types of downstream tasks across eight datasets. We also tailor three popular defenses to mitigate AdvCLIP, highlighting the need for new defense mechanisms to defend cross-modal pre-trained encoders. Our codes are available at: https://github.com/CGCL-codes/AdvCLIP.

Personalized Behavior-Aware Transformer for Multi-Behavior Sequential Recommendation

Jiajie Su
Chaochao Chen
Zibin Lin
Xi Li
Weiming Liu
Xiaolin Zheng

Sequential Recommendation (SR) captures users' dynamic preferences by modeling how users transit among items. However, SR models that utilize only single type of behavior interaction data encounter performance degradation when the sequences are short. To tackle this problem, we focus on Multi-Behavior Sequential Recommendation (MBSR) in this paper, which aims to leverage time-evolving heterogeneous behavioral dependencies for better exploring users' potential intents on the target behavior. Solving MBSR is challenging. On the one hand, users exhibit diverse multi-behavior patterns due to personal characteristics. On the other hand, there exists comprehensive co-influence between behavior correlations and item collaborations, the intensity of which is deeply affected by temporal factors. To tackle these challenges, we propose a Personalized Behavior-Aware Transformer framework (PBAT) for MBSR problem, which models personalized patterns and multifaceted sequential collaborations in a novel way to boost recommendation performance. PBAT includes two main modules, i.e, personalized behavior pattern generator and behavior-aware collaboration extractor. First, PBAT develops a personalized behavior pattern generator in the representation layer, which extracts dynamic and discriminative behavior patterns for sequential learning. Second, PBAT reforms the self-attention layer with a behavior-aware collaboration extractor, which introduces a fused behavior-aware attention mechanism for incorporating both behavioral and temporal impacts into collaborative transitions. We conduct experiments on three benchmark datasets and the results demonstrate the effectiveness and interpretability of our framework.

A Contrastive Learning Framework for Dual-Target Cross-Domain Recommendation

Jinhu Lu
Guohao Sun
Xiu Fang
Jian Yang
Wei He

Cross-Domain Recommendation (CDR) is proposed to address the long-standing data sparsity problem in recommender systems (RSs). Traditional CDR only leverages relatively richer information from an auxiliary domain to improve the performance in a sparser domain, which is also called single-target CDR. In recent years, dual-target CDR has been proposed to improve recommendation performance in both domains simultaneously. The existing dual-target CDR methods are based on common users to achieve knowledge transfer between domains. We argue that the existing methods face two challenges: (1) how to learn more representative user and item embeddings in each domain, and (2) in the case of a small number of common users in real-world datasets, how to achieve better knowledge transfer. To address these challenges, in this paper, we propose a contrastive learning (CL) framework, called CL-DTCDR. In CL-DTCDR, we first design a CL task in each domain to learn more representative user and item embeddings. Then, we further construct positive pairs of the user and her/his most similar user between domains to optimize user embeddings. By two CL tasks, CL-DTCDR effectively improves performance in both domains. Extensive experiments conducted on three real-world datasets demonstrate that CL-DTCDR significantly outperforms the state-of-the-art approaches.

Self-Distillation Dual-Memory Online Hashing with Hash Centers for Streaming Data Retrieval

Chong-Yu Zhang
Xin Luo
Yu-Wei Zhan
Peng-Fei Zhang
Zhen-Duo Chen
Yongxin Wang
Xun Yang
Xin-Shun Xu

With the continuous generation of massive amounts of multimedia data nowadays, hashing has demonstrated significant potentials for large-scale search. To handle the emerging needs for streaming data retrieval, online hashing is drawing more and more attention. For online scenario, data distribution may change and concept drifts may occur as new data is continuously added to the database. Inevitably, hashing models may lose or disrupt the previously obtained knowledge when learning from new information, which is called the problem of catastrophic forgetting. In this paper, we propose a new online hashing method called Self-distillation Dual-memory Online Hashing with Hash Centers, which is abbreviated to SDOH-HC, to overcome this challenge. Specifically, SDOH-HC contains replay and distillation modules. For replay, a dual-memory mechanism is proposed which involves hash centers and exemplars. For knowledge distillation, we let hash centers distill information from themselves, i.e., the version of last round. Additionally, a new objective function is further built on above modules and is solved discretely to learn hash codes. Extensive experiments on three benchmark datasets demonstrate the effectiveness of our method.

Unsupervised Hashing with Contrastive Learning by Exploiting Similarity Knowledge and Hidden Structure of Data

Zhenpeng Song
Qinliang Su
Jiayang Chen

By noticing the superior ability of contrastive learning in representation learning, several recent works have proposed to use it to learn semantic-rich hash codes. However, due to the absence of label information, existing contrastive-based hashing methods simply follow contrastive learning by only using the augmentation of the anchor as positive, while treating all other samples in the batch as negatives, resulting in the ignorance of a large number of potential positives. Consequently, the learned hash codes tend to be distributed dispersedly in the space, making their distances unable to accurately reflect their semantic similarities. To address this issue, we propose to exploit the similarity knowledge and hidden structure of the dataset. Specifically, we first develop an intuitive approach based on self-training that comprises two main components, a pseudo-label predictor and a hash code improving module, which mutually benefit from each other by utilizing the output from one another, in conjunction with the similarity knowledge obtained from pre-trained models. Furthermore, we subjected the intuitive approach to a more rigorous probabilistic framework and propose CGHash, a probabilistic hashing model based on conditional generative models, which is theoretically more reasonable and could model the similarity knowledge and the hidden group structure more accurately. Our extensive experimental results on three image datasets demonstrate that CGHash exhibits significant superiority when compared to both the proposed intuitive approach and existing baselines. Our code is available at https://github.com/KARLSZP/CGHash.

Giving Text More Imagination Space for Image-text Matching

Xinfeng Dong
Longfei Han
Dingwen Zhang
Li Liu
Junwei Han
Huaxiang Zhang

Image-text matching is a hot topic in multi-modal analysis. The existing image-text matching algorithms focus on bridging the heterogeneity gap and mapping the feature into a common space under strong alignment assumption. However, these methods have unsatisfactory performance under the weak alignment scenario, which assumes that the text contains more abstract information, and the number of entities in the text is always fewer than objects in image. This is the first time, from our knowledge, to solve the image-text matching problem from the perspective of information difference with weak alignment. In order to both narrow the cross-modal heterogeneity gap and balance the information discrepancy, we proposed an imagination network to enrich the text modality based on pre-trained framework, which is helpful for image-text matching. The imagination network utilizes reinforcement learning to enhance the semantic information for text modality, and an action refinement strategy is designed to constrain the freedom and divergence of imagination. The experiment results show the superiority and generality of the proposed framework based on two pre-trained models, CLIP and BLIP on two most frequently-used datasets MSCOCO and Flickr30K.

Modal-aware Bias Constrained Contrastive Learning for Multimodal Recommendation

Wei Yang
Zhengru Fang
Tianle Zhang
Shiguang Wu
Chi Lu

Multimodal recommendation system has been widely used in short video platform, e-commerce platform and news media. Multimodal data contains information such as product image and product text, which is often used as auxiliary signal to improve the effect of recommendation system significantly. In order to alleviate the problems of data sparsity and noise, some researchers construct data augmentation to use self-supervised learning to help model training. These methods have achieved certain results. However, most of the work is based on data augmentation in random ways, such as random masking and random perturbation. This random method is likely to lose important information and introduce new noise, resulting in biased augmentation data. Therefore, we propose a Modal-aware Bias Constrained Contrastive Learning method (BCCL) to solve the above problems. Specifically, BCCL introduces a bias-constrained data augmentation method to ensure the quality of augmentation samples. Then the multi-modal semantic information is modeled by the designed modal awareness module. Furthermore, we propose a information alignment module to improve the sparse modal feature learning of the model. We conducted a comprehensive experiment on three real-world data sets, and the experimental results showed that the proposed BCCL outperformed all the state-of-art methods. In-depth experiments have verified the effectiveness of our proposed modules.

Precise Target-Oriented Attack against Deep Hashing-based Retrieval

Wenshuo Zhao
Jingkuan Song
Shengming Yuan
Lianli Gao
Yang Yang
Hengtao Shen

Deep hashing has been widely applied in large-scale image retrieval due to its powerful computational efficiency. Nevertheless, the vulnerability of deep hashing to adversarial examples has been revealed, particularly to targeted attacks with stronger manipulability. Existing targeted attack methods for deep hashing default to selecting target labels from random images, usually encompassing multiple classes for attack in multi-label datasets. However, they exhabit poor performance when facing a preciser single target label selection. In this work, we propose a novel Precise Target-Oriented Attack dubbed PTA, to enhance the precision of such targeted attacks. Specifically, we further categorize the general target label into preciser single target label for attack. By relaxing the non-differentiable indicator function, we directly adopt Average Precision (AP) as optimization objective to guide the generation of adversarial examples on a small subset of the entire database, thus achieving stronger precision. Extensive experiments demonstrate that the proposed PTA achieves state-of-the-art performance in both general and single target label selection, with superior transferability and universality.

Conversational Composed Retrieval with Iterative Sequence Refinement

Hao Wei
Shuhui Wang
Zhe Xue
Shengbo Chen
Qingming Huang

Due to the progress of large-scale multimodal model pretraining, existing cross-modal retrieval techniques is accurate to align text description to the target image when they show close and clear semantic correspondence. However, in real situations, users only provide ambiguous text queries, making it difficult to retrieve the desired images. To address this issue, we introduce the conversational composed retrieval paradigm, inspired by conversational search which models complex user intent through iterative interaction. This paradigm enhances the model capacity in learning fine-grained correspondences. To train the cross-modal conversational retrieval, we propose the Iterative Refining Retrieval (IRR) framework. It formalizes the reference images and modification texts in each session as a multimodal sequence, which is fed into the generative model to predict the information in the sequence autoregressively, and ultimately predicting the target image feature. In the conversational retrieval paradigm, the model refines the learned correspondences based on the interaction in the later stage of the retrieval session, thus captures fine-grained semantic correspondence to enforce the cross-modal representation. We propose a domain-specific multimodal pretraining method and the full sequence sampling augmentation method to fully utilize the session information. Extensive experiments demonstrate that the iterative refining retrieval method achieves state-of-the-art performance on sessions of varying lengths.

Hi-SIGIR: Hierachical Semantic-Guided Image-to-image Retrieval via Scene Graph

Yulu Wang
Pengwen Dai
Xiaojun Jia
Zhitao Zeng
Rui Li
Xiaochun Cao

Image-to-image retrieval, a fundamental task, aims at matching similar images based on a query image. Existing methods with convolutional neural networks are usually sensitive to low-level visual features, and ignore high-level semantic relationship information. This makes retrieving complicated images with multiple objects and various relationships a significant challenge. Although some works introduce the scene graph to capture the global semantic features of the objects and their relations, they ignore the local visual representations. In addition, due to the fragility of individual modal representations, poisoning attacks in adversarial scenarios are easily achieved, hurting the robustness of the visual-guided foundation image retrieval model. To overcome these issues, we propose a novel hierarchical semantic-guided image-to-image retrieval method via scene graph, called Hi-SIGIR. Specifically, to begin with, our proposed method generates the scene graph of an image. Then, our model extracts and learns both the visual and semantic features of the nodes and relations within the scene graphs. Next, these features are fused to obtain local information and sent to the graph neural network to obtain global information. Using these information, the similarity between the scene graphs of several images is calculated at both the local and global levels to perform image retrieval. Finally, we introduce a surrogate that calculates relevance in a cross-modal manner to understand image content better. Experimental evaluations on several wildly-used benchmarks demonstrate the superiority of the proposed method.

Pareto Invariant Representation Learning for Multimedia Recommendation

Shanshan Huang
Haoxuan Li
Qingsong Li
Chunyuan Zheng
Li Liu

Multimedia recommendation involves personalized ranking tasks, where multimedia content is usually represented using a generic encoder. However, these generic representations introduce spurious correlations that fail to reveal users' true preferences. Existing works attempt to alleviate this problem by learning invariant representations, but overlook the balance between independent and identically distributed (IID) and out-of-distribution (OOD) generalization. In this paper, we propose a framework called Pareto Invariant Representation Learning (PaInvRL) to mitigate the impact of spurious correlations from an IID-OOD multi-objective optimization perspective, by learning invariant representations (intrinsic factors that attract user attention) and variant representations (other factors) simultaneously. Specifically, PaInvRL includes three iteratively executed modules: (i) heterogeneous identification module, which identifies the heterogeneous environments to reflect distributional shifts for user-item interactions; (ii) invariant mask generation module, which learns invariant masks based on the Pareto-optimal solutions that minimize the adaptive weighted Invariant Risk Minimization (IRM) and Empirical Risk (ERM) losses; (iii) convert module, which generates both variant representations and item-invariant representations for training a multi-modal recommendation model that mitigates spurious correlations and balances the generalization performance within and cross the environmental distributions. We compare the proposed PaInvRL with state-of-the-art recommendation models on three public multimedia recommendation datasets (Movielens, Tiktok, and Kwai), and the experimental results validate the effectiveness of PaInvRL for both within-and cross-environmental learning.

Hashing One With All

Jiaguo Yu
Yuming Shen
Haofeng Zhang

The recent trend in unsupervised hashing requires not only a discrete representation space but also the ability to mine the similarities between data points. Determining and maintaining the relations between each datum and all the others maximally utilize the semantic diversity of the training set, but can we explore this on the holistic dataset using a single network end-to-end? In this paper, we take a step towards this vision by proposing Overview Hashing (OH). OH unifies the two ultimate goals of unsupervised hashing, i.e., (1) encoding compact features and (2) learning data similarities on a large scale end-to-end, into one model. In particular, we split the top of an encoder into a binary hash head and a continuous one. For an arbitrary datum, its similarities to all the others in the dataset are reflected in the Hamming distances of their hash heads. The distances then act as the weights to aggregate the continuous heads, shaping the final representation of this datum for loss computation. Hence, training with this representation simultaneously tunes the similarities of this datum to the whole dataset. In the context of a contrastive learning framework, we theoretically endorse our design by linking it to knowledge distillation and the attention mechanisms. Our experiments on the benchmarked datasets show the superiority of OH over the state-of-the-art hashing methods. Code is available at \hrefhttps://github.com/RosieYuu/OH \textcolorred https://github.com/RosieYuu/OH.

ChinaOpen: A Dataset for Open-world Multimodal Learning

Aozhu Chen
Ziyuan Wang
Chengbo Dong
Kaibin Tian
Ruixiang Zhao
Xun Liang
Zhanhui Kang
Xirong Li

This paper introduces ChinaOpen, a dataset sourced from Bilibili, a popular Chinese video-sharing website, for open-world multimodal learning. While the state-of-the-art multimodal learning networks have shown impressive performance in automated video annotation and cross-modal video retrieval, their training and evaluation are primarily conducted on YouTube videos with English text. Their effectiveness on Chinese data remains to be verified. In order to support multimodal learning in the new context, we construct ChinaOpen-50k, a webly annotated training set of 50k Bilibili videos associated with user-generated titles and tags. Both text-based and content-based data cleaning are performed to remove low-quality videos in advance. For a multi-faceted evaluation, we build ChinaOpen-1k, a manually labeled test set of 1k videos. Each test video is accompanied with a manually checked user title and a manually written caption. Besides, each video is manually tagged to describe objects / actions / scenes shown in the visual content. The original user tags are also manually checked. Moreover, with all the Chinese text translated into English, ChinaOpen-1k is also suited for evaluating models trained on English data. In addition to ChinaOpen, we propose Generative Video-to-text Transformer (GVT) for Chinese video captioning. We conduct an extensive evaluation of the state-of-the-art single-task / multi-task models on the new dataset, resulting in a number of novel findings and insights.

A Reinforcement Learning-Based Automatic Video Editing Method Using Pre-trained Vision-Language Model

Panwen Hu
Nan Xiao
Feifei Li
Yongquan Chen
Rui Huang

In this era of videos, automatic video editing techniques attract more and more attention from industry and academia since they can reduce workloads and lower the requirements for human editors. Existing automatic editing systems are mainly scene-or event-specific, e.g., soccer game broadcasting, yet the automatic systems for general editing, e.g., movie or vlog editing which covers various scenes and events, were rarely studied before, and converting the event-driven editing method to a general scene is nontrivial. In this paper, we propose a two-stage scheme for general editing. Firstly, unlike previous works that extract scene-specific features, we leverage the pre-trained Vision-Language Model (VLM) to extract the editing-relevant representations as editing context. Moreover, to close the gap between the professional-looking videos and the automatic productions generated with simple guidelines, we propose a Reinforcement Learning (RL)-based editing framework to formulate the editing problem and train the virtual editor to make better sequential editing decisions. Finally, we evaluate the proposed method on a more general editing task with a real movie dataset. Experimental results demonstrate the effectiveness and benefits of the proposed context representation and the learning ability of our RL-based editing framework.

Knowledge Prompt-tuning for Sequential Recommendation

Jianyang Zhai
Xiawu Zheng
Chang-Dong Wang
Hui Li
Yonghong Tian

Pre-trained language models (PLMs) have demonstrated strong performance in sequential recommendation (SR), which are utilized to extract general knowledge. However, existing methods still lack domain knowledge and struggle to capture users' fine-grained preferences. Meanwhile, many traditional SR methods improve this issue by integrating side information while suffering from information loss. To summarize, we believe that a good recommendation system should utilize both general and domain knowledge simultaneously. Therefore, we introduce an external knowledge base and propose Knowledge Prompt-tuning for Sequential Recommendation (KP4SR). Specifically, we construct a set of relationship templates and transform a structured knowledge graph (KG) into knowledge prompts to solve the problem of the semantic gap. However, knowledge prompts disrupt the original data structure and introduce a significant amount of noise. We further construct a knowledge tree and propose a knowledge tree mask, which restores the data structure in a mask matrix form, thus mitigating the noise problem. We evaluate KP4SR on three real-world datasets, and experimental results show that our approach outperforms state-of-the-art methods on multiple evaluation metrics. Specifically, compared with PLM-based methods, our method improves NDCG@5 and HR@5 by 40.65% and 36.42% on the books dataset, 11.17% and 11.47% on the music dataset, and 22.17% and 19.14% on the movies dataset, respectively. Our code is publicly available at the link: https://github.com/zhaijianyang/KP4SR.

Learning Occlusion Disentanglement with Fine-grained Localization for Occluded Person Re-identification

Wenfeng Liu
Xudong Wang
Lei Tan
Yan Zhang
Pingyang Dai
Yongjian We
Rongrong Ji

Person re-identification (Re-ID) has been extensively investigated in recent years. However, many existing paradigms rely on holistic person regions for matching, disregarding the challenges posed by occlusions in real-world scenarios. Recent methods have explored occlusion augmentation or external semantic cues. Nevertheless, these approaches tend to be coarse-grained, discarding valuable semantic information in local regions when determining them as occlusions. In this paper, we propose a Fine-grained Occlusion Disentanglement Network (FODN) that can extract more information from limited person regions. Specifically, we propose a fine-grained occlusion augmentation scheme to generate diverse occlusion data and employ bilinear interpolation and downsampling strategies to obtain fine-grained occlusion labels. We then design an occlusion feature disentanglement Module that decouples norm and angle from features and supervises the occlusion-aware task using the aforementioned occlusion labeling and person re-identification tasks, respectively, resulting in more robust features. Additionally, we propose a dynamic local weight controller to balance the relative importance of various human body parts, thereby improving the model's ability to mine more effective local features from limited human body regions after occlusion removal. Comprehensive experiments on various person Re-ID benchmarks demonstrate the superiority of FODN over state-of-the-art methods.

Interactive Interior Design Recommendation via Coarse-to-fine Multimodal Reinforcement Learning

He Zhang
Ying Sun
Weiyu Guo
Yafei Liu
Haonan Lu
Xiaodong Lin
Hui Xiong

Personalized interior decoration design often incurs high labor costs. Recent efforts in developing intelligent interior design systems have focused on generating textual requirement-based decoration designs while neglecting the problem of how to mine homeowner's hidden preferences and choose the proper initial design. To fill this gap, we propose an Interactive Interior Design Recommendation System (IIDRS) based on reinforcement learning (RL). IIDRS aims to find an ideal plan by interacting with the user, who provides feedback on the gap between the recommended plan and their ideal one. To improve decision-making efficiency and effectiveness in large decoration spaces, we propose a Decoration Recommendation Coarse-to-Fine Policy Network (DecorRCFN). Additionally, to enhance generalization in online scenarios, we propose an object-aware feedback generation method that augments model training with diversified and dynamic textual feedback. Extensive experiments on a real-world dataset demonstrate our method outperforms traditional methods by a large margin in terms of recommendation accuracy. Further user studies demonstrate that our method reaches higher real-world user satisfaction than baseline methods.

Towards Visual Taxonomy Expansion

Tinghui Zhu
Jingping Liu
Jiaqing Liang
Haiyun Jiang
Yanghua Xiao
Zongyu Wang
Rui Xie
Yunsen Xian

Taxonomy expansion task is essential in organizing the ever-increasing volume of new concepts into existing taxonomies. Most existing methods focus exclusively on using textual semantics, leading to an inability to generalize to unseen terms and the "Prototypical Hypernym Problem." In this paper, we propose Visual Taxonomy Expansion (VTE), introducing visual features into the taxonomy expansion task. We propose a textual hypernymy learning task and a visual prototype learning task to cluster textual and visual semantics. In addition to the tasks on respective modalities, we introduce a hyper-proto constraint that integrates textual and visual semantics to produce fine-grained visual semantics. Our method is evaluated on two datasets, where we obtain compelling results. Specifically, on the Chinese taxonomy dataset, our method significantly improves accuracy by 8.75%. Additionally, our approach performs better than ChatGPT on the Chinese taxonomy dataset.

Enhancing Product Representation with Multi-form Interactions for Multimodal Conversational Recommendation

Wenzhe Du
Su Haoyang
Nguyen Cam-Tu
Jian Sun

Multimodal Conversational Recommendation aims to find appropriate products based on a multi-turn dialogue, where user requests and products can be presented in both visual and textual modalities. While previous studies have focused on understanding user preferences from conversational contexts, the task of product modeling has been relatively unexplored. This study targets to fill this gap and demonstrates that information from multiple product views and cross-view interactions are essential for recommendation, along with dialog information. To this end, a product image is first encoded using a gated multi-view image encoder, and representations for the global and local views are obtained. On the textual side, two views are considered: the structure view (product attributes) and the sequence view (product description/reviews). Two forms of inter-modal interactions for product representation are then modeled: interactions between the global image view and the textual structure view, and interactions between the local image view and the textual sequence view. Furthermore, the representation is enhanced to attend to the latest user request in the dialog context, resulting in query-aware product representation. The experimental results indicate that our method, named Enteract, achieves state-of-the-art performance on two well-known datasets (MMD and SIMMC).

Stepwise Refinement Short Hashing for Image Retrieval

Yuan Sun
Dezhong Peng
Jian Dai
Zhenwen Ren

Due to significant advantages in terms of storage cost and query speed, hashing learning has attracted much attention for image retrieval. Existing hashing methods often acquiescently use long hash codes to guarantee performance, which greatly limits flexibility and scalability. Nevertheless, short hash codes are more suitable for devices with limited computing resources. When these methods use extremely short hash codes, it is difficult to meet the actual performance demand due to the information loss caused by the avalanche of dimension truncation. To address this issue, we propose a novel stepwise refinement short hashing (SRSH) for image retrieval that extracts critical features from high-dimensional image data to learn high-quality hash codes. Specifically, we propose a three-step coupled refinement strategy to relax a single hash function into three more flexible mapping matrices, such that the hash function can have more flexible to approximate precise hash codes and alleviate the information loss. Then, we adopt pairwise similarity preserving to promote coarse and fine hash codes to inherit intrinsic semantic structure from original data. Extensive experiments demonstrate the superior performance of SRSH on four image datasets.

Knowledge Decomposition and Replay: A Novel Cross-modal Image-Text Retrieval Continual Learning Method

Rui Yang
Shuang Wang
Huan Zhang
Siyuan Xu
YanHe Guo
Xiutiao Ye
Biao Hou
Licheng Jiao

To enable machines to mimic human cognitive abilities and alleviate the catastrophic forgetting problem in cross-modal image-text retrieval (CMITR), this paper proposes a novel continual learning method, Knowledge Decomposition and Replay (KDR), which emulates the process of knowledge decomposition and replay exhibited by humans in complex and changing environments. KDR has two components: a feature Decomposition-based CMITR Model (DCM) and a cross-task Generic Knowledge Replay strategy (GKR). DCM decomposes text and image features into task-specific and generic knowledge features, mimicking the human cognitive process of knowledge decomposition. Specifically, it employs a generic knowledge features extraction module for all tasks and a task-specific module for each task with a few trainable fully connected layers. Similarly, GKR emulates the human behavior of knowledge replay by utilizing the image-text similarity matrix output from the old task model with inputting the previous samples to induce the learning of the image-text similarity matrix output from the current task model with inputting the previous samples, using knowledge distillation technology. To demonstrate the effect of KDR, we adapted a continual learning dataset Seq-COCO from MSCOCO. Extensive experiments on Seq-COCO showed that KDR reduces catastrophic forgetting and consolidates general knowledge, improving the model's learning ability in CMITR.

Striking a Balance: Unsupervised Cross-Domain Crowd Counting via Knowledge Diffusion

Haiyang Xie
Zhengwei Yang
Huilin Zhu
Zheng Wang

Supervised crowd counting relies on manual labeling, which is costly and time-consuming. This led to an increased interest in unsupervised methods. However, there is a significant domain gap issue in unsupervised methods, which is manifested by a model trained on one dataset serving dramatic performance drops when being transferred to another. This phenomenon can be attributed to the diverse domain knowledge making it difficult for the unsupervised models to transfer between general (e.g., similar distribution) and domain-specific (e.g., unique density, perspective, illumination, etc.) knowledge, leading to knowledge bias. Existing methods focus on exploring distinguishable relationships and establishing connections between the source and target domains. However, the similar knowledge transfer cannot perfectly simulate the contents of the target domain, leading to the model's inability to generalize to domain-specific knowledge. In this paper, we propose a Self-awareness Knowledge Diffusion method (SaKnD) that leverages the self-knowledge without establishing cross-domain knowledge relationships, which aims to balance the knowledge bias between general and domain-specific knowledge. Specifically, we propose a strategy to evaluate the uncertainty and consistency to define the clueless and informed areas, which determine the location and orientation of knowledge diffusion. These clueless areas serve as domain-specific knowledge that needs to be optimized, and these informed areas serve as general knowledge across domains. Extensive experiments on three standard crowd-counting benchmarks, ShanghaiTech PartA, ShanghaiTech PartB, and UCF_QNRF, show that the proposed SaKnD achieves state-of-the-art performance.

Task-Adversarial Adaptation for Multi-modal Recommendation

Hongzu Su
Jingjing Li
Fengling Li
Lei Zhu
Ke Lu
Yang Yang

An ideal multi-modal recommendation system is supposed to be timely updated with the latest modality information and interaction data because the distribution discrepancy between new data and historical data will lead to severe recommendation performance deterioration. However, upgrading a recommendation system with numerous new data consumes much time and computing resources. To mitigate this problem, we propose a Task-Adversarial Adaptation (TAA) framework, which is able to align data distributions and reduce resource consumption at the same time. This framework is specifically designed to align distributions of embedded features for different recommendation tasks between the source domain (i.e., historical data) and the target domain (i.e., new data). Technically, we design a domain feature discriminator for each task to distinguish which domain a feature comes from. By the two-player min-max game between the feature discriminator and the feature embedding network, the feature embedding network is able to align the source and target data distributions. With the ability to align source and target distributions, we are able to reduce the number of training samples by random sampling. In addition, we formulate the proposed approach as a plug-and-play module to accelerate the model training and improve the performance of mainstream multi-modal multi-task recommendation systems. We evaluate our method by predicting the Click-Through Rate (CTR) in e-commerce scenarios. Extensive experiments verify that our method is able to significantly improve prediction performance and accelerate model training on the target domain. For instance, our method is able to surpass the previous state-of-the-art method by 2.45% in terms of Area Under Curve (AUC) on AliExpress_US dataset while only utilizing one percent of the target data in training. Code: https://github.com/TL-UESTC/TAA.

Counterfactual Cross-modality Reasoning for Weakly Supervised Video Moment Localization

Zezhong Lv
Bing Su
Ji-Rong Wen

Video moment localization aims to retrieve the target segment of an untrimmed video according to the natural language query. Weakly supervised methods gains attention recently, as the precise temporal location of the target segment is not always available. However, one of the greatest challenges encountered by the weakly supervised method is implied in the mismatch between the video and language induced by the coarse temporal annotations. To refine the vision-language alignment, recent works contrast the cross-modality similarities driven by reconstructing masked queries between positive and negative video proposals. However, the reconstruction may be influenced by the latent spurious correlation between the unmasked and the masked parts, which distorts the restoring process and further degrades the efficacy of contrastive learning since the masked words are not completely reconstructed from the cross-modality knowledge. In this paper, we discover and mitigate this spurious correlation through a novel proposed counterfactual cross-modality reasoning method. Specifically, we first formulate query reconstruction as an aggregated causal effect of cross-modality and query knowledge. Then by introducing counterfactual cross-modality knowledge into this aggregation, the spurious impact of the unmasked part contributing to the reconstruction is explicitly modeled. Finally, by suppressing the unimodal effect of masked query, we can rectify the reconstructions of video proposals to perform reasonable contrastive learning. Extensive experimental evaluations demonstrate the effectiveness of our proposed method. The code is available at https://github.com/sLdZ0306/CCR https://github.com/sLdZ0306/CCR.

MISSRec: Pre-training and Transferring Multi-modal Interest-aware Sequence Representation for Recommendation

Jinpeng Wang
Ziyun Zeng
Yunxiao Wang
Yuting Wang
Xingyu Lu
Tianxiang Li
Jun Yuan
Rui Zhang
Hai-Tao Zheng
Shu-Tao Xia

The goal of sequential recommendation (SR) is to predict a user's potential interested items based on her/his historical interaction sequences. Most existing sequential recommenders are developed based on ID features, which, despite their widespread use, often underperform with sparse IDs and struggle with the cold-start problem. Besides, inconsistent ID mappings hinder the model's transferability, isolating similar recommendation domains that could have been co-optimized. This paper aims to address these issues by exploring the potential of multi-modal information in learning robust and generalizable sequence representations. We propose MISSRec, a multi-modal pre-training and transfer learning framework for SR. On the user side, we design a Transformer-based encoder-decoder model, where the contextual encoder learns to capture the sequence-level multi-modal synergy while a novel interest-aware decoder is developed to grasp item-modality-interest relations for better sequence representation. On the candidate item side, we adopt a dynamic fusion module to produce user-adaptive item representation, providing more precise matching between users and items. We pre-train the model with contrastive learning objectives and fine-tune it in an efficient manner. Extensive experiments demonstrate the effectiveness and flexibility of MISSRec, promising an practical solution for real-world recommendation scenarios.

Attributes Grouping and Mining Hashing for Fine-Grained Image Retrieval

Xin Lu
Shikun Chen
Yichao Cao
Xin Zhou
Xiaobo Lu

In recent years, hashing methods have been popular in the large-scale media search for low storage and strong representation capabilities. To describe objects with similar overall appearance but subtle differences, more and more studies focus on hashing-based fine-grained image retrieval. Existing hashing networks usually generate both local and global features through attention guidance on the same deep activation tensor, which limits the diversity of feature representations. To handle this limitation, we substitute convolutional descriptors for attention-guided features and propose an Attributes Grouping and Mining Hashing (AGMH), which groups and embeds the category-specific visual attributes in multiple descriptors to generate a comprehensive feature representation for efficient fine-grained image retrieval. Specifically, an Attention Dispersion Loss (ADL) is designed to force the descriptors to attend to various local regions and capture diverse subtle details. Moreover, we propose a Stepwise Interactive External Attention (SIEA) to mine critical attributes in each descriptor and construct correlations between fine-grained attributes and objects. The attention mechanism is dedicated to learning discrete attributes, which will not cost additional computations in hash codes generation. Finally, the compact binary codes are learned by preserving pairwise similarities. Experimental results demonstrate that AGMH consistently yields the best performance against state-of-the-art methods on fine-grained benchmark datasets.

Semantic-Guided Feature Distillation for Multimodal Recommendation

Fan Liu
Huilin Chen
Zhiyong Cheng
Liqiang Nie
Mohan Kankanhalli

Multimodal recommendation exploits the rich multimodal information associated with users or items to enhance the representation learning for better performance. In these methods, end-to-end feature extractors (e.g., shallow/deep neural networks) are often adopted to tailor the generic multimodal features that are extracted from raw data by pre-trained models for recommendation. However, compact extractors, such as shallow neural networks, may find it challenging to extract effective information from complex and high-dimensional generic modality features. Conversely, DNN-based extractors may encounter the data sparsity problem in recommendation. To address this problem, we propose a novel model-agnostic approach called Semantic-guided Feature Distillation (SGFD), which employs a teacher-student framework to extract feature for multimodal recommendation. The teacher model first extracts rich modality features from the generic modality feature by considering both the semantic information of items and the complementary information of multiple modalities. SGFD then utilizes response-based and feature-based distillation loss to effectively transfer the knowledge encoded in the teacher model to the student model. To evaluate the effectiveness of our SGFD, we integrate SGFD into three backbone multimodal recommendation models. Extensive experiments on three public real-world datasets demonstrate that SGFD-enhanced models can achieve substantial improvement over their counterparts.

Multi-View Graph Convolutional Network for Multimedia Recommendation

Penghang Yu
Zhiyi Tan
Guanming Lu
Bing-Kun Bao

Multimedia recommendation has received much attention in recent years. It models user preferences based on both behavior information and item multimodal information. Though current GCN-based methods achieve notable success, they suffer from two limitations: (1) Modality noise contamination to the item representations. Existing methods often mix modality features and behavior features in a single view (e.g., user-item view) for propagation, the noise in the modality features may be amplified and coupled with behavior features. In the end, it leads to poor feature discriminability; (2) Incomplete user preference modeling caused by equal treatment of modality features. Users often exhibit distinct modality preferences when purchasing different items. Equally fusing each modality feature ignores the relative importance among different modalities, leading to the suboptimal user preference modeling.

To tackle the above issues, we propose a novel Multi-View Graph Convolutional Network (MGCN) for the multimedia recommendation. Specifically, to avoid modality noise contamination, the modality features are first purified with the aid of item behavior information. Then, the purified modality features of items and behavior features are enriched in separate views, including the user-item view and the item-item view. In this way, the distinguishability of features is enhanced. Meanwhile, a behavior-aware fuser is designed to comprehensively model user preferences by adaptively learning the relative importance of different modality features. Furthermore, we equip the fuser with a self-supervised auxiliary task. This task is expected to maximize the mutual information between the fused multimodal features and behavior features, so as to capture complementary and supplementary preference information simultaneously. Extensive experiments on three public datasets demonstrate the effectiveness of our methods. Our code is made publicly available on https://github.com/demonph10/MGCN.

SESSION: Poster Session V: Engaging Users with Multimedia -- Summarization, Analytics, and Storytelling

Self-supervised Video Summarization Guided by Semantic Inverse Optimal Transport

Yutong Wang
Hongteng Xu
Dixin Luo

Video summarization is a critical task in video analysis that aims to create a brief yet informative summary of the original video (i.e., a set of keyframes) while retaining its primary content. Supervised summarization methods rely on time-consuming keyframe labeling and thus often suffer from the insufficiency issue of training data. In contrast, the performance of unsupervised summarization methods is often unsatisfactory due to the lack of semantically-meaningful guidance on the keyframe selection. In this study, we propose a novel self-supervised video summarization framework with the help of computational optimal transport techniques. Specifically, we generate textual descriptions from video shots and learn the projection from the textual embeddings to the visual ones together with an optimal transport plan between them via solving an inverse optimal transport problem. We propose an alternating optimization algorithm to solve this problem efficiently and design an effective mechanism in the algorithm to avoid trivial solutions. Given the optimal transport plan and the underlying distance between the projected textual embeddings and the visual ones, we synthesize pseudo-significance scores for video frames and leverage the scores as offline supervision to train a keyframe selector. Without subjective and error-prone manual annotations, the proposed framework surpasses previous unsupervised methods in producing high-quality results for generic and instructional video summarization tasks, whose performance even is comparable to those supervised competitors. The code is available at https://github.com/Dixin-s-Lab/Video-Summary-IOT.

Uncovering the Unseen: Discover Hidden Intentions by Micro-Behavior Graph Reasoning

Zhuo Zhou
Wenxuan Liu
Danni Xu
Zheng Wang
Jian Zhao

This paper introduces a new and challenging Hidden Intention Discovery (HID) task. Unlike existing intention recognition tasks, which are based on obvious visual representations to identify common intentions for normal behavior, HID focuses on discovering hidden intentions when humans try to hide their intentions for abnormal behavior. HID presents a unique challenge in that hidden intentions lack the obvious visual representations to distinguish them from normal intentions. Fortunately, from a sociological and psychological perspective, we find that the difference between hidden and normal intentions can be reasoned from multiple micro-behaviors, such as gaze, attention, and facial expressions. Therefore, we first discover the relationship between micro-behavior and hidden intentions and use graph structure to reason about hidden intentions. To facilitate research in the field of HID, we also constructed a seminal dataset containing a hidden intention annotation of a typical theft scenario for HID. Extensive experiments show that the proposed network improves performance on the HID task by 9.9% over the state-of-the-art method SBP.

Improving Rumor Detection by Class-based Adversarial Domain Adaptation

Jingqiu Li
Lanjun Wang
Jianlin He
Yongdong Zhang
Anan Liu

Since rumors widely spread on social networks can cause serious negative impacts, a batch of studies have investigated how to detect rumors. Most of them rely on existing datasets and try to improve the detection performance on those known datasets, but their performance drops significantly when detecting newly emerging events. This is not in line with the original intention of the rumor detection task. To tackle this issue, we formulate the rumor detection problem as a domain adaption task and propose the Class-based Adversarial Domain Adaptive framework, CADA, which is a general model framework where any latest rumor detection methods can be plugged-in. The improvement of new emerging rumor event detection is based on adversarial training. Specifically, CADA considers class-based discriminators to achieve fine-grained alignment of declarations to be detected of different classes. Experiments on three public datasets show that CADA can improve the detection performance of existing rumor detection models and achieve better results than state-of-the-art models. In terms of accuracy, the performance is improved by at least 3% compared with the original base model in the PHEME dataset, and as high as 10% in the Twitter datasets.

TopicCAT: Unsupervised Topic-Guided Co-Attention Transformer for Extreme Multimodal Summarisation

Peggy Tang
Kun Hu
Lei Zhang
Junbin Gao
Jiebo Luo
Zhiyong Wang

The exponential growth of multimedia data has sparked a surge of interest in multimodal summarisation with multimodal output (MSMO). A relatively unexplored but essential task within this field is extreme multimodal summarisation, a process that involves creating extremely concise multimodal summaries to further address the issue of multimedia information overload. In this study, we propose a novel Unsupervised Topic-guided Co-Attention Transformer (TopicCAT) neural network to produce extreme multimodal summaries for video-document pairs. The approach consists of two learning stages for a comprehensive multimodal understanding, guided by topic-based insights: a unimodal learning stage and a cross-modal learning stage, in which a cross-modal topic model is devised to capture the overarching themes present in both documents and videos. To achieve unsupervised learning, eliminating the need for resource-expensive collection of ground-truth multimodal summaries, we propose an optimal transport-based optimisation scheme to evaluate summary coverage from a semantic distribution perspective at the topic-level. Comprehensive experiments demonstrate the effectiveness of our proposed TopicCAT method on a multimodal news dataset, achieving a BERTScore of 84.46 and an accuracy of 0.60.

Toward Human Perception-Centric Video Thumbnail Generation

Tao Yang
Fan Wang
Junfan Lin
Zhongang Qi
Yang Wu
Jing Xu
Ying Shan
Changwen Chen

Video thumbnail plays an essential role in summarizing video content into a compact and concise image for users to browse efficiently. However, automatically generating attractive and informative video thumbnails remains an open problem due to the difficulty of formulating human aesthetic perception and the scarcity of paired training data. This work proposes a novel Human Perception-Centric Video Thumbnail Generation (HPCVTG) to address these challenges. Specifically, our framework first generates a set of thumbnails using a principle-based system, which conforms to established aesthetic and human perception principles, such as visual balance in the layout and avoiding overlapping elements. Then rather than designing from scratch, we ask human annotators to evaluate some of these thumbnails and select their preferred ones. A Transformer-based Variational Auto-Encoder (VAE) model is firstly pre-trained with Model-Agnostic Meta-Learning (MAML) and then fine-tuned on these human-selected thumbnails. The exploration of combining the MAML pre-training paradigm with human feedback in training can reduce human involvement and make the training process more efficient. Extensive experimental results show that our HPCVTG framework outperforms existing methods in objective and subjective evaluations, highlighting its potential to improve the user experience when browsing videos and inspire future research in human perception-centric content generation tasks. The code and dataset will be released via https://github.com/yangtao2019yt/HPCVTG.

SESSION: Poster Session VI: Engaging Users with Multimedia -- Interactions and Quality of Experience

Feature Decoupling-Recycling Network for Fast Interactive Segmentation

Huimin Zeng
Weinong Wang
Xin Tao
Zhiwei Xiong
Yu-Wing Tai
Wenjie Pei

Recent interactive segmentation methods iteratively take source image, user guidance and previously predicted mask as the input without considering the invariant nature of the source image. As a result, the process of extracting features from the source image is repeated in each interaction, resulting in substantial computational redundancy. In this work, we propose the Feature Decoupling-Recycling Network (FDRN), which decouples the modeling components based on their intrinsic discrepancies and then recycles components that can be reused for each user interaction. Thus, the efficiency of the whole interactive process can be significantly improved. To be specific, we apply the Decoupling-Recycling strategy from three perspectives to address three types of discrepancies, respectively. First, our model decouples the learning of source image semantics from the encoding of user guidance to process two types of input domains separately. Second, FDRN decouples high-level and low-level features from stratified semantic representations to enhance feature learning. Third, during the encoding of user guidance, current user guidance is decoupled from historical guidance to highlight the effect of current user guidance. We conduct extensive experiments on 6 datasets from different domains and modalities, which demonstrate the following merits of our model: 1) superior efficiency than other methods, particularly advantageous in the challenging scenarios requiring long-term interactions (up to 4.25x faster), while achieving favorable segmentation performance; 2) strong applicability to various methods serving as a universal enhancement technique; 3) well cross-task generalizability, e.g., to medical image segmentation, and robustness against misleading user guidance.

Predictive Sampling for Efficient Pairwise Subjective Image Quality Assessment

Shima Mohammadi
João Ascenso

Subjective image quality assessment studies are used in many scenarios, such as the evaluation of compression, super-resolution, and denoising solutions. Among the available subjective test methodologies, pair comparison is attracting popularity due to its simplicity, reliability, and robustness to changes in the test conditions, e.g. display resolutions. The main problem that impairs its wide acceptance is that the number of pairs to compare by subjects grows quadratically with the number of stimuli that must be considered. Usually, the paired comparison data obtained is fed into an aggregation model to obtain a final score for each degraded image and thus, not every comparison contributes equally to the final quality score. In the past years, several solutions that sample pairs (from all possible combinations) have been proposed, from random sampling to active sampling based on the past subjects' decisions. This paper introduces a novel sampling solution called Predictive Sampling for Pairwise Comparison (PS-PC) which exploits the characteristics of the input data to make a prediction of which pairs should be evaluated by subjects. The proposed solution exploits popular machine learning techniques to select the most informative pairs for subjects to evaluate, while for the other remaining pairs, it predicts the subjects' preferences. The experimental results show that PS-PC is the best choice among the available sampling algorithms with higher performance for the same number of pairs. Moreover, since the choice of the pairs is done a priori before the subjective test starts, the algorithm is not required to run during the test and thus much more simple to deploy in online crowdsourcing subjective tests.

Interactive Image Style Transfer Guided by Graffiti

Quan Wang
Yanli Ren
Xinpeng Zhang
Guorui Feng

Neural style transfer (NST) can quickly produce impressive artistic images, which allows ordinary people to become painter. The brushstrokes of stylized images created by the current NST methods are often unpredictable, which does not conform to the logic of the artist's drawing. At the same time, the style distribution of the generated stylized image texture differs from the real artwork. In this paper, we propose an interactive image style transfer network (IIST-Net) to overcome the above limitations. Our IIST-Net can generate stylized results for brushstrokes in arbitrary directions guided by graffiti curves. The style distribution of these stylized results is closer to the real-life artwork. Specifically, we design an Interactive Brush-texture Generation (IBG) module in IIST-Net to progressively generate controllable brush-textures. Then, two encoders are introduced to embed the interactive brush-textures into the content image in the deep space for producing the fused content feature map. The Multilayer Style Attention (MSA) module is proposed to further distill multi-scale style features and transfer them to the fused content feature map for obtaining the final stylized feature map with controllable brushstrokes. Additionally, we adopt the content loss, style loss, adversarial loss and contrastive loss to jointly supervise the proposed network. Experimental comparisons have demonstrated the effectiveness of our proposed method for creating controllable and realistic stylized images.

Ada-DQA: Adaptive Diverse Quality-aware Feature Acquisition for Video Quality Assessment

Hongbo Liu
Mingda Wu
Kun Yuan
Ming Sun
Yansong Tang
Chuanchuan Zheng
Xing Wen
Xiu Li

Video quality assessment (VQA) has attracted growing attention in recent years. While the great expense of annotating large-scale VQA datasets has become the main obstacle for current deep-learning methods. To surmount the constraint of insufficient training data, in this paper, we first consider the complete range of video distribution diversity (i.e. content, distortion, motion) and employ diverse pretrained models (e.g. architecture, pretext task, pre-training dataset) to benefit quality representation. An Adaptive Diverse Quality-aware feature Acquisition (Ada-DQA) framework is proposed to capture desired quality-related features generated by these frozen pretrained models. By leveraging the Quality-aware Acquisition Module (QAM), the framework is able to extract more essential and relevant features to represent quality. Finally, the learned quality representation is utilized as supplementary supervisory information, along with the supervision of the labeled quality score, to guide the training of a relatively lightweight VQA model in a knowledge distillation manner, which largely reduces the computational cost during inference. Experimental results on three mainstream no-reference VQA benchmarks clearly show the superior performance of Ada-DQA in comparison with current state-of-the-art approaches without using extra training data of VQA.

My Brother Helps Me: Node Injection Based Adversarial Attack on Social Bot Detection

Lanjun Wang
Xinran Qiao
Yanwei Xie
Weizhi Nie
Yongdong Zhang
Anan Liu

Social platforms such as Twitter are under siege from a multitude of fraudulent users. In response, social bot detection tasks have been developed to identify such fake users. Due to the structure of social networks, the majority of methods are based on the graph neural network(GNN), which is susceptible to attacks. In this study, we propose a node injection-based adversarial attack method designed to deceive bot detection models. Notably, neither the target bot nor the newly injected bot can be detected when a new bot is added around the target bot. This attack operates in a black-box fashion, implying that any information related to the victim model remains unknown. To our knowledge, this is the first study exploring the resilience of bot detection through graph node injection. Furthermore, we develop an attribute recovery module to revert the injected node embedding from the graph embedding space back to the original feature space, enabling the adversary to manipulate node perturbation effectively. We conduct adversarial attacks on four commonly used GNN structures for bot detection on two widely used datasets: Cresci-2015 and TwiBot-22. The attack success rate is over 73% and the rate of newly injected nodes being detected as bots is below 13% on these two datasets.

On the Performance of Subjective Visual Quality Assessment Protocols for Nearly Visually Lossless Image Compression

Michela Testolina
Davi Lazzarotto
Rafael Rodrigues
Shima Mohammadi
João Ascenso
António M. G. Pinheiro
Touradj Ebrahimi

The past decades have witnessed rapid growth in imaging as a major form of communication between individuals. Due to recent advances in capture, storage, delivery and display technologies, consumers demand improved perceptual quality while requiring reduced storage. In this context, research and innovation in lossy image compression have steered towards methods capable of achieving high compression ratios without compromising the perceived visual quality of images, and in some cases even enhancing the latter. Subjective visual quality assessment of images plays a fundamental role in defining quality as perceived by human observers. Although the field of image compression is constantly evolving towards efficient solutions for higher visual qualities, standardized subjective visual quality assessment protocols are still limited to those proposed in ITU-R Recommendation BT.500 and JPEG AIC standards. The number of comprehensive and in-depth studies where different protocols are compared is still insufficient. Moreover, previous works have not investigated the effectiveness of these methods on higher quality ranges, using recent image compression methods. In this paper, subjective visual scores collected from three subjective image quality assessment protocols, namely the Double Stimulus Continuous Quality Scale (DSCQS) and two test methods described in the JPEG AIC Part 2 standard, are compared between different laboratories under similar controlled conditions. The analysis of the experimental results has revealed that the DSCQS protocol is highly influenced by the quality of the reference images and experience of the subjects, while the JPEG AIC Part 2 specifications produce more stable results but are expensive and only suitable for a limited range of qualities. These emphasize the need for new robust subjective image quality assessment methodologies able to discriminate in the range of qualities generally demanded by consumers, i.e. from high to nearly visually lossless.

A Novel Temporal Channel Enhancement and Contextual Excavation Network for Temporal Action Localization

Zan Gao
Xinglei Cui
Yibo Zhao
Tao Zhuo
Weili Guan
Meng Wang

The temporal action localization (TAL) task aims to locate and classify action instances in untrimmed videos. Most previous methods use classifiers and locators to act on the same feature; thus, the classification and localization processes are relatively independent. Therefore, if the classification results and localization results are fused, there will be a problem that the classification results are correct while the localization results are wrong, resulting in inaccurate final results, and vice versa. To solve this problem, we propose a novel temporal channel enhancement and contextual excavation network (TCN) for the TAL task, which generates robust classification and localization features and refines the final localization results. Specifically, a temporal channel enhancement module is designed to enhance the temporal and channel information of the feature sequence. Then, the temporal semantic contextual excavation module is developed to establish relationships between similar frames. Finally, the features with enhanced contextual information are transferred to a classifier. While executing the classification process, we obtain powerful classification features. Most importantly, with the robust classification features, the final localization features are produced by the refine localization module, which is applied to obtain the final localization results. Extensive experiments show that TCN can outperform all the SOTA methods on the THUMOS14 dataset, and achieves a comparable performance on the ActivityNet1.3 dataset. Compared with ActionFormer (ECCV 2022) and BREM (MM 2022) on the THUMOS14 dataset, the proposed TCN can achieve improvements of 1.8% and 5.0%, respectively.

MFR-Net: Multi-faceted Responsive Listening Head Generation via Denoising Diffusion Model

Jin Liu
Xi Wang
Xiaomeng Fu
Yesheng Chai
Cai Yu
Jiao Dai
Jizhong Han

Face-to-face communication is a common scenario including roles of speakers and listeners. Most existing research methods focus on producing speaker videos, while the generation of listener heads remains largely overlooked. Responsive listening head generation is an important task that aims to model face-to-face communication scenarios by generating a listener head video given a speaker video and a listener head image. An ideal generated responsive listening video should respond to the speaker with attitude or viewpoint expressing while maintaining diversity in interaction patterns and accuracy in listener identity information. To achieve this goal, we propose the Multi-Faceted Responsive Listening Head Generation Network (MFR-Net). Specifically, MFR-Net employs the probabilistic denoising diffusion model to predict diverse head pose and expression features. In order to perform multi-faceted response to the speaker video, while maintaining accurate listener identity preservation, we design the Feature Aggregation Module to boost listener identity features and fuse them with other speaker-related features. Finally, a renderer finetuned with identity consistency loss produces the final listening head videos. Our extensive experiments demonstrate that MFR-Net not only achieves multi-faceted responses in diversity and speaker identity information but also in attitude and viewpoint expression.

Scene-Generalizable Interactive Segmentation of Radiance Fields

Songlin Tang
Wenjie Pei
Xin Tao
Tanghui Jia
Guangming Lu
Yu-Wing Tai

Existing methods for interactive segmentation in radiance fields entail scene-specific optimization and thus cannot generalize across different scenes, which greatly limits their applicability. In this work we make the first attempt at Scene-Generalizable Interactive Segmentation in Radiance Fields (SGISRF) and propose a novel SGISRF method, which can perform 3D object segmentation for novel (unseen) scenes represented by radiance fields, guided by only a few interactive user clicks in a given set of multi-view 2D images. In particular, the proposed SGISRF focuses on addressing three crucial challenges with three specially designed techniques. First, we devise the Cross-Dimension Guidance Propagation to encode the scarce 2D user clicks into informative 3D guidance representations. Second, the Uncertainty-Eliminated 3D Segmentation module is designed to achieve efficient yet effective 3D segmentation. Third, Concealment-Revealed Supervised Learning scheme is proposed to reveal and correct the concealed 3D segmentation errors resulted from the supervision in 2D space with only 2D mask annotations. Extensive experiments on two real-world challenging benchmarks covering diverse scenes demonstrate 1) effectiveness and scene-generalizability of the proposed method, 2) favorable performance compared to classical method requiring scene-specific optimization.

MetaCast: A Self-Driven Metaverse Announcer Architecture Based on Quality of Experience Evaluation Model

Zhonghao Lin
Haihan Duan
Jiaye Li
Xinyao Sun
Wei Cai

Metaverse provides users with a novel experience through immersive multimedia technologies. Along with the rapid user growth, numerous events bursting in the metaverse necessitate an announcer to help catch and monitor ongoing events. However, systems on the market primarily serve for esports competitions and rely on human directors, making it challenging to provide 24-hour delivery in the metaverse persistent world. To fill the blank, we proposed a three-stage architecture for metaverse announcers, which is designed to identify events, position cameras, and blend between shots. Based on the architecture, we introduced a Metaverse Announcer User Experience (MAUE) model to identify the factors affecting the users' Quality of Experience (QoE) from a human-centered perspective. In addition, we implemented MetaCast, a practical self-driven metaverse announcer in a university campus metaverse prototype, to conduct user studies for MAUE model. The experimental results have effectively achieved satisfactory announcer settings that align with the preferences of most users, encompassing parameters such as video transition rate, repetition rate, importance threshold value, and image composition.

Visual Redundancy Removal of Composite Images via Multimodal Learning

Wuyuan Xie
Shukang Wang
Rong Zhang
Miaohui Wang

Composite images are generated by combining two or more different photographs, and their content is typically heterogeneous. However, existing unimodal visual redundancy prediction methods are difficult to accurately model the complex characteristics of this image type. In this paper, we investigate the visual redundancy modeling of composite images from an end-to-end multimodal perspective, including four cross-media modalities (i.e., text, brightness, color, and segmentation). Specifically, we design a two-stage cross-modal alignment module based on self-attention mechanism and contrastive learning, and develop a fusion module based on a cross-modal augmentation paradigm. Further, we establish the first cross-media visual redundancy dataset for composite images, which contains 413 groups of cross-modal data and generates 13629 realistic compression distortions using the latest versatile video coding (VVC) standard. Experimental results on nine benchmark datasets demonstrate the effectiveness of our method, outperforming seven representative methods.

A Model-Agnostic Semantic-Quality Compatible Framework based on Self-Supervised Semantic Decoupling

Xiaoyu Ma
Chenxi Feng
Jiaojiao Wang
Qiang Lin
Suiyu Zhang
Jinchi Zhu
Xiaodiao Chen
Chang Liu
Dingguo Yu

Blind Image Quality Assessment (BIQA) is a challenging research topic that is critical for preprocessing and optimizing downstream vision tasks such as semantic recognition and image restoration. However, there has been a significant disconnect between BIQA research and other vision tasks. The primary cause of such disconnect is the incompatibility of existing BIQA models with other vision tasks, resulting in significant computational complexity. To address this issue, we propose a model-agnostic semantic-quality compatible framework that can simultaneously generate quality and semantic predictions. By incorporating a lightweight learning architecture, we demonstrate that a parameter-fixed semantic-oriented backbone can predict the perceptual quality of images as accurately as models trained end-to-end. We systematically study the major components of our framework, and our experimental results demonstrate the superiority of our model in terms of both complexity and accuracy. The source code of this work is available at https://github.com/MaxiaoyuHehe/SQCFNet.

Implicit Obstacle Map-driven Indoor Navigation Model for Robust Obstacle Avoidance

Wei Xie
Haobo Jiang
Shuo Gu
Jin Xie

Robust obstacle avoidance is one of the critical steps for successful goal-driven indoor navigation tasks. Due to the obstacle missing in the visual image and the possible missed detection issue, visual image-based obstacle avoidance techniques still suffer from unsatisfactory robustness. To mitigate it, in this paper, we propose a novel implicit obstacle map-driven indoor navigation framework for robust obstacle avoidance, where an implicit obstacle map is learned based on the historical trial-and-error experience rather than the visual image. In order to further improve the navigation efficiency, a non-local target memory aggregation module is designed to leverage a non-local network to model the intrinsic relationship between the target semantic and the target orientation clues during the navigation process so as to mine the most target-correlated object clues for the navigation decision. Extensive experimental results on AI2-Thor and RoboTHOR benchmarks verify the excellent obstacle avoidance and navigation efficiency of our proposed method.The core source code is available at https://github.com/xwaiyy123/object-navigation.

Personalized Image Aesthetics Assessment with Attribute-guided Fine-grained Feature Representation

Hancheng Zhu
Zhiwen Shao
Yong Zhou
Guangcheng Wang
Pengfei Chen
Leida Li

Personalized image aesthetics assessment (PIAA) has gained increasing attention from researchers due to its ability to measure individual users' specific aesthetic experiences. However, most existing PIAA methods rely on holistic features or simplistic coding to characterize users' aesthetic preferences for images, and we believe that more rich explicit features are needed in modeling PIAA. Consequently, we propose an attribute-guided fine-grained feature-aware personalized image aesthetics assessment method, which can fully capture fine-grained features from multiple attributes to represent users' aesthetic preferences for images. To achieve this, we first build a fine-grained feature extraction (FFE) module to obtain the refined local features of image attributes to compensate for holistic features. The FFE module is then used to generate user-level features, which are combined with the image-level features to obtain user-preferred fine-grained feature representations. By training extensive users' PIAA tasks, the aesthetic distribution of most users can be transferred to the personalized scores of individual users. To enable our proposed model to learn more generalizable aesthetics among individual users, we incorporate the degree of dispersion between users' personalized scores and image aesthetic distribution as a coefficient in the loss function during model training. Experimental results on several PIAA databases show that our method outperforms existing mainstream PIAA methods, and can effectively infer users' personalized aesthetics of images.

Non-Local Geometry and Color Gradient Aggregation Graph Model for No-Reference Point Cloud Quality Assessment

Songtao Wang
Xiaoqi Wang
Hao Gao
Jian Xiong

No-Reference point cloud quality assessment (NR-PCQA) is a challenging task in computer vision due to the irregularity of point cloud structures and the unavailability of reference information. Existing point-based and projection-based NR-PCQA models are limited by the representation of point cloud distortion and the modeling of spatial topological structure. To address these limitations, we first propose two visual quality-related gradients: local-maximum geometry gradient and distance-weighted color gradient, which can effectively represent local variations in terms of spatial structure and color intensities between adjacent points. We further propose a non-local geometry and color gradient aggregation graph model for evaluating the perceptual quality of point clouds. Specifically, local graph convolutions are designed to model the topological relationship across neighboring points by aggregating the geometry and color gradients. Furthermore, a position-adaptive self-attention mechanism is introduced to expand the receptive field for modeling the global dependencies of point clouds. Experimental results on two benchmark databases demonstrate that the proposed model outperforms existing state-of-the-art methods.

SESSION: Poster Session VII: Engaging Users with Multimedia -- Metaverse, Art and Culture

360-Degree Panorama Generation from Few Unregistered NFoV Images

Jionghao Wang
Ziyu Chen
Jun Ling
Rong Xie
Li Song

360° panoramas are extensively utilized as environmental light sources in computer graphics. However, capturing a 360° × 180° panorama poses challenges due to the necessity of specialized and costly equipment, and additional human resources. Prior studies develop various learning-based generative methods to synthesize panoramas from a single Narrow Field-of-View (NFoV) image, but they are limited in alterable input patterns, generation quality, and controllability. To address these issues, we propose a novel pipeline called PanoDiff, which efficiently generates complete 360° panoramas using one or more unregistered NFoV images captured from arbitrary angles. Our approach has two primary components to overcome the limitations. Firstly, a two-stage angle prediction module to handle various numbers of NFoV inputs. Secondly, a novel latent diffusion-based panorama generation model uses incomplete panorama and text prompts as control signals and utilizes several geometric augmentation schemes to ensure geometric properties in generated panoramas. Experiments show that PanoDiff achieves state-of-the-art panoramic generation quality and high controllability, making it suitable for applications such as content editing.

Speech-Driven 3D Face Animation with Composite and Regional Facial Movements

Haozhe Wu
Songtao Zhou
Jia Jia
Junliang Xing
Qi Wen
Xiang Wen

Speech-driven 3D face animation poses significant challenges due to the intricacy and variability inherent in human facial movements. This paper emphasizes the importance of considering both the composite and regional natures of facial movements in speech-driven 3D face animation. The composite nature pertains to how speech-independent factors globally modulate speech-driven facial movements along the temporal dimension. Meanwhile, the regional nature alludes to the notion that facial movements are not globally correlated but are actuated by local musculature along the spatial dimension. It is thus indispensable to incorporate both natures for engendering vivid animation. To address the composite nature, we introduce an adaptive modulation module that employs arbitrary facial movements to dynamically adjust speech-driven facial movements across frames on a global scale. To accommodate the regional nature, our approach ensures that each constituent of the facial features for every frame focuses on the local spatial movements of 3D faces. Moreover, we present a non-autoregressive backbone for translating audio to 3D facial movements, which maintains high-frequency nuances of facial movements and facilitates efficient inference. Comprehensive experiments and user studies demonstrate that our method surpasses contemporary state-of-the-art approaches both qualitatively and quantitatively.

Learning Profitable NFT Image Diffusions via Multiple Visual-Policy Guided Reinforcement Learning

Huiguo He
Tianfu Wang
Huan Yang
Jianlong Fu
Nicholas Jing Yuan
Jian Yin
Hongyang Chao
Qi Zhang

We study the task of generating profitable Non-Fungible Token (NFT) images from user-input texts. Recent advances in diffusion models have shown great potential for image generation. However, existing works can fall short in generating visually-pleasing and highly-profitable NFT images, mainly due to the lack of 1) plentiful and fine-grained visual attribute prompts for an NFT image, and 2) effective optimization metrics for generating high-quality NFT images. To solve these challenges, we propose a Diffusion based generation framework with Multiple Visual-Policies as rewards (i.e., Diffusion-MVP) for NFT images. The proposed framework consists of a large language model (LLM), a diffusion-based image generator, and a series of visual rewards by design. First, the LLM enhances a basic human input (such as "panda") by generating more comprehensive NFT-style prompts that include specific visual attributes, such as "panda with Ninja style and green background." Second, the diffusion-based image generator is fine-tuned using a large-scale NFT dataset to capture fine-grained image styles and accessory compositions of popular NFT elements. Third, we further propose to utilize multiple visual-policies as optimization goals, including visual rarity levels, visual aesthetic scores, and CLIP-based text-image relevances. This design ensures that our proposed Diffusion-MVP is capable of minting NFT images with high visual quality and market value. To facilitate this research, we have collected the largest publicly available NFT image dataset to date, consisting of 1.5 million high-quality images with corresponding texts and market values. Extensive experiments including objective evaluations and user studies demonstrate that our framework can generate NFT images showing more visually engaging elements and higher market value, compared with state-of-the-art approaches.

Points-to-3D: Bridging the Gap between Sparse Points and Shape-Controllable Text-to-3D Generation

Chaohui Yu
Qiang Zhou
Jingliang Li
Zhe Zhang
Zhibin Wang
Fan Wang

Text-to-3D generation has recently garnered significant attention, fueled by 2D diffusion models trained on billions of image-text pairs. Existing methods primarily rely on score distillation to leverage the 2D diffusion priors to supervise the generation of 3D models, e.g., NeRF. However, score distillation is prone to suffer the view inconsistency problem, and implicit NeRF modeling can also lead to an arbitrary shape, thus leading to less realistic and uncontrollable 3D generation. In this work, we propose a flexible framework of Points-to-3D to bridge the gap between sparse yet freely available 3D points and realistic shape-controllable 3D generation by distilling the knowledge from both 2D and 3D diffusion models. The core idea of Points-to-3D is to introduce controllable sparse 3D points to guide the text-to-3D generation. Specifically, we use the sparse point cloud generated from the 3D diffusion model, Point-E, as the geometric prior, conditioned on a single reference image. To better utilize the sparse 3D points, we propose an efficient point cloud guidance loss to adaptively drive the NeRF's geometry to align with the shape of the sparse 3D points. In addition to controlling the geometry, we propose to optimize the NeRF for a more view-consistent appearance. To be specific, we perform score distillation to the publicly available 2D image diffusion model ControlNet, conditioned on text as well as depth map of the learned compact geometry. Qualitative and quantitative comparisons demonstrate that Points-to-3D improves view consistency and achieves good shape controllability for text-to-3D generation. Points-to-3D provides users with a new way to improve and control text-to-3D generation.

Real-time Facial Animation for 3D Stylized Character with Emotion Dynamics

Ye Pan
Ruisi Zhang
Jingying Wang
Yu Ding
Kenny Mitchell

Our aim is to improve animation production techniques' efficiency and effectiveness. We present two real-time solutions which drive character expressions in a geometrically consistent and perceptually valid way. Our first solution combines keyframe animation techniques with machine learning models. We propose a 3D emotion transfer network makes use of a 2D human image to generate a stylized 3D rig parameter. Our second solution combines blendshape-based motion capture animation techniques with machine learning models. We propose a blendshape adaption network which generates the character rig parameter motions with geometric consistency and temporally stability. We demonstrate the effectiveness of our system by comparing it to a commercial product Faceware. Results reveal that ratings of the recognition, intensity, and attractiveness of expressions depicted for animated characters via our systems are statistically higher than Faceware. Our results may be implemented into the animation pipeline, supporting animators to create expressions more rapidly and precisely.

3DStyle-Diffusion: Pursuing Fine-grained Text-driven 3D Stylization with 2D Diffusion Models

Haibo Yang
Yang Chen
Yingwei Pan
Ting Yao
Zhineng Chen
Tao Mei

3D content creation via text-driven stylization has played a fundamental challenge to multimedia and graphics community. Recent advances of cross-modal foundation models (e.g., CLIP) have made this problem feasible. Those approaches commonly leverage CLIP to align the holistic semantics of stylized mesh with the given text prompt. Nevertheless, it is not trivial to enable more controllable stylization of fine-grained details in 3D meshes solely based on such semantic-level cross-modal supervision. In this work, we propose a new 3DStyle-Diffusion model that triggers fine-grained stylization of 3D meshes with additional controllable appearance and geometric guidance from 2D Diffusion models. Technically, 3DStyle-Diffusion first parameterizes the texture of 3D mesh into reflectance properties and scene lighting using implicit MLP networks. Meanwhile, an accurate depth map of each sampled view is achieved conditioned on 3D mesh. Then, 3DStyle-Diffusion leverages a pre-trained controllable 2D Diffusion model to guide the learning of rendered images, encouraging the synthesized image of each view semantically aligned with text prompt and geometrically consistent with depth map. This way elegantly integrates both image rendering via implicit MLP networks and diffusion process of image synthesis in an end-to-end fashion, enabling a high-quality fine-grained stylization of 3D meshes. We also build a new dataset derived from Objaverse and the evaluation protocol for this task. Through both qualitative and quantitative experiments, we validate the capability of our 3DStyle-Diffusion. Source code and data are available at https://github.com/yanghb22-fdu/3DStyle-Diffusion-Official.

Toward Zero-shot Character Recognition: A Gold Standard Dataset with Radical-level Annotations

Xiaolei Diao
Daqian Shi
Jian Li
Lida Shi
Mingzhe Yue
Ruihua Qi
Chuntao Li
Hao Xu

Optical character recognition (OCR) methods have been applied to diverse tasks, e.g., street view text recognition and document analysis. Recently, zero-shot OCR has piqued the interest of the research community because it considers a practical OCR scenario with unbalanced data distribution. However, there is a lack of benchmarks for evaluating such zero-shot methods that apply a divide-and-conquer recognition strategy by decomposing characters into radicals. Meanwhile, radical recognition, as another important OCR task, also lacks radical-level annotation for model training. In this paper, we construct an ancient Chinese character image dataset that contains both radical-level and character-level annotations to satisfy the requirements of the above-mentioned methods, namely, ACCID, where radical-level annotations include radical categories, radical locations, and structural relations. To increase the adaptability of ACCID, we propose a splicing-based synthetic character algorithm to augment the training samples and apply an image denoising method to improve the image quality. By introducing character decomposition and recombination, we propose a baseline method for zero-shot OCR. The experimental results demonstrate the validity of ACCID and the baseline model quantitatively and qualitatively.

TSSAT: Two-Stage Statistics-Aware Transformation for Artistic Style Transfer

Haibo Chen
Lei Zhao
Jun Li
Jian Yang

Artistic style transfer aims to create new artistic images by rendering a given photograph with the target artistic style. Existing methods learn styles simply based on global statistics or local patches, lacking careful consideration of the drawing process in practice. Consequently, the stylization results either fail to capture abundant and diversified local style patterns, or contain undesired semantic information of the style image and deviate from the global style distribution. To address this issue, we imitate the drawing process of humans and propose a Two-Stage Statistics-Aware Transformation (TSSAT) module, which first builds the global style foundation by aligning the global statistics of content and style features and then further enriches local style details by swapping the local statistics (instead of local features) in a patch-wise manner, significantly improving the stylization effects. Moreover, to further enhance both content and style representations, we introduce two novel losses: an attention-based content loss and a patch-based style loss, where the former enables better content preservation by enforcing the semantic relation in the content image to be retained during stylization, and the latter focuses on increasing the local style similarity between the style and stylized images. Extensive qualitative and quantitative experiments verify the effectiveness of our method.

CPNet: Cartoon Parsing with Pixel and Part Correlation

Jian-Jun Qiao
Jie Zhang
Xiao Wu
Yu-Pei Song
Wei Li

Cartoon parsing, the task of segmenting constituent parts such as heads, arms, and legs of cartoon characters, holds substantial significance for applications in the animation industry and emerging metaverse. Nonetheless, this domain presents considerable challenges stemming from complex visual appearances, irregular structures, abstract drawing styles, among other factors. In this paper, a novel Cartoon Parsing Network (CPNet) is introduced to address these challenges. CPNet skillfully leverages the spatial and semantic correlations of pixels to discern intricate and visually akin appearances. Furthermore, it employs both local and global correlations of constituent parts to differentiate irregular and abstract body sections. Specifically, the pixels of the cartoon image are interconnected by capitalizing on the spatial and semantic correlations. To this end, a center point predictor, working in tandem with a pixel-aware attention, facilitates the exploration of pixel-level correlation learning. Additionally, the various constituent parts are meticulously organized to resonate with the intrinsic physiological structure of a cartoon character. The character's graph structure is assembled and analyzed by an edge-aware graph neural network, thereby linking adjacent parts and assimilating local correlations. A part-guided non-local attention mechanism is fashioned to correlate individual parts with the entire body, thereby modeling global connections. In addition, a new dataset named CartoonSet is curated and annotated explicitly for cartoon parsing. Experiments carried out on both cartoon parsing and human parsing datasets yield compelling results, thereby attesting to the efficacy and innovativeness of the proposed method.

RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture

Liangchen Song
Liangliang Cao
Hongyu Xu
Kai Kang
Feng Tang
Junsong Yuan
Zhao Yang

The techniques for 3D indoor scene capturing are widely used, but the meshes produced leave much to be desired. In this paper, we propose "RoomDreamer", which leverages powerful natural language to synthesize a new room with a different style. Unlike existing image synthesis methods, our work addresses the challenge of synthesizing both geometry and texture aligned to the input scene structure and prompt simultaneously. The key insight is that a scene should be treated as a whole, taking into account both scene texture and geometry. The proposed framework consists of two significant components: Geometry Guided Diffusion and Mesh Optimization. Geometry Guided Diffusion for 3D Scene guarantees the consistency of the scene style by applying the 2D prior to the entire scene simultaneously. Mesh Optimization improves the geometry and texture jointly and eliminates the artifacts in the scanned scene. To validate the proposed method, real indoor scenes scanned with smartphones are used for extensive experiments, through which the effectiveness of our method is demonstrated.

High Fidelity Face Swapping via Semantics Disentanglement and Structure Enhancement

Fengyuan Liu
Lingyun Yu
Hongtao Xie
Chuanbin Liu
Zhiguo Ding
Quanwei Yang
Yongdong Zhang

In this paper, we propose a novel Semantics and Structure-aware face Swapping framework (S2Swap) that exploits semantics disentanglement and structure enhancement for high fidelity face generation. Different from previous methods that either 1) suffer from degraded generation fidelity due to insufficient identity-attributes disentanglement or 2) neglect the importance of structure information for identity consistency, our approach can achieve local facial semantics disentanglement beyond global identity while boosting identity consistency through structure enhancement. Specifically, to achieve identity-attributes disentanglement, our S2Swap is designed from global-local perspectives. Firstly, an Oriented Identity Transfer module is proposed to globally disentangle target identity and attributes under global identity semantics prior. Such global disentanglement enables source identity transfer to the individual target identity. Secondly, a Local Semantics Disentanglement module is devised to disentangle local identity and identity-irrelevant facial semantics, providing local semantic compensation for the global counterpart. Moreover, to boost identity consistency, a Structure-Aware Head Modeling module is introduced to provide the desired face structure enhancement through an intuitive face sketch. Finally, considering the identity-attributes trade-off, we adaptively integrate semantics and structure information in a self-learning manner. Extensive experiments qualitatively and quantitatively show that our method outperforms SOTA face swapping methods in terms of both identity transfer and attribute preservation.

SimHMR: A Simple Query-based Framework for Parameterized Human Mesh Reconstruction

Zihao Huang
Min Shi
Chengxin Liu
Ke Xian
Zhiguo Cao

Human Mesh Reconstruction (HMR) aims to recover 3D human poses and shapes from a single image. Existing parameterized HMR approaches follow the "representation-to-reasoning'' paradigm to predict human body and pose parameters. This paradigm typically involves intermediate representation and complex pipeline, where potential side effects may occur that could hinder performance. In contrast, query-based non-parameterized methods directly output 3D joints and mesh vertices, but they rely on excessive queries for prediction, leading to low efficiency and robustness. In this work, we propose a simple query-based framework, dubbed SimHMR, for parameterized human mesh reconstruction. This framework streamlines the prediction process by using a few parameterized queries, which effectively removes the need for hand-crafted intermediate representation and reasoning pipeline. Different from query-based non-parameterized HMR that uses excessive coordinate queries, SimHMR only requires a few semantic queries, which physically correspond to pose, shape, and camera. The use of semantic queries significantly improves the efficiency and robustness in extreme scenarios, e.g., occlusions. Without bells and whistles, øurs achieves state-of-the-art performance on 3DPW and Human3.6M benchmarks, and surpasses existing methods on challenging 3DPW-OCC. Code available at https://github.com/inso-13/SimHMR github.com/inso-13/SimHMR

Rethinking Neural Style Transfer: Generating Personalized and Watermarked Stylized Images

Quan Wang
Sheng Li
Xinpeng Zhang
Guorui Feng

Neural style transfer (NST) has attracted many research interests recent years. The existing NST schemes could only generate one stylized image from a content-style image pair. They are weak in creating diverse and personalized artistic styles. On the other hand, the stylized images could easily be stolen and illegally redistributed when shared online, which has not been addressed at all in the existing NST schemes. In this paper, we propose a personalized and watermark-guided style transfer network (PWST-Net) to tackle the aforementioned issues. Our PWST-Net could generate diverse stylized images from a content-style image pair using different personalization keys. Once the style transfer is done, our stylized images are with watermarks naturally embedded for copyright protection. We propose a novel style encoder in our PWST-Net to progressively generate the stylized images, which contains a Guided Fusion (GF) block and a Style Transformation (ST) block. The GF block generates a coarse stylized image based on a personalized direction field that is specific to a personalization key and the style image. The ST block refines the coarse stylized image into the final stylized image. It embeds a watermark into the deep feature space of the stylized image during the style transfer. To make the stylized images more diverse, we further propose a new personalization loss for training our PWST-Net. Various experiments demonstrate the effectiveness of our proposed method for generating personalized and watermarked stylized images, which also outperforms the state-of-the-art NST schemes in terms of artistic visual appearance.

An Order-Complexity Aesthetic Assessment Model for Aesthetic-aware Music Recommendation

Xin Jin
Wu Zhou
Jinyu Wang
Duo XU
Yongsen Zheng

Computational aesthetic evaluation has made remarkable contribution to visual art works, but its application to music is still rare. Currently, subjective evaluation is still the most effective form of evaluating artistic works. However, subjective evaluation of artistic works will consume a lot of human and material resources. The popular AI generated content (AIGC) tasks nowadays have flooded all industries, and music is no exception. While compared to music produced by humans, AI generated music still sounds mechanical, monotonous, and lacks aesthetic appeal. Due to the lack of music datasets with rating annotations, we have to choose traditional aesthetic equations to objectively measure the beauty of music. In order to improve the quality of AI music generation and further guide computer music production, synthesis, recommendation and other tasks, we use Birkhoff's aesthetic measure to design a aesthetic model, objectively measuring the aesthetic beauty of music, and form a recommendation list according to the aesthetic feeling of music. Experiments show that our objective aesthetic model and recommendation method are effective.

S3DS: Self-supervised Learning of 3D Skeletons from Single View Images

Jianwei Hu
Ningna Wang
Baorong Yang
Gang Chen
Xiaohu Guo
Bin Wang

3D skeleton is an inherent structure of objects and is often used for shape analysis. However, most supervised deep learning methods, which directly obtain 3D skeletons from 2D images, are constrained by skeleton data preparation. In this paper, we introduce a self-supervised method S3DS: a differentiable rendering-based method to reconstruct a 3D skeleton of shape from its single-view images, by using medial axis transformation (MAT) as its 3D skeleton. We use medial spheres (center positions and radii) to represent the 3D skeleton and use the connectivity of the spheres (medial mesh) to represent the topology. We trained a medial sphere prediction network, which reconstructs 3D skeleton spheres (centers and radii) from a single-view image and renders them into a 2D silhouette with many circles. Because of the radius, the center of the circle will fall on the 2D skeleton. Then the 3D spheres are fitted to the 3D skeleton by fitting many 2D circles onto the 2D skeleton. A mechanism is proposed to generate the connectivity of the discrete medial spheres and construct the 3D topology of the shape. We have conducted extensive experiments on public datasets and proved that S3DS has better performance than baseline and competitive performances with supervised methods on 3D skeletons reconstruction.

Controllable Face Sketch-Photo Synthesis with Flexible Generative Priors

Kun Cheng
Mingrui Zhu
Nannan Wang
Guozhang Li
Xiaoyu Wang
Xinbo Gao

Current face sketch-photo synthesis researches generally embrace an image-to-image (I2I) translation pipeline. However, these methods ignore the one-to-many mapping problem (i.e., multiple plausible photo results can correspond to a single input sketch) in sketch-to-photo synthesis task, resulting in significant performance degradation on diverse datasets. Besides, generating high-quality images on limited data is also a challenge for this task. To address these challenges, we propose a dual-path framework that introduces generative priors to better perform cross-domain reconstruction on limited data. The coarse path uses a layer-swapped pre-trained generator to achieve coarse cross-domain reconstruction, and the refinement path further improves the structure and texture details. To align the feature maps between the two paths, we introduce a spatial feature calibration module. Despite this, our framework still struggles to handle diverse datasets. Thanks to the flexibility of generative priors, we can extend the framework to achieve exemplar-guided I2I translation by incorporating an exemplar with style mixing and a proposed semantic-aware style refinement strategy, which addresses the one-to-many mapping problem in sketch-to-photo synthesis task. Furthermore, our framework can perform cross-domain editing by employing off-the-shelf editing methods based on the latent space, achieving fine-grained control. Extensive experiments on diverse datasets demonstrate the superiority of our framework over other state-of-the-art methods.

CoP: Chain-of-Pose for Image Animation in Large Pose Changes

Xiaomeng Fu
Xi Wang
Jin Liu
Shuhui Wang
Jiao Dai
Jizhong Han

Image animation involves generating a video of a source image imitating the pose of a driving video. Despite recent advancements in the image animation task, most state-of-the-art methods remain vulnerable to large pose changes. In cases of large pose changes, existing methods struggle to model the complex nonlinear motion and yield distorted results, which greatly restricts their application in the real world. To tackle this problem, we present a novel approach called Chain-of-Pose (CoP) that decomposes large pose changes into a sequence of intermediate pose changes. This enables us to handle simplified pose changes and improves the accuracy of pose estimation. Furthermore, to better preserve the appearance of the source object, we introduce the Appearance Refinement Module (ARM) that effectively integrates the appearance texture feature of the source image with the structural pose feature from the pose chain. Our experimental results demonstrate that our method qualitatively and quantitatively outperforms state-of-the-art approaches on four diverse datasets, comprising talking faces, human bodies, and pixel animals. Notably, our approach significantly improves video quality in the case of large object pose changes. Our code is attached to the supplementary material.

The Effects of Viewing Formats and Song Genres on Audience Experiences in Virtual Avatar Concerts

Sebin Lee
Daye Kim
Jungjin Lee

With the recent advancements in multimedia and computer graphics technology, virtual avatar concerts have become increasingly popular among both well-known celebrities and subculture-based virtual YouTubers, attracting large audiences. Although this new form of performance art can be delivered through various platforms that use different viewing formats, research on the components of enjoying virtual avatar concerts and their effects on the audience experience is lacking. This study aims to explore the effects of different viewing formats and song genres on sickness, presence, and immersion and identify preferences for enjoying virtual avatar concerts. We first analyzed data from 64 virtual avatar concert cases to identify the most commonly used viewing media, methods, and song genres. Based on our analysis, we designed and conducted a user experiment with four virtual avatar concert scenarios. The experiment compared two viewing methods (audience's point of view and director's edited view) and two song genres (dance and ballad) for 2D display and a head-mounted display. The results show that the audience's point of view and a dance song can be advantageous in increasing the immersive experience irrespective of the viewing medium. Furthermore, the participants show different expectations regarding performances according to song genres. We discuss these findings and propose guidelines to help design and conduct virtual avatar concerts effectively.

IN/ACTive: A Distance-Technology-Mediated Stage for Performer-Audience Telepresence and Environmental Control

RAY LC
Sijia Liu
Qiaosheng Lyu

The increasing virtualization of the performance process has resulted in passive interactions between performer and audience based on the observing paradigm, but without the presence of in-person staging. We designed a more interactive paradigm of remote performance using a multimedia exhibition strategy where visitors can alter the environment of the performer's location by changing its music and lighting, whereas the performer can create engagement in the exhibition space by affecting a remote robotic arm. We conducted a case study with five participating performers, investigating their expectations, workflows, and perceptions before and after the performance-exhibition, and examining the video footage of their interactions with visitors to understand how they adapt and respond to this remote performance paradigm. We found that the robot arm was perceived as a mediating character with its own distinct identity, that musical changes have implicit behavioral effects on the performance, that lighting manipulations actively changed performer actions during improvisation, and that audiences appear to identify themselves and the robot as integral co-performers in this setup. This work provides insights into how performers learn to engage with audiences in novel distanced spaces and diverse interactive media, generating insights for future virtualized performative interactions beyond the classical stage metaphor.

Double Doodles: Sketching Animation in Immersive Environment With 3+6 DOFs Motion Gestures

Ruizhao Chen
Ye Pan
Zhigang Deng
Lili Wang
Lizhuang Ma

We present "Double Doodles'' to make full use of two sequential inputs of a VR controller with 9 DOFs in total, 3 DOFs of the first input sequence for the generation of motion paths and 6 DOFs of the second input sequence for motion gestures. While engineering our system, we take ergonomics into consideration and design a set of user-defined motion gestures to describe character motions. We employ a real-time deep learning-based approach for highly accurate motion gesture classification. We then integrate our approach into a prototype system, and it allows users to directly create character animations in VR environments using motion gestures with a VR controller, followed by animation preview and animation interactive editing. Finally, we evaluate the feasibility and effectiveness of our system through a user study, demonstrating the usefulness of our system for visual storytelling dedicated to amateurs, as well as for providing fast drafting tools for artists.

Relit-NeuLF: Efficient Relighting and Novel View Synthesis via Neural 4D Light Field

Zhong Li
Liangchen Song
Zhang Chen
Xiangyu Du
Lele Chen
Junsong Yuan
Yi Xu

In this paper, we address the problem of simultaneous relighting and novel view synthesis of a complex scene from multi-view images with a limited number of light sources. We propose an analysis-synthesis approach called Relit-NeuLF. Following the recent neural 4D light field network (NeuLF)[22], Relit-NeuLF first leverages a two-plane light field representation to parameterize each ray in a 4D coordinate system, enabling efficient learning and inference. Then, we recover the spatially-varying bidirectional reflectance distribution function (SVBRDF) of a 3D scene in a self-supervised manner. A DecomposeNet learns to map each ray to its SVBRDF components: albedo, normal, and roughness. Based on the decomposed BRDF components and conditioning light directions, a RenderNet learns to synthesize the color of the ray. To self-supervise the SVBRDF decomposition, we encourage the predicted ray color to be close to the physically-based rendering result using the microfacet model. Comprehensive experiments demonstrate that the proposed method is efficient and effective on both synthetic data and real-world human face data, and outperforms the state-of-the-art results.

SESSION: Poster Session VIII: Engaging Users with Multimedia -- Multimedia Applications

Adaptive Feature Swapping for Unsupervised Domain Adaptation

Junbao Zhuo
Xingyu Zhao
Shuhao Cui
Qingming Huang
Shuhui Wang

The bottleneck of visual domain adaptation always lies in the learning of domain invariant representations. In this paper, we present a simple but effective technique named Adaptive Feature Swapping for learning domain invariant features in Unsupervised Domain Adaptation (UDA). Adaptive Feature Swapping aims to select semantically irrelevant features from labeled source data and unlabeled target data and swap these features with each other. Then the merged representations are also utilized for training with prediction consistency constraints. In this way, the model is encouraged to learn representations that are robust to domain-specific information. We develop two swapping strategies including channel swapping and spatial swapping. The former encourages the model to squeeze redundancy out of features and pay more attention to semantic information. The latter motivates the model to be robust to the background and focus on objects. We conduct experiments on object recognition and semantic segmentation in UDA setting and the results show that Adaptive Feature Swapping can promote various existing UDA methods. Our codes are publicly available at https://github.com/junbaoZHUO/AFS.

Graph Convolutional Incomplete Multi-modal Hashing

Xiaobo Shen
Yinfan Chen
Shirui Pan
Weiwei Liu
Yuhui Zheng

Multi-modal hashing (MMH) encodes multi-modal data into latent hash code, and has been widely applied for efficient large-scale multi-modal retrieval. In practice it is common that multi-modal data is often corrupted with missing modalities, e.g., social image often lacks its tags in image-text retrieval. Conventional MMHs can only learn on complete modalities, which however wastes a considerable amount of collected data. To fulfill this gap, this paper proposes Graph Convolutional Incomplete Multi-modal Hashing (GCIMH) to learn hash code on incomplete multi-modal data. GCIMH develops Graph Convolutional Autoencoder to reconstruct incomplete multi-modal data with effective exploit of its semantic structure. GCIMH further develops multi-modal and label networks to encode multiple modalities and label respectively. GCIMH can successfully transfer knowledge of autoencoder and label network to multi-modal hashing network using teacher-student learning framework. GCIMH can handle missing modalities in both offline training and online query stages. Extensive empirical studies on three benchmark datasets demonstrate the superiority of the proposed GCIMH over the state-of-the-arts on both complete and incomplete multi-modal retrieval.

EasyNet: An Easy Network for 3D Industrial Anomaly Detection

Ruitao Chen
Guoyang Xie
Jiaqi Liu
Jinbao Wang
Ziqi Luo
Jinfan Wang
Feng Zheng

3D anomaly detection is an emerging and vital computer vision task in industrial manufacturing (IM). Recently many advanced algorithms have been published, but most of them cannot meet the needs of IM. There are several disadvantages: i) difficult to deploy on production lines since their algorithms heavily rely on large pretrained models; ii) hugely increase storage overhead due to overuse of memory banks; iii) the inference speed cannot be achieved in real-time. To overcome these issues, we propose an easy and deployment-friendly network (called EasyNet) without using pretrained models and memory banks: firstly, we design a multi-scale multi-modality feature encoder-decoder to accurately reconstruct the segmentation maps of anomalous regions and encourage the interaction between RGB images and depth images; secondly, we adopt a multi-modality anomaly segmentation network to achieve a precise anomaly map; thirdly, we propose an attention-based information entropy fusion module for feature fusion during inference, making it suitable for real-time deployment. Extensive experiments show that EasyNet achieves an anomaly detection AUROC of 92.6% without using pretrained models and memory banks. In addition, EasyNet is faster than existing methods, with a high frame rate of 94.55 FPS on a Tesla V100 GPU.

Modeling Multi-Relational Connectivity for Personalized Fashion Matching

Yujuan Ding
P.Y. Mok
Yi Bin
Xun Yang
Zhiyong Cheng

Personalized fashion matching task aims to predict the compatible fashion items given available ones for specific users through the effective modeling of the third-order interaction patterns among the user and item pairs. To achieve this, previous methods separately model two key components, user-item and item-item relationships, which ignore the inherent correlations between them and lead to undesirable performance. With a new perspective, this paper proposes to formulate the personalized item matching as the multi-relational connectivity and apply a single-component translation operation to model the targeted third-order interactions. With user-item-item interactions naturally constructing a multi-relational graph, we further device two graph learning modules to enhance the translation-based matching approach from two perspectives,C ontext and Path. The proposed method, named CP-TransMatch, has been tested with extensive experiments on three benchmark fashion datasets and proven effective. It sets the new SOTA for the personalized fashion matching task.

Taking a Part for the Whole: An Archetype-agnostic Framework for Voice-Face Association

Guancheng Chen
Xin Liu
Xing Xu
Yiu-ming Cheung
Taihao Li

Voice-face association is generally specialized as a cross-modal cognitive matching problem, and recent attention has been paid on the feasibility of devising the computational mechanisms for recognizing such associations. Existing works are commonly resorting to the combination of contrastive learning and classification-based loss to correlate the heterogeneous datas. Nevertheless, the reliance on typical features of each category, known as archetypes, derived from the combination suffer from the weak invariance of modality-specific features within the same identity, which might induce a cross-modal joint feature space with calibration deviations. To tackle these problems, this paper presents an efficient Archetype-agnostic framework for reliable voice-face association. First, an Archetype-agnostic Subspace Merging (AaSM) method is carefully designed to perform feature calibration which can well get rid of the archetype dependence to facilitate the mutual perception of datas. Further, an efficient Bilateral Connection Re-gauging scheme is proposed to quantitatively screen and calibrate the biased datas, namely loose pairs that deviate from joint feature space. Besides, an Instance Equilibrium strategy is dynamically derived to optimize the training process on loose data pairs and significantly improve the data utilization. Through the joint exploitation of the above, the proposed framework can well associate the voice-face data to benefit various kinds of cross-modal cognitive tasks. Extensive experiments verify the superiorities of the proposed voice-face association framework and show its competitive performances with the state-of-the-arts.

DAWN: Direction-aware Attention Wavelet Network for Image Deraining

Kui Jiang
Wenxuan Liu
Zheng Wang
Xian Zhong
Junjun Jiang
Chia-Wen Lin

Single image deraining aims to remove rain perturbation while restoring the clean background scene from a rain image. However, existing methods tend to produce blurry and over-smooth outputs, lacking some textural details. Wavelet transform can depict the contextual and textural information of an image at different levels, showing impressive capability of learning structural information in the images to avoid artifacts, and thus has been recently explored to consider the inherent overlap of background and rain perturbation in both the pixel domain and the frequency embedding space. However, the existing wavelet-based methods ignore the heterogeneous degradation for different coefficients due to the inherent directional characteristics of rain streaks, leading to inter-frequency conflicts and compromised deraining results. To address this issue, we propose a novel Direction-aware Attention Wavelet Network (DAWN) for rain streaks removal. DAWN has several key distinctions from existing wavelet transform-based methods: 1) introducing the vector decomposition to parameterize the learning procedure, where the rain streaks are derived into the vertical (V) and horizontal (H) components to learn the specific representation; 2) a novel direction-aware attention module (DAM) to fit the projection and transformation parameters to characterize the direction-specific rain components, which helps accurate texture restoration; 3) exploring practical composite constraints on the structure, details, and chrominance aspects for high-quality background restoration. Our proposed DAWN delivers significant performance gains on nine datasets across image deraining and object detection tasks, exceeding the state-of-the-art method MPRNet by 0.88 dB in PSNR on the Test1200 dataset with only 35.5% computation cost.

BiFPro: A Bidirectional Facial-data Protection Framework against DeepFake

Honggu Liu
Xiaodan Li
Wenbo Zhou
Han Fang
Paolo Bestagini
Weiming Zhang
Yuefeng Chen
Stefano Tubaro
Nenghai Yu
Yuan He
Hui Xue

The rapid progress of the DeepFake technique has caused severe privacy problems. Thus protecting facial data against DeepFake becomes an urgent requirement. Face protection can be regarded as a bidirectional process: Face-out-detection (FOD) and Face-in-forensics (FIF). For FOD, the detectability should be satisfied when using the protected face to replace other faces. For FIF, traceability should be guaranteed when the protected face is replaced by others. For this, we propose a Bidirectional Facial-data Protection Framework (BiFPro) to protect face data comprehensively. This framework is composed of three main parts: Watermarking embedding, Face-out-detection (FOD) and Face-in-forensics (FIF). For the FOD case, we ensure the vulnerability of the original face by embedding fragile watermarking. Once the protected facial image is used to replace other faces, the watermarking information will be corrupted in the synthesized face images which can be used to detect the authenticity of the protected facial images. As for the FIF case, we guarantee the traceability of the protected face image by embedding robust watermarking, with which the fake faces can be traced with the reserved watermarking even after the face is swapped. Experimental results demonstrate that our proposed BiFPro could generate the watermarking which is fragile to FOD and at the same time robust to FIF with an average watermark extraction success rate reaching more than 95% when defending against the four advanced DeepFake techniques. Finally, we hope this work can encourage more initiative countermeasures against DeepFake.

Unsupervised Visible-Infrared Person ReID by Collaborative Learning with Neighbor-Guided Label Refinement

De Cheng
Xiaojian Huang
Nannan Wang
Lingfeng He
Zhihui Li
Xinbo Gao

Unsupervised learning visible-infrared person re-identification (USL-VI-ReID) aims at learning modality-invariant features from unlabeled cross-modality dataset, which is crucial for practical applications in video surveillance systems. The key to essentially address the USL-VI-ReID task is to solve the cross-modality data association problem for further heterogeneous joint learning. To address this issue, we propose a Dual Optimal Transport Label Assignment (DOTLA) framework to simultaneously assign the generated labels from one modality to its counterpart modality. The proposed DOTLA mechanism formulates a mutual reinforcement and efficient solution to cross-modality data association, which could effectively reduce the side-effects of some insufficient and noisy label associations. Besides, we further propose a cross-modality neighbor consistency guided label refinement and regularization module, to eliminate the negative effects brought by the inaccurate supervised signals, under the assumption that the prediction or label distribution of each example should be similar to its nearest neighbors'. Extensive experimental results on the public SYSU-MM01 and RegDB datasets demonstrate the effectiveness of the proposed method, surpassing existing state-of-the-art approach by a large margin of 7.76% mAP on average, which even surpasses some supervised VI-ReID methods.

Digital Twins Fuzzy System Based on Time Series Forecasting Model LFTformer

Jinkang Guo
Zhibo Wan
Zhihan Lv

Currently, people are actively seeking new and clean energy sources to replace traditional fossil fuels. As a widely distributed, abundant, and easily accessible renewable energy source, wind energy has multiple applications both domestically and internationally, and has a promising future. In wind power forecasting, this paper proposes a new algorithm LFTformer that combines the Transformer model with linear fuzzy information granulation (LFIG).The model uses the improved LFIG algorithm to extract the semantic information of time series, and divides the original time series into multiple information granules, which are then used as inputs to the Transformer model. Through comparative experiments for wind power prediction over 72 hours, 120 hours, and 168 hours, this research demonstrates that the LFTformer can improve the accuracy of wind power prediction. Finally, a Digital Twins system for wind power prediction is constructed using three-dimensional visualization techniques, displaying the prediction results and key data indicators, providing reliable data support for wind power systems.

Sample Less, Learn More: Efficient Action Recognition via Frame Feature Restoration

Harry Cheng
Yangyang Guo
Liqiang Nie
Zhiyong Cheng
Mohan Kankanhalli

Training an effective video action recognition model poses significant computational challenges, particularly under limited resource budgets. Current methods primarily aim to either reduce model size or utilize pre-trained models, limiting their adaptability to various backbone architectures. This paper investigates the issue of over-sampled frames, a prevalent problem in many approaches yet it has received relatively little attention. Despite the use of fewer frames being a potential solution, this approach often results in a substantial decline in performance. To address this issue, we propose a novel method to restore the intermediate features for two sparsely sampled and adjacent video frames. This feature restoration technique brings a negligible increase in computational requirements compared to resource-intensive image encoders, such as ViT. To evaluate the effectiveness of our method, we conduct extensive experiments on four public datasets, including Kinetics-400, ActivityNet, UCF-101, and HMDB-51. With the integration of our method, the efficiency of three commonly used baselines has been improved by over 50%, with a mere 0.5% reduction in recognition accuracy. In addition, our method also surprisingly helps improve the generalization ability of the models under zero-shot settings.

A Generalized Physical-knowledge-guided Dynamic Model for Underwater Image Enhancement

Pan Mu
Hanning Xu
Zheyuan Liu
Zheng Wang
Sixian Chan
Cong Bai

Underwater images often suffer from color distortion and low contrast resulting in various image types, due to the scattering and absorption of light by water. While it is difficult to obtain high-quality paired training samples with a generalized model. To tackle these challenges, we design a Generalized Underwater image enhancement method via a Physical-knowledge-guided Dynamic Model (short for GUPDM). In particular, to cover complex underwater scenes, this study changes the global atmosphere light and the transmission to simulate various underwater image types through the formation model. We then design an Atmosphere-based Dynamic Structure (ADS) and Transmission-guided Dynamic Structure (TDS) that use dynamic convolutions to adaptively extract prior information from underwater images and generate parameters for Prior-based Multi-scale Structure (PMS). These two modules enable the network to select appropriate parameters for various water types adaptively. Besides, the multi-scale feature extraction module in PMS uses convolution blocks with different kernel sizes and obtains weights for each feature map via channel attention block. The source code will be available at https://github.com/shiningZZ/GUPDM

Exploring Shape Embedding for Cloth-Changing Person Re-Identification via 2D-3D Correspondences

Yubin Wang
Huimin Yu
Yuming Yan
Shuyi Song
Biyang Liu
Yichong Lu

Cloth-Changing Person Re-Identification (CC-ReID) is a common and realistic problem since fashion constantly changes over time and people's aesthetic preferences are not set in stone. While most existing cloth-changing ReID methods focus on learning cloth-agnostic identity representations from coarse semantic cues (e.g. silhouettes and part segmentation maps), they neglect the continuous shape distributions at the pixel level. In this paper, we propose Continuous Surface Correspondence Learning (CSCL), a new shape embedding paradigm for cloth-changing ReID. CSCL establishes continuous correspondences between a 2D image plane and a canonical 3D body surface via pixel-to-vertex classification, which naturally aligns a person image to the surface of a 3D human model and simultaneously obtains pixel-wise surface embeddings. We further extract fine-grained shape features from the learned surface embeddings and then integrate them with global RGB features via a carefully designed cross-modality fusion module. The shape embedding paradigm based on 2D-3D correspondences remarkably enhances the model's global understanding of human body shape. To promote the study of ReID under clothing change, we construct 3D Dense Persons (DP3D), which is the first large-scale cloth-changing ReID dataset that provides densely annotated 2D-3D correspondences and a precise 3D mesh for each person image, while containing diverse cloth-changing cases over all four seasons. Experiments on both cloth-changing and cloth-consistent ReID benchmarks validate the effectiveness of our method.

Locate and Verify: A Two-Stream Network for Improved Deepfake Detection

Chao Shuai
Jieming Zhong
Shuang Wu
Feng Lin
Zhibo Wang
Zhongjie Ba
Zhenguang Liu
Lorenzo Cavallaro
Kui Ren

Deepfake has taken the world by storm, triggering a trust crisis. Current deepfake detection methods are typically inadequate in generalizability, with a tendency to overfit to image contents such as the background, which are frequently occurring but relatively unimportant in the training dataset. Furthermore, current methods heavily rely on a few dominant forgery regions and may ignore other equally important regions, leading to inadequate uncovering of forgery cues.

In this paper, we strive to address these shortcomings from three aspects: (1) We propose an innovative two-stream network that effectively enlarges the potential regions from which the model extracts forgery evidence. (2) We devise three functional modules to handle the multi-stream and multi-scale features in a collaborative learning scheme. (3) Confronted with the challenge of obtaining forgery annotations, we propose a Semi-supervised Patch Similarity Learning strategy to estimate patch-level forged location annotations. Empirically, our method demonstrates significantly improved robustness and generalizability, outperforming previous methods on six benchmarks, and improving the frame-level AUC on Deepfake Detection Challenge preview dataset from 0.797 to 0.835 and video-level AUC on CelebDF_v1 dataset from 0.811 to 0.847. Our implementation is available at https://github.com/sccsok/Locate-and-Verify.

StegaDDPM: Generative Image Steganography based on Denoising Diffusion Probabilistic Model

Yinyin Peng
Donghui Hu
Yaofei Wang
Kejiang Chen
Gang Pei
Weiming Zhang

Image steganography is the technology of concealing secret messages within an image. Recently, generative image steganography has been developed, which conceals secret messages during image generation. However, existing generative image steganography schemes are often criticized for their poor steganographic capacity and extraction accuracy. To ensure secure and dependable communication, we propose a novel generative image steganography based on the denoising diffusion probabilistic model, called StegaDDPM. StegaDDPM utilizes the probability distribution between the intermediate state and generated image in the reverse process of the diffusion model. The secret message is hidden in the generated image through message sampling, which follows the same probability distribution as normal generation. The receiver uses two shared random seeds to reproduce the reverse process and accurately extract secret data. Experimental results show that StegaDDPM outperforms state-of-the-art methods in terms of steganographic capacity, extraction accuracy, and security. In addition, it can securely conceal and accurately extract secret messages up to 9 bits per pixel.

Toward Intelligent Interactive Design: A Generation Framework Based on Cross-domain Fashion Elements

Jianyang Shi
Haijun Zhang
Dongliang Zhou
Zhao Zhang

Traditional fashion design typically requires the expertise of designers, which limits the involvement of ordinary users during the design process. While it would be desirable for users to participate in the preliminary design phase, their lack of basic design knowledge may render them too inexperienced to produce satisfactory designs. To improve the design efficiency for common users, we present a novel interactive fashion design framework based on generative adversarial network (GAN). This framework can assist users in designing fashion items by drawing only rough scribbles and providing simple fashion styles. Specifically, we propose a new cross-domain feature fusion encoder network that maps design image features from different domains into a series of style vectors which are then fed into a generator. We demonstrate that the learned style vectors can decouple the representations of cross-domain design elements and control the design results through scribbles and style images. Furthermore, we propose a method for rewriting our model with scribbles and style images, to allow designers to train our model more easily. To examine the effectiveness of our proposed model, we constructed a large-scale dataset containing 90,000 pairs of fashion item images. Experimental results show that our proposed method outperforms state-of-the-art methods and can effectively control cross-domain image features, suggesting the potential of our model for providing users with an intelligence-driven interactive design tool.

Semi-Supervised Panoptic Narrative Grounding

Danni Yang
Jiayi Ji
Xiaoshuai Sun
Haowei Wang
Yinan Li
Yiwei Ma
Rongrong Ji

Despite considerable progress, the advancement of Panoptic Narrative Grounding (PNG) remains hindered by costly annotations. In this paper, we introduce a novel Semi-Supervised Panoptic Narrative Grounding (SS-PNG) learning scheme, capitalizing on a smaller set of labeled image-text pairs and a larger set of unlabeled pairs to achieve competitive performance. Unlike visual segmentation tasks, PNG involves one pixel belonging to multiple open-ended nouns. As a result, existing multi-class based semi-supervised segmentation frameworks cannot be directly applied to this task. To address this challenge, we first develop a novel SS-PNG Network (SS-PNG-NW) tailored to the SS-PNG setting. We thoroughly investigate strategies such as Burn-In and data augmentation to determine the optimal generic configuration for the SS-PNG-NW. Additionally, to tackle the issue of imbalanced pseudo-label quality, we propose a Quality-Based Loss Adjustment (QLA) approach to adjust the semi-supervised objective, resulting in an enhanced SS-PNG-NW+. Employing our proposed QLA, we improve BCE Loss and Dice loss at pixel and mask levels, respectively. We conduct extensive experiments on PNG datasets, with our SS-PNG-NW+ demonstrating promising results comparable to fully-supervised models across all data ratios. Remarkably, our SS-PNG-NW+ outperforms fully-supervised models with only 30% and 50% supervision data, exceeding their performance by 0.8% and 1.1% respectively. This highlights the effectiveness of our proposed SS-PNG-NW+ in overcoming the challenges posed by limited annotations and enhancing the applicability of PNG tasks. The source code is available at https://github.com/nini0919/SSPNG.

Prompted Contrast with Masked Motion Modeling: Towards Versatile 3D Action Representation Learning

Jiahang Zhang
Lilang Lin
Jiaying Liu

Self-supervised learning has proved effective for skeleton-based human action understanding, which is an important yet challenging topic. Previous works mainly rely on contrastive learning or masked motion modeling paradigm to model the skeleton relations. However, the sequence-level and joint-level representation learning cannot be effectively and simultaneously handled by these methods. As a result, the learned representations fail to generalize to different downstream tasks. Moreover, combining these two paradigms in a naive manner leaves the synergy between them untapped and can lead to interference in training. To address these problems, we propose Prompted Contrast with Masked Motion Modeling, PCM 3, for versatile 3D action representation learning. Our method integrates the contrastive learning and masked prediction tasks in a mutually beneficial manner, which substantially boosts the generalization capacity for various downstream tasks. Specifically, masked prediction provides novel training views for contrastive learning, which in turn guides the masked prediction training with high-level semantic information. Moreover, we propose a dual-prompted multi-task pretraining strategy, which further improves model representations by reducing the interference caused by learning the two different pretext tasks. Extensive experiments on five downstream tasks under three large-scale datasets are conducted, demonstrating the superior generalization capacity of PCM3 compared to the state-of-the-art works. Our project is publicly available at: https://jhang2020.github.io/Projects/PCM3/PCM3.html.

Enhanced Image Deblurring: An Efficient Frequency Exploitation and Preservation Network

Shuting Dong
Zhe Wu
Feng Lu
Chun Yuan

Most of these frequency-based deblurring methods mainly have two major limitations: (1) insufficient exploitation of frequency information, (2) inadequate preservation of frequency information. In this paper, we propose a novel Efficient Frequency Exploitation and Preservation Network (EFEP) to address these limitations. Firstly, we propose a novel Frequency-Balanced Exploitation Encoder (FBE-Encoder) to sufficiently exploit frequency information. We insert a novel Frequency-Balanced Navigator (FBN) module in the encoder, which establishes a dynamic balance that adaptively explores and integrates the correlations between frequency features and other features presented in the network. And it also can highlight the most important regions in frequency features. Secondly, considering the limitation that frequency information is inevitably lost in deep network architectures, we present an Enhanced Selective Frequency Decoder (ESF-Decoder) that not only effectively reduces spatial information redundancy, but also fully explores the different importance of various frequency information to ensure the supplement of valid spatial information and weaken the invalid information. Thirdly, each encoder/decoder block of the EFEP consists of multiple Contrastive Residual Blocks (CRBs), which are designed to explicitly compute and incorporate feature distinctions. Powered by the above designs, our EFEP outperforms state-of-the-art models on both quantitative and qualitative evaluations.

InspirNET: An Unsupervised Generative Adversarial Network with Controllable Fine-grained Texture Disentanglement for Fashion Generation

Han Yan
Haijun Zhang
Jie Hou
Jicong Fan
Zhao Zhang

Texture constitutes the color and fabric of fashion items. Its choice in fashion items can directly express the personality and emotional state of a wearer. Despite the rapid development of intelligence-driven fashion design, it remains challenging to achieve independent control over texture without affecting other attributes, due to the highly intertwined nature of texture space. To accomplish fine-grained texture disentanglement, we propose InspirNET, an unsupervised disentangled generative adversarial framework, that manipulates textures in a fine-grained latent space so as to produce new textures effectively, aiming to broaden the range of fashion options available to common users with distinct textures as well as boosting designers' potential for fashion innovation and inspiration. Specifically, we first introduce an auto-fashion attribute encoder to map the input fashion item into texture and structure spaces. To achieve unsupervised fine-grained texture disentanglement, our model proposes a K-textures disentanglement module that decomposes the texture space into several orthogonal vectors, each of which is empowered to control an independent texture element. In particular, by employing an orthogonal eigenvector to interpolate with another, a multitude of new textures can be generated easily. Qualitative and quantitative experiments demonstrate that our InspirNET can effectively utilize decomposed orthogonal vectors to generate a wide range of fashion items with diverse textures. Our model exhibits superior performance over state-of-the-art methods in terms of maintaining the authenticity of texture transfer.

Draw2Edit: Mask-Free Sketch-Guided Image Manipulation

Yiwen Xu
Ruoyu Guo
Maurice Pagnucco
Yang Song

Sketch-based image modification is an interactive approach for image editing, where users indicate their intention of modifications in the images by drawing sketches on the input image and then the model generates the modified image based on the input sketch. Existing methods often necessitate specifying the region to be modified through a pixel-level mask, transforming the image modification process into a sketch-based inpainting task. Such approaches, however, present a limitation: the mask can cause loss of essential semantic information, compelling the model to perform restoration rather than editing the image. To address this challenge, we propose a novel mask-free image modification method, named Draw2Edit, which enables direct drawing of sketches and editing of images without pixel-level masks, simplifying the editing process. In addition, we employ the free-form deformation to generate structurally corresponding sketches and training images, effectively addressing the challenge of collecting paired sketches and images for training while enhancing the model's effectiveness for sketch-guided tasks. We evaluate our proposed method on commonly-used sketch-guided inpainting datasets, including CelebA-HQ and Places2, and demonstrate its state-of-the-art performance in both quantitative evaluation and user studies. Our code is available at https://github.com/YiwenXu/Draw2Edit.

Mask-Guided Progressive Network for Joint Raindrop and Rain Streak Removal in Videos

Hongtao Wu
Yijun Yang
Haoyu Chen
Jingjing Ren
Lei Zhu

Videos captured in rainy weather are unavoidably corrupted by both rain streaks and raindrops in driving scenarios, and it is desirable and challenging to recover background details obscured by rain streaks and raindrops. However, existing video rain removal methods often address either video rain streak removal or video raindrop removal, thereby suffer from degraded performance when deal with both simultaneously. The bottleneck is a lack of a video dataset, where each video frame contains both rain streaks and raindrops. To address this issue, we in this work generate a synthesized dataset, namely VRDS, with 102 rainy videos from diverse scenarios, and each video frame has the corresponding rain streak map, raindrop mask, and the underlying rain-free clean image (ground truth). Moreover, we devise a mask-guided progressive video deraining network (ViMP-Net) to remove both rain streaks and raindrops of each video frame. Specifically, we develop an intensity-guided alignment block to predict the rain streak intensity map and remove the rain streaks of the input rainy video at the first stage. Then, we predict a raindrop mask and pass it into a devised mask-guided dual transformer block to learn inter-frame and intra-frame transformer features, which are then fed into a decoder for further eliminating raindrops. Experimental results demonstrate that our ViMP-Net outperforms state-of-the-art methods on our synthetic dataset and real-world rainy videos. Our code is available at https://github.com/TonyHongtaoWu/ViMP-Net.

Active CT Reconstruction with a Learned Sampling Policy

Ce Wang
Kun Shang
Haimiao Zhang
Shang Zhao
Dong Liang
S Kevin Zhou

Computed tomography (CT) is a widely-used imaging technology that assists clinical decision-making with high-quality human body representations. To reduce the radiation dose posed by CT, sparse-view (SV) CT is developed with preserved image quality. However, these methods are still stuck with a fixed uniform SV (USV) sampling strategy, which inhibits the possibility of acquiring a better image with an even reduced dose. In this paper, we explore this possibility via learning an active SV (ASV) sampling policy that optimizes the sampling positions for regions of interest (RoI)-specific, high-quality reconstruction. To this end, we design an sampling agent for the recommendation of ASV sampling positions based on on-the-fly reconstruction with obtained sinograms in a progressive fashion. With such a design, we achieve better performances on the NIH-AAPM dataset over popular USV sampling, especially when the number of views is small. Finally, such a design enables the RoI-aware reconstruction with improved local quality within the RoI that are clinically important. Experiments on the VerSe dataset demonstrate the ability of the proposed sampling policy, which is difficult to achieve with USV sampling.

TextPainter: Multimodal Text Image Generation with Visual-harmony and Text-comprehension for Poster Design

Yifan Gao
Jinpeng Lin
Min Zhou
Chuanbin Liu
Hongtao Xie
Tiezheng Ge
Yuning Jiang

Text design is one of the most critical procedures in poster design, as it relies heavily on the creativity and expertise of humans to design text images considering the visual harmony and text-semantic. This study introduces TextPainter, a novel multimodal approach that leverages contextual visual information and corresponding text semantics to generate text images. Specifically, TextPainter takes the global-local background image as a hint of style and guides the text image generation with visual harmony. Furthermore, we leverage the language model and introduce a text comprehension module to achieve both sentence-level and word-level style variations. Besides, we construct the PosterT80K dataset, consisting of about 80K posters annotated with sentence-level bounding boxes and text contents. We hope this dataset will pave the way for further research on multimodal text image generation. Extensive quantitative and qualitative experiments demonstrate that TextPainter can generate visually-and-semantically-harmonious text images for posters.

Rethinking Neighborhood Consistency Learning on Unsupervised Domain Adaptation

Chang Liu
Lichen Wang
Yun Fu

Unsupervised domain adaptation (UDA) involves predicting unlabeled data in a target domain by using labeled data from the source domain. However, recent advances in pseudo-labeling (PL) methods have been hampered by noisy pseudo-labels that diminish the local discriminativeness of the target structure. Although neighborhood-based PL can help preserve the local structure, it also risks assigning the whole local neighborhood to the wrong semantic category. To address this issue, we propose a novel framework called neighborhood consistency learning (NCL) that operates at both the semantic and instance levels and features a new consistency objective function. Specifically, our objective function aims to promote semantic consistency in the target neighborhood by computing the correlation matrix between the target samples and their neighborhood aggregation over a batch and matching the correlation matrix to an identity matrix. Importantly, our approach allows the target neighborhood to receive gradients from several potential positive categories instead of just one certain category. Our extensive experiments on UDA benchmarks demonstrate the effectiveness of NCL over other state-of-the-art PL-based methods.

MV-Diffusion: Motion-aware Video Diffusion Model

Zijun Deng
Xiangteng He
Yuxin Peng
Xiongwei Zhu
Lele Cheng

In this paper, we present a Motion-aware Video Diffusion Model (MV-Diffusion) for enhancing the temporal consistency of generated videos using autoregressive diffusion models. Despite the success of diffusion models in various vision generation tasks, generating high-quality and realistic videos with coherent temporal structure remains a challenging problem. Current methods have primarily focused on capturing implicit motion features within a restricted window of RGB frames, rather than explicitly modeling the motion. To address this, we focus on improving the temporal modeling ability of the current autoregressive video diffusion approach by leveraging rich temporal trajectory information in a global context and explicitly modeling local motion trends. The main contributions of this research include: (1) a Trajectory Modeling (TM) block that enhances the model's conditioning by incorporating global motion trajectory information, (2) a Motion Trend Attention (MTA) block that utilizes a cross-attention mechanism to explicitly infer motion trends from the optical flow rather than implicitly learning from RGB input. Experimental results on three video generation tasks using four datasets show the effectiveness of our proposed MV-Diffusion, outperforming existing state-of-the-art approaches. The code is available at https://github.com/PKU-ICST-MIPL/MV-Diffusion_ACMMM2023.

ROAD: Robust Unsupervised Domain Adaptation with Noisy Labels

Yanglin Feng
Hongyuan Zhu
Dezhong Peng
Xi Peng
Peng Hu

In recent years, Unsupervised Domain Adaptation (UDA) has emerged as a popular technique for transferring knowledge from a labeled source domain to an unlabeled target domain. However, almost all of the existing approaches implicitly assume that the source domain is correctly labeled, which is expensive or even impossible to satisfy in open-world applications due to ubiquitous imperfect annotations (i.e., noisy labels). In this paper, we reveal that noisy labels interfere with learning from the source domain, thus leading to noisy knowledge being transferred from the source domain to the target domain, termed Dual Noisy Information (DNI). To address this issue, we propose a robust unsupervised domain adaptation framework (ROAD), which prevents the network model from overfitting noisy labels to capture accurate discrimination knowledge for domain adaptation. Specifically, a Robust Adaptive Weighted Learning mechanism (RSWL) is proposed to adaptively assign weights to each sample based on its reliability to enforce the model to focus more on reliable samples and less on unreliable samples, thereby mining robust discrimination knowledge against noisy labels in the source domain. In order to prevent noisy knowledge from misleading domain adaptation, we present a Robust Domain-adapted Prediction Learning mechanism (RDPL) to reduce the weighted decision uncertainty of predictions in the target domain, thus ensuring the accurate knowledge of source domain transfer into the target domain, rather than uncertain knowledge from noise impact. Comprehensive experiments are conducted on three widely-used UDA benchmarks to demonstrate the effectiveness and robustness of our ROAD against noisy labels by comparing it with 13 state-of-the-art methods. Code is available at https://github.com/penghu-cs/ROAD.

WRAP: Watermarking Approach Robust Against Film-coating upon Printed Photographs

Gaozhi Liu
Yichao Si
Zhenxing Qian
Xinpeng Zhang
Sheng Li
Wanli Peng

Recently, print-resist watermarking has attracted much interest. Many watermarking schemes have been proposed to achieve robustness against printing and camera-capturing. Though these studies have shown promising results overall, they overlook the scenario of film-coating photographs, which is a significant and common scenario in real-world. The film-coating process can introduce severe distortions to the original image and easily incapacitate the watermark. To address this issue, we propose WRAP, a novel Watermarking scheme Robust Against film-coating upon Printed photographs. We first construct a large dataset with 120,000 film-coating images to train a style-transfer-based film-coating simulation network. Based on the network, we propose a comprehensive distortion layer which includes film-coating simulation and common disturbances in the printing and camera-capturing process. With the distortion layer, the entire embedding and extraction network can be trained end-to-end to gain robustness against film-coating upon printed photographs. Extensive experiments demonstrate the superior performances of our model in terms of robustness and generalization capability. Our model outperforms state-of-the-art print-resist watermarking schemes when testing in film-coating scenario and achieves outstanding performance across various datasets, types of films, and cameras. To the best of our knowledge, we are the first to conduct research on digital watermarking in film-coating scenario.

Lite-MKD: A Multi-modal Knowledge Distillation Framework for Lightweight Few-shot Action Recognition

Baolong Liu
Tianyi Zheng
Peng Zheng
Daizong Liu
Xiaoye Qu
Junyu Gao
Jianfeng Dong
Xun Wang

Existing few-shot action recognition methods have placed primary focus on improving the recognition accuracy while neglecting another important indicator in practical scenarios, i.e., model efficiency. In this paper, we make the first attempt and propose a Lightweight Multi-modal Knowledge Distillation framework (Lite-MKD) for few-shot action recognition. In this framework, the teacher model conducts multi-modal learning to achieve a comprehensive fusion of the optical flow, depth, and appearance features of human movements, thus achieving a more robust representation of actions. The student model is utilized to learn to recognize actions from the single RGB modality at a lower computational cost under the guidance of the teacher. To fully explore and integrate multi-modal information, a hierarchical Multi-modal Fusion Module (MFM) is introduced in the teacher model. Besides, a multi-level Distinguish-to-Mimic (D2M) knowledge distillation component is proposed for the student model. D2M improves the ability of the student model to mimic the action classification probabilities of the teacher model by enhancing the distinguishability of the student model for different video categories in the support set. Extensive experiments on three action recognition datasets Kinetics, HMDB51, and UCF101 demonstrate our framework's effectiveness and stable generalization ability. With a much more lightweight network for inference, we achieve comparable performance to previous state-of-the-art methods. Our source code is available at https://github.com/HuiGuanLab/Lite-MKD

Efficiency-optimized Video Diffusion Models

Zijun Deng
Xiangteng He
Yuxin Peng

Video diffusion models have recently shown strong capability in synthesizing high-fidelity videos in various ways, including prediction, interpolation, and unconditional generation. However, their synthesis ability credits a lot to leveraging large denoising models to reverse the long noise-adding process, which also brings extremely expansive sampling and training costs. After examining the source of the computation cost, we confirm that the main calculation comes from the redundancy of the convolution. To address this issue, we propose Efficiency-optimized Video Diffusion Models to reduce the network's computation cost by minimizing the input and output channels of the convolution. First, a bottleneck residual pathway is proposed to conduct a channel-wise downsample to the convolution pathways, which extracts crucial information from the input and reduces computation cost. Second, a three-path channel split strategy is proposed to reduce channel redundancy by handling part of the input channels with more efficient pointwise convolution and skip-connection pathways. Furthermore, a mixed self-attention mechanism is proposed to optimize the computation cost of the self-attention in the network by adaptively choosing the algorithm with lower time complexity according to the input token lengths and hidden dimensions. Extensive experiments on three downstream tasks show that our Efficiency-optimized Video Diffusion Models can achieve a 10x speed-up while achieving comparable or even better results in the performance of fidelity compared with the state-of-the-art methods. The code is available at https://github.com/PKU-ICST-MIPL/EVDM_ACMMM2023.

Fearless Luminance Adaptation: A Macro-Micro-Hierarchical Transformer for Exposure Correction

Gehui Li
Jinyuan Liu
Long Ma
Zhiying Jiang
Xin Fan
Risheng Liu

Photographs taken with less-than-ideal exposure settings often display poor visual quality. Since the correction procedures vary significantly, it is difficult for a single neural network to handle all exposure problems. Moreover, the inherent limitations of convolutions, hinder the models ability to restore faithful color or details on extremely over-/under-exposed regions. To overcome these limitations, we propose a Macro-Micro-Hierarchical transformer, which consists of a macro attention to capture long-range dependencies, a micro attention to extract local features, and a hierarchical structure for coarse-to-fine correction. In specific, the complementary macro-micro attention designs enhance locality while allowing global interactions. The hierarchical structure enables the network to correct exposure errors of different scales layer by layer. Furthermore, we propose a contrast constraint and couple it seamlessly in the loss function, where the corrected image is pulled towards the positive sample and pushed away from the dynamically generated negative samples. Thus the remaining color distortion and loss of detail can be removed. We also extend our method as an image enhancer for low-light face recognition and low-light semantic segmentation. Experiments demonstrate that our approach obtains more attractive results than state-of-the-art methods quantitatively and qualitatively.

WaterFlow: Heuristic Normalizing Flow for Underwater Image Enhancement and Beyond

ZengXi Zhang
Zhiying Jiang
Jinyuan Liu
Xin Fan
Risheng Liu

Underwater images suffer from light refraction and absorption, which impairs visibility and interferes the subsequent applications. Existing underwater image enhancement methods mainly focus on image quality improvement, ignoring the effect on practice. To balance the visual quality and application, we propose a heuristic normalizing flow for detection-driven underwater image enhancement, dubbed WaterFlow. Specifically, we first develop an invertible mapping to achieve the translation between the degraded image and its clear counterpart. Considering the differentiability and interpretability, we incorporate the heuristic prior into the data-driven mapping procedure, where the ambient light and medium transmission coefficient benefit credible generation. Furthermore, we introduce a detection perception module to transmit the implicit semantic guidance into the enhancement procedure, where the enhanced images hold more detection-favorable features and are able to promote the detection performance. Extensive experiments prove the superiority of our WaterFlow, against state-of-the-art methods quantitatively and qualitatively.

Progressive Domain-style Translation for Nighttime Tracking

Jinpu Zhang
Ziwen Li
Ruonan Wei
Yuehuan Wang

Nighttime tracking is challenging due to the lack of sufficient training data and scene diversity. Unsupervised domain adaptation is a solution by transferring knowledge from day (source domain) to night (target domain). It typically involves adversarial training with a domain discriminator on the source and target data to learn domain-invariant features. However, the imbalanced source/target distribution can cause overfitting of the domain discriminator, hindering the domain adaptability. To address this issue, we propose a Progressive Domain-Style Translation (PDST) for domain adaptive nighttime tracking. PDST decomposes and recombines domain-invariant content encodings and domain-specific style encodings of different domains. Thus the rich source domain content is translated to the target domain, expanding the inter-class diversity of the target domain to alleviate overfitting. Moreover, a momentum update manner is introduced to progressively estimate the domain-style encoding from multiple features, which more accurately reflects the statistical domain attribute than an individual image-style. Finally, we incorporate two regularization terms to constrain the content and domain-style consistency in the translation process, ensuring the generated source-like target features are valid to facilitate the training of domain adaptation. Exhaustive experiments demonstrate the domain adaptability and SOTA performance of the proposed method in nighttime tracking.

DuDoINet: Dual-Domain Implicit Network for Multi-Modality MR Image Arbitrary-scale Super-Resolution

Guangyuan Li
Wei Xing
Lei Zhao
Zehua Lan
Zhanjie Zhang
Jiakai Sun
Haolin Yin
Huaizhong Lin
Zhijie Lin

Compared to single-modality magnetic resonance (MR) image super-resolution (SR) methods, multi-modality MR image methods can utilize high-resolution reference modality (e.g., T1 modality) to provide valuable complementary information for low-resolution target modality (e.g., T2 modality) in SR reconstruction, which can further improve the quality of the SR images. Although they have achieved impressive results, these methods still suffer from the following drawbacks: (1) They can only handle fixed integer upsampling factors, such as 2X, 3X, and 4X, and require training and storing corresponding models for each upsampling factor, which is infeasible in clinical practice; (2) They only perform feature extraction and reconstruction in the image domain. However, the aliasing artifacts produced in the image domain are structural and non-local. Therefore, using only the image domain cannot effectively reconstruct high-quality aliasing-free SR images. To address these issues, we develop a brand-new Dual-Domain Implicit Network (DuDoINet) for multi-modality MR image arbitrary-scale SR. Specifically, we propose a dual-domain learning scheme for multi-modality MR image SR, which allows the network to sufficiently exploit the frequency and image domain information in MR images. In addition, we design implicit attention to achieve arbitrary-scale upsampling of MR images, which utilizes a continuously differentiable function that generates pixel values from pixel coordinates. Furthermore, we designed a deformable cross-modality attention mechanism that can adaptively transfer high-frequency details from the T1 to the T2 modality, better integrating valuable complementary information from the T1 modality. Extensive and comprehensive experiments on healthy subjects and patient datasets demonstrate that our DuDoINet outperforms SOTA methods, demonstrating its great potential for clinical practice.

DeNoL: A Few-Shot-Sample-Based Decoupling Noise Layer for Cross-channel Watermarking Robustness

Han Fang
Kejiang Chen
Yupeng Qiu
Jiayang Liu
Ke Xu
Chengfang Fang
Weiming Zhang
Ee-Chien Chang

Cross-channel (e.g. Screen-to-Camera) robustness is an urgent requirement for modern watermarking systems. To realize such robustness, training a network that can precisely simulate the cross-channel distortion as the noise layer for deep watermarking training is an effective way. However, network training requires massive data, and generating the data is laborious. Meanwhile, directly using limited data to train may lead to an over-fitting issue. To address such limitation, we proposed DeNoL, a decoupling noise layer for cross-channel simulation which only needs few-shot samples. We believe the overfitting issue comes from the overlearning of the training image content rather than only simulating the distortion style. Consequently, we design a network that can decouple the image content and the distortion style into different components. Thus, by fixing the content representation component and fine-tuning a new style component accordingly, the network can efficiently learn and only learn the distortion style. Such learning can be done with only few-shot samples. Besides, in order to enhance adaptability, we also proposed a diversification operation to cooperate with DeNoL. Experimental results show that DeNoL can effectively simulate cross-channel distortion with only 20 image pairs and assist in training a general and robust watermarking network.

Parameter Exchange for Robust Dynamic Domain Generalization

Luojun Lin
Zhifeng Shen
Zhishu Sun
Yuanlong Yu
Lei Zhang
Weijie Chen

Agnostic domain shift is the main reason of model degradation on the unknown target domains, which brings an urgent need to develop Domain generalization (DG). Recent advances at DG use dynamic networks to achieve training-free adaptation on the unknown target domains, termed Dynamic Domain Generalization (DDG), which compensates for the lack of self-adaptability in static models with fixed weights. The parameters of dynamic networks can be decoupled into a static and a dynamic component, which are designed to learn domain-invariant and domain-specific features, respectively. Based on the existing arts, in this work, we try to push the limits of DDG by disentangling the static and dynamic components more thoroughly from an optimization perspective. Our main consideration is that we can enable the static component to learn domain-invariant features more comprehensively by augmenting the domain-specific information. As a result, the more comprehensive domain-invariant features learned by the static component can then enforce the dynamic component to focus more on learning adaptive domain-specific features. To this end, we propose a simple yet effective Parameter Exchange (PE) method to perturb the combination between the static and dynamic components. We optimize the model using the gradients from both the perturbed and non-perturbed feed-forward jointly to implicitly achieve the aforementioned disentanglement. In this way, the two components can be optimized in a mutually-beneficial manner, which can resist the agnostic domain shifts and improve the self-adaptability on the unknown target domain. Extensive experiments show that PE can be easily plugged into existing dynamic networks to improve their generalization ability without bells and whistles.

Recurrent Self-Supervised Video Denoising with Denser Receptive Field

Zichun Wang
Yulun Zhang
Debing Zhang
Ying Fu

Self-supervised video denoising has seen decent progress through the use of blind spot networks. However, under their blind spot constraints, previous self-supervised video denoising methods suffer from significant information loss and texture destruction in either the whole reference frame or neighbor frames, due to their inadequate consideration of the receptive field. Moreover, the limited number of available neighbor frames in previous methods leads to the discarding of distant temporal information. Nonetheless, simply adopting existing recurrent frameworks does not work, since they easily break the constraints on the receptive field imposed by self-supervision. In this paper, we propose RDRF for selfsupervised video denoising, which not only fully exploits both the reference and neighbor frames with a denser receptive field, but also better leverages the temporal information from both local and distant neighbor features. First, towards a comprehensive utilization of information from both reference and neighbor frames, RDRF realizes a denser receptive field by taking more neighbor pixels along the spatial and temporal dimensions. Second, it features a self-supervised recurrent video denoising framework, which concurrently integrates distant and near-neighbor temporal features. This enables long-term bidirectional information aggregation, while mitigating error accumulation in the plain recurrent framework. Our method exhibits superior performance on both synthetic and real video denoising datasets. Codes will be available at https://github.com/Wang-XIaoDingdd/RDRF.

Exploiting Fine-Grained DCT Representations for Hiding Image-Level Messages within JPEG Images

Junxue Yang
Xin Liao

Unlike hiding bit-level messages, hiding image-level messages is more challenging, which requires large capacity, high imperceptibility, and high security. Although recent advances in hiding image-level messages have been remarkable, existing schemes are limited to lossless spatial images as covers and cannot be directly applied to JPEG images, the ubiquitous lossy format images in daily life. The difficulties of migration are caused by the lack of targeted design and the loss of details due to lossy decompression and compression. Considering that taking DCT densely on 8X8 image patches is the core of the JPEG compression standard, we design a novel model called EFDR, which can comprehensively Exploit Fine-grained D CT Representations and embed the secret image into quantized DCT coefficients to avoid the lossy process. Specifically, we transform the JPEG cover image and hidden secret image into fine-grained DCT representations that compact the frequency and are associated with the inter-block and intra-block correlations. Subsequently, the fine-grained DCT representations are further enhanced by a sub-band features enhancement module. Afterward, a transformer-based invertibility module is designed to fuse enhanced sub-band features. Such a design enables a fine-grained self-attention on each sub-band and captures long-range dependencies while maintaining excellent reversibility for hiding and recovery. To our best knowledge, this is the first attempt to embed a color image of equal size in a color JPEG image. Extensive experiments demonstrate the effectiveness of our EFDR with superior performance. Code will be available at https://github.com/yangjunxue/EFDR.

A Reference-free Self-supervised Domain Adaptation Framework for Low-quality Fundus Image Enhancement

Qingshan Hou
Peng Cao
Jiaqi Wang
Xiaoli Liu
Jinzhu Yang
Osmar R. Zaiane

Retinal fundus images have been applied for the diagnosis and screening of eye diseases, such as Diabetic Retinopathy (DR) or Diabetic Macular Edema (DME). However, both low-quality fundus images and style inconsistency potentially increase uncertainty in the diagnosis of fundus disease and even lead to misdiagnosis by ophthalmologists. Most of the existing fundus image enhancement methods mainly focus on improving the image quality by leveraging the guidance of high-quality images, which is difficult to be collected in medical applications. In this paper, we tackle image quality enhancement in a fully unsupervised setting, i.e., neither paired images nor high-quality images. To this end, we explore the potential of the self-supervised task for improving the quality of fundus images without the requirement of high-quality reference images, and proposed a Domain Adaptation Self-supervised Quality Enhancement framework, named DASQE. Specifically, we construct multiple patch-wise domains via a well-designed rule-based quality assessment scheme and style clustering. To achieve robust low-quality image enhancement and address style inconsistency, we formulate two self-supervised domain adaptation tasks to disentangle the features of image content, low-quality factors and style information by exploring intrinsic supervision signals within the low-quality images. Extensive experiments are conducted on four benchmark datasets, and results show that our DASQE method achieves new state-of-the-art performance when only low-quality images are available.

A Four-Pronged Defense Against Byzantine Attacks in Federated Learning

Wei Wan
Shengshan Hu
Minghui Li
Jianrong Lu
Longling Zhang
Leo Yu Zhang
Hai Jin

Federated learning (FL) is a nascent distributed learning paradigm to train a shared global model without violating users' privacy. FL has been shown to be vulnerable to various Byzantine attacks, where malicious participants could independently or collusively upload well-crafted updates to deteriorate the performance of the global model. However, existing defenses could only mitigate part of Byzantine attacks, without providing an all-sided shield for FL. It is difficult to simply combine them as they rely on totally contradictory assumptions.

In this paper, we propose FPD, a four-pronged defense against both non-colluding and colluding Byzantine attacks. Our main idea is to utilize absolute similarity to filter updates rather than relative similarity used in existingI works. To this end, we first propose a reliable client selection strategy to prevent the majority of threats in the bud. Then we design a simple but effective score-based detection method to mitigate colluding attacks. Third, we construct an enhanced spectral-based outlier detector to accurately discard abnormal updates when the training data is not independent and identically distributed (non-IID). Finally, we design update denoising to rectify the direction of the slightly noisy but harmful updates. The four sequentially combined modules can effectively reconcile the contradiction in addressing non-colluding and colluding Byzantine attacks. Extensive experiments over three benchmark image classification datasets against four state-of-the-art Byzantine attacks demonstrate that FPD drastically outperforms existing defenses in IID and non-IID scenarios (with 30% improvement on model accuracy).

A Capture to Registration Framework for Realistic Image Super-Resolution in the Industry Environment

Boyang Wang
Yan Wang
Qing Zhao
Junxiong Lin
Zeng Tao
Pinxue Guo
Zhaoyu Chen
Kaixun Jiang
Shaoqi Yan
Shuyong Gao
Wenqiang Zhang

The acquisition and processing of visual data in industrial environments are of paramount importance. High-resolution (HR) images offer superior clarity and richer textural detail compared to low-resolution (LR) images. On the one hand, owing to the incorporation of richer information, HR images demonstrate substantially enhanced performance compared to LR images in downstream applications, such as anomaly detection. On the other hand, they provide valuable insights to designers and quality inspectors who require a detailed understanding of the images. Currently, the majority of research on super-resolution focuses on natural scenes such as cities and fields, however, the development of datasets for industrial scenes is still in its infancy. To address the image distortion in building realistic LR-HR image pairs in the industry environment, we design a capture to registration framework. It consists of the standard imaging system, physical calibration of the imaging system, as well as the rigid to elastic registration of the LR-HR image pairs. Thus, we build the first realistic industrial sence super-resolution dataset (IndSR), comprises of 50 sets of calibrated images with three scale factors and five typical defects. To benchmark IndSR, we employ quantitative, qualitative, and task-oriented studies to evaluate the representative super-resolution and anomaly detection methods. Besides, we systematically investigate and discuss the performances and results of the existing SISR methods to advance research in the field of super-resolution in industry environment. The IndSR dataset can be available from https://byw4ng.github.io/IndSR/.

Recurrent Multi-scale Transformer for High-Resolution Salient Object Detection

Xinhao Deng
Pingping Zhang
Wei Liu
Huchuan Lu

Salient Object Detection (SOD) aims to identify and segment the most conspicuous objects in an image or video. As an important pre-processing step, it has many potential applications in multimedia and vision tasks. With the advance of imaging devices, SOD with high-resolution images is of great demand, recently. However, traditional SOD methods are largely limited to low-resolution images, making them difficult to adapt to the development of High-Resolution SOD (HRSOD). Although some HRSOD methods emerge, there are no large enough datasets for training and evaluating. Besides, current HRSOD methods generally produce incomplete object regions and irregular object boundaries. To address above issues, in this work, we first propose a new HRS10K dataset, which contains 10,500 high-quality annotated images at 2K-8K resolution. As far as we know, it is the largest dataset for the HRSOD task, which will significantly help future works in training and evaluating models. Furthermore, to improve the HRSOD performance, we propose a novel Recurrent Multi-scale Transformer (RMFormer), which recurrently utilizes shared Transformers and multi-scale refinement architectures. Thus, high-resolution saliency maps can be generated with the guidance of lower-resolution predictions. Extensive experiments on both high-resolution and low-resolution benchmarks show the effectiveness and superiority of the proposed framework. The source code and dataset are released at: https://github.com/DrowsyMon/RMFormer.

ProTegO: Protect Text Content against OCR Extraction Attack

Yanru He
Kejiang Chen
Guoqiang Chen
Zehua Ma
Kui Zhang
Jie Zhang
Huanyu Bian
Han Fang
Weiming Zhang
Nenghai Yu

Online documents greatly improve the efficiency of information interaction but also cause potential security hazards, such as the ability to copy and reuse text content without authorization readily. To address copyright concerns, recent works have proposed converting reproducible text content into non-reproducible formats, making digital text content observable but not duplicable. However, as the Optical Character Recognition (OCR) technology develops, adversaries can still take screenshots of the target text region and use OCR to extract the text content. None of the existing methods can be well adapted to this kind of OCR extraction attack. In this paper, we propose "ProTegO'', a novel text content protection method against the OCR extraction attack, which generates adversarial underpaintings that do not affect human reading but can interfere with OCR after taking screenshots. Specifically, we design a text-style universal adversarial underpaintings generation framework, which can mislead both text recognition models and commercial OCR services. For invisibility, we take full advantage of the fusion property of human eyes and create complementary underpaintings to display alternatively on the screen. Experimental results demonstrate that ProTegO is a one-size-fits-all method that can ensure good visual quality while simultaneously achieving a high protection success rate on text recognition models with different architectures, outperforming the state-of-the-art methods. Furthermore, we validate the feasibility of ProTegO on a wide range of popular commercial OCR services, including Microsoft, Tencent, Alibaba, Huawei, Baidu, Apple, and Xiaomi. Codes will be available at https://github.com/Ruby-He/ProTegO.

FourLLIE: Boosting Low-Light Image Enhancement by Fourier Frequency Information

Chenxi Wang
Hongjun Wu
Zhi Jin

Recently, Fourier frequency information has attracted much attention in Low-Light Image Enhancement (LLIE). Some researchers noticed that, in the Fourier space, the lightness degradation mainly exists in the amplitude component and the rest exists in the phase component. By incorporating both the Fourier frequency and the spatial information, these researchers proposed remarkable solutions for LLIE. In this work, we further explore the positive correlation between the magnitude of amplitude and the magnitude of lightness, which can be effectively leveraged to improve the lightness of low-light images in the Fourier space. Moreover, we find that the Fourier transform can extract the global information of the image, and does not introduce massive neural network parameters like Multi-Layer Perceptrons (MLPs) or Transformer. To this end, a two-stage Fourier-based LLIE network (FourLLIE) is proposed. In the first stage, we improve the lightness of low-light images by estimating the amplitude transform map in the Fourier space. In the second stage, we introduce the Signal-to-Noise-Ratio (SNR) map to provide the prior for integrating the global Fourier frequency and the local spatial information, which recovers image details in the spatial space. With this ingenious design, FourLLIE outperforms the existing state-of-the-art (SOTA) LLIE methods on four representative datasets while maintaining good model efficiency. Notably, compared with a recent Transformer-based SOTA method SNR-Aware, FourLLIE reaches superior performance with only 0.31% parameters. Code is available at https://github.com/wangchx67/FourLLIE

Stroke-based Neural Painting and Stylization with Dynamically Predicted Painting Region

Teng Hu
Ran Yi
Haokun Zhu
Liang Liu
Jinlong Peng
Yabiao Wang
Chengjie Wang
Lizhuang Ma

Stroke-based rendering aims to recreate an image with a set of strokes. Most existing methods render complex images using an uniform-block-dividing strategy, which leads to boundary inconsistency artifacts. To solve the problem, we propose Compositional Neural Painter, a novel stroke-based rendering framework which dynamically predicts the next painting region based on the current canvas, instead of dividing the image plane uniformly into painting regions. We start from an empty canvas and divide the painting process into several steps. At each step, a compositor network trained with a phasic RL strategy first predicts the next painting region, then a painter network trained with a WGAN discriminator predicts stroke parameters, and a stroke renderer paints the strokes onto the painting region of the current canvas. Moreover, we extend our method to stroke-based style transfer with a novel differentiable distance transform loss, which helps preserve the structure of the input image during stroke-based stylization. Extensive experiments show our model outperforms the existing models in both stroke-based neural painting and stroke-based stylization.

SimulFlow: Simultaneously Extracting Feature and Identifying Target for Unsupervised Video Object Segmentation

Lingyi Hong
Wei Zhang
Shuyong Gao
Hong Lu
WenQiang Zhang

Unsupervised video object segmentation (UVOS) aims at detecting the primary objects in a given video sequence without any human interposing. Most existing methods rely on two-stream architectures that separately encode the appearance and motion information before fusing them to identify the target and generate object masks. However, this pipeline is computationally expensive and can lead to suboptimal performance due to the difficulty of fusing the two modalities properly. In this paper, we propose a novel UVOS model called SimulFlow that simultaneously performs feature extraction and target identification, enabling efficient and effective unsupervised video object segmentation. Concretely, we design a novel SimulFlow Attention mechanism to bridege the image and motion by utilizing the flexibility of attention operation, where coarse masks predicted from fused feature at each stage are used to constrain the attention operation within the mask area and exclude the impact of noise. Because of the bidirectional information flow between visual and optical flow features in SimulFlow Attention, no extra hand-designed fusing module is required and we only adopt a light decoder to obtain the final prediction. We evaluate our method on several benchmark datasets and achieve state-of-the-art results. Our proposed approach not only outperforms existing methods but also addresses the computational complexity and fusion difficulties caused by two-stream architectures. Our models achieve 87.4 ℐ&F on DAVIS-16 with the highest speed (63.7 FPS on a 3090) and the lowest parameters (13.7 M). Our SimulFlow also obtains competitive results on video salient object detection datasets.

Orthogonal Temporal Interpolation for Zero-Shot Video Recognition

Yan Zhu
Junbao Zhuo
Bin Ma
Jiajia Geng
Xiaoming Wei
Xiaolin Wei
Shuhui Wang

Zero-shot video recognition (ZSVR) is a task that aims to recognize video categories that have not been seen during the model training process. Recently, vision-language models (VLMs) pre-trained on large-scale image-text pairs have demonstrated impressive transferability for ZSVR. To make VLMs applicable to the video domain, existing methods often use an additional temporal learning module after the image-level encoder to learn the temporal relationships among video frames. Unfortunately, for video from unseen categories, we observe an abnormal phenomenon where the model that uses spatial-temporal feature performs much worse than the model that removes temporal learning module and uses only spatial feature. We conjecture that improper temporal modeling on video disrupts the spatial feature of the video. To verify our hypothesis, we propose Feature Factorization to retain the orthogonal temporal feature of the video and use interpolation to construct refined spatial-temporal feature. The model using appropriately refined spatial-temporal feature performs better than the one using only spatial feature, which verifies the effectiveness of the orthogonal temporal feature for the ZSVR task. Therefore, an Orthogonal Temporal Interpolation module is designed to learn a better refined spatial-temporal video feature during training. Additionally, a Matching Loss is introduced to improve the quality of the orthogonal temporal feature. We propose a model called OTI for ZSVR by employing orthogonal temporal interpolation and the matching loss based on VLMs. The ZSVR accuracies on popular video datasets (i.e., Kinetics-600, UCF101 and HMDB51) show that OTI outperforms the previous state-of-the-art method by a clear margin.Our codes are publicly available at https://github.com/yanzhu/mm2023_oti.

Normality Learning-based Graph Anomaly Detection via Multi-Scale Contrastive Learning

Jingcan Duan
Pei Zhang
Siwei Wang
Jingtao Hu
Hu Jin
Jiaxin Zhang
Haifang Zhou
Xinwang Liu

Graph anomaly detection (GAD) has attracted increasing attention in machine learning and data mining. Recent works have mainly focused on how to capture richer information to improve the quality of node embeddings for GAD. Despite their significant advances in detection performance, there is still a relative dearth of research on the properties of the task. GAD aims to discern the anomalies that deviate from most nodes. However, the model is prone to learn the pattern of normal samples which make up the majority of samples. Meanwhile, anomalies can be easily detected when their behaviors differ from normality. Therefore, the performance can be further improved by enhancing the ability to learn the normal pattern. To this end, we propose a normality learning-based GAD framework via multi-scale contrastive learning networks (NLGAD for abbreviation). Specifically, we first initialize the model with the contrastive networks on different scales. To provide sufficient and reliable normal nodes for normality learning, we design an effective hybrid strategy for normality selection. Finally, the model is refined with the only input of reliable normal nodes and learns a more accurate estimate of normality so that anomalous nodes can be more easily distinguished. Eventually, extensive experiments on six benchmark graph datasets demonstrate the effectiveness of our normality learning-based scheme on GAD. Notably, the proposed algorithm improves the detection performance (up to 5.89% AUC gain) compared with the state-of-the-art methods. The source code is released at https://github.com/FelixDJC/NLGAD.

Efficient Hierarchical Multi-view Fusion Transformer for 3D Human Pose Estimation

Kangkang Zhou
Lijun Zhang
Feng Lu
Xiang-Dong Zhou
Yu Shi

In multi-view 3D human pose estimation (HPE), information from different viewpoints is highly variable due to complex factors such as background and occlusion, making cross-view feature extrac tion and fusion difficult. Most existing methods have problems of over-reliance on camera parameters or insufficient semantic feature extraction. To address these issues, this paper proposes a hierar chical multi-view fusion transformer (HMVformer) framework for 3D HPE, incorporating cross-view feature fusion methods into the spatial and temporal feature extraction process in a coarse-to-fine manner. To begin, global to local attention graph features are ex tracted and incorporated with the original pose features to better preserve the spatial structure semantic knowledge. Then, various cross-view feature fusion modules are built and embedded into the pose feature extraction for consistent and distinctive information fusion across multiple viewpoints. Furthermore, sequential tem poral information is extracted and fused with spatial knowledge for feature refinement and depth uncertainty reduction. Extensive experiments on three popular 3D HPE benchmarks show that HMV former achieves state-of-the-art results without relying on complex loss functions or providing camera parameters, simple but effective in mitigating depth ambiguity and improving 3D pose prediction accuracy. Codes and models are available1.

Deconfounded Multimodal Learning for Spatio-temporal Video Grounding

Jiawei Wang
Zhanchang Ma
Da Cao
Yuquan Le
Junbin Xiao
Tat-Seng Chua

The task of spatio-temporal video grounding involves identifying the spatial and temporal regions in a video that correspond to the objects or actions described in a given textual description. However, current models used for spatio-temporal video grounding often rely heavily on spatio-temporal priors to make the predictions. As a result, they may suffer from spurious correlations and lack the ability to generalize well to new or diverse scenarios. To overcome this limitation, we introduce a deconfounded multimodal learning framework, which utilizes a structural causal model to treat dataset biases as a confounder and subsequently remove their confounding effect. Through this framework, we can perform causal intervention on the multimodal input and derive an unbiased estimation formula through the do-calculus technique. In order to tackle the challenge of diverse and often unobservable confounders, we further propose a novel retrieval-based approach with a causal mask mechanism. The proposed method leverages analogical reasoning to facilitate deconfounded learning and mitigate dataset biases, enabling unbiased spatio-temporal prediction without explicitly modeling the confounding factors. Extensive experiments on two challenging benchmarks have well verified the effectiveness and rationality of our proposed solution.

STIRER: A Unified Model for Low-Resolution Scene Text Image Recovery and Recognition

Minyi Zhao
Shijie Xuyang
Jihong Guan
Shuigeng Zhou

Though scene text recognition (STR) from high-resolution (HR) images has achieved significant success in the past years, text recognition from low-resolution (LR) images is still a challenging task. This inspires the study on scene text image super-resolution (STISR) to generate super-resolution (SR) images based on the LR images, then STR is performed on the generated SR images, which eventually boosts the recognition performance. However, existing methods have two major drawbacks: 1) STISR models may generate imperfect SR images, which mislead the subsequent recognition. 2) As the STISR models are optimized for high recognition accuracy, the fidelity of SR images may be degraded. Consequently, neither the recognition performance of STR nor the fidelity of STISR is desirable. In this paper, a novel model called STIRER (the abbreviation of Scene Text Image REcovery and Recognition) is proposed to effectively and simultaneously recover and recognize LR scene text images under a unified framework. Concretely, STIRER consists of a feature encoder to obtain pixel features and two dedicated decoders to generate SR images and recognize texts respectively based on the encoded features and the raw LR images. We propose a progressive scene text swin transformer architecture as the encoder to enrich the representations of the pixel features for better recovery and recognition. Extensive experiments on two LR datasets show the superiority of our model to the existing methods on recognition performance, super-resolution fidelity and computational cost. The STIRER Code is available in https://github.com/zhaominyiz/STIRER.

ControlStyle: Text-Driven Stylized Image Generation Using Diffusion Priors

Jingwen Chen
Yingwei Pan
Ting Yao
Tao Mei

Recently, the multimedia community has witnessed the rise of diffusion models trained on large-scale multi-modal data for visual content creation, particularly in the field of text-to-image generation. In this paper, we propose a new task for "stylizing'' text-to-image models, namely text-driven stylized image generation, that further enhances editability in content creation. Given input text prompt and style image, this task aims to produce stylized images which are both semantically relevant to input text prompt and meanwhile aligned with the style image in style. To achieve this, we present a new diffusion model (ControlStyle) via upgrading a pre-trained text-to-image model with a trainable modulation network enabling more conditions of text prompts and style images. Moreover, diffusion style and content regularizations are simultaneously introduced to facilitate the learning of this modulation network with these diffusion priors, pursuing high-quality stylized text-to-image generation. Extensive experiments demonstrate the effectiveness of our ControlStyle in producing more visually pleasing and artistic results, surpassing a simple combination of text-to-image model and conventional style transfer techniques.

Slowfast Diversity-aware Prototype Learning for Egocentric Action Recognition

Guangzhao Dai
Xiangbo Shu
Rui Yan
Peng Huang
Jinhui Tang

Egocentric Action Recognition (EAR) is required to recognize both the interacting objects (noun) and the motion (verb) against cluttered backgrounds with distracting objects. For capturing interacting objects, traditional approaches heavily rely on luxury object annotations or detectors, though a few works heuristically enumerate the fixed sets of verb-constrained prototypes to roughly exclude the background. For capturing motion, the inherent variations of motion duration among egocentric videos with different lengths are almost ignored. To this end, we propose a novel Slowfast Diversity-aware Prototype learning (SDP) to effectively capture interacting objects by learning compact yet diverse prototypes, and adaptively capture motion in either long-time video or short-time video. Specifically, we present a new Part-to-Prototype (P2P) scheme to learn prototypes from raw videos covering the interacting objects by refining the semantic information from part level to prototype level. Moreover, for adaptively capturing motion, we design a new Slow-Fast Context (SFC) mechanism that explores the Up/Down augmentations for the prototype representation at the semantic level to strengthen the transient dynamic information in short-time videos and eliminate the redundant dynamic information in long-time videos, which are further fine-complemented via the slow- and fast-aware attentions. Extensive experiments demonstrate SDP outperforms state-of-the-art methods on two large-scale egocentric video benchmarks, i.e., EPIC-KITCHENS-100 and EGTEA.

Universal Defensive Underpainting Patch: Making Your Text Invisible to Optical Character Recognition

JiaCheng Deng
Li Dong
Jiahao Chen
Diqun Yan
Rangding Wang
Dengpan Ye
Lingchen Zhao
Jinyu Tian

Optical Character Recognition (OCR) enables automatic text extraction from scanned or digitized text images, but it also makes it easy to pirate valuable or sensitive text from these images. Previous methods to prevent OCR piracy by distorting characters in text images are impractical in real-world scenarios, as pirates can capture arbitrary portions of the text images, rendering the defenses ineffective. In this work, we propose a novel and effective defense mechanism termed the Universal Defensive Underpainting Patch (UDUP) that modifies the underpainting of text images instead of the characters. UDUP is created through an iterative optimization process to craft a small, fixed-size defensive patch that can generate non-overlapping underpainting for text images of any size. Experimental results show that UDUP effectively defends against unauthorized OCR under the setting of any screenshot range or complex image background. It is agnostic to the content, size, colors, and languages of characters, and is robust to typical image operations such as scaling and compressing. In addition, the transferability of UDUP is demonstrated by evading several off-the-shelf OCRs. The code is available at https://github.com/QRICKDD/UDUP.

UniSinger: Unified End-to-End Singing Voice Synthesis With Cross-Modality Information Matching

Zhiqing Hong
Chenye Cui
Rongjie Huang
Lichao Zhang
Jinglin Liu
Jinzheng He
Zhou Zhao

Though previous works have shown remarkable achievements in singing voice generation, most existing models focus on one specific application and there is a lack of unified singing voice synthesis models. In addition to low relevance among tasks, different input modalities are one of the most intractable hindrances. Current methods suffer from information confusion and they can not perform precise control. In this work, we propose UniSinger, a unified end-to-end singing voice synthesizer, which integrates three abilities related to singing voice generation: singing voice synthesis (SVS), singing voice conversion (SVC), and singing voice editing (SVE) into a single framework. Specifically, we perform representation disentanglement for controlling different attributes of the singing voice. We further propose a cross-modality information matching method to close the distribution gap between multi-modal inputs and achieve end-to-end training. The experiments conducted on the OpenSinger dataset demonstrate that UniSinger achieves state-of-the-art results in three applications. Further extensive experiments verify the capability of representation disentanglement and information matching, reflecting that UniSinger enjoys great superiority in sample quality, timbre similarity, and multi-task compatibility. Audio samples can be found in https://unisinger.github.io/Samples/.

CLG-INet: Coupled Local-Global Interactive Network for Image Restoration

Yuqi Jiang
Chune Zhang
Shuo Jin
Jiao Liu
Jiapeng Wang

Image restoration is an ill-posed problem due to the infinite feasible solutions for degraded images. Although CNN-based and Transformer-based approaches have been proven effective in image restoration, there are still two challenges in restoring complex degraded images: 1)local-global information extraction and fusion, and 2)computational cost overhead. To address these challenges, in this paper, we propose a lightweight image restoration network (CLG-INet) based on CNN-Transformer interaction, which can efficiently couple the local and global information. Specifically, our model is hierarchically built with a "sandwich-like" structure of coupling blocks, where each block contains three layers in sequence (CNN-Transformer-CNN). The Transformer layer is designed with two core modules: Dynamic Bi-Projected Attention (DBPA), which performs dual projection with large convolutions across windows to capture long-range dependencies, and Gated Non-linear Feed-Forward Network (GNFF), which reconstructs mixed feature information. In addition, we introduce interactive learning, which fuses local features and global representations in different resolutions to the maximum extent. Extensive experiments demonstrate that CLG-INet significantly boosts performance on various image restoration tasks, such as deraining, deblurring, and denoising.

Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics

Chen Liu
Peike Patrick Li
Xingqun Qi
Hu Zhang
Lincheng Li
Dadong Wang
Xin Yu

The audio-visual segmentation (AVS) task aims to segment sounding objects from a given video. Existing works mainly focus on fusing audio and visual features of a given video to achieve sounding object masks. However, we observed that prior arts are prone to segment a certain salient object in a video regardless of the audio information. This is because sounding objects are often the most salient ones in the AVS dataset. Thus, current AVS methods might fail to localize genuine sounding objects due to the dataset bias. In this work, we present an audio-visual instance-aware segmentation approach to overcome the dataset bias. In a nutshell, our method first localizes potential sounding objects in a video by an object segmentation network, and then associates the sounding object candidates with the given audio. We notice that an object could be a sounding object in one video but a silent one in another video. This would bring ambiguity in training our object segmentation network as only sounding objects have corresponding segmentation masks. We thus propose a silent object-aware segmentation objective to alleviate the ambiguity. Moreover, since the category information of audio is unknown, especially for multiple sounding sources, we propose to explore the audio-visual semantic correlation and then associate audio with potential objects. Specifically, we attend predicted audio category scores to potential instance masks and these scores will highlight corresponding sounding instances while suppressing inaudible ones. When we enforce the attended instance masks to resemble the ground-truth mask, we are able to establish audio-visual semantics correlation. Experimental results on the AVS benchmarks demonstrate that our method can effectively segment sounding objects without being biased to salient objects and also achieves state-of-the-art performance in both the single-source and multi-source scenarios.

Taming the Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow

Junhong Gou
Siyu Sun
Jianfu Zhang
Jianlou Si
Chen Qian
Liqing Zhang

Virtual try-on is a critical image synthesis task that aims to transfer clothes from one image to another while preserving the details of both humans and clothes. While many existing methods rely on Generative Adversarial Networks (GANs) to achieve this, flaws can still occur, particularly at high resolutions. Recently, the diffusion model has emerged as a promising alternative for generating high-quality images in various applications. However, simply using clothes as a condition for guiding the diffusion model to inpaint is insufficient to maintain the details of the clothes. To overcome this challenge, we propose an exemplar-based inpainting approach that leverages a warping module to guide the diffusion model's generation effectively. The warping module performs initial processing on the clothes, which helps to preserve the local details of the clothes. We then combine the warped clothes with clothes-agnostic person image and add noise as the input of diffusion model. Additionally, the warped clothes is used as local conditions for each denoising process to ensure that the resulting output retains as much detail as possible. Our approach, namely Diffusion-based Conditional Inpainting for Virtual Try-ON(DCI-VTON), effectively utilizes the power of the diffusion model, and the incorporation of the warping module helps to produce high-quality and realistic virtual try-on results. Experimental results on VITON-HD demonstrate the effectiveness and superiority of our method. Source code and trained models will be publicly released at: https://github.com/bcmi/DCI-VTON-Virtual-Try-On.

Designing Loving-Kindness Meditation in Virtual Reality for Long-Distance Romantic Relationships

Xian Wang
Xiaoyu Mo
Lik-Hang Lee
Xiaoying Wei
Xiaofu Jin
Mingming Fan
Pan Hui

Loving-kindness meditation (LKM) is used in clinical psychology for couples' relationship therapy, but physical isolation can make the relationship more strained and inaccessible to LKM. Virtual reality (VR) can provide immersive LKM activities for long-distance couples. However, no suitable commercial VR applications for couples exist to engage in LKM activities of long-distance. This paper organized a series of workshops with couples to build a prototype of a couple-preferred LKM app. Through analysis of participants' design works and semi-structured interviews, we derived design considerations for such VR apps and created a prototype for couples to experience. We conducted a study with couples to understand their experiences of performing LKM using the VR prototype and a traditional video conferencing tool. Results show that LKM session utilizing both tools has a positive effect on the intimate relationship and the VR prototype is a more preferable tool for long-term use. We believe our experience can inform future researchers.

MLIC: Multi-Reference Entropy Model for Learned Image Compression

Wei Jiang
Jiayu Yang
Yongqi Zhai
Peirong Ning
Feng Gao
Ronggang Wang

Recently, learned image compression has achieved remarkable performance. The entropy model, which estimates the distribution of the latent representation, plays a crucial role in boosting rate-distortion performance. However, most entropy models only capture correlations in one dimension, while the latent representation contains channel-wise, local spatial, and global spatial correlations. To tackle this issue, we propose the Multi-Reference Entropy Model (MEM) and the advanced version, MEM+. These models capture the different types of correlations present in latent representation. Specifically, we first divide the latent representation into slices. When decoding the current slice, we use previously decoded slices as context and employ the attention map of the previously decoded slice to predict global correlations in the current slice. To capture local contexts, we introduce two enhanced checkerboard context capturing techniques that avoids performance degradation. Based on MEM and MEM+, we propose image compression models MLIC and MLIC+. Extensive experimental evaluations demonstrate that our MLIC and MLIC+ models achieve state-of-the-art performance, reducing BD-rate by 8.05% and 11.39% on the Kodak dataset compared to VTM-17.0 when measured in PSNR.

Automatic Human Scene Interaction through Contact Estimation and Motion Adaptation

Mingrui Zhang
Ming Chen
Yan Zhou
Li Chen
Weihua Jian
Pengfei Wan

Human scene interaction (HSI) aims to understand and accommodate the various ways humans interact with their environment. However, existing works typically struggle to produce high-precision contact estimation for understanding these interactions and lack sufficient fine-grained semantics to adapt to the complexities of diverse characters and scenes, resulting in unnatural and inaccurate interacting motions. In this paper, we present a novel approach to automatic human scene interaction that effectively recovers the human body mesh and high-precision contact information, subsequently enabling adaptation to different environments. Our main contributions include the proposal of a contact estimation framework that leverages semantic features from 2D images and 3D model recovered with inverse kinematics to guide the learning of vertex-level human scene contact (HSC) estimation. For motion adaptation, we propose an enhanced Laplacian semantics descriptor combined with kinematic constraints, enabling precise retargeting between variously sized human models and distinct 3D scenes. Through extensive experiments, we demonstrate our method's superiority against state-of-the-art approaches and showcase the results of the automatic human scene interaction process.

When Masked Image Modeling Meets Source-free Unsupervised Domain Adaptation: Dual-Level Masked Network for Semantic Segmentation

Gang Li
Xianzheng Ma
Zhao Wang
Hao Li
Qifei Zhang
Chao Wu

Source-Free domain adaptive Semantic Segmentation (SFSS) aims to transfer knowledge from source domain to the target domain with only pre-trained source segmentation model and the unlabeled target dataset. Only a few works have been researched for SFSS, relying on entropy minimization, pseudo-labeling. Nevertheless, due to the domain bias, these methods tend to suffering from the confusion of classes with a similar visual appearance in different domains. To address the above issue, we propose to enhance discriminability towards target samples with masked image modeling to model spatial context relations as additional recognition clues. Specifically, we design a novel Dual-Level Masked Consistency method, which explicitly encourages the model to learn comprehensive context relations, i.e. patch-wise context and channel-wise context, on the target domain. By randomly masking target images and forcing the model to reconstruct predictions of the entire image with left unmasked part, the model has to make full use of spatially contextual information. To take a step further, we propose a novel masking strategy considering both local context and global context information by applying patch-wise masking on image patches and channel-wise masking on latent features. Notably, patch-wise context learning and channel-wise context learning can complement each other. Extensive experiments demonstrate the effectiveness of our proposed method and our method achieves state-of-the-art performance on two synthetic-to-real benchmarks: GTA5→Cityscapes and SYNTHIA→Cityscapes.

Multi-Part Token Transformer with Dual Contrastive Learning for Fine-grained Image Classification

Chuanming Wang
Huiyuan Fu
Huadong Ma

Fine-grained image classification focuses on distinguishing objects from different similar subcategories, which requires the classification model to extract subtle yet discriminative descriptors. Recent Vision Transformer (ViT) has shown an enormous potential for this challenging task, but previous ViT-based methods have primarily focused on improving the relationship between image patches, neglecting the limited expressive capability caused by the single class token.To address this limitation, we propose to learn a Multi-part Token Transformer (MpT-Trans), which extends the class token to multiple tokens presenting various parts, enhancing the model's capability of extracting discriminative information. Specifically, our MpT-Trans model interpolates the vision transformer framework with two modules: (i) the Part-wise Shift Learning (PwSL) module is proposed to extend the single class token to a set of part tokens with differentiable shifts, enabling the model to extract informative representations from different perspectives; (ii) the Dual Contrastive Learning (DuCL) module is introduced to exploit the inter-class and inter-part relationships to regularize the learning of part tokens, enhancing their diversity and discrimination for accurate classification. Extensive experiments and ablation study demonstrate that the proposed MpT-Trans achieves state-of-the-art performance on various fine-grained image benchmark datasets, demonstrating the effectiveness of our proposed method.

Recognizing High-Speed Moving Objects with Spike Camera

Junwei Zhao
Jianming Ye
Shiliang Shiliang
Zhaofei Yu
Tiejun Huang

Spike camera is a novel bio-inspired vision sensor that mimics the sampling mechanism of the primate fovea. It presents high temporal resolution and dynamic range, showing great potentials in the high-speed moving object recognition task, which has not been fully explored in the Multimedia community due to the lack of data and annotations. This paper contributes the first large-scale High-Speed Spiking Recognition (HSSR) dataset, by recording high-speed moving objects using a spike camera. The HSSR dataset contains 135,000 indoor objects annotated using ImageNet labels and 3,100 outdoor objects collected from real-world scenarios. Furthermore, we propose an original spiking recognition framework, which employs long-term spike stream features to supervise the feature learning from short-term spike streams. This framework improves the recognition accuracy, meanwhile substantially decreasing the recognition latency, making our method can accurately recognize moving objects at an equivalent speed of 514 km/h, using only 1 ms of spike stream. Experimental results show that, the proposed method achieves 76.5% accuracy for recognizing 100 fine-grained indoor objects and 84.3% accuracy for recognizing 8 outdoor objects using 1 ms of spike streams. Resources will be available at https://github.com/Evin-X/HSSR.

MUP: Multi-granularity Unified Perception for Panoramic Activity Recognition

Meiqi Cao
Rui Yan
Xiangbo Shu
Jiachao Zhang
Jinpeng Wang
Guo-Sen Xie

Panoramic activity recognition is required to jointly identify multi-granularity human behaviors including individual actions, group activities, and global activities in multi-person videos. Previous methods encode these behaviors hierarchically through multiple stages, which disturb the inherent co-occurrence across multi-granularity behaviors in the same scene. To this end, we propose a novel Multi-granularity Unified Perception (MUP) framework that perceives different granularity behaviors universally to explore the co-occurrence motion pattern via the same parameters in an end-to-end fashion. To be specific, the proposed framework stacks three Unified Motion Encoding (UME) blocks for modeling multiple granularity behaviors with shared parameters. UME block mines intra-relevant and cross-relevant semantics synchronously from input feature sequences via Intra-granularity Motion Embedding (IME) and Cross-granularity Motion Prototyping (CMP). In particular, IME aims to model the interactions among visual features within each granularity based on the attention mechanism. CMP aims to aggregate features across different granularities (i.e., person to group) via several learnable prototypes. Extensive experiments demonstrate that MUP outperforms the state-of-the-art methods on JRDB-PAR and has satisfactory interpretability.

Learning Spectral-wise Correlation for Spectral Super-Resolution: Where Similarity Meets Particularity

Hongyuan Wang
Lizhi Wang
Chang Chen
Xue Hu
Fenglong Song
Hua Huang

Hyperspectral images consist of multiple spectral channels, and the task of spectral super-resolution is to reconstruct hyperspectral images from 3-channel RGB images, where modeling spectral-wise correlation is of great importance. Based on the analysis of the physical process of this task, we distinguish the spectral-wise correlation into two aspects: similarity and particularity. The Existing Transformer model cannot accurately capture spectral-wise similarity due to the inappropriate spectral-wise fully connected linear mapping acting on input spectral feature maps, which results in spectral feature maps mixing. Moreover, the token normalization operation in the existing Transformer model also results in its inability to capture spectral-wise particularity and thus fails to extract key spectral feature maps. To address these issues, we propose a novel Hybrid Spectral-wise Attention Transformer (HySAT). The key module of HySAT is Plausible Spectral-wise self-Attention (PSA), which can simultaneously model spectral-wise similarity and particularity. Specifically, we propose a Token Independent Mapping (TIM) mechanism to reasonably model spectral-wise similarity, where a linear mapping shared by spectral feature maps is applied on input spectral feature maps. Moreover, we propose a Spectral-wise Re-Calibration (SRC) mechanism to model spectral-wise particularity and effectively capture significant spectral feature maps. Experimental results show that our method achieves state-of-the-art performance in the field of spectral super-resolution with the lowest error and computational costs.

What2comm: Towards Communication-efficient Collaborative Perception via Feature Decoupling

Kun Yang
Dingkang Yang
Jingyu Zhang
Hanqi Wang
Peng Sun
Liang Song

Multi-agent collaborative perception has received increasing attention recently as an emerging application in driving scenarios. Despite advancements in previous approaches, challenges remain due to redundant communication patterns and vulnerable collaboration processes. To address these issues, we propose What2comm, an end-to-end collaborative perception framework to achieve a trade-off between perception performance and communication bandwidth. Our novelties lie in three aspects. First, we design an efficient communication mechanism based on feature decoupling to transmit exclusive and common feature maps among heterogeneous agents to provide perceptually holistic messages. Secondly, a spatio-temporal collaboration module is introduced to integrate complementary information from collaborators and temporal ego cues, leading to a robust collaboration procedure against transmission delay and localization errors. Ultimately, we propose a common-aware fusion strategy to refine final representations with informative common features. Comprehensive experiments in real-world and simulated scenarios demonstrate the effectiveness of What2comm.

Learning Discriminative Feature Representation for Open Set Action Recognition

Hongjie Zhang
Yi Liu
Yali Wang
Limin Wang
Yu Qiao

Open set action recognition (OSAR) is a challenging task that requires a classifier to identify actions that do not belong to any of the classes in its training set. Existing methods employ the Evidential Neural Network (ENN) as an open-set classifier, which is trained in a supervised manner on feature representations from known classes to quantify the predictive uncertainty of human actions. In this paper, we propose a novel framework for OSAR that enriches the discriminative representation from a backbone with a reconstructive one to further improve performance. Our approach involves augmenting the input features with their reconstruction obtained from a reconstruction-based model in unsupervised training on known classes. We then use the correspondence between the two features to learn the open-set classifier, forcing it to associate low correspondence both when the feature is from unknown classes as well as when the input feature and its reconstruction variant are inconsistent with each other. Our experimental results on standard OSAR benchmarks demonstrate that our end-to-end trained model significantly outperforms state-of-the-art methods. Our proposed approach shows the effectiveness of combining discriminative and reconstructive representations for OSAR.

LHNet: A Low-cost Hybrid Network for Single Image Dehazing

Shenghai Yuan
Jijia Chen
Jiaqi Li
Wenchao Jiang
Song Guo

Single image dehazing is a challenging task that requires both local detail and global distribution, and can be applied to various scenarios. However, physics-based dehazing algorithms perform well only in specific settings, while CNN-based algorithms struggle with capturing global information, and ViT-based approaches suffer from inadequate representation of local details. The shortcomings of the above three types of methods lead to issues such as imbalanced colors and incoherent details in the predicted haze-free image. To address these challenges, we propose a new Low-cost Hybrid Network called LHNet. The key insight of LHNet is the effective hybrid of different features, which can achieve better information fusion in the form of feature awareness at the cost of few parameters. This fusion approach narrows the gap between different features and enables LHNet to autonomously choose the fusion granularity to maximize the utilization of prior, local and global information. Extensive experiments are performed on the mainstream dehazing datasets, and the results show that LHNet achieves state-of-the-art performance in single image dehazing. By adopting our fusion approach, a better dehazing effect can be achieved than with other dehazing algorithms with more parameters, even when only CNN and ViT are used. The code is available at https://github.com/SHYuanBest/LHNet.

Context-Aware Talking-Head Video Editing

Songlin Yang
Wei Wang
Jun Ling
Bo Peng
Xu Tan
Jing Dong

Talking-head video editing aims to efficiently insert, delete, and substitute the word of a pre-recorded video through a text transcript editor. The key challenge for this task is obtaining an editing model that generates new talking-head video clips which simultaneously have accurate lip synchronization and motion smoothness. Previous approaches, including 3DMM-based (3D Morphable Model) methods and NeRF-based (Neural Radiance Field) methods, are sub-optimal in that they either require minutes of source videos and days of training time or lack the disentangled control of verbal (e.g., lip motion) and non-verbal (e.g., head pose and expression) representations for video clip insertion. In this work, we fully utilize the video context to design a novel framework for talking-head video editing, which achieves efficiency, disentangled motion control, and sequential smoothness. Specifically, we decompose this framework to motion prediction and motion-conditioned rendering: (1) We first design an animation prediction module that efficiently obtains smooth and lip-sync motion sequences conditioned on the driven speech. This module adopts a non-autoregressive network to obtain context prior and improve the prediction efficiency, and it learns a speech-animation mapping prior with better generalization to novel speech from a multi-identity video dataset. (2) We then introduce a neural rendering module to synthesize the photo-realistic and full-head video frames given the predicted motion sequence. This module adopts a pre-trained head topology and uses only few frames for efficient fine-tuning to obtain a person-specific rendering model. Extensive experiments demonstrate that our method efficiently achieves smoother editing results with higher image quality and lip accuracy using less data than previous methods.

GrooveMeter: Enabling Music Engagement-aware Apps by Detecting Reactions to Daily Music Listening via Earable Sensing

Euihyeok Lee
Chulhong Min
Jaeseung Lee
Jin Yu
Seungwoo Kang

We present GrooveMeter, a novel system that automatically detects vocal and motion reactions to music and supports music engagement-aware applications. We use smart earbuds as sensing devices, already widely used for music listening, and devise reaction detection techniques by leveraging an inertial measurement unit (IMU) and a microphone on earbuds. To explore reactions in daily music-listening situations, we collect the first-kind-of dataset containing 926-minute-long IMU and audio data with 30 participants. With the dataset, we discover unique challenges in detecting music-listening reactions and devise sophisticated processing pipelines to enable accurate and efficient detection. Our comprehensive evaluation shows GrooveMeter achieves the macro F1 scores of 0.89 for vocal reaction and 0.81 for motion reaction with leave-one-subject-out (LOSO) cross-validation (CV). More importantly, it shows higher accuracy and robustness compared to alternative methods. We also present the potential use cases.

A Baseline Investigation: Transformer-based Cross-view Baseline for Text-based Person Search

Xianghao Zang
Wei Gao
Ge Li
Han Fang
Chao Ban
Zhongjiang He
Hao Sun

This paper investigates a baseline approach for text-based person search by using a transformer-based framework. Existing methods usually treat the visual and textual features as independent entities for speeding up the model inference process. However, the attention to the same images should be changed according to different texts. In this paper, we use a commonly employed framework with a fused feature as the baseline, which overcomes the misalignment problem introduced by fixed features. A thorough investigation is conducted in this paper. Moreover, we propose Cross-View Matching (CVM) to provide challenging, positive text-image pairs that enable the model to learn cross-view meta-information. Furthermore, we suggest a novel evaluation process to reduce the inference time and GPU memory demand. The experiments are conducted on CUHK-PEDES, ICFG-PEDES, and RSTPReid benchmarks. Through extensive parameter analysis, the potentials of a transformer-based framework are fully explored. Although the proposed scheme is a simple framework, it achieves significant performance improvements compared with other state-of-the-art methods.

GridFormer: Towards Accurate Table Structure Recognition via Grid Prediction

Pengyuan Lyu
Weihong Ma
Hongyi Wang
Yuechen Yu
Chengquan Zhang
Kun Yao
Yang Xue
Jingdong Wang

All tables can be represented as grids. Based on this observation, we propose GridFormer, a novel approach for interpreting unconstrained table structures by predicting the vertex and edge of a grid. First, we propose a flexible table representation in the form of an M X N grid. In this representation, the vertexes and edges of the grid store the localization and adjacency information of the table. Then, we introduce a DETR-style table structure recognizer to efficiently predict this multi-objective information of the grid in a single shot. Specifically, given a set of learned row and column queries, the recognizer directly outputs the vertexes and edges information of the corresponding rows and columns. Extensive experiments on five challenging benchmarks which include wired, wireless, multi-merge-cell, oriented, and distorted tables demonstrate the competitive performance of our model over other methods.

Bilevel Generative Learning for Low-Light Vision

Yingchi Liu
Zhu Liu
Long Ma
Jinyuan Liu
Xin Fan
Zhongxuan Luo
Risheng Liu

Recently, there has been a growing interest in constructing deep learning schemes for Low-Light Vision (LLV). Existing techniques primarily focus on designing task-specific and data-dependent vision models on the standard RGB domain, which inherently contain latent data associations. In this study, we propose a generic low-light vision solution by introducing a generative block to convert data from the RAW to the RGB domain. This novel approach connects diverse vision problems by explicitly depicting data generation, which is the first in the field. To precisely characterize the latent correspondence between the generative procedure and the vision task, we establish a bilevel model with the parameters of the generative block defined as the upper level and the parameters of the vision task defined as the lower level. We further develop two types of learning strategies targeting different goals, namely low cost and high accuracy, to acquire a new bilevel generative learning paradigm. The generative blocks embrace a strong generalization ability in other low-light vision tasks through the bilevel optimization on enhancement tasks. Extensive experimental evaluations on three representative low-light vision tasks, namely enhancement, detection, and segmentation, fully demonstrate the superiority of our proposed approach. The code will be available at https://github.com/Yingchi1998/BGL.

A Symbolic Characters Aware Model for Solving Geometry Problems

Maizhen Ning
Qiu-Feng Wang
Kaizhu Huang
Xiaowei Huang

AI has made significant progress in solving math problems, but geometry problems remain challenging due to their reliance on both text and diagrams. In the text description, symbolic characters such as "ABC" often serve as a bridge to connect the corresponding diagram. However, by simply tokenizing symbolic characters into individual letters (e.g., 'A', 'B' and 'C'), existing works fail to study them explicitly and thus lose the semantic relationship with the diagram. In this paper, we develop a symbolic character-aware model to fully explore the role of these characters in both text and diagram understanding and optimize the model under a multi-modal reasoning framework. In the text encoder, we propose merging individual symbolic characters to form one semantic unit along with geometric information from the corresponding diagram. For the diagram encoder, we pre-train it under a multi-label classification framework with the symbolic characters as labels. In addition, we enhance the geometry diagram understanding ability via a self-supervised learning method under the masked image modeling auxiliary task. By integrating the proposed model into a general encoder-decoder pipeline for solving geometry problems, we demonstrate its superiority on two benchmark datasets, including GeoQA and Geometry3K, with extensive experiments. Specifically, on GeoQA, the question-solving accuracy is increased from 60.0% to 64.1%, achieving a new state-of-the-art accuracy; on Geometry3K, we reduce the question average solving steps from 6.9 down to 6.0 with marginally higher solving accuracy.

Versatile Face Animator: Driving Arbitrary 3D Facial Avatar in RGBD Space

Haoyu Wang
Haozhe Wu
Junliang Xing
Jia Jia

Creating realistic 3D facial animation is crucial for various applications in the movie production and gaming industry, especially with the burgeoning demand in the metaverse. However, prevalent methods such as blendshape-based approaches and facial rigging techniques are time-consuming, labor-intensive, and lack standardized configurations, making facial animation production challenging and costly. In this paper, we propose a novel self-supervised framework, Versatile Face Animator, which combines facial motion capture with motion retargeting in an end-to-end manner, eliminating the need for blendshapes or rigs. Our method has the following two main characteristics: 1) we propose an RGBD animation module to learn facial motion from raw RGBD videos by hierarchical motion dictionaries and animate RGBD images rendered from 3D facial mesh coarse-to-fine, enabling facial animation on arbitrary 3D characters regardless of their topology, textures, blendshapes, and rigs; and 2) we introduce a mesh retarget module to utilize RGBD animation to create 3D facial animation by manipulating facial mesh with controller transformations, which are estimated from dense optical flow fields and blended together with geodesic-distance-based weights. Comprehensive experiments demonstrate the effectiveness of our proposed framework in generating impressive 3D facial animation results, highlighting its potential as a promising solution for the cost-effective and efficient production of facial animation in the metaverse.

DiffBFR: Bootstrapping Diffusion Model for Blind Face Restoration

Xinmin Qiu
Congying Han
Zicheng Zhang
Bonan Li
Tiande Guo
Xuecheng Nie

Blind face restoration (BFR) is important while challenging. Prior works prefer to exploit GAN-based frameworks to tackle this task due to the balance of quality and efficiency. However, these methods suffer from poor stability and adaptability to long-tail distribution, failing to simultaneously retain source identity and restore detail. In this paper, we propose to introduce Diffusion Probabilistic Model (DPM) for BFR to tackle the above problem, given its superiority over GAN in aspects of avoiding training collapse and generating long-tail distribution. We name the proposed framework as DiffBFR. In particular, DiffBFR utilizes a two-step design, that first restores identity information from low-quality images and then enhances texture details according to the distribution of real faces. This design is implemented with two key components: 1) Identity Restoration Module (IRM) for preserving the face details in results. Instead of denoising from pure Gaussian random distribution with LQ images as the condition during the reverse process, we propose a novel truncated sampling method which starts from LQ images with part noise added. We theoretically prove that this change shrinks the evidence lower bound of DPM and then restores more original details. With theoretical proof, two cascade conditional DPMs with different input sizes are introduced to strengthen this sampling effect and reduce training difficulty in the high-resolution image generated directly. 2) Texture Enhancement Module (TEM) for polishing the texture of the image. Here an unconditional DPM, a LQ-free model, is introduced to further force the restorations to appear realistic. We theoretically proved that this unconditional DPM trained on pure HQ images contributes to justifying the correct distribution of inference images output from IRM in pixel-level space. Concretely, truncated sampling with fractional time step is utilized to polish pixel-level textures while preserving identity information. Our experiments demonstrated that the proposed DiffBFR achieves significantly superior results to state-of-the-art methods both quantitatively and qualitatively.

FedVQA: Personalized Federated Visual Question Answering over Heterogeneous Scenes

Mingrui Lao
Nan Pu
Zhun Zhong
Nicu Sebe
Michael S. Lew

This paper presents a new setting for visual question answering (VQA) called personalized federated VQA (FedVQA) that addresses the growing need for decentralization and data privacy protection. FedVQA is both practical and challenging, requiring clients to learn well-personalized models on scene-specific datasets with severe feature/label distribution skews. These models then collaborate to optimize a generic global model on a central server, which is desired to generalize well on both seen and unseen scenes without sharing raw data with the server and other clients. The primary challenge of FedVQA is that, client models tend to forget the global knowledge initialized from central server during the personalized training, which impairs their personalized capacity due to the potential overfitting issue on local data. This further leads to divergence issues when aggregating distinct personalized knowledge at the central server, resulting in an inferior generalization ability on unseen scenes. To address the challenge, we propose a novel federated pairwise preference preserving (FedP3) framework to improve personalized learning via preserving generic knowledge under FedVQA constraints. Specifically, we first design a differentiable pairwise preference (DPP) to improve knowledge preserving by formulating a flexible yet effective global knowledge. Then, we introduce a forgotten-knowledge filter (FKF) to encourage the client models to selectively consolidate easily-forgotten knowledge. By aggregating the DPP and the FKF, FedP3 coordinates the generic and the personalized knowledge to enhance the personalized ability of clients and generalizability of the server. Extensive experiments show that FedP3 consistently surpasses the state-of-the-art in FedVQA task.

Progressive Spatio-temporal Perception for Audio-Visual Question Answering

Guangyao Li
Wenxuan Hou
Di Hu

Audio-Visual Question Answering (AVQA) task aims to answer questions about different visual objects, sounds, and their associations in videos. Such naturally multi-modal videos are composed of rich and complex dynamic audio-visual components, where most of which could be unrelated to the given questions, or even play as interference in answering the content of interest. Oppositely, only focusing on the question-aware audio-visual content could get rid of influence, meanwhile enabling the model to answer more efficiently. In this paper, we propose a Progressive Spatio-Temporal Perception Network (PSTP-Net), which contains three modules that progressively identify key spatio-temporal regions w.r.t. questions. Specifically, a temporal segment selection module is first introduced to select the most relevant audio-visual segments related to the given question. Then, a spatial region selection module is utilized to choose the most relevant regions associated with the question from the selected temporal segments. To further refine the selection of features, anaudio-guided visual attention module is employed to perceive the association between auido and selected spatial regions. Finally, the spatio-temporal features from these modules are integrated for answering the question. Extensive experimental results on the public MUSIC-AVQA and AVQA datasets provide compelling evidence of the effectiveness and efficiency of PSTP-Net.

Physical Invisible Backdoor Based on Camera Imaging

Yusheng Guo
Nan Zhong
Zhenxing Qian
Xinpeng Zhang

Backdoor attack aims to compromise a model, which returns an adversary-wanted output when a specific trigger pattern appears yet behaves normally for clean inputs. Current backdoor attacks require changing pixels of clean images, which results in poor stealthiness of attacks and increases the difficulty of the physical implementation. This paper proposes a novel physical invisible backdoor based on camera imaging without changing nature image pixels. Specifically, a compromised model returns a target label for images taken by a particular camera, while it returns correct results for other images. To implement and evaluate the proposed backdoor, we take shots of different objects from multi-angles using multiple smartphones to build a new dataset of 21,500 images. Conventional backdoor attacks work ineffectively with some classical models, such as ResNet18, over the above-mentioned dataset. Therefore, we propose a three-step training strategy to mount the backdoor attack. First, we design and train a camera identification model with the phone IDs to extract the camera fingerprint feature. Subsequently, we elaborate a special network architecture, which is easily compromised by our backdoor attack, by leveraging the attributes of the CFA interpolation algorithm and combining it with the feature extraction block in the camera identification model. Finally, we transfer the backdoor from the elaborated special network architecture to the classical architecture model via teacher-student distillation learning. Since the trigger of our method is related to the specific phone, our attack works effectively in the physical world. Experiment results demonstrate the feasibility of our proposed approach and robustness against various backdoor defences.

PI-NeRF: A Partial-Invertible Neural Radiance Fields for Pose Estimation

Zhihao Li
Kexue Fu
Haoran Wang
Manning Wang

In recent years, Neural Radiance Fields (NeRF) have been used as a map of 3D scene to estimate the 6-DoF pose of new observed images - given an image, estimate the relative rotation and translation of a camera using a trained NeRF. However, existing NeRF-based pose estimation methods have a small convergence region and need to be optimized iteratively over a given initial pose, which makes them slow and sensitive to the initial pose. In this paper, we propose PI-NeRF that directly outputs the pose of a given image without pose initialization and iterative optimization. This is achieved by integrating NeRF with invertible neural network (INN). Our method employs INNs to establish a bijective mapping between the rays and pixel features, which allows us to directly estimate the ray corresponding to each image pixel using the feature map extracted by an image encoder. Based on these rays, we can directly estimate the pose of the image using the PnP algorithm. Experiments conducted on both synthetic and real-world datasets demonstrate that our method is two orders of magnitude faster than existing NeRF-based methods, while the accuracy is competitive without initial pose. The accuracy of our method also outperforms NeRF-free absolute pose regression methods by a large margin.

Improving Few-shot Image Generation by Structural Discrimination and Textural Modulation

Mengping Yang
Zhe Wang
Wenyi Feng
Qian Zhang
Ting Xiao

Few-shot image generation, which aims to produce plausible and diverse images for one category given a few images from this category, has drawn extensive attention. Existing approaches either globally interpolate different images or fuse local representations with pre-defined coefficients. However, such an intuitive combination of images/features only exploits the most relevant information for generation, leading to poor diversity and coarse-grained semantic fusion. To remedy this, this paper proposes a novel textural modulation (TexMod) mechanism to inject external semantic signals into internal local representations. Parameterized by the feedback from the discriminator, our TexMod enables more fined-grained semantic injection while maintaining the synthesis fidelity. Moreover, a global structural discriminator (StructD) is developed to explicitly guide the model to generate images with reasonable layout and outline. Furthermore, the frequency awareness of the model is reinforced by encouraging the model to distinguish frequency signals. Together with these techniques, we build a novel and effective model for few-shot image generation. The effectiveness of our model is identified by extensive experiments on three popular datasets and various settings. Besides achieving state-of-the-art synthesis performance on these datasets, our proposed techniques could be seamlessly integrated into existing models for a further performance boost. Our code and models are available at \hrefhttps://github.com/kobeshegu/SDTM-GAN-ACMMM-2023 here.

The Silent Manipulator: A Practical and Inaudible Backdoor Attack against Speech Recognition Systems

Zhicong Zheng
Xinfeng Li
Chen Yan
Xiaoyu Ji
Wenyuan Xu

Backdoor Attacks have been shown to pose significant threats to automatic speech recognition systems (ASRs). Existing success largely assumes backdoor triggering in the digital domain, or the victim will not notice the presence of triggering sounds in the physical domain. However, in practical victim-present scenarios, the over-the-air distortion of the backdoor trigger and the victim awareness raised by its audibility may invalidate such attacks. In this paper, we propose SMA, an inaudible grey-box backdoor attack that can be generalized to real-world scenarios where victims are present by exploiting both the vulnerability of microphones and neural networks. Specifically, we utilize the nonlinear effects of microphones to inject an inaudible ultrasonic trigger. To accurately characterize the microphone response to the crafted ultrasound, we construct a novel nonlinear transfer function for effective optimization. We also design optimization objectives to ensure triggers' robustness in the physical world and transferability on unseen ASR models. In practice, SMA can bypass the microphone's built-in filters and human perception, activating the implanted trigger in the ASRs inaudibly, regardless of whether the user is speaking. Extensive experiments show that the attack success rate of SMA can reach nearly 100% in the digital domain and over 85% against most microphones in the physical domains by only poisoning about 0.5% of the training audio dataset. Moreover, our attack can resist typical defense countermeasures to backdoor attacks.

FFNeRV: Flow-Guided Frame-Wise Neural Representations for Videos

Joo Chan Lee
Daniel Rho
Jong Hwan Ko
Eunbyung Park

Neural fields, also known as coordinate-based or implicit neural representations, have shown a remarkable capability of representing, generating, and manipulating various forms of signals. For video representations, however, mapping pixel-wise coordinates to RGB colors has shown relatively low compression performance and slow convergence and inference speed. Frame-wise video representation, which maps a temporal coordinate to its entire frame, has recently emerged as an alternative method to represent videos, improving compression rates and encoding speed. While promising, it has still failed to reach the performance of state-of-the-art video compression algorithms. In this work, we propose FFNeRV, a novel method for incorporating flow information into frame-wise representations to exploit the temporal redundancy across the frames in videos inspired by the standard video codecs. Furthermore, we introduce a fully convolutional architecture, enabled by one-dimensional temporal grids, improving the continuity of spatial features. Experimental results show that FFNeRV yields the best performance for video compression and frame interpolation among the methods using frame-wise representations or neural fields. To reduce the model size even further, we devise a more compact convolutional architecture using the group and pointwise convolutions. With model compression techniques, including quantization-aware training and entropy coding, FFNeRV outperforms widely-used standard video codecs (H.264 and HEVC) and performs on par with state-of-the-art video compression algorithms.

Restoration of Multiple Image Distortions using a Semi-dynamic Deep Neural Network

Hongming Luo
Fei Zhou
Zehong Zhou
Kin-Man Lam
Guoping Qiu

Restoring multiple image distortions with a single model is difficult because different distortions require fundamentally different processing mechanisms, e.g., deblurring requires high-pass filtering, while denoising requires low-pass filtering operations. This paper presents a dynamic universal image restoration (DUIR) system capable of simultaneously processing multiple distortions. The new model features several innovative designs: (i) a distortion embedding module (DEM) to automatically encode the distortion information of an input, (ii) a distortion attention module (DAM) that uses a bi-directional long short-term memory (LSTM) to encode the distortion into a sequence of forward and backward interdependent modulating signals, and (iii) a dynamically adaptive image restoration deep convolutional neural network (DAIR-DCNN) featuring unique semi-dynamic layers (SDLs) in which part of their parameters are dynamically modulated by the distortion signals. DEM, DAM, and SDLs together make DAIR-DCNN adaptive to the distortions of the current input, which in turn equips the DUIR system with the capability of simultaneously processing multiple image distortions with a single trained model. We present extensive experimental results to show that the new technique achieves superior performance to state-of-the-art models on both synthetic and real data. We further demonstrate that a trained DUIR system can simultaneously handle different distortions, including those with conflicting demands, such as denoising, deblurring, and compression artifact removal.

FCBoost-Net: A Generative Network for Synthesizing Multiple Collocated Outfits via Fashion Compatibility Boosting

Dongliang Zhou
Haijun Zhang
Jianghong Ma
Jicong Fan
Zhao Zhang

Outfit generation is a challenging task in the field of fashion technology, in which the aim is to create a collocated set of fashion items that complement a given set of items. Previous studies in this area have been limited to generating a unique set of fashion items based on a given set of items, without providing additional options to users. This lack of a diverse range of choices necessitates the development of a more versatile framework. However, when the task of generating collocated and diversified outfits is approached with multimodal image-to-image translation methods, it poses a challenging problem in terms of non-aligned image translation, which is hard to address with existing methods. In this research, we present FCBoost-Net, a new framework for outfit generation that leverages the power of pre-trained generative models to produce multiple collocated and diversified outfits. Initially, FCBoost-Net randomly synthesizes multiple sets of fashion items, and the compatibility of the synthesized sets is then improved in several rounds using a novel fashion compatibility booster. This approach was inspired by boosting algorithms and allows the performance to be gradually improved in multiple steps. Empirical evidence indicates that the proposed strategy can improve the fashion compatibility of randomly synthesized fashion items as well as maintain their diversity. Extensive experiments confirm the effectiveness of our proposed framework with respect to visual authenticity, diversity, and fashion compatibility.

Hierarchical Masked 3D Diffusion Model for Video Outpainting

Fanda Fan
Chaoxu Guo
Litong Gong
Biao Wang
Tiezheng Ge
Yuning Jiang
Chunjie Luo
Jianfeng Zhan

Video outpainting aims to complete missing areas at the edges of video frames adequately. Compared to image outpainting, it presents an additional challenge as the model should maintain the temporal consistency of the filled area. In this paper, we introduce a masked 3D diffusion model for video outpainting. We use the technique of mask modeling to train the 3D diffusion model. This allows us to use multiple guide frames to connect the results of multiple video clip inferences, thus ensuring temporal consistency and reducing jitter between adjacent frames. Meanwhile, we extract the global frames of the video as prompts and guide the model to obtain information other than the current video clip using cross-attention. We also introduce a hybrid coarse-to-fine inference pipeline to alleviate the artifact accumulation problem. The existing coarse-to-fine pipeline only uses the infilling strategy, which brings degradation because the time interval of the sparse frames is too large. Our pipeline benefits from bidirectional learning of the mask modeling and thus can employ a hybrid strategy of infilling and interpolation when generating sparse frames. Experiments show that our method achieves state-of-art results in video outpainting tasks.

OraclePoints: A Hybrid Neural Representation for Oracle Character

Runhua Jiang
Yongge Liu
Boyuan Zhang
Xu Chen
Deng Li
Yahong Han

Oracle Bone Inscriptions (OBI) are ancient hieroglyphs originated in China and are considered one of the most famous writing systems in the world. Up to now, thousands of OBIs have been discovered, which require deciphering by experts to understand their contents. Experts typically need to restore, classify, and compare each character with previous inscriptions. Although existing research can assist with one of these operations, their performance falls short of practical requirements. In this work, we propose the OraclePoints framework, which represents OBI images as hybrid neural representations comprising features of images and point sets. The image representation provides inscription appearance and character structure, while the point representation makes it easy and effective to distinguish characters and noises. In addition, we demonstrate that OraclePoints can be easily integrated with existing models in a plug-and-play manner. Comprehensive experiments demonstrate that the proposed hybrid neural representation framework supports a range of OBI tasks, including character image retrieval, recognition, and denoising. It is also demonstrated that OraclePoints is helpful for deciphering OBIs by linking ancient characters to modern Chinese characters. Our codes are available at https://ddghjikle.github.io/.

Style-Controllable Generalized Person Re-identification

Yuke Li
Jingkuan Song
Hao Ni
Heng Tao Shen

Domain generalizable person Re-identification is a challenging and realistic task. It requires a model to train on multi-source domains and then generalizes well on unseen target domains. Existing approaches typically mix images from different domains in a mini-batch for training, but this can increase discrimination within a mini-batch due to the vast style differences among domains. As a result, the model may converge easily by mining domain-related information, while neglecting identity-discriminative information, especially for metric learning. To improve the difficulty of metric learning under multi-source training, we design a Style-aware Hard-negative Sampling (SHS) strategy. SHS effectively improves metric learning but reduces the style diversity within the batch. To enhance style diversity, we devise a Dynamic Style Mixing (DSM) which memorizes single-domain styles and synthesizes novel styles, which largely raises the diversity of source domains. Extensive experiments prove the effectiveness of our method. In both single-source and multi-source settings, our approach significantly outperforms the state-of-the-art (SOTA).

Practical Deep Dispersed Watermarking with Synchronization and Fusion

Hengchang Guo
Qilong Zhang
Junwei Luo
Feng Guo
Wenbin Zhang
Xiaodong Su
Minglei Li

Deep learning based blind watermarking works have gradually emerged and achieved impressive performance. However, previous deep watermarking studies mainly focus on fixed low-resolution images while paying less attention to arbitrary resolution images, especially widespread high-resolution images nowadays. Moreover, most works usually demonstrate robustness against typical non-geometric attacks (e.g., JPEG compression) but ignore common geometric attacks (e.g., Rotate) and more challenging combined attacks. To overcome the above limitations, we propose a practical deep Dispersed Watermarking with Synchronization and Fusion, called DWSF. Specifically, given an arbitrary-resolution cover image, we adopt a dispersed embedding scheme which sparsely and randomly selects several fixed small-size cover blocks to embed a consistent watermark message by a well-trained encoder. In the extraction stage, we first design a watermark synchronization module to locate and rectify the encoded blocks in the noised watermarked image. We then utilize a decoder to obtain messages embedded in these blocks, and propose a message fusion strategy based on similarity to make full use of the consistency among messages, thus determining a reliable message. Extensive experiments conducted on different datasets convincingly demonstrate the effectiveness of our proposed DWSF. Compared with state-of-the-art approaches, our blind watermarking can achieve better performance: averagely improve the bit accuracy by 5.28% and 5.93% against single and combined attacks, respectively, and show less file size increment and better visual quality. Our code is available at https://github.com/bytedance/DWSF.

G2-DUN: Gradient Guided Deep Unfolding Network for Image Compressive Sensing

Wenxue Cui
Xingtao Wang
Xiaopeng Fan
Shaohui Liu
Chen Ma
Debin Zhao

Inspired by certain optimization solvers, the deep unfolding network (DUN) usually inherits a multi-phase structure for image compressive sensing (CS). However, in existing DUNs, the message transmission within and between phases still faces two issues: 1) the roughness of transmitted information, e.g., the low-dimensional representations. 2) the inefficiency of transmitted policy, e.g., simply concatenating deep features. In this paper, by unfolding the Proximal Gradient Descent (PGD) algorithm, a novel gradient guided DUN (G2 -DUN) for image CS is proposed, in which a gradient map is delicately introduced within each phase for providing richer informational guidance at both intra-phase and inter-phase levels. Specifically, corresponding to the gradient descent (GD) of PGD, a gradient guided GD module is designed, in which the gradient map can adaptively guide step size allocation for different textures of input image, realizing a content-aware gradient updating. On the other hand, corresponding to the proximal mapping (PM) of PGD, a gradient guided PM module is developed, in which the gradient map can dynamically guide the exploring of deep textural priors in multi-scale space, achieving the dynamic perception of the proposed deep model. By introducing the gradient map, the proposed message transmission system not only facilitates the informational communication between different functional modules within each phase, but also strengthens the inferential cooperation among cascaded phases. Extensive experiments manifest that the proposed G2 -DUN outperforms existing state-of-the-art CS methods.

Securing Fixed Neural Network Steganography

Zicong Luo
Sheng Li
Guobiao Li
Zhenxing Qian
Xinpeng Zhang

Image steganography is the art of concealing secret information in images in a way that is imperceptible to unauthorized parties. Recent advances show that is possible to use a fixed neural network (FNN) for secret embedding and extraction. Such fixed neural network steganography (FNNS) achieves high steganographic performance without training the networks, which could be more useful in real-world applications. However, the existing FNNS schemes are vulnerable in the sense that anyone can extract the secret from the stego-image. To deal with this issue, we propose a key-based FNNS scheme to improve the security of the FNNS, where we generate key-controlled perturbations from the FNN for data embedding. As such, only the receiver who possesses the key is able to correctly extract the secret from the stego-image using the FNN. In order to improve the visual quality and undetectability of the stego-image, we further propose an adaptive perturbation optimization strategy by taking the perturbation cost into account. Experimental results show that our proposed scheme is capable of preventing unauthorized secret extraction from the stego-images. Furthermore, our scheme is able to generate stego-images with higher visual quality than the state-of-the-art FNNS scheme, especially when the FNN is a neural network for ordinary learning tasks.

Unfolding Once is Enough: A Deployment-Friendly Transformer Unit for Super-Resolution

Yong Liu
Hang Dong
Boyang Liang
Songwei Liu
Qingji Dong
Kai Chen
Fangmin Chen
Lean Fu
Fei Wang

Recent years have witnessed a few attempts of vision transformers for single image super-resolution (SISR). Since the high resolution of intermediate features in SISR models increases memory and computational requirements, efficient SISR transformers are more favored. Based on some popular transformer backbone, many methods have explored reasonable schemes to reduce the computational complexity of the self-attention module while achieving impressive performance. However, these methods only focus on the performance on the training platform (e.g., Pytorch/Tensorflow) without further optimization for the deployment platform (e.g., TensorRT). Therefore, they inevitably contain some redundant operators, posing challenges for subsequent deployment in real-world applications. In this paper, we propose a deployment-friendly transformer unit, namely UFONE (i.e., UnFolding ONce is Enough), to alleviate these problems. In each UFONE, we introduce an Inner-patch Transformer Layer (ITL) to efficiently reconstruct the local structural information from patches and a Spatial-Aware Layer (SAL) to exploit the long-range dependencies between patches. Based on UFONE, we propose a Deployment-friendly Inner-patch Transformer Network (DITN) for the SISR task, which can achieve favorable performance with low latency and memory usage on both training and deployment platforms. Furthermore, to further boost the deployment efficiency of the proposed DITN on TensorRT, we also provide an efficient substitution for layer normalization and propose a fusion optimization strategy for specific operators. Extensive experiments show that our models can achieve competitive results in terms of qualitative and quantitative performance with high deployment efficiency.

Towards Real-Time Neural Video Codec for Cross-Platform Application Using Calibration Information

Kuan Tian
Yonghang Guan
Jinxi Xiang
Jun Zhang
Xiao Han
Wei Yang

The state-of-the-art neural video codecs have outperformed the most sophisticated traditional codecs in terms of rate-distortion (RD) performance in certain cases. However, utilizing them for practical applications is still challenging for two major reasons. 1) Cross-platform computational errors resulting from floating point operations can lead to inaccurate decoding of the bitstream. 2) The high computational complexity of the encoding and decoding process poses a challenge in achieving real-time performance. In this paper, we propose a real-time cross-platform neural video codec, which is capable of efficiently decoding (25FPS) of 720P video bitstream from other encoding platforms on a consumer-grade GPU (e.g., NVIDIA RTX 2080). First, to solve the problem of inconsistency of codec caused by the uncertainty of floating point calculations across platforms, we design a calibration transmitting system to guarantee the consistent quantization of entropy parameters between the encoding and decoding stages. The parameters that may have transboundary quantization between encoding and decoding are identified in the encoding stage, and their coordinates will be delivered by auxiliary transmitted bitstream. By doing so, these inconsistent parameters can be processed properly in the decoding stage. Furthermore, to reduce the bitrate of the auxiliary bitstream, we rectify the distribution of entropy parameters using a piecewise Gaussian constraint. Second, to match the computational limitations on the decoding side for real-time video codec, we design a lightweight model. A series of efficiency techniques, such as model pruning, motion downsampling, and arithmetic coding skipping, enable our model to achieve 25 FPS decoding speed on NVIDIA RTX 2080 GPU. Experimental results demonstrate that our model can achieve real-time decoding of 720P videos while encoding on another platform. Furthermore, the real-time model brings up to a maximum of 24.2% BD-rate improvement from the perspective of PSNR with the anchor H.265 (medium).

SEAM: Searching Transferable Mixed-Precision Quantization Policy through Large Margin Regularization

Chen Tang
Kai Ouyang
Zenghao Chai
Yunpeng Bai
Yuan Meng
Zhi Wang
Wenwu Zhu

Mixed-precision quantization (MPQ) suffers from the time-consuming process of searching the optimal bit-width allocation (i.e., the policy) for each layer, especially when using large-scale datasets such as ISLVRC-2012. This limits the practicality of MPQ in real-world deployment scenarios. To address this issue, this paper proposes a novel method for efficiently searching for effective MPQ policies using a small proxy dataset instead of the large-scale dataset used for training the model. Deviating from the established norm of employing a consistent dataset for both model training and MPQ policy search stages, our approach, therefore, yields a substantial enhancement in the efficiency of MPQ exploration. Nonetheless, using discrepant datasets poses challenges in searching for a transferable MPQ policy. Driven by the observation that quantization noise of sub-optimal policy exerts a detrimental influence on the discriminability of feature representations---manifesting as diminished class margins and ambiguous decision boundaries---our method aims to identify policies that uphold the discriminative nature of feature representations, i.e., intra-class compactness and inter-class separation. This general and dataset-independent property makes us search for the MPQ policy over a rather small-scale proxy dataset and then the policy can be directly used to quantize the model trained on a large-scale dataset. Our method offers several advantages, including high proxy data utilization, no excessive hyper-parameter tuning, and high searching efficiency. We search high-quality MPQ policies with the proxy dataset that has only 4% of the data scale compared to the large-scale target dataset, achieving the same accuracy as searching directly on the latter, improving MPQ searching efficiency by up to 300×.

Self-Reference Image Super-Resolution via Pre-trained Diffusion Large Model and Window Adjustable Transformer

Guangyuan Li
Wei Xing
Lei Zhao
Zehua Lan
Jiakai Sun
Zhanjie Zhang
Quanwei Zhang
Huaizhong Lin
Zhijie Lin

Currently, reference-based super-resolution (RefSR) techniques leverage high-resolution (HR) reference images to provide useful content and texture information for low-resolution (LR) images during the super-resolution (SR) process. Nevertheless, it is time-consuming, laborious, and even impossible in some cases to find high-quality reference images. To tackle this problem, we propose a brand-new self-reference image super-resolution approach using a pre-trained diffusion large model and a window adjustable transformer, termed DWTrans. Our proposed method does not require explicitly inputting manually acquired reference images during training and inference. Specifically, we feed the degraded LR images into a pre-trained stable diffusion large model to automatically generate corresponding high-quality self-reference (SRef) images that provide valuable high-frequency details for the LR images in the process of SR. To extract valuable high-frequency information in SRef images, we design a window adjustable transformer with both non-adjustable window layer (NWL) and adjustable window layer (AWL). The NWL learns local features from LR images using a dense window, while the AWL acquires global features from the SRef images using a random sparse window. Furthermore, to fully utilize the high-frequency features in the SRef image, we introduce the adaptive deformable fusion module to adaptively fuse the features of the LR and SRef images. Experimental results validate that our proposed DWTrans outperforms state-of-the-art methods on various benchmark datasets both quantitatively and visually.

Learning Dynamic Point Cloud Compression via Hierarchical Inter-frame Block Matching

Shuting Xia
Tingyu Fan
Yiling Xu
Jenq-Neng Hwang
Zhu Li

3D dynamic point cloud (DPC) compression relies on mining its temporal context, which faces significant challenges due to DPC's sparsity and non-uniform structure. Existing methods are limited in capturing sufficient temporal dependencies. Therefore, this paper proposes a learning-based DPC compression framework via hierarchical block-matching-based inter-prediction module to compensate and compress the DPC geometry in latent space. Specifically, we propose a hierarchical motion estimation and motion compensation (Hie-ME/MC) framework for flexible inter-prediction, which dynamically selects the granularity of optical flow to encapsulate the motion information accurately. To improve the motion estimation efficiency of the proposed inter-prediction module, we further design a KNN-attention block matching (KABM) network that determines the impact of potential corresponding points based on the geometry and feature correlation. Finally, we compress the residual and the multi-scale optical flow with a fully-factorized deep entropy model. The experiment result on the MPEG-specified Owlii Dynamic Human Dynamic Point Cloud (Owlii) dataset shows that our framework outperforms the previous state-of-the-art methods and the MPEG standard V-PCC v18 in inter-frame low-delay mode.

RecolorNeRF: Layer Decomposed Radiance Fields for Efficient Color Editing of 3D Scenes

Bingchen Gong
Yuehao Wang
Xiaoguang Han
Qi Dou

Radiance fields have gradually become a main representation of media. Although its appearance editing has been studied, how to achieve view-consistent recoloring in an efficient manner is still under explored. We present RecolorNeRF, a novel user-friendly color editing approach for the neural radiance fields. Our key idea is to decompose the scene into a set of pure-colored layers, forming a palette. By this means, color manipulation can be conducted by altering the color components of the palette directly. To support efficient palette-based editing, the color of each layer needs to be as representative as possible. In the end, the problem is formulated as an optimization problem, where the layers and their blending weights are jointly optimized with the NeRF itself. Extensive experiments show that our jointly-optimized layer decomposition can be used against multiple backbones and produce photo-realistic recolored novel-view renderings. We demonstrate that RecolorNeRF outperforms baseline methods both quantitatively and qualitatively for color editing even in complex real-world scenes.

Adversarial Bootstrapped Question Representation Learning for Knowledge Tracing

Jianwen Sun
Fenghua Yu
Sannyuya Liu
Yawei Luo
Ruxia Liang
Xiaoxuan Shen

Knowledge tracing (KT), which estimates and traces the degree of learners' mastery of concepts based on students' responses to learning resources, has become an increasingly relevant problem in intelligent education. The accuracy of predictions greatly depends on the quality of question representations. While contrastive learning has been commonly used to generate high-quality representations, the selection of positive and negative samples for knowledge tracing remains a challenge. To address this issue, we propose an adversarial bootstrapped question representation (ABQR) model, which can generate robust and high-quality question representations without requiring negative samples. Specifically, ABQR introduces the bootstrap self-supervised learning framework, which learns question representations from different views of the skill-informed question interaction graph and facilitates question representations between each view to predict one another, thereby circumventing the need for negative sample selection. Moreover, we propose a multi-objective multi-round feature adversarial graph augmentation method to obtain a higher-quality target view, while preserving the structural information of the original graph. ABQR is versatile and can be easily integrated with any base KT model as a plug-in to enhance the quality of question representation. Extensive experiments demonstrate that ABQR significantly improves the performance of the base KT model and outperforms state-of-the-art models. Ablation experiments confirm the effectiveness of each module of ABQR. The code is available at https://github.com/lilstrawberry/ABQR.

TTPOINT: A Tensorized Point Cloud Network for Lightweight Action Recognition with Event Cameras

Hongwei Ren
Yue Zhou
Haotian FU
Yulong Huang
Renjing Xu
Bojun Cheng

Event cameras have gained popularity in computer vision due to their data sparsity, high dynamic range, and low latency. As a bio-inspired sensor, event cameras generate sparse and asynchronous data, which is inherently incompatible with the traditional frame-based method. Alternatively, the point-based method can avoid additional modality transformation and naturally adapt to the sparsity of events. Still, it typically cannot reach a comparable accuracy as the frame-based method. We propose a lightweight and generalized point cloud network called TTPOINT which achieves competitive results even compared to the state-of-the-art (SOTA) frame-based method in action recognition tasks while only using 1.5 % of the computational resources. The model is adept at abstracting local and global geometry by hierarchy structure. By leveraging tensor-train compressed feature extractors, TTPOINT can be designed with minimal parameters and computational complexity. Additionally, we developed a straightforward downsampling algorithm to maintain the spatio-temporal feature. In the experiment, TTPOINT emerged as the SOTA method on three datasets while also attaining SOTA among point cloud methods on all five datasets. Moreover, by using the tensor-train decomposition method, the accuracy of the proposed TTPOINT is almost unaffected while compressing the parameter size by 55% in all five datasets.

DFIL: Deepfake Incremental Learning by Exploiting Domain-invariant Forgery Clues

Kun Pan
Yifang Yin
Yao Wei
Feng Lin
Zhongjie Ba
Zhenguang Liu
Zhibo Wang
Lorenzo Cavallaro
Kui Ren

The malicious use and widespread dissemination of deepfake pose a significant crisis of trust. Current deepfake detection models can generally recognize forgery images by training on a large dataset. However, the accuracy of detection models degrades significantly on images generated by new deepfake methods due to the difference in data distribution. To tackle this issue, we present a novel incremental learning framework that improves the generalization of deepfake detection models by continual learning from a small number of new samples. To cope with different data distributions, we propose to learn a domain-invariant representation based on supervised contrastive learning, preventing overfit to the insufficient new data. To mitigate catastrophic forgetting, we regularize our model in both feature-level and label-level based on a multi-perspective knowledge distillation approach. Finally, we propose to select both central and hard representative samples to update the replay set, which is beneficial for both domain-invariant representation learning and rehearsal-based knowledge preserving. We conduct extensive experiments on four benchmark datasets, obtaining the new state-of-the-art average forgetting rate of 7.01 and average accuracy of 85.49 on FF++, DFDC-P, DFD, and CDF2. Our code is released at \textcolorblue https://github.com/DeepFakeIL/DFIL.

ICMH-Net: Neural Image Compression Towards both Machine Vision and Human Vision

Lei Liu
Zhihao Hu
Zhenghao Chen
Dong Xu

Neural image compression has gained significant attention thanks to the remarkable success of deep neural networks. However, most existing neural image codecs focus solely on improving human vision perception. In this work, our objective is to enhance image compression methods for both human vision quality and machine vision tasks simultaneously. To achieve this, we introduce a novel approach to Partition, Transmit, Reconstruct, and Aggregate (PTRA) the latent representation of images to balance the optimizations for both aspects. By employing our method as a module in existing neural image codecs, we create a latent representation predictor that dynamically manages the bit-rate cost for machine vision tasks. To further improve the performance of auto-regressive-based coding techniques, we enhance our hyperprior network and predictor module with context modules, resulting in a reduction in bit-rate. The extensive experiments conducted on various machine vision benchmarks such as ILSVRC 2012, VOC 2007, VOC 2012, and COCO demonstrate the superiority of our newly proposed image compression framework. It outperforms existing neural image compression methods in multiple machine vision tasks including classification, segmentation, and detection, while maintaining high-quality image reconstruction for human vision.

High Visual-Fidelity Learned Video Compression

Meng Li
Yibo Shi
Jing Wang
Yunqi Huang

With the growing demand for video applications, many advanced learned video compression methods have been developed, outperforming traditional methods in terms of objective quality metrics such as PSNR. Existing methods primarily focus on objective quality but tend to overlook perceptual quality. Directly incorporating perceptual loss into a learned video compression framework is non-trivial and raises several perceptual quality issues that need to be addressed. In this paper, we investigated these issues in learned video compression and propose a novel High Visual-Fidelity Learned Video Compression framework (HVFVC). Specifically, we design a novel confidence-based feature reconstruction method to address the issue of poor reconstruction in newly-emerged regions, which significantly improves the visual quality of the reconstruction. Furthermore, we present a periodic compensation loss to mitigate the checkerboard artifacts related to deconvolution operation and optimization. Extensive experiments have shown that the proposed HVFVC achieves excellent perceptual quality, outperforming the latest VVC standard with only 50% required bitrate.

An Intelligent Learning Approach to Achieve Near-Second Low-Latency Live Video Streaming under Highly Fluctuating Networks

Guanghui Zhang
Ke Liu
Mengbai Xiao
Bingshu Wang
Vaneet Aggarwal

Fueled by the rapid advances in high-speed mobile networks, live video streaming has seen explosive growth in recent years and many DASH-based bitrate adaptive streaming algorithms were specifically proposed for low-latency video delivery. However, our investigations revealed that these algorithms are susceptible to network condition changes due to the use of solo universal adaptation logics, resulting the playback latency that has substantial variations across highly-fluctuating network environments and fails to meet the service quality requirement all the time. To tackle this challenge, this paper proposes Stateful Live Video Streaming (SLVS), which is a novel learning approach that learns the various network features and optimizes the adaptation logic separately for different network conditions, then dynamically tunes the logic at runtime, so that bitrate decision can better match the changing networks. Extensive evaluations show that SLVS can control playback latency down to 1s while improving Quality-of-Experience (QoE) by 17.7% to 31.8%. Moreover, it has strong robustness to maintain near-second latency over highly-fluctuating networks as well as long-period of video viewing.

Cross-Architecture Distillation for Face Recognition

Weisong Zhao
Xiangyu Zhu
Zhixiang He
Xiao-Yu Zhang
Zhen Lei

Transformers have emerged as the superior choice for face recognition tasks, but their insufficient platform acceleration hinders their application on mobile devices. In contrast, Convolutional Neural Networks (CNNs) capitalize on hardware-compatible acceleration libraries. Consequently, it has become indispensable to preserve the distillation efficacy when transferring knowledge from a Transformer-based teacher model to a CNN-based student model, known as Cross-Architecture Knowledge Distillation (CAKD). Despite its potential, the deployment of CAKD in face recognition encounters two challenges: 1) the teacher and student share disparate spatial information for each pixel, obstructing the alignment of feature space, and 2) the teacher network is not trained in the role of a teacher, lacking proficiency in handling distillation-specific knowledge. To surmount these two constraints, 1) we first introduce a Unified Receptive Fields Mapping module (URFM) that maps pixel features of the teacher and student into local features with unified receptive fields, thereby synchronizing the pixel-wise spatial information of teacher and student. Subsequently, 2) we develop an Adaptable Prompting Teacher network (APT) that integrates prompts into the teacher, enabling it to manage distillation-specific knowledge while preserving the model's discriminative capacity. Extensive experiments on popular face benchmarks and two large-scale verification sets demonstrate the superiority of our method.

Separable Modulation Network for Efficient Image Super-Resolution

Zhijian Wu
Jun Li
Dingjiang Huang

Deep learning-based models have demonstrated unprecedented success in image super-resolution (SR) tasks. However, more attention has been paid to lightweight SR models lately, due to the increasing demand for on-device inference. In this paper, we propose a novel Separable Modulation Network (SMN) for efficient image SR. The key parts of the SMN are the Separable Modulation Unit (SMU) and the Locality Self-enhanced Network (LSN). SMU enables global relational interactions but significantly eases the process by separating spatial modulation from channel aggregation, hence making the long-range interaction efficient. Specifically, spatial modulation extracts global contexts from spatial, and channel aggregation condenses all global context features into the channel modulator, ultimately the aggregated contexts are fused into the final features. In addition, LSN allows guiding the network to focus on more refined image attributes by encoding local contextual information. By coupling two complementary components, SMN can capture both short- and long-range contexts for accurate image reconstruction. Extensive experimental results demonstrate that our SMN achieves state-of-the-art performance among the existing efficient SR methods with less complexity.

A Novel Deep Video Watermarking Framework with Enhanced Robustness to H.264/AVC Compression

Yulin Zhang
Jiangqun Ni
Wenkang Su
Xin Liao

The recent success of deep image watermarking has demonstrated the potential of deep learning for watermarking, which has drawn increasing attention to deep video watermarking with the objective to improve its robustness and perceptual quality. Compared to images, video watermarking is much more challenging due to the rich structures of video data and the diversity of attacks in video transmission pipeline. The existing deep video watermarking schemes are far from satisfactory in dealing with temporal attacks, e.g., frame averaging, frame dropping and transcoding. To this end, a novel deep framework for Robustness Enhanced Video watermarking (REVMark) is proposed in this paper, aiming at improving the overall robustness, especially in dealing with H.264/AVC compression, while maintaining good visual quality. REVMark has an encoder/decoder structure with a pre-processing block (TAsBlock) to effectively extract the temporal-associated features on aligned frames. To ensure the end-to-end robust training, a distortion layer is integrated into the REVMark to resemble various attacks in real-world scenarios, among which, a new differentiable simulator of video compression, namely DiffH264, is developed to approximately simulate the process of H.264/AVC compression. In addition, the mask loss is incorporated to guide the encoder to embed the watermark in the human-imperceptible regions, thus improving the perceptual quality of the watermarked video. Experimental results demonstrate that the proposed scheme can outperform other SOTA methods while achieving 10X faster inference.

LHAct: Rectifying Extremely Low and High Activations for Out-of-Distribution Detection

Yue Yuan
Rundong He
Zhongyi Han
Yilong Yin

In recent years, out-of-distribution (OOD) detection has emerged as a crucial research area, especially when deploying AI products in real-world scenarios. OOD detection researchers have made significant efforts to mitigate the adverse effects of abnormal activation values (abbr. activations) that refer to the outputs of the activation function acted on feature maps. Since abnormal activations would cause difficulty in separating ID and OOD data, the previous unified solution is to rectify the extremely high abnormal activations by clipping them with a pre-defined threshold or filtering them with a low-pass filter. However, it ignores the extremely low abnormal activations, and the proposed rectification strategy is always suboptimal because the used rectification function is non-convergence or high-intensity convergence, leading to under-rectification or over-rectification. In this paper, we propose an approach called Rectifying Extremely Low and High Activations (LHAct). LHAct includes a newly-designed function to rectify the extremely low and high activations at the same time. Specifically, LHAct increases the difference of means between ID and OOD activation distributions while decreasing their variances after processing the original activations. Our theoretical analyses demonstrate that LHAct significantly enhances the separability of ID and OOD data. By conducting extensive experiments, we demonstrate that LHAct surpasses previous activation-based methods significantly and generalizes well to other architectures and OOD scores. Code is available at: https://github.com/ystyuan/LHAct.git.

Handwritten Chemical Structure Image to Structure-Specific Markup Using Random Conditional Guided Decoder

Jinshui Hu
Hao Wu
Mingjun Chen
Chenyu Liu
Jiajia Wu
Shi Yin
Baocai Yin
Bing Yin
Cong Liu
Jun Du
Lirong Dai

Satisfactory recognition performance has been achieved for simple and controllable printed molecular images. However, recognizing handwritten chemical structure images remains unresolved due to the inherent ambiguities in handwritten atoms and bonds, as well as the signifcant challenge of converting projected 2D molecular layouts into markup strings. Target to address these problems, this paper proposes an end-to-end framework for handwritten chemical structure images recognition, with novel structure-specific markup language (SSML) and random conditional guided decoder (RCGD). SSML alleviates ambiguity and complexity in Chemfig syntax by designing an innovative markup language to accurately depict molecular structures. Besides, we propose RCGD to address the issue of multiple path decoding of molecular structures, which is composed of conditional attention guidance, memory classification and path selection mechanisms. In order to fully confirm the effectiveness of the end-to-end method, a new database containing 50,000 handwritten chemical structure images (EDU-CHEMC) has been established. Experimental results demonstrate that compared to traditional SMILES sequences, our SSML can significantly reduces the semantic gap between chemical images and markup strings. It is worth noting that our method can also recognize invalid or non-existent organic molecular structures, making it highly applicable for tasks related to teaching evaluations in the fields of chemistry and biology education. The EDU-CHEMC will be released soon in https://github.com/iFLYTEK-CV/EDU-CHEMC.

HCSD-Net: Single Image Desnowing with Color Space Transformation

Ting Zhang
Nanfeng Jiang
Hongxin Wu
Keke Zhang
Yuzhen Niu
Tiesong Zhao

Single-image desnowing aims at depressing snowflake noises while preserving a clean background. Existing methods usually mask the locations of noises and remove them in RGB color space. In this paper, we rethink this problem by investigating the impacts of color space selection. Theoretical analysis and experiments reveal that the feature of snowflake noises exhibit different distributions in different color spaces. In particular, these noises are barely seen in Hue channel, which inspires us to recover global structure and texture information of the clean background from Hue channel. More low-frequency information is also found in the Hue channel. With these observations, we propose a novel Hybrid-Color-Space-based Desnowing Network (HCSD-Net). The proposed HCSD-Net extracts low-frequency and high-frequency features in Hue channel and RGB color space, respectively. After that, it utilizes a multi-scale fusion module to enhance high-frequency details at a small feature resolution. These details are further used to supervise and supplement the background information. Extensive experiments demonstrate that our proposed HCSD-Net outperforms state-of-the-art methods on various synthetic and real-world desnowing datasets. Codes are available at https://github.com/ttz-rainbow/HCSD-Net.

FastLLVE: Real-Time Low-Light Video Enhancement with Intensity-Aware Look-Up Table

Wenhao Li
Guangyang Wu
Wenyi Wang
Peiran Ren
Xiaohong Liu

Low-Light Video Enhancement (LLVE) has received considerable attention in recent years. One of the critical requirements of LLVE is inter-frame brightness consistency, which is essential for maintaining the temporal coherence of the enhanced video. However, most existing single-image-based methods fail to address this issue, resulting in flickering effect that degrades the overall quality after enhancement. Moreover, 3D Convolution Neural Network (CNN)-based methods, which are designed for video to maintain inter-frame consistency, are computationally expensive, making them impractical for real-time applications. To address these issues, we propose an efficient pipeline named FastLLVE that leverages the Look-Up-Table (LUT) technique to maintain inter-frame brightness consistency effectively. Specifically, we design a learnable Intensity-Aware LUT (IA-LUT) module for adaptive enhancement, which addresses the low-dynamic problem in low-light scenarios. This enables FastLLVE to perform low-latency and low-complexity enhancement operations while maintaining high-quality results. Experimental results on benchmark datasets demonstrate that our method achieves the State-Of-The-Art (SOTA) performance in terms of both image quality and inter-frame brightness consistency. More importantly, our FastLLVE can process 1,080p videos at 50+ Frames Per Second (FPS), which is 2 X faster than SOTA CNN-based methods in inference time, making it a promising solution for real-time applications. The code is available at https://github.com/Wenhao-Li-777/FastLLVE.

CLE Diffusion: Controllable Light Enhancement Diffusion Model

Yuyang Yin
Dejia Xu
Chuangchuang Tan
Ping Liu
Yao Zhao
Yunchao Wei

Low light enhancement has gained increasing importance with the rapid development of visual creation and editing. However, most existing enhancement algorithms are designed to homogeneously increase the brightness of images to a pre-defined extent, limiting the user experience. To address this issue, we propose Controllable Light Enhancement Diffusion Model, dubbed CLE Diffusion, a novel diffusion framework to provide users with rich controllability.Built with a conditional diffusion model, we introduce an illumination embedding to let users control their desired brightness level. Additionally, we incorporate the Segment-Anything Model (SAM) to enable user-friendly region controllability, where users can click on objects to specify the regions they wish to enhance. Extensive experiments demonstrate that CLE Diffusion achieves competitive performance regarding quantitative metrics, qualitative results, and versatile controllability. Project page: https://yuyangyin.github.io/CLEDiffusion

SeeDS: Semantic Separable Diffusion Synthesizer for Zero-shot Food Detection

Pengfei Zhou
Weiqing Min
Yang Zhang
Jiajun Song
Ying Jin
Shuqiang Jiang

Food detection is becoming a fundamental task in food computing that supports various multimedia applications, including food recommendation and dietary monitoring. To deal with real-world scenarios, food detection needs to localize and recognize novel food objects that are not seen during training, demanding Zero-Shot Detection (ZSD). However, the complexity of semantic attributes and intra-class feature diversity poses challenges for ZSD methods in distinguishing fine-grained food classes. To tackle this, we propose the Semantic Separable Diffusion Synthesizer (SeeDS) framework for Zero-Shot Food Detection (ZSFD). SeeDS consists of two modules: a Semantic Separable Synthesizing Module (S3M) and a Region Feature Denoising Diffusion Model (RFDDM). The S3M learns the disentangled semantic representation for complex food attributes from ingredients and cuisines, and synthesizes discriminative food features via enhanced semantic information. The RFDDM utilizes a novel diffusion model to generate diversified region features and enhances ZSFD via fine-grained synthesized features. Extensive experiments show the state-of-the-art ZSFD performance of our proposed method on two food datasets, ZSFooD and UECFOOD-256. Moreover, SeeDS also maintains effectiveness on general ZSD datasets, PASCAL VOC and MS COCO. The code and dataset can be found at https://github.com/LanceZPF/SeeDS https://github.com/LanceZPF/SeeDS.

Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from a Single Image

Liao Shen
Xingyi Li
Huiqiang Sun
Juewen Peng
Ke Xian
Zhiguo Cao
Guosheng Lin

We study the problem of synthesizing a long-term dynamic video from only a single image. This is challenging since it requires consistent visual content movements given large camera motions. Existing methods either hallucinate inconsistent perpetual views or struggle with long camera trajectories. To address these issues, it is essential to estimate the underlying 4D (including 3D geometry and scene motion) and fill in the occluded regions. To this end, we present Make-It-4D, a novel method that can generate a consistent long-term dynamic video from a single image. On the one hand, we utilize layered depth images (LDIs) to represent a scene, and they are then unprojected to form a feature point cloud. To animate the visual content, the feature point cloud is displaced based on the scene flow derived from motion estimation and the corresponding camera pose. Such 4D representation enables our method to maintain the global consistency of the generated dynamic video. On the other hand, we fill in the occluded regions by using a pre-trained diffusion model to inpaint and outpaint the input image. This enables our method to work under large camera motions. Benefiting from our design, our method can be training-free which saves a significant amount of training time. Experimental results demonstrate the effectiveness of our approach, which showcases compelling rendering results.

LocalPose: Object Pose Estimation with Local Geometry Guidance

Yang Xiao
Bo Duan
Mingwei Sun
Jingwei Huang

We present LocalPose, a novel method for 9 DoF object pose estimation from object point clouds. Existing works regress pose directly from the global shape embedding and are limited to a fixed set of shapes. We identify that the global object pose is closely related to local geometry properties like surface orientations at representative regions. Therefore, our key idea is to summarize local geometry properties as pose signatures at each point and aggregate them into the global pose, where local pose signatures are easier to learn by the network and generalize to novel shapes. We find two types of pose signatures that benefit pose estimation. First, we learn a neural network to predict 9 DoF pose signatures as pose candidates, and the process of voting them for the object pose. Second, we treat surface normals as direct pose regulators who help to select a subset of pose candidates to achieve the best accuracy. Experiments show that our method outperforms the state-of-the-art in terms of fine-grained pose accuracy on synthetic and real datasets, contributed by both pose signatures as candidates and regulators.

PEARL: Preprocessing Enhanced Adversarial Robust Learning of Image Deraining for Semantic Segmentation

Xianghao Jiao
Yaohua Liu
Jiaxin Gao
Xinyuan Chu
Xin Fan
Risheng Liu

In light of the significant progress made in the development and application of semantic segmentation tasks, there has been increasing attention towards improving the robustness of segmentation models against natural degradation factors (e.g., rain streaks) or artificially attack factors (e.g., adversarial attack). Whereas, most existing methods are designed to address a single degradation factor and are tailored to specific application scenarios. In this work, we present the first attempt to improve the robustness of semantic segmentation tasks by simultaneously handling different types of degradation factors. Specifically, we introduce the Preprocessing Enhanced Adversarial Robust Learning (PEARL) framework based on the analysis of our proposed Naive Adversarial Training (NAT) framework. Our approach effectively handles both rain streaks and adversarial perturbation by transferring the robustness of the segmentation model to the image derain model. Furthermore, as opposed to the commonly used Negative Adversarial Attack (NAA), we design the Auxiliary Mirror Attack (AMA) to introduce positive information prior to the training of the PEARL framework, which improves defense capability and segmentation performance. Our extensive experiments and ablation studies based on different derain methods and segmentation models have demonstrated the significant performance improvement of PEARL with AMA in defense against various adversarial attacks and rain streaks while maintaining high generalization performance across different datasets. The source codes are available at https://github.com/JiaoXianghao/PEARL.

A Lightweight Collective-attention Network for Change Detection

Yuchao Feng
Yanyan Shao
Honghui Xu
Jinshan Xu
Jianwei Zheng

Change detection of multi-temporal remote sensing images is mushrooming with the innovations of neural networks, whose daunting challenge lies in locating sporadically distributed spatial-temporal changes given sophisticated scenes and various imaging conditions. Unfortunately, instead of devoting full attention to changes, most existing solutions often expend unnecessary resources yet derive task-irrelevant features. To relieve this issue, we propose a collective-attention network, which enjoys lightweight model architecture yet guarantees high performance. Specifically, an inter-temporal collective-attention module is developed for efficient interaction of bi-temporal features, in which a shared attention distribution is derived via the multiplication of temporal-concatenated queries and spatial-subtracted keys. Additionally, we present a non-change consistency-constraint, enforcing a change-oriented attention distribution and a noise-suppressed treatment. With the learned interaction features, bi-temporal differences are captured simply using the operations of spatial absolute error and temporal concatenation. Finally, decoding multi-scale differences is accomplished by lightweight temporal self-attention and spatial self-attention. Experiments on four datasets demonstrate that our model achieves state-of-the-art performance, yet requires only 1.71M parameters and 1.98G FLOPs.

Personalized Single Image Reflection Removal Network through Adaptive Cascade Refinement

Mengyi Wang
Xinxin Zhang
Yongshun Gong
Yilong Yin

In this paper, we aim to restore a reflection-free image from a single reflection-contaminated image captured through the glass. Many deep-learning-based methods attempt to solve the challenging problem by utilizing a uniform model obtained from training data for all test images. Hence, the distinctive characteristics of the test images are not considered. Besides, several methods use a cascade structure in image restoration to refine the results. But they blindly cascade modules with the same weights, improving the model's performance only to a certain extent. To address these problems, we propose a personalized single-image reflection removal network through adaptive cascade refinement (PNACR) based on meta-learning and self-supervised learning. While meta-learning can rapidly adapt to a new task with a few samples, PNACR can remove reflections of a new image with its distinctive characteristics learned by self-supervised learning. Furthermore, the proposed adaptive cascade model can adjust the weights of the model at the next iteration according to the output of the model at the current iteration, significantly improving the model's performance. Hence, the proposed model can learn information from both external training data and the new input image to provide a personalized reflection removal model for each new input image. Extensive comparison and ablation experiments on publicly available datasets demonstrate the validity of the proposed method in quantitative evaluation metrics and qualitative visualization.

Resolve Domain Conflicts for Generalizable Remote Physiological Measurement

Weiyu Sun
Xinyu Zhang
Hao Lu
Ying Chen
Yun Ge
Xiaolin Huang
Jie Yuan
Yingcong Chen

Remote photoplethysmography (rPPG) technology has become increasingly popular due to its non-invasive monitoring of various physiological indicators, making it widely applicable in multimedia interaction, healthcare, and emotion analysis. Existing rPPG methods utilize multiple datasets for training to enhance the generalizability of models. However, they often overlook the underlying conflict issues in the rPPG field, such as (1) label conflict resulting from different phase delays between physiological signal labels and face videos at the instance level, and (2) attribute conflict stemming from distribution shifts caused by head movements, illumination changes, skin types, etc. To address this, we introduce the DOmain-HArmonious framework (DOHA). Specifically, we first propose a harmonious phase strategy to eliminate uncertain phase delays and preserve the temporal variation of physiological signals. Next, we design a harmonious hyperplane optimization that reduces irrelevant attribute shifts and encourages the model's optimization towards a global solution that fits more valid scenarios. Our experiments demonstrate that DOHA significantly improves the performance of existing methods under multiple protocols.

Secondary Labeling: A Novel Labeling Strategy for Image Manipulation Detection

Yang Wei
Bin Xiao
Xiuli Bi
Zhuoran Ma
Yang Liu
Zhuo Ma

Image manipulation detection methods typically rely on a binary annotation called Primary Labeling (PrLa) to identify tampered and authentic regions in a tampered image. However, PrLa only focuses on the difference between authentic and tampered regions, ignoring the distinctions among tampered regions in different images. This transforms the task of image manipulation detection into salient object detection, with the goal shifting towards identifying the most attention-grabbing objects in images. To address this issue, this paper proposes a novel labeling strategy called Secondary Labeling (SeLa). SeLa generates a query table containing multiple tampered categories and randomly reassigns these tampered classes to different types of tampered data, effectively improving the detection performance of models by refocusing the differences among the various data. Additionally, to further improve the detection performance, this paper introduces an Adaptive Label Smoothing (ALS) regularization method. This method addresses the loss of correlation among tampered classes in SeLa caused by the one-hot encoding method. Experimental results show that compared with PrLa, SeLa not only improves the performance of detection models by up to 17%, but also enhances the robustness and convergence rate.

Robust Image Steganography against General Scaling Attacks

Qingliang Liu
Jiangqun Ni
Xianglei Hu

Conventional image steganography is assumed to transmit the message, in the most securest way possible for a given payload, over lossless channels, and the associated steganographic schemes are generally vulnerable to active attacks, e.g., JPEG re-compression, and scaling, as seen on social networks. Although considerable progress has been made on robust steganography against JPEG re-compression, there exist few steganographic schemes capable of resisting scaling attacks due to the tricky inverse interpolations involved in algorithm design. To tackle this issue, a framework for robust image steganography resisting scaling with general interpolations either in std form with fixed interpolation block, or pre-filtering-based anti-aliasing implementation with variable block, is proposed in this paper. And the task of robust steganography can be formulated as one of constrained integer programming aiming at perfectly recovering the secret message from the stego image while minimizing the difference between cover and stego images and the embedding distortion between scaled cover and scaled stego images. By introducing a metric - the degree of pixel involvement (dPI) to identify the modifiable pixels in the cover image, the optimization problem above could be effectively solved using the branch and bound algorithm (B&B). Extensive experiments demonstrate that the proposed scheme could not only resist scaling attacks with various interpolation techniques at arbitrary scaling factors (SFs), but also outperform the prior art in terms of security between the cover and stego images by a clear margin. In addition, the application of the proposed method in LinkedIn against the joint attacks of scaling and JPEG re-compression also shows its effectiveness on social network in real-world scenarios.

FlatGAN: A Holistic Approach for Robust Flat-Coloring in High-Definition with Understanding Line Discontinuity

Han Kim
Chunggi Lee
Junsoo Lee
Dohyun Kim
Kwangjin Lee
Moohyun Oh
Daesik Kim

The process of drawing digital comics and animations is a complex process that involves multiple stages. Flat-coloring, the task of filling segmented regions in a line art image with uniform tone and hue, is a particularly time-consuming and labor-intensive task. We have identified that artists suffer from not only adjusting colors in overflowing regions due to line discontinuity but also finding to replace misaligned pixels near the line due to region-bleeding problems (aliasing issues). To address these issues, we propose a holistic data generation pipeline (FlatGAN-DG) that awares the region of line discontinuity and augments the input sketch image to build robust models for noise. In addition, we propose a real-time post-processing method (FlatGAN-PP) that automatically finds and replaces miscolored pixels to alleviate the region-bleeding problems (aliasing issues). To enhance inference speed, we build FlatGAN, which shares the parameters of a generator to predict the foreground, background, and trimap at once to learn in a multi-task manner. Our experimental results show that our method outperforms other rule-and learning-based methods on three different datasets with different painting styles. To evaluate the segmented regions, we collect datasets with the annotation of split-score, merge-hard-score, and merge-easy-score. We also introduce a new evaluation metric (Region Score) on these datasets, validating the efficacy of our methods through a user study. Code is available at https://github.com/hanish3464/FlatGAN.

Recurrent Spike-based Image Restoration under General Illumination

Lin Zhu
Yunlong Zheng
Mengyue Geng
Lizhi Wang
Hua Huang

Spike camera is a new type of bio-inspired vision sensor that records light intensity in the form of a spike array with high temporal resolution (20,000 Hz). This new paradigm of vision sensor offers significant advantages for many vision tasks such as high speed image reconstruction. However, existing spike-based approaches typically assume that the scenes are with sufficient light intensity, which is usually unavailable in many real-world scenarios such as rainy days or dusk scenes. To unlock more spike-based application scenarios, we propose a Recurrent Spike-based Image Restoration (RSIR) network, which is the first work towards restoring clear images from spike arrays under general illumination. Specifically, to accurately describe the noise distribution under different illuminations, we build a physical-based spike noise model according to the sampling process of the spike camera. Based on the noise model, we design our RSIR network which consists of an adaptive spike transformation module, a recurrent temporal feature fusion module, and a frequency-based spike denoising module. Our RSIR can process the spike array in a recursive manner to ensure that the spike temporal information is well utilized. In the training process, we generate the simulated spike data based on our noise model to train our network. Extensive experiments on real-world datasets with different illuminations demonstrate the effectiveness of the proposed network. The code and dataset are released at https://github.com/BIT-Vision/RSIR.

IS2Net: Intra-domain Semantic and Inter-domain Style Enhancement for Semi-supervised Medical Domain Generalization

Shiao Xie
Ziwei Niu
Huimin Huang
Hao Sun
Rui Qin
Yen-Wei Chen
Lanfen Lin

Domain generalization (DG) demonstrates superior generalization ability in cross-center medical image segmentation. Despite its great success, existing fully supervised DG methods require collecting a large quantity of pixel-level annotations which is quite expensive and time-consuming. To address this challenge, several semi-supervised domain generalized (SSDG) methods have been proposed by simply coupling semi-supervised learning (SSL) with DG tasks, which give rise to two main concerns: (1) Intra-domain dubious semantic information: the quality of pseudo labels in each source domain suffers from the limited amount of labeled data and cross-domain discrepancy. (2) Inter-domain intangible style relationship: current models fail in integrating domain-level information and overlook the relationships among different domains, which degrades the generalization ability of model. In light of these two issues, we propose a novel SSDG framework, namely IS2Net, by arranging an inter-domain generalization branch and several intra-domain SSL branches in a parallel manner, powered by two appealing designs that build a positive interaction between them: (1) A style and semantic memory mechanism is designed to provide both high-quality class-wise representations for intra-domain semantic enhancement and stable domain-specific knowledge for inter-domain style relationship construction. (2) Confident pseudo labeling strategy aims at generating more reliable supervision for intra and inter domain branches, and thus facilitating the learning process of the whole framework. Extensive experiments show that IS2Net yields consistent improvements over the state-of- the-art methods in three public benchmarks.

Synthesizing Videos from Images for Image-to-Video Adaptation

Junbao Zhuo
Xingyu Zhao
Shuhui Wang
Huimin Ma
Qingming Huang

We address the image-to-video adaptation task that aims to leverage labeled images and unlabeled videos for video recognition. There are two major challenges in this task, including the domain discrepancy between the two domains, and the modality gap between the image and video modalities. Existing methods mainly employ a two-stage paradigm by first adopting frame-level adaptation to reduce the domain discrepancy and then learning a spatio-temporal model to bridge the modality gap. In this paper, we provide a new perspective and propose a single-stage method that synthesizes video from the source static image and converts the image-to-video adaptation problem into a video-to-video adaptation problem. With the synthesized video, we present a simple baseline that a spatio-temporal model is trained with cross entropy loss with source labels and the Batch Nuclear norm Maximization loss to encourage the classification responses of target videos maintain the discriminability and diversity. We further propose a new pseudo label generation method that inherits the robustness of class prototype and the effectiveness of the small loss criterion. Based on the constructed baseline and the proposed pseudo label generation method, we train a model that achieves state-of-the-art performances or gets comparable performances on three standard benchmarks. Our codes are publicly available at https://github.com/junbaoZHUO/ST-I2V.

Generalized Universal Domain Adaptation with Generative Flow Networks

Didi Zhu
Yinchuan Li
Yunfeng Shao
Jianye Hao
Fei Wu
Kun Kuang
Jun Xiao
Chao Wu

We introduce a new problem in unsupervised domain adaptation, termed as Generalized Universal Domain Adaptation (GUDA), which aims to achieve precise prediction of all target labels including unknown categories. GUDA bridges the gap between label distribution shift-based and label space mismatch-based variants, essentially categorizing them as a unified problem, guiding to a comprehensive framework for thoroughly solving all the variants. The key challenge of GUDA is developing and identifying novel target categories while estimating the target label distribution. To address this problem, we take advantage of the powerful exploration capability of generative flow networks and propose an active domain adaptation algorithm named GFlowDA, which selects diverse samples with probabilities proportional to a reward function. To enhance the exploration capability and effectively perceive the target label distribution, we tailor the states and rewards, and introduce an efficient solution for parent exploration and state transition. We also propose a training paradigm for GUDA called Generalized Universal Adversarial Network (GUAN), which involves collaborative optimization between GUAN and GFlowNet. Theoretical analysis highlights the importance of exploration, and extensive experiments on benchmark datasets demonstrate the superiority of GFlowDA.

Automatic Asymmetric Embedding Cost Learning via Generative Adversarial Networks

Dongxia Huang
Weiqi Luo
Peijia Zheng
Jiwu Huang

In comparison to symmetric embedding, asymmetric methods generally provide better steganography security. However, the performance of existing asymmetric methods is limited by their reliance on symmetric embedding costs. In this paper, we present a novel Generative Adversarial Network (GAN)-based steganography approach that independently learns asymmetric embedding costs from scratch. Our proposed framework features a generator with a dual-branch architecture and a discriminator that integrates multiple steganalytic networks. To address the issues of model instability and non-convergence that often arise in GAN model training, we implement an adaptive strategy that updates the GAN model parameters according to the performance of multiple steganalytic networks in each iteration. Furthermore, we introduce a new adversarial loss function that effectively learns asymmetric embedding costs by utilizing features like image residuals, gradients, asymmetric embedding probability maps, and the sign of the modification map to train the dual-branch network within the generator. Our comprehensive experiments show that our method achieves state-of-the-art steganography security results, significantly outperforming existing top-performing symmetric and asymmetric methods. Additionally, numerous ablation experiments confirm the rationality of our GAN-based model design.

Event-based Motion Deblurring with Modality-Aware Decomposition and Recomposition

Wen Yang
Jinjian Wu
Leida Li
Weisheng Dong
Guangming Shi

Event camera responds to the brightness changes at each pixel independently with microsecond accuracy. Event cameras offer attractive property that can record well high-speed scene but ignore static and non-moving areas, while conventional frame cameras are able to acquire the whole intensity information of the scene but suffer from motion blur. Therefore, it would be desirable to combine the best of two cameras for reconstructing high quality intensity frame with no motion blur. The human visual system presents a two-pathway procedure for non-action-based representation and objects motion perception, which corresponds well to the hybrid frame and event. In this paper, inspired by the two-pathway visual system, a novel dual-stream based framework is proposed for motion deblurring (DS-Deblur), which flexibly utilizes the respective advantages from frame and event. A complementary-unique information splitting based feature fusion module is firstly proposed to adaptively aggregate the frame and event progressively at multiple levels, which is well-grounded on the hierarchical process in twopathway visual system. Then, a recurrent spatio-temporal feature transformation module is designed to exploit relevant information between adjacent frames, in which features of both current and previous frames are transformed in a global-local manner. Extensive experiments on both synthetic and real motion blur datasets demonstrate our method achieves state-of-the-art performance. Project website: https://github.com/wyang-vis/Motion-Deblurringwith-Hybrid-Frames-and-Events.

Disentangle Propagation and Restoration for Efficient Video Recovery

Cong Huang
Jiahao Li
Lei Chu
Dong Liu
Yan Lu

We propose the first framework for accelerating video recovery, which aims to efficiently recover high-quality videos from degraded inputs affected by various deteriorative factors. Although current video recovery methods have achieved excellent performance, their significant computational overhead limits their widespread application. To address this, we present a pioneering study on explicitly disentangling temporal and spatial redundant computation by decomposing the input frame into propagation and restoration regions, thereby achieving significant computational reduction. Specifically, we leverage contrastive learning to learn degradation-invariant features, which overcomes the disturbance of deteriorative factors and enables accurate disentanglement. For the propagation region, we introduce a split-fusion block to address inter-frame variations, efficiently generating high-quality output at a low cost and significantly reducing temporal redundant computation. For the restoration region, we propose an efficient adaptive halting mechanism that requires few extra parameters and can adaptively halt the patch processing, considerably reducing spatial redundant computation. Furthermore, we design patch-adaptive prior regularization to boost efficiency and performance. Our proposed method achieves outstanding results on various video recovery tasks, such as video denoising, video deraining, video dehazing, and video super-resolution, with a 50% ~ 60% reduction in GMAC over the state-of-the-art video recovery methods while maintaining comparable performance.

Entropy-based Optimization on Individual and Global Predictions for Semi-Supervised Learning

Zhen Zhao
Meng Zhao
Ye Liu
Di Yin
Luping Zhou

Pseudo-labelling-based semi-supervised learning (SSL) has demonstrated remarkable success in enhancing model performance by effectively leveraging a large amount of unlabeled data. However, existing studies focus mainly on rectifying individual predictions (i.e., pseudo-labels) on each unlabeled instance but ignore the overall prediction statistics from a global perspective. Such neglect may lead to model collapse and performance degradation in SSL, especially in label-scarce scenarios. In this paper, we emphasize the cruciality of global prediction constraints and propose a new SSL method that employs Entropy-based optimization on both Individual and Global predictions of unlabeled instances, dubbed EntInG. Specifically, we propose two criteria for leveraging unlabeled data in SSL: individual prediction entropy minimization (IPEM) and global distribution entropy maximization (GDEM). On the one hand, we show that current dominant SSL methods can be viewed as an implicit form of IPEM improved by recent augmentation techniques. On the other hand, we construct a new distribution loss to encourage GDEM, which greatly benefits producing better pseudo-labels for unlabeled data. Theoretical analysis also demonstrates that our proposed criteria can be derived by enforcing mutual information maximization on unlabeled instances. Despite its simplicity, our proposed method can achieve significant accuracy gains on popular SSL classification benchmarks.

Brighten-and-Colorize: A Decoupled Network for Customized Low-Light Image Enhancement

Chenxi Wang
Zhi Jin

Low-Light Image Enhancement (LLIE) aims to improve the perceptual quality of an image captured in low-light conditions. Generally, a low-light image can be divided into lightness and chrominance components. Recent advances in this area mainly focus on the refinement of the lightness, while ignoring the role of chrominance. It easily leads to chromatic aberration and, to some extent, limits the diverse applications of chrominance in customized LLIE. In this work, a "brighten-and-colorize'' network (called BCNet), which introduces image colorization to LLIE, is proposed to address the above issues. BCNet can accomplish LLIE with accurate color and simultaneously enables customized enhancement with varying saturations and color styles based on user preferences. Specifically, BCNet regards LLIE as a multi-task learning problem: brightening and colorization. The brightening sub-task aligns with other conventional LLIE methods to get a well-lit lightness. The colorization sub-task is accomplished by regarding the chrominance of the low-light image as color guidance like the user-guide image colorization. Upon completion of model training, the color guidance (i.e., input low-light chrominance) can be simply manipulated by users to acquire customized results. This customized process is optional and, due to its decoupled nature, does not compromise the structural and detailed information of lightness. Extensive experiments on the commonly used LLIE datasets show that the proposed method achieves both State-Of-The-Art (SOTA) performance and user-friendly customization.

Cross-view Resolution and Frame Rate Joint Enhancement for Binocular Video

Panda Pan
Yang Zhao
Yuan Chen
Wei Jia
Zhao Zhang
Ronggang Wang

With the popular of stereo video and free-viewpoint video, binocular and multi-view video enhancement has attracted increasing attention. Current binocular video enhancement methods mainly focus on stereo super-resolution. In this paper, we tend to discuss a new binocular video resolution and frame-rate enhancement scenario to fully utilize the cross-view complementary information. Specifically, one view is captured with high resolution (HR) and low frame-rate (LFR), while the other viewpoint records low resolution (LR) and high frame-rate (HFR) video. Then, a binocular video joint enhancement network, which adopts dual-branch structure with cross-view guidance, is proposed to jointly reconstruct HR and HFR stereo videos. The proposed framework can reduce the capture, storage, compression, and transmission cost of normal HR and HFR stereo videos. Compared with single-view super-resolution and video frame interpolation techniques, the proposed method can recover more realistic HR details and intermediate motion by using cross-view reference. Experimental results on stereo video datasets demonstrate the effectiveness of the proposed joint resolution and frame-rate enhancement framework.

Up to Thousands-fold Storage Saving: Towards Efficient Data-Free Distillation of Large-Scale Visual Classifiers

Fanfan Ye
Bingyi Lu
Liang Ma
Qiaoyong Zhong
Di Xie

Data-Free Knowledge Distillation (DFKD) has started to make breakthroughs in classification tasks for large-scale datasets such as ImageNet-1k. Despite the encouraging results achieved, these modern DFKD methods still suffer from the massive waste of system storage and I/O resources. They either synthesize and store a vast amount of pseudo data or build thousands of generators. In this work, we introduce a storage-efficient scheme called Class-Expanding DFKD (CE-DFKD). It allows us to reduce storage costs by orders of magnitude in large-scale tasks using just one or a few generators without explicitly storing any data. The key to the success of our approach lies in alleviating the mode collapse of the generator by expanding its collapse range. Specifically, we first investigate and address the optimization conflict of previous single-generator-based DFKD methods by introducing conditional constraints. Then, we propose two class-expanding strategies to enrich the conditional information of the generator from both inter-class and intra-class perspectives. With the diversity of generated samples significantly enhanced, the proposed CE-DFKD outperforms existing methods by a large margin while achieving up to thousands of times storage savings. Besides the ImageNet-1k, the proposed CE-DFKD is compatible with widely used small-scale datasets and can be scaled to the more complex ImageNet-21k-P dataset, which was previously unreported in prior DFKD methods.

Learning Intra and Inter-Camera Invariance for Isolated Camera Supervised Person Re-identification

Menglin Wang
Xiaojin Gong

Supervised person re-identification assumes that a person has images captured under multiple cameras. However when cameras are placed in distance, a person rarely appears in more than one camera. This paper thus studies person re-ID under such isolated camera supervised (ISCS) setting. Instead of trying to generate fake cross-camera features like previous methods, we explore a novel perspective by making efficient use of the variation in training data. Under ISCS setting, a person only has limited images from a single camera, so the camera bias becomes a critical issue confounding ID discrimination. Cross-camera images are prone to being recognized as different IDs simply by camera style. To eliminate the confounding effect of camera bias, we propose to learn both intra- and inter-camera invariance under a unified framework. First, we construct style-consistent environments via clustering, and perform prototypical contrastive learning within each environment. Meanwhile, strongly augmented images are contrasted with original prototypes to enforce intra-camera augmentation invariance. For inter-camera invariance, we further design a much improved variant of multi-camera negative loss that optimizes the distance of multi-level negatives. The resulting model learns to be invariant to both subtle and severe style variation within and cross-camera. On multiple benchmarks, we conduct extensive experiments and validate the effectiveness and superiority of the proposed method. Code will be available athttps://github.com/Terminator8758/IICI.

Adversarial Attack for Robust Watermark Protection Against Inpainting-based and Blind Watermark Removers

Mingzhi Lyu
Yi Huang
Adams Wai-Kin Kong

The rise of social media platforms, especially those focusing on image sharing, has made visible watermarks increasingly important in protecting image copyrights. However, multiple studies have revealed that watermarks are vulnerable to both inpainting-based removers and blind watermark removers. Though two adversarial attack methods have been proposed to defend against watermark removers, they are tailored to a particular type of removers in a white-box setting, which significantly limits their practicality and applicability. To date, there is no adversarial attack method that can protect watermarks against the two types of watermark removers simultaneously. In this paper, we propose a novel method, named Adversarial Watermark Defender with Attribution-Guided Perturbation (AWD-AGP), that defends against both inpainting-based and blind watermark removers under a black-box setting. AWD-AGP is the first watermark protection method employing adversarial location. The adversarial location is generated by a Watermark Positioning Network, which predicts an optimal location for watermark placement, making watermark removal challenging for inpainting-based removers. Since inpainting-based removers and blind watermark removers exploit information in different regions of an image to perform removal, we propose an attribution-guided scheme, which automatically assigns attack strengths to different pixels against different removers. With this design, the generated perturbation can attack the two types of watermark removers concurrently. Experiments on seven models, including four inpainting-based removers and three blind watermark removers demonstrate the effectiveness of AWD-AGP.

LDRM: Degradation Rectify Model for Low-light Imaging via Color-Monochrome Cameras

Junhong Lin
Shufan Pei
Bing Chen
Nanfeng Jiang
Wei Gao
Tiesong Zhao

Low-light imaging task aims to approximate low-light scenes as perceived by human eyes. Existing methods usually pursue higher brightness, resulting in unrealistic exposure. Inspired by Human Vision System (HVS), where rods perceive more lights while cones perceive more colors, we propose a Low-light Degradation Rectify Model (LDRM) with color-monochrome cameras to solve this problem. First, we propose to use a low-ISO color camera and a high-ISO monochrome camera for low-light imaging under short-exposure of less than 0.1s. Short-exposure could avoid motion blurriness, while monochrome camera captures more photons than color camera. By mimicing HVS, this capture system could benefit low-light imaging. Second, we propose an LDRM model to fuse the color-monochrome image pair into a high-quality image. In this model, we separately restore UV and Y channels through chrominance and luminance branches and use monochrome image to guide the restoration of luminance. We also propose a latent code embedding method to improve the restorations of both branches. Third, we create a Low-light Color-Monochrome benchmark (LCM), including both synthetic and real-world datasets, to examine low-light imaging quality of LDRM and the state-of-the-art methods. Experimental results demonstrate the superior performance of LDRM with visually pleasing results. Codes and datasets are available at https://github.com/StephenLinn/LDRM.

Two-stage Content-Aware Layout Generation for Poster Designs

Shang Chai
Liansheng Zhuang
Fengying Yan
Zihan Zhou

Automatic layout generation models can generate numerous design layouts in a few seconds, which significantly reduces the amount of repetitive work for designers. However, most of these models consider the layout generation task as arranging layout elements with different attributes on a blank canvas, thus struggle to handle the case when an image is used as the layout background. Additionally, existing layout generation models often fail to incorporate explicit aesthetic principles such as alignment and non-overlap, and neglect implicit aesthetic principles which are hard to model. To address these issues, this paper proposes a two-stage content-aware layout generation framework for poster layout generation. Our framework consists of an aesthetics-conditioned layout generation module and a layout ranking module. The diffusion model based layout generation module utilizes an aesthetics-guided layout denoising process to sample layout proposals that meet explicit aesthetic constraints. The Auto-Encoder based layout ranking module then measures the distance between those proposals and real designs to determine the layout that best meets implicit aesthetic principles. Quantitative and qualitative experiments demonstrate that our method outperforms state-of-the-art content-aware layout generation models.

Reinforcement Learning-based Adversarial Attacks on Object Detectors using Reward Shaping

Zhenbo Shi
Wei Yang
Zhenbo Xu
Zhidong Yu
Liusheng Huang

In the field of object detector attacks, previous methods primarily rely on fixed gradient optimization or patch-based cover techniques, often leading to suboptimal attack performance and excessive distortions. To address these limitations, we propose a novel attack method, Interactive Reinforcement-based Sparse Attack (IRSA), which employs Reinforcement Learning (RL) to discover the vulnerabilities of object detectors and systematically generate erroneous results. Specifically, we formulate the process of seeking optimal margins for adversarial examples as a Markov Decision Process (MDP). We tackle the RL convergence difficulty through innovative reward functions and a composite optimization method for effective and efficient policy training. Moreover, the perturbations generated by IRSA are more subtle and difficult to detect while requiring less computational effort. Our method also demonstrates strong generalization capabilities against various object detectors. In summary, IRSA is a refined, efficient, and scalable interactive, iterative, end-to-end algorithm.

SGDiff: A Style Guided Diffusion Model for Fashion Synthesis

Zhengwentai Sun
Yanghong Zhou
Honghong He
P.Y. Mok

This paper reports on the development of a novel style guided diffusion model (SGDiff) which overcomes certain weaknesses inherent in existing models for image synthesis. The proposed SGDiff combines image modality with a pretrained text-to-image diffusion model to facilitate creative fashion image synthesis. It addresses the limitations of text-to-image diffusion models by incorporating supplementary style guidance, substantially reducing training costs, and overcoming the difficulties of controlling synthesized styles with text-only inputs. This paper also introduces a new dataset -- SG-Fashion, specifically designed for fashion image synthesis applications, offering high-resolution images and an extensive range of garment categories. By means of comprehensive ablation study, we examine the application of classifier-free guidance to a variety of conditions and validate the effectiveness of the proposed model for generating fashion images of the desired categories, product attributes, and styles. The contributions of this paper include a novel classifier-free guidance method for multi-modal feature fusion, a comprehensive dataset for fashion image synthesis application, a thorough investigation on conditioned text-to-image synthesis, and valuable insights for future research in the text-to-image synthesis domain. The code and dataset are available at: https://github.com/taited/SGDiff.

Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment

Zheng-Yan Sheng
Yang Ai
Yan-Nian Chen
Zhen-Hua Ling

This paper presents a novel task, zero-shot voice conversion based on face images (zero-shot FaceVC), which aims at converting the voice characteristics of an utterance from any source speaker to a newly coming target speaker, solely relying on a single face image of the target speaker. To address this task, we propose a face-voice memory-based zero-shot FaceVC method. This method leverages a memory-based face-voice alignment module, in which slots act as the bridge to align these two modalities, allowing for the capture of voice characteristics from face images. A mixed supervision strategy is also introduced to mitigate the long-standing issue of the inconsistency between training and inference phases for voice conversion tasks. To obtain speaker-independent content-related representations, we transfer the knowledge from a pretrained zero-shot voice conversion model to our zero-shot FaceVC model. Considering the differences between FaceVC and traditional voice conversion tasks, systematic subjective and objective metrics are designed to thoroughly evaluate the homogeneity, diversity and consistency of voice characteristics controlled by face images. Through extensive experiments, we demonstrate the superiority of our proposed method on the zero-shot FaceVC task. Samples are presented on our demo website.

Open-Scenario Domain Adaptive Object Detection in Autonomous Driving

Zeyu Ma
Ziqiang Zheng
Jiwei Wei
Xiaoyong Wei
Yang Yang
Heng Tao Shen

Existing domain adaptive object detection algorithms (DAOD) have demonstrated their effectiveness in discriminating and localizing objects across scenarios. However, these algorithms typically assume a single source and target domain for adaptation, which is not representative of the more complex data distributions in practice. To address this issue, we propose a novel Open-Scenario Domain Adaptive Object Detection (OSDA), which leverages multiple source and target domains for more practical and effective domain adaptation. We are the first to increase the granularity of the background category by building the foundation model using contrastive vision-language pre-training in an open-scenario setting for better distinguishing foreground and background, which is under-explored in previous studies. The performance gains by introducing the pre-training have been observed and have validated the model's ability to detect objects across domains. To further fine-tune the model for domain-specific object detection, we propose a hierarchical feature alignment strategy to obtain a better common feature space among the various source and target domains. In the case of multi-source domains, the cross-reconstruction framework is introduced for learning more domain invariances. The proposed method is able to alleviate knowledge forgetting without any additional computational costs. Extensive experiments across different scenarios demonstrate the effectiveness of the proposed model.

Free Fine-tuning: A Plug-and-Play Watermarking Scheme for Deep Neural Networks

Run Wang
Jixing Ren
Boheng Li
Tianyi She
Wenhui Zhang
Liming Fang
Jing Chen
Lina Wang

Watermarking has been widely adopted for protecting the intellectual property (IP) of Deep Neural Networks (DNN) to defend the unauthorized distribution. Unfortunately, studies have shown that the popular data-poisoning DNN watermarking scheme via tedious model fine-tuning on a poisoned dataset (carefully-crafted sample-label pairs) is not efficient in tackling the tasks on challenging datasets and production-level DNN model protection. To address the aforementioned limitation, in this paper, we propose a plug-and-play watermarking scheme for DNN models by injecting an independent proprietary model into the target model to serve the watermark embedding and ownership verification. In contrast to the prior studies, our proposed method by incorporating a proprietary model is free of target model fine-tuning without involving any parameters update of the target model, thus the fidelity is well preserved and scalable to challenging real tasks. Experimental results on real-world challenging datasets (e.g., ImageNet) and production-level DNN models demonstrated its effectiveness, fidelity w.r.t. the functionality preservation of the target model, robustness against popular watermark removal attacks, and the plug-and-play deployment. The source code and models are available at https://github.com/AntigoneRandy/PTYNet.

Data-Scarce Animal Face Alignment via Bi-Directional Cross-Species Knowledge Transfer

Dan Zeng
Shanchuan Hong
Shuiwang Li
Qiaomu Shen
Bo Tang

Animal face alignment is challenging due to large intra- and inter-species variations and a scarcity of labeled data. Existing studies circumvent this problem by directly finetuning a human face alignment model or focusing on animal-specific face alignment~(e.g., horse, sheep). In this paper, we propose Cross-Species Knowledge Transfer, Meta-CSKT, for animal face alignment, which consists of a base network and an adaptation network. Two networks continuously complement each other through the bi-directional cross-species knowledge transfer. This is motivated by observing knowledge sharing among animals. Meta-CSKT uses a circuit feedback mechanism to improve the base network with the cognitive differences of the adaptation network between few-shot labeled and large-scale unlabeled data. In addition, we propose a positive example mining method to identify positives, semi-hard positives, and hard negatives in unlabeled data to mitigate the scarcity of labeled data and facilitate Meta-CSKT learning. Experiments show that Meta-CSKT outperforms state-of-the-art methods by a large margin on the horse facial keypoint dataset and Japanese Macaque Species dataset, while achieving comparable results to state-of-the-art methods on large-scale labeled AnimalWeb~(e.g., 18K), using only a few labeled images~(e.g., 40)1.

Simple Techniques are Sufficient for Boosting Adversarial Transferability

Chaoning Zhang
Philipp Benz
Adil Karjauv
In So Kweon
Choong Seon Hong

Transferable targeted adversarial attack against deep image classifiers has remained an open issue. Depending on the space to optimize the loss, the existing methods can be divided into two categories: (a) feature space attack and (b) output space attack. The feature space attack outperforms output space one by a large margin but at the cost of requiring the training of layer-wise auxiliary classifiers for each corresponding target class together with the greedy search for the optimal layers. In this work, we revisit the method of output space attack and improve it from two perspectives. First, we identify over-fitting as one major factor that hinders transferability, for which we propose to augment the network input and/or feature layers with noise. Second, we propose a new cross-entropy loss with two ends: one for pushing the sample far from the source class, i.e. ground-truth class, and the other for pulling it close to the target class. We demonstrate that simple techniques are sufficient enough for achieving very competitive performance.

Augmented Digital Twins for Predictive Automatic Regulation and Fault Alarm in Sewage Plan

Yuhang Zhao
Shanchen Pang
Zhihan Lv
Sheng Miao

In this paper, Digital Twins(DT) is combined with the sewage plant. Through Digital Twins, the actual needs are analyzed to solve the problems existing in the sewage plant. Combined with Augmented Reality(AR), Machine Learning(ML) and automatic control algorithms, various functions of sewage plant can be achieved. The system uses Long Short Term Memory(LSTM), Gate Recurrent Unit(GRU) and Fuzzy Neural Network(FNN) to predict the Chemical Oxygen Demand(COD) concentration in water quality. By using these algorithms, the Digital Twins Sewage Plant(DTSP) can be better interacted with workers. Through remote control, fault alarm, automatic regulation and prediction, Digital Twins can improve the efficiency of sewage treatment.

Dance with You: The Diversity Controllable Dancer Generation via Diffusion Models

Siyue Yao
Mingjie Sun
Bingliang Li
Fengyu Yang
Junle Wang
Ruimao Zhang

Recently, digital humans for interpersonal interaction in virtual environments have gained significant attention. In this paper, we introduce a novel multi-dancer synthesis task called partner dancer generation, which involves synthesizing virtual human dancers capable of performing dance with users. The task aims to control the pose diversity between the lead dancer and the partner dancer. The core of this task is to ensure the controllable diversity of the generated partner dancer while maintaining temporal coordination with the lead dancer. This scenario varies from earlier research in generating dance motions driven by music, as our emphasis is on automatically designing partner dancer postures according to pre-defined diversity, the pose of lead dancer, as well as the accompanying tunes. To achieve this objective, we propose a three-stage framework called Dance-with-You (DanY). Initially, we employ a 3D Pose Collection stage to collect a wide range of basic dance poses as references for motion generation. Then, we introduce a hyper-parameter that coordinates the similarity between dancers by masking poses to prevent the generation of sequences that are over-diverse or consistent. To avoid the rigidity of movements, we design a Dance Pre-generated stage to pre-generate these masked poses instead of filling them with zeros. After that, a Dance Motion Transfer stage is adopted with leader sequences and music, in which a multi-conditional sampling formula is rewritten to transfer the pre-generated poses into a sequence with a partner style. In practice, to address the lack of multi-person datasets, we introduce AIST-M, a new dataset for partner dancer generation, which is publicly availiable at https://github.com/JJessicaYao/AIST-M-Dataset. Comprehensive evaluations on our AIST-M dataset demonstrate that the proposed DanY can synthesize satisfactory partner dancer results with controllable diversity.

Improving Anomaly Segmentation with Multi-Granularity Cross-Domain Alignment

Ji Zhang
Xiao Wu
Zhi-Qi Cheng
Qi He
Wei Li

Anomaly segmentation plays a crucial role in identifying anomalous objects within images, which facilitates the detection of road anomalies for autonomous driving. Although existing methods have shown impressive results in anomaly segmentation using synthetic training data, the domain discrepancies between synthetic training data and real test data are often neglected. To address this issue, Multi-Granularity Cross-Domain Alignment (MGCDA) framework is proposed for anomaly segmentation in complex driving environments. It uniquely combines a new Multi-source Domain Adversarial Training (MDAT) module and a novel Cross-domain Anomaly-aware Contrastive Learning (CACL) method to boost the generality of the model, seamlessly integrating multi-domain data at both scene and sample levels. Multi-source domain adversarial loss and a dynamic label smoothing strategy are integrated into MDAT module to facilitate the acquisition of domain-invariant features at the scene level, through adversarial training across multiple stages. CACL aligns sample-level representations with contrastive loss on cross-domain data, which utilizes an anomaly-aware sampling strategy to efficiently sample hard samples and anchors. The proposed framework has decent properties of parameter-free during the inference stage and is compatible with other anomaly segmentation networks. Experimental conducted on Fishyscapes and RoadAnomaly datasets demonstrate that the proposed framework achieves the state-of-the-art performance.

RD-FGFS: A Rule-Data Hybrid Framework for Fine-Grained Footstep Sound Synthesis from Visual Guidance

Qiutang Qi
Haonan Cheng
Yang Wang
Long Ye
Shaobin Li

Existing methods are difficult to synthesize fine-grained footsteps based on video frames only. This is due to the complicated nonlinear mapping relationships between motion states, spatial locations and different footstep sounds. Aiming to address this issue, we propose a Rule-Data guided Fine-Grained Footstep Sound (RD-FGFS) synthesis method. To the best of our knowledge, our work takes the first step in integrating data-driven and rule modeling approaches for visually aligned footstep sound synthesis. Firstly, we design a learning-based footstep sound generation network (FSGN) architecture driven by pose and flow features. The FSGN is proposed for generating an initial target sound which captures timing cues. Secondly, a rule-based fine-grained footstep sound adjustment (FGFSA) method is designed based on the visual guidance, namely ground material, movement type, and displacement distance. The proposed FGFSA effectively constructs a mapping relationship between different visual cues and footstep sounds, enabling fine-grained variations of footstep sounds. Experimental results show that our method improves the visual and sound synchronization results of footsteps and achieves impressive performance in footstep sound fine-grained control.

Aesthetics-Driven Virtual Time-Lapse Photography Generation

Lihua Lu
Hui Wei
Xin Jin
Yihao Zhang
Boyan Dong
Longteng Jiang
Xiaohui Zhang
Ruyang Li
Yaqian Zhao

Time-lapse videos can visualize the temporal change of dynamic scenes and present wonderful sights with drastic variance in color appearance and rapid movement that interests people. We propose an aesthetics-driven virtual time-lapse photography framework to explore the automatic generation of time-lapse videos in the virtual world, which has potential applications like artistic creation and entertainment in the virtual space. We first define shooting parameters to parameterize the time-lapse photography process and accordingly propose image, video, and time-lapse aesthetic assessments to optimize these parameters, enabling the process to be autonomous and adaptive. We also build an interactive interface to visualize the shooting process and help users conduct virtual time-lapse photography by personalizing shooting parameters according to their aesthetic preferences. Finally, we present a two-stream time-lapse aesthetic model and a time-lapse aesthetic dataset, which can evaluate the aesthetic quality of time-lapse videos. Experimental results demonstrate our method can automatically generate time-lapse videos comparable to those of professional photographers and is more efficient.

Neural Video Compression with Spatio-Temporal Cross-Covariance Transformers

Zhenghao Chen
Lucas Relic
Roberto Azevedo
Yang Zhang
Markus Gross
Dong Xu
Luping Zhou
Christopher Schroers

Although existing neural video compression~(NVC) methods have achieved significant success, most of them focus on improving either temporal or spatial information separately. They generally use simple operations such as concatenation or subtraction to utilize this information, while such operations only partially exploit spatio-temporal redundancies. This work aims to effectively and jointly leverage robust temporal and spatial information by proposing a new 3D-based transformer module: Spatio-Temporal Cross-Covariance Transformer (ST-XCT). The ST-XCT module combines two individual extracted features into a joint spatio-temporal feature, followed by 3D convolutional operations and a novel spatio-temporal-aware cross-covariance attention mechanism. Unlike conventional transformers, the cross-covariance attention mechanism is applied across the feature channels without breaking down the spatio-temporal features into local tokens. Such design allows for modeling global cross-channel correlations of the spatio-temporal context while lowering the computational requirement. Based on ST-XCT, we introduce a novel transformer-based end-to-end optimized NVC framework. ST-XCT-based modules are integrated into various key coding components of NVC, such as feature extraction, frame reconstruction, and entropy modeling, demonstrating its generalizability. Extensive experiments show that our ST-XCT-based NVC proposal achieves state-of-the-art compression performances on various standard video benchmark datasets.

SiFDetectCracker: An Adversarial Attack Against Fake Voice Detection Based on Speaker-Irrelative Features

Xuan Hai
Xin Liu
Yuan Tan
Qingguo Zhou

Voice is a vital medium for transmitting information. The advancement of speech synthesis technology has resulted in high-quality synthesized voices indistinguishable from human ears. These fake voices have been widely used in natural Deepfake production and other malicious activities, raising serious concerns regarding security and privacy. To deal with this situation, there have been many studies working on detecting fake voices and reporting excellent performance. However, is the story really over? In this paper, we propose SiFDetectCracker, a black-box adversarial attack framework based on Speaker-Irrelative Features (SiFs) against fake voice detection. We select background noise and mute parts before and after the speaker's voice as the primary attack features. By modifying these features in synthesized speech, the fake speech detector will make a misjudgment. Experiments show that SiFDetectCracker achieved a success rate of more than 80% in bypassing existing state-of-the-art fake voice detection systems. We also conducted several experiments to evaluate our attack approach's transferability and activation factor.

Incremental Few Shot Semantic Segmentation via Class-agnostic Mask Proposal and Language-driven Classifier

Leo Shan
Wenzhang Zhou
Grace Zhao

Incremental Few-Shot Semantic Segmentation (IFSS) aims to extend pre-trained segmentation models to new classes with limited annotated images without accessing old training data. During incrementally learning novel classes, the data distribution of old classes will be corrupted, leading to catastrophic forgetting. Meanwhile, the samples of the new class are limited, making it impossible for the model to learn a satisfactory representation of the new class. Previous IFSS methods are mainly based on distillation or storing old data. In this paper, we propose a new IFSS framework called CaLNet, i.e., Class-agnostic mask proposal and Language-driven classifier incremental few-shot semantic segmentation network. Specifically, CaLNet employs a class-agnostic mask proposal, and due to its class-agnostic nature, the capabilities of mask proposals can be easily extended from base classes to novel classes. As a result, incremental learning is only needed in the classifier part. Meanwhile, when incrementally learning novel classes, it is challenging for the classifier to learn a complete representation of the new classes due to the limited number of samples. Based on this, we combine the language embedding into the visual features, making the expression of the new class complete. Results on Pascal-VOC and COCO show that CaLNet achieves a new SOTA.

AbCoRD: Exploiting multimodal generative approach for Aspect-based Complaint and Rationale Detection

Raghav Jain
Apoorva Singh
Vivek Gangwar
Sriparna Saha

Valuable feedback can be found in customer reviews about a product or service, but it can be difficult to identify specific complaints and their underlying causes from the text. While various methods have been used to detect and analyze complaints, there has been little research on examining complaints at the aspect level and the reasons behind these complaints. To address this, we added rationale annotation for aspect-based complaint classes to a publicly available benchmark multimodal complaint dataset (CESAMARD) covering five domains (books, electronics, edibles, fashion, and miscellaneous). Current methods treat these tasks as classification problems and do not use external knowledge. Our study proposes a knowledge-infused, hierarchical, multimodal generative approach for aspect-based complaint and rationale detection that reframes the multitasking problem as a multimodal text-to-text generation task. Our approach achieved benchmark performance in the aspect-based complaint and rationale detection task through an extensive evaluation. We demonstrated that our model consistently outperformed all other baselines and state-of-the-art models in both full and few-shot settings. Our study contributes to the development of more accurate and efficient methods for extracting valuable insights from customer reviews to improve products and service 1

LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On

Davide Morelli
Alberto Baldrati
Giuseppe Cartella
Marcella Cornia
Marco Bertini
Rita Cucchiara

The rapidly evolving fields of e-commerce and metaverse continue to seek innovative approaches to enhance the consumer experience. At the same time, recent advancements in the development of diffusion models have enabled generative networks to create remarkably realistic images. In this context, image-based virtual try-on, which consists in generating a novel image of a target model wearing a given in-shop garment, has yet to capitalize on the potential of these powerful generative solutions. This work introduces LaDI-VTON, the first Latent Diffusion textual Inversion-enhanced model for the Virtual Try-ON task. The proposed architecture relies on a latent diffusion model extended with a novel additional autoencoder module that exploits learnable skip connections to enhance the generation process preserving the model's characteristics. To effectively maintain the texture and details of the in-shop garment, we propose a textual inversion component that can map the visual features of the garment to the CLIP token embedding space and thus generate a set of pseudo-word token embeddings capable of conditioning the generation process. Experimental results on Dress Code and VITON-HD datasets demonstrate that our approach outperforms the competitors by a consistent margin, achieving a significant milestone for the task. Source code and trained models are publicly available at: https://github.com/miccunifi/ladi-vton.

Localization-assisted Uncertainty Score Disentanglement Network for Action Quality Assessment

Yanli Ji
Lingfeng Ye
Huili Huang
Lijing Mao
Yang Zhou
Lingling Gao

Action Quality Assessment (AQA) has wide applications in various scenarios. Regarding the AQA of long-term figure skating, the big challenge lies in semantic context feature learning for Program Component Score (PCS) prediction and fine-grained technical subaction analysis for Technical Element Score (TES) prediction. In this paper, we propose a Localization-assisted Uncertainty Score Disentanglement Network (LUSD-Net) to deal with PCS and TES two predictions. In the LUSD-Net, we design an uncertainty score disentanglement solution, including score disentanglement and uncertainty regression, to decouple PCS-oriented and TES-oriented representations from skating sequences, ensuring learning differential representations for two types of score prediction. For long-term feature learning, a temporal interaction encoder is presented to build temporal context relation learning on PCS-oriented and TES-oriented features. To address subactions in TES prediction, a weakly-supervised temporal subaction localization is adopted to locate technical subactions in long sequences. For evaluation, we collect a large-scale Fine-grained Figure Skating dataset (FineFS) involving RGB videos and estimated skeleton sequences, providing rich annotations for multiple downstream action analysis tasks. The extensive experiments illustrate that our proposed LUSD-Net significantly improves the AQA performance, and the FineFS dataset provides a quantity data source for the AQA. The source code of LUSD-Net and the FineFS dataset is released at https://github.com/yanliji/FineFS-dataset.

Exploring the Adversarial Robustness of Video Object Segmentation via One-shot Adversarial Attacks

Kaixun Jiang
Lingyi Hong
Zhaoyu Chen
Pinxue Guo
Zeng Tao
Yan Wang
Wenqiang Zhang

Video object segmentation (VOS) is a fundamental task for computer vision and multimedia. Despite significant progress of VOS models in recent works, there has been little research on the VOS models' adversarial robustness, posing serious security risks in the VOS models' practical applications (e.g., autonomous driving and video surveillance). Adversarial robustness refers to the ability of the model to resist malicious attacks on adversarial examples. To address this gap, we propose a one-shot adversarial robustness evaluation framework (i.e., the adversary only perturbs the first frame) for VOS models, including white-box and black-box attacks. For white-box attacks, we introduce Objective Attention (OA) and Boundary Attention (BA) mechanisms to enhance the attention of attack on objects from both pixel and object levels while mitigating issues such as multi-objects attack imbalance, attack bias towards the background, and boundary reservation. For black-box attacks, we propose the Video Diverse Input (VDI) module, which utilizes data augmentation to simulate historical information, improving our method's black-box transferability. We conduct extensive experiments to evaluate the adversarial robustness of VOS models with different structures. Our experimental results reveal that existing VOS models are more vulnerable to our attacks (both white-box and black-box) compared to other state-of-the-art attacks. We further analyze the influence of different designs (e.g., memory and matching mechanisms) on adversarial robustness. Finally, we provide insights for designing more secure VOS models in the future.

Gaze Analysis System for Immersive 360° Video for Preservice Teacher Education

Chris Lenart
Pegah Ahadian
Yuxin Yang
Simon Suo
Ashton Corsello
Karl Kosko
Qiang Guan

Unified systems for multi-sensor devices, particularly eye-tracking in Virtual Reality (VR), are intricate and often require the listening and streaming of multichannel data. In this project, we propose a visual analysis framework for replicating a participant's viewing involvement by interpreting head movements as rotations and point-of-gaze (POG) as on-screen indicators. Our solution suggests an additional layer of system for near-real-time for processing and analyzing this multi-device data to connect with the data and enable both near-real-time or subsequent offline viewing of the entire VR eye-tracking session. Moreover, our method provides a no-batteries-need solution to create traditional eye-tracking visualization techniques. Finally, we apply three prior education technology analysis metrics: higher density gaze for students, shorter fixation time, and less fixation duration variance for students to determine expertise levels in this system. We systematically establish a ubiquitous, multi-device, eye-tracking solution to incorporate this approach. We evaluate the effectiveness of our system through a user study, using both expertise and non- expertise levels, and selectively surveying to ascertain the quality of the replicated experience and we test the system by running a real-world user study with sixty four different participants. We demonstrate the application's significance and potential to integrate prior analysis metrics using the collected data which this data collection and analysis have been approved by IRB.

Kernel Dimension Matters: To Activate Available Kernels for Real-time Video Super-Resolution

Shuo Jin
Meiqin Liu
Chao Yao
Chunyu Lin
Yao Zhao

Real-time video super-resolution requires low latency with high-quality reconstruction. Existing methods mostly use pruning schemes or neglect complicated modules to reduce the calculation complexity. However, the video contains large amounts of temporal redundancies due to the inter-frame correlation, which is rarely investigated in existing methods. The static and dynamic information lies in feature maps and represents the redundant complements and temporal offsets respectively. It is crucial to split channels with dynamic and static information for efficient processing. Thus, this paper proposes a kernel-split strategy to activate available kernels for real-time inference. This strategy focuses on the dimensions of convolutional kernels, including the channel and depth dimensions. Available kernel dimensions are activated according to the split of high-value and low-value channels. Specifically, a multi-channel selection unit is designed to discriminate the importance of channels and filter the high-value channels hierarchically. At each hierarchy, low-dimensional convolutional kernels are activated to reuse the low-value channel and re-parameterized convolutional kernels are employed on the high-value channel to merge the depth dimension. In addition, we design a multiple flow deformable alignment module for a sufficient temporal representation with affordable calculation cost. Experimental results demonstrate that our method outperforms other state-of-the-art (SOTA) ones in terms of reconstruction quality and runtime. Codes will be available at https://github.com/Kimsure/KSNet.

AniPixel: Towards Animatable Pixel-Aligned Human Avatar

Jinlong Fan
Jing Zhang
Zhi Hou
Dacheng Tao

Although human reconstruction typically results in human-specific avatars, recent 3D scene reconstruction techniques utilizing pixel-aligned features show promise in generalizing to new scenes. Applying these techniques to human avatar reconstruction can result in a volumetric avatar with generalizability but limited animatability due to rendering only being possible for static representations. In this paper, we propose AniPixel, a novel animatable and generalizable human avatar reconstruction method that leverages pixel-aligned features for body geometry prediction and RGB color blending. Technically, to align the canonical space with the target space and the observation space, we propose a bidirectional neural skinning field based on skeleton-driven deformation to establish the target-to-canonical and canonical-to-observation correspondences. Then, we disentangle the canonical body geometry into a normalized neutral-sized body and a subject-specific residual for better generalizability. As the geometry and appearance are closely related, we introduce pixel-aligned features to facilitate the body geometry prediction and detailed surface normals to reinforce the RGB color blending. We also devise a pose-dependent and view direction-related shading module to represent the local illumination variance. Experiments show that AniPixel renders comparable novel views while delivering better novel pose animation results than state-of-the-art methods. Code will be released at AniPixel. https://github.com/loong8888/AniPixel.

A Multiple Prediction Mechanisms Ensemble for Complex Remote Sensing Scenes

Lin Qi Feng
Lin Luo Jun
Yu Yuan Long
Gang Fu

Facing complex remote sensing scenes, detection models with single detection mechanisms cannot always provide satisfactory detection capabilities. In order to obtain better detection performance in various remote sensing scenes, this paper constructs a novel ensemble model, namely: the multiple prediction mechanisms ensemble (MPME). In order to improve the feature representation ability and region recognition ability of the ensemble model, we build the ensemble of feature pyramids (EFP) and the ensemble of detection heads (EDH) respectively. In order to further improve the detection accuracy of the ensemble model, we propose a training strategy (k-Nearest Loss Learning), so that each sub-detector does not need to learn a trade-off among all training samples, and also reduces the possibility of model over-fitting. The experimental results show that our MPME is a more efficient and effective ensemble model. Compared with other ensemble models, our MPME has a faster detection speed and better detection accuracy. Compared with other state-of-the-art detectors, our detector also achieves superior detection performance.

SUG: Single-dataset Unified Generalization for 3D Point Cloud Classification

Siyuan Huang
Bo Zhang
Botian Shi
Hongsheng Li
Yikang Li
Peng Gao

Although Domain Generalization (DG) problem has been fast-growing in the 2D image tasks, its exploration on 3D point cloud data is still insufficient and challenged by more complex and uncertain cross-domain variances with uneven inter-class modality distribution. In this paper, different from previous 2D DG works, we focus on the 3D DG problem and propose a Single-dataset Unified Generalization (SUG) framework that only leverages a single source dataset to alleviate the unforeseen domain differences faced by a well-trained source model. Specifically, we first design a Multi-grained Sub-domain Alignment (MSA) method, which can constrain the learned representations to be domain-agnostic and discriminative, by performing a multi-grained feature alignment process between the splitted sub-domains from the single source dataset. Then, a Sample-level Domain-aware Attention (SDA) strategy is presented, which can selectively enhance easy-to-adapt samples from different sub-domains according to the sample-level inter-domain distance to avoid the negative transfer. Experiments demonstrate that our SUG can boost the generalization ability for unseen target domains, even outperforming the existing unsupervised domain adaptation methods that have to access extensive target domain data.

SMM: Self-supervised Multi-Illumination Color Constancy Model with Multiple Pretext Tasks

Ziyu Feng
Zheming Xu
Haina Qin
Congyan Lang
Bing Li
Weihua Xiong

Color constancy is an important ability of the human visual system to perceive constant colors across different illumination. In this paper, we study a more practical yet challenging task, removing color cast by multiple spatial-varying illumination. Previous methods are limited by the scale of the current multi-illumination datasets, which hinders them from learning more discriminative features. Instead, we first propose a self-supervised multi-illumination color constancy model that leverages multiple pretext tasks to fully explore lighting color contextual information and inherent color information without using any manual annotations. During the pre-training phase, we train multiple Transformer-based encoders by learning multiple pretext tasks: (i) the local color distortion recovery task, which is carefully designed to learn lighting color contextual representation, and (ii) the colorization task, which is utilized to acquire inherent knowledge. In the downstream color constancy task, we fine-tune the encoders and design a lightweight decoder to obtain better illumination distributions with fewer parameters. Our lightweight architecture outperforms the state-of-the-art methods on the multi-illuminant benchmark (LSMI) and got robust performance on the single illuminant benchmark (NUS-8). Additionally, extensive ablation studies and visualization results demonstrate the effectiveness of integrating lighting color contextual and inherent color information learning in a self-supervised manner.

FlexIcon: Flexible Icon Colorization via Guided Images and Palettes

Shukai Wu
Yuhang Yang
Shuchang Xu
Weiming Liu
Xiao Yan
Sanyuan Zhang

Automatic icon colorization systems show great potential value as they can serve as a source of inspiration for designers. Despite yielding promising results, previous reference-guided approaches ignore how to effectively fuse icon structure and style, leading to unpleasant color effects. Meanwhile, they cannot take free-style palettes as inputs, which is less user-friendly. To this end, we present FlexIcon, a Flexible Icon colorization model based on guided images and palettes. To promote visual quality, our model first leverages a Hybrid Multi-expert Module to aggregate better structural features, followed by dynamically integrating the global style with each individual pixel of the structure map via the Pixel-Style Aggregation Layer. We also introduce an efficient learning scheme for free-style palette-based colorization, editing, interpolation, and diverse generation. Extensive experiments demonstrate the superiority of our framework compared with state-of-the-art approaches. In addition, we contribute a Mandala dataset to the multimedia community and further validate the application value of the proposed model.

SESSION: Poster Session IX: Engaging Users with Multimedia -- Social-good, Fairness and Transparency

Who is Speaking Actually? Robust and Versatile Speaker Traceability for Voice Conversion

Yanzhen Ren
Hongcheng Zhu
Liming Zhai
Zongkun Sun
Rubing Shen
Lina Wang

Voice conversion (VC), as a voice style transfer technology, is becoming increasingly prevalent while raising serious concerns about its illegal use. Proactively tracing the origins of VC-generated speeches, i.e., speaker traceability, can prevent the misuse of VC, but unfortunately has not been extensively studied. In this paper, we are the first to investigate the speaker traceability for VC and propose a traceable VC framework named VoxTracer. Our VoxTracer is similar to but beyond the paradigm of audio watermarking. We first use unique speaker embedding to represent speaker identity. Then we design a VAE-Glow structure, in which the hiding process imperceptibly integrates the source speaker identity into the VC, and the tracing process accurately recovers the source speaker identity and even the source speech in spite of severe speech quality degradation. To address the speech mismatch between the hiding and tracing processes affected by different distortions, we also adopt an asynchronous training strategy to optimize the VAE-Glow models. The VoxTracer is versatile enough to be applied to arbitrary VC methods and popular audio coding standards. Extensive experiments demonstrate that the VoxTracer achieves not only high imperceptibility in hiding, but also nearly 100% tracing accuracy against various types of audio lossy compressions (AAC, MP3, Opus and SILK) with a broad range of bitrates (16 kbps - 128 kbps) even in a very short time duration (0.74s). Our source code is available at https://github.com/hongchengzhu/VoxTracer.

FedGH: Heterogeneous Federated Learning with Generalized Global Header

Liping Yi
Gang Wang
Xiaoguang Liu
Zhuan Shi
Han Yu

Federated learning (FL) is an emerging machine learning paradigm that allows multiple parties to train a shared model collaboratively in a privacy-preserving manner. Existing horizontal FL methods generally assume that the FL server and clients hold the same model structure. However, due to system heterogeneity and the need for personalization, enabling clients to hold models with diverse structures has become an important direction. Existing model-heterogeneous FL approaches often require publicly available datasets and incur high communication and/or computational costs, which limit their performances. To address these limitations, we propose a simple but effective Federated Global prediction Header (FedGH) approach. It is a communication and computation-efficient model-heterogeneous FL framework which trains a shared generalized global prediction header with representations extracted by heterogeneous extractors for clients' models at the FL server. The trained generalized global prediction header learns from different clients. The acquired global knowledge is then transferred to clients to substitute each client's local prediction header. We derive the non-convex convergence rate of FedGH. Extensive experiments on two real-world datasets demonstrate that FedGH achieves significantly more advantageous performance in both model-homogeneous and -heterogeneous FL scenarios compared to seven state-of-the-art personalized FL models, beating the best-performing baseline by up to 8.87% (for model-homogeneous FL) and 1.83% (for model-heterogeneous FL) in terms of average test accuracy, while saving up to 85.53% of communication overhead.

Towards Fast and Stable Federated Learning: Confronting Heterogeneity via Knowledge Anchor

Jinqian Chen
Jihua Zhu
Qinghai Zheng

Federated learning encounters a critical challenge of data heterogeneity, adversely affecting the performance and convergence of the federated model. Various approaches have been proposed to address this issue, yet their effectiveness is still limited. Recent studies have revealed that the federated model suffers severe forgetting in local training, leading to global forgetting and performance degradation. Although the analysis provides valuable insights, a comprehensive understanding of the vulnerable classes and their impact factors is yet to be established. In this paper, we aim to bridge this gap by systematically analyzing the forgetting degree of each class during local training across different communication rounds. Our observations are: (1) Both missing and non-dominant classes suffer similar severe forgetting during local training, while dominant classes show improvement in performance. (2) When dynamically reducing the sample size of a dominant class, catastrophic forgetting occurs abruptly when the proportion of its samples is below a certain threshold, indicating that the local model struggles to leverage a few samples of a specific class effectively to prevent forgetting. Motivated by these findings, we propose a novel and straightforward algorithm called Federated Knowledge Anchor (FedKA). Assuming that all clients have a single shared sample for each class, the knowledge anchor is constructed before each local training stage by extracting shared samples for missing classes and randomly selecting one sample per class for non-dominant classes. The knowledge anchor is then utilized to correct the gradient of each mini-batch towards the direction of preserving the knowledge of the missing and non-dominant classes. Extensive experimental results demonstrate that our proposed FedKA achieves fast and stable convergence, significantly improving accuracy on popular benchmarks.

Spatio-Temporal Catcher: A Self-Supervised Transformer for Deepfake Video Detection

Maosen Li
Xurong Li
Kun Yu
Cheng Deng
Heng Huang
Feng Mao
Hui Xue
Minghao Li

As deepfake technology has become increasingly sophisticated and accessible, making it easier for individuals with malicious intent to create convincing fake content, which has raised considerable concern in the multimedia and computer vision community. Despite significant advances in deepfake video detection, most existing methods mainly focused on model architecture and training processes with little focus on data perspectives. In this paper, we argue that data quality has become the main bottleneck of current research. To be specific, in the pre-training phase, the domain shift between pre-training and target datasets may lead to poor generalization ability. Meanwhile, in the training phase, the low fidelity of the existing datasets leads to detectors relying on specific low-level visual artifacts or inconsistency. To overcome the shortcomings, (1). In the pre-training phase, pre-train our model on high-quality facial videos by utilizing data-efficient reconstruction-based self-supervised learning to solve domain shift. (2). In the training phase, we develop a novel spatio-temporal generator that can synthesize various high-quality "fake" videos in large quantities at a low cost, which enables our model to learn more general spatio-temporal representations in a self-supervised manner. (3). Additinally, to take full advantage of synthetic "fake" videos, we adopt diversity losses at both frame and video levels to explore the diversity of clues in "fake" videos. Our proposed framework is data-efficient and does not require any real-world deepfake videos. Extensive experiments demonstrate that our method significantly improves the generalization capability. Particularly on the most challenging CDF and DFDC datasets, our method outperforms the baselines by 8.88% and 7.73% points, respectively.

Bridging Trustworthiness and Open-World Learning: An Exploratory Neural Approach for Enhancing Interpretability, Generalization, and Robustness

Shide Du
Zihan Fang
Shiyang Lan
Yanchao Tan
Manuel Günther
Shiping Wang
Wenzhong Guo

As researchers strive to narrow the gap between machine intelligence and human through the development of artificial intelligence multimedia technologies, it is imperative that we recognize the critical importance of trustworthiness in open-world, which has become ubiquitous in all aspects of daily life for everyone. However, several challenges may create a crisis of trust in current open-world artificial multimedia systems that need to be bridged: 1) Insufficient explanation of predictive results; 2) Inadequate generalization for learning models; 3) Poor adaptability to uncertain environments. Consequently, we explore a neural program to bridge trustworthiness and open-world learning, extending from single-modal to multi-modal scenarios for readers.1) To enhance design-level interpretability, we first customize trustworthy networks with specific physical meanings; 2) We then design environmental well-being task-interfaces via flexible learning regularizers for improving the generalization of trustworthy learning; 3) We propose to increase the robustness of trustworthy learning by integrating open-world recognition losses with agent mechanisms. Eventually, we enhance various trustworthy properties through the establishment of design-level explainability, environmental well-being task-interfaces and open-world recognition programs. As a result, these designed open-world protocols are applicable across a wide range of surroundings, under open-world multimedia recognition scenarios with significant performance improvements observed.

Semantic-aware Consistency Network for Cloth-changing Person Re-Identification

Peini Guo
Hong Liu
Jianbing Wu
Guoquan Wang
Tao Wang

Cloth-changing Person Re-Identification (CC-ReID) is a challenging task that aims to retrieve the target person across multiple surveillance cameras when clothing changes might happen. Despite recent progress in CC-ReID, existing approaches are still hindered by the interference of clothing variations since they lack effective constraints to keep the model consistently focused on clothing-irrelevant regions. To address this issue, we present a Semantic-aware Consistency Network (SCNet) to learn identity-related semantic features by proposing effective consistency constraints. Specifically, we generate the black-clothing image by erasing pixels in the clothing area, which explicitly mitigates the interference from clothing variations. In addition, to fully exploit the fine-grained identity information, a head-enhanced attention module is introduced, which learns soft attention maps by utilizing the proposed part-based matching loss to highlight head information. We further design a semantic consistency loss to facilitate the learning of high-level identity-related semantic features, forcing the model to focus on semantically consistent cloth-irrelevant regions. By using the consistency constraint, our model does not require any extra auxiliary segmentation module to generate the black-clothing image or locate the head region during the inference stage. Extensive experiments on four cloth-changing person Re-ID datasets (LTCC, PRCC, Vc-Clothes, and DeepChange) demonstrate that our proposed SCNet makes significant improvements over prior state-of-the-art approaches. Our code is available at: https://github.com/Gpn-star/SCNet.

Adaptive Spatio-Temporal Directed Graph Neural Network for Parkinson's Detection using Vertical Ground Reaction Force

Xiaotian Wang
Shuo Liang
Zhifu Zhao
Xinyu Cui
Kai Chen
Xuanhang Xu

Vertical Ground Reaction Force (VGRF) signal obtained from foot-worn sensors, also known as plantar data, provides a highly informative and detailed representation of an individual's gait features. Existing methods, such as CNNs, LSTMs and Transformers, have revealed the efficiency of deep learning in Parkinson's Disease (PD) diagnosis using VGRF signal. However, the intrinsic topologic graph and pressure transmission characteristics of plantar data are overlooked in those approaches, which are essential features for gait analysis. In this paper, we propose to construct a plantar directed topologic graph to fully exploit the plantar topology in gait circles. It can facilitate the expression of gait information by representing sensors as nodes and pressure transmissions as directional edges. Accordingly, an Adaptive Spatio-Temporal Directed Graph Neural Network (AST-DGNN) is proposed to extract the connection features of the plantar directed topologic graph. Each AST-DGNN Unit includes an Adaptive Directed Graph Network (ADGN) block and a Temporal Convolutional Network (TCN) block. In order to capture both local and global spatial relationships among sensor nodes and pressure transmission edges, the ADGN block performs message passing on the plantar directed topologic graph in an adaptive manner. To capture the temporal features of sensor nodes and pressure transmission edges, the TCN block defines a temporal feature extraction process for each node and edge in the graph. Moreover, the data augmentation is introduced for plantar data to improve the generalization ability of the AST-DGNN. Experimental results on Ga, Ju, and Si datasets demonstrate that the proposed method outperforms the existing methods under both cross-dataset validation and mixed-data cross-validation. Especially in cross-dataset validation, there is an average improvement of 2.13%, 7.73%, and 12.27% in accuracy, F1 score, and G-mean, respectively.

UMMAFormer: A Universal Multimodal-adaptive Transformer Framework for Temporal Forgery Localization

Rui Zhang
Hongxia Wang
Mingshan Du
Hanqing Liu
Yang Zhou
Qiang Zeng

The emergence of artificial intelligence-generated content (AIGC) has raised concerns about the authenticity of multimedia content in various fields. However, existing research for forgery content detection has focused mainly on binary classification tasks of complete videos, which has limited applicability in industrial settings. To address this gap, we propose UMMAFormer, a novel universal transformer framework for temporal forgery localization (TFL) that predicts forgery segments with multimodal adaptation. Our approach introduces a Temporal Feature Abnormal Attention (TFAA) module based on temporal feature reconstruction to enhance the detection of temporal differences. We also design a Parallel Cross-Attention Feature Pyramid Network (PCA-FPN) to optimize the Feature Pyramid Network (FPN) for subtle feature enhancement. To evaluate the proposed method, we contribute a novel Temporal Video Inpainting Localization (TVIL) dataset specifically tailored for video inpainting scenes. Our experiments show that our approach achieves state-of-the-art performance on benchmark datasets, including Lav-DF, TVIL, and Psynd, significantly outperforming previous methods. The code and data are available at https://github.com/ymhzyj/UMMAFormer/.

Elucidate Gender Fairness in Singing Voice Transcription

Xiangming Gu
Wei Zeng
Ye Wang

It is widely known that males and females typically possess different sound characteristics when singing, such as timbre and pitch, but it has never been explored whether these gender-based characteristics lead to a performance disparity in singing voice transcription (SVT), whose target includes pitch. Such a disparity could cause fairness issues and severely affect the user experience of downstream SVT applications. Motivated by this, we first demonstrate the female superiority of SVT systems, which is observed across different models and datasets. We find that different pitch distributions, rather than gender data imbalance, contribute to this disparity. To address this issue, we propose using an attribute predictor to predict gender labels and adversarially training the SVT system to enforce the gender-invariance of acoustic representations. Leveraging the prior knowledge that pitch distributions may contribute to the gender bias, we propose conditionally aligning acoustic representations between demographic groups by feeding note events to the attribute predictor. Empirical experiments on multiple benchmark SVT datasets show that our method significantly reduces gender bias (up to more than 50%) with negligible degradation of overall SVT performance, on both in-domain and out-of-domain singing data, thus offering a better fairness-utility trade-off.

Combating Online Misinformation Videos: Characterization, Detection, and Future Directions

Yuyan Bu
Qiang Sheng
Juan Cao
Peng Qi
Danding Wang
Jintao Li

With information consumption via online video streaming becoming increasingly popular, misinformation video poses a new threat to the health of the online information ecosystem. Though previous studies have made much progress in detecting misinformation in text and image formats, video-based misinformation brings new and unique challenges to automatic detection systems: 1) high information heterogeneity brought by various modalities, 2) blurred distinction between misleading video manipulation and nonmalicious artistic video editing, and 3) new patterns of misinformation propagation due to the dominant role of recommendation systems on online video platforms. To facilitate research on this challenging task, we conduct this survey to present advances in misinformation video detection. We first analyze and characterize the misinformation video from three levels including signal, semantic, and intent. Based on the characterization, we systematically review existing works for detection from features of various modalities to techniques for clue integration. We also introduce existing resources including representative datasets and useful tools. Besides summarizing existing studies, we discuss related areas and outline open issues and future directions to encourage and guide more research on misinformation video detection. The corresponding repository is at https://github.com/ICTMCG/Awesome-Misinfo-Video-Detection.

Cuing Without Sharing: A Federated Cued Speech Recognition Framework via Mutual Knowledge Distillation

Yuxuan Zhang
Lei Liu
Li Liu

Cued Speech (CS) is a visual coding tool to encode spoken languages at the phonetic level, which combines lip-reading and hand gestures to effectively assist communication among people with hearing impairments. The Automatic CS Recognition (ACSR) task aims to recognize CS videos into linguistic texts, which involves both lips and hands as two distinct modalities conveying complementary information. However, the traditional centralized training approach poses potential privacy risks due to the use of facial and gesture videos in CS data. To address this issue, we propose a new Federated Cued Speech Recognition (FedCSR) framework to train an ACSR model over the decentralized CS data without sharing private information. In particular, a mutual knowledge distillation method is proposed to maintain cross-modal semantic consistency of the Non-IID CS data, which ensures learning a unified feature space for linguistic and visual information. On the server side, a globally shared linguistic model is trained to capture the long-term dependencies in the text sentences, which is aligned with the visual information from the local clients via visual-to-linguistic distillation. On the client side, the visual model of each client is trained with its own local data, assisted by linguistic-to-visual distillation treating the linguistic model as the teacher. To the best of our knowledge, this is the first approach to consider the federated ACSR task for privacy protection. Experimental results on the Chinese CS dataset with multiple cuers demonstrate that our approach outperforms both mainstream federated learning baselines and existing centralized state-of-the-art ACSR methods, achieving 9.7% performance improvement for character error rate (CER) and 15.0% for word error rate (WER). The Chinese CS dataset and our code will be open-sourced.

Multi-scale Target-Aware Framework for Constrained Splicing Detection and Localization

Yuxuan Tan
Yuanman Li
Limin Zeng
Jiaxiong Ye
Wei Wang
Xia Li

Constrained image splicing detection and localization (CISDL) is a fundamental task of multimedia forensics, which detects splicing operation between two suspected images and localizes the spliced region on both images. Recent works regard it as a deep matching problem and have made significant progress. However, existing frameworks typically perform feature extraction and correlation matching as separate processes, which may hinder the model's ability to learn discriminative features for matching and can be susceptible to interference from ambiguous background pixels. In this work, we propose a multi-scale target-aware framework to couple feature extraction and correlation matching in a unified pipeline. In contrast to previous methods, we design a target-aware attention mechanism that jointly learns features and performs correlation matching between the probe and donor images. Our approach can effectively promote the collaborative learning of related patches, and perform mutual promotion of feature learning and correlation matching. Additionally, in order to handle scale transformations, we introduce a multi-scale projection method, which can be readily integrated into our target-aware framework that enables the attention process to be conducted between tokens containing information of varying scales. Our experiments demonstrate that our model, which uses a unified pipeline, outperforms state-of-the-art methods on several benchmark datasets and is robust against scale transformations.

Physics-Based Adversarial Attack on Near-Infrared Human Detector for Nighttime Surveillance Camera Systems

Muyao Niu
Zhuoxiao Li
Yifan Zhan
Huy H. Nguyen
Isao Echizen
Yinqiang Zheng

Many surveillance cameras switch between daytime and nighttime modes based on illuminance levels. During the day, the camera records ordinary RGB images through an enabled IR-cut filter. At night, the filter is disabled to capture near-infrared (NIR) light emitted from NIR LEDs typically mounted around the lens. While the vulnerabilities of RGB-based AI algorithms have been widely reported, those of NIR-based AI have rarely been investigated. In this paper, we identify fundamental vulnerabilities in NIR-based image understanding caused by color and texture loss due to the intrinsic characteristics of clothes' reflectance and cameras' spectral sensitivity in the NIR range. We further show that the nearly co-located configuration of illuminants and cameras in existing surveillance systems facilitates concealing and fully passive attacks in the physical world. Specifically, we demonstrate how retro-reflective and insulation plastic tapes can manipulate the intensity distribution of NIR images. We showcase an attack on the YOLO-based human detector using binary patterns designed in the digital space (via black-box query and searching) and then physically realized using tapes pasted onto clothes. Our attack highlights significant reliability concerns about nighttime surveillance systems, which are intended to enhance security. Codes Available: https://github.com/MyNiuuu/AdvNIR.

Follow-me: Deceiving Trackers with Fabricated Paths

Shengtao Lou
Buyu Liu
Jun Bao
Jiajun Ding
Jun Yu

Convolutional Neural Networks (CNNs) are vulnerable to adversarial attacks in which visually imperceptible perturbations can deceive CNN-based models. While current research on adversarial attacks in single object tracking exists, it overlooks a critical aspect of manipulating predicted trajectories to follow user-defined paths regardless of the actual location of the targeted object. To address this, we propose the very first white-box attack algorithm that is capable of deceiving victim trackers by compelling them to generate trajectories that adhere to predetermined counterfeit paths. Specifically, we focus on Siamese-based trackers as our victim models. Given an arbitrary counterfeit path, we first decompose it into discrete target locations in each frame, with the assumption of constant velocity. These locations are converted to heatmap anchors, which represent the offset of their location from the target object's location in the previous frame. Later on, we design a novel loss function to minimize the gap between above-mentioned anchors and our predicted ones. Finally, the gradients computed by such loss are used to update the original video, resulting in our adversarial video. To validate our ideas, we design three sets of counterfeit paths as well as novel evaluation metrics to measure the path-following properties. Experiments with two victim models on three publicly available datasets, OTB100, VOT2018, and VOT2016, demonstrate that our algorithm not only outperforms SOTA methods significantly under conventional evaluation metrics, e.g. 90% and 68.4% precision and successful rate drop on OTB100, but also follows the counterfeit paths well, which is beyond any existing attack methods. The source code is available at https://github.com/loushengtao/Follow-me.

Autistic Spectrum Disorders Diagnose with Graph Neural Networks

Lu Wei
Bin Liu
Jiujun He
Manxue Zhang
Yi Huang

Autism spectrum disorder (ASD) is a complex neurodevelopmental disorder that affects socialization and is characterized by abnormal, restricted, or repetitive language behaviors. Symptoms typically start to appear around the age of 2, making early diagnosis essential for treatment. One standardized screening method is an autism-specific interview with children's parents. However, this diagnostic process requires highly experienced physicians, making questionnaire-based screening less effective. Recently, imaging-based diagnosis has emerged as a more objective option. In this paper, we propose a graph neural network-based model for ASD diagnosis using Diffusion Tensor Imaging (DTI) and functional Magnetic Resonance Imaging (fMRI) data. We first calculate the correlations of 90 brain regions based on the automated anatomical labeling (AAL) template using brain imaging data of DTI and fMRI. This enables the construction of a comprehensive network map that delineates the interconnections among various brain regions. Subsequently, we propose to utilize a graph neural network for the purpose of diagnosing ASD, wherein the graph derived from DTI serves as the adjacency matrix, while the map of the fMRI is utilized as the node features. To improve the performance of diagnosis, we introduce a regularization of maximum inter-class graph distance and minimum intra-class graph distance, in addition to graph classification. We then calculate the correlation matrix between functional areas based on the obtained 90 implicit features corresponding to the nodes of functional areas and their 90 eigenvalues. We also perform hypothesis tests on the 90 eigenvalues corresponding to ASD negative and positive groups in turn to discover the pathogenic functional areas by comparing the eigenvalue distributions between the two groups. Our experiments on 138 real-world samples demonstrate the superior performance of our proposed model for diagnosis.

Moiré Backdoor Attack (MBA): A Novel Trigger for Pedestrian Detectors in the Physical World

Hui Wei
Hanxun Yu
Kewei Zhang
Zhixiang Wang
Jianke Zhu
Zheng Wang

A backdoor attack is executed by injecting a few poisoned samples into the training dataset of Deep Neural Networks (DNNs), enabling attackers to implant a hidden manipulation. This manipulation can be triggered during inference to exhibit controlled behavior, posing risks in real-world deployments. In this paper, we specifically focus on the safety-critical task of pedestrian detection and propose a novel backdoor trigger by exploiting the Moiré effect. The Moiré effect, a common physical phenomenon, disrupts camera-captured images by introducing Moiré patterns and unavoidable interference. Our method comprises three key steps. Firstly, we analyze the Moiré effect's cause and simulate its patterns on pedestrians' clothing. Next, we embed these Moiré patterns as a backdoor trigger into digital images and use this dataset to train a backdoored detector. Finally, we physically test the trained detector by wearing clothing that generates Moiré patterns. We demonstrate that individuals wearing such clothes can effectively evade detection by the backdoored model while wearing regular clothes does not trigger the attack, ensuring the attack remains covert. Extensive experiments in both digital and physical spaces thoroughly demonstrate the effectiveness and efficacy of our proposed Moiré Backdoor Attack.

Fine-Grained Music Plagiarism Detection: Revealing Plagiarists through Bipartite Graph Matching and a Comprehensive Large-Scale Dataset

Wenxuan Liu
Tianyao He
Chen Gong
Ning Zhang
Hua Yang
Junchi Yan

Music plagiarism detection is gaining more and more attention due to the popularity of music production and society's emphasis on intellectual property. We aim to find fine-grained plagiarism in music pairs since conventional methods are coarse-grained and cannot match real-life scenarios. Considering that there is no sizeable dataset designed for the music plagiarism task, we establish a large-scale simulated dataset, named Music Plagiarism Detection Dataset (MPD-Set) under the guidance and expertise of researchers from national-level professional institutions in the field of music. MPD-Set considers diverse music plagiarism cases found in real life from the melodic, rhythmic, and tonal levels respectively. Further, we establish a Real-life Dataset for evaluation, where all plagiarism pairs are real cases. To detect the fine-grained plagiarism pairs effectively, we propose a graph-based method called Bipatite Melody Matching Detector (BMM-Det), which formulates the problem as a max matching problem in the bipartite graph. Experimental results on both the simulated and Real-life Datasets demonstrate that BMM-Det outperforms the existing plagiarism detection methods, and is robust to common plagiarism cases like transpositions, pitch shifts, duration variance, and melody change. Datasets and source code are open-sourced at https://github.com/xuan301/BMMDet_MPDSet.

Ada3Diff: Defending against 3D Adversarial Point Clouds via Adaptive Diffusion

Kui Zhang
Hang Zhou
Jie Zhang
Qidong Huang
Weiming Zhang
Nenghai Yu

Deep 3D point cloud models are sensitive to adversarial attacks, which poses threats to safety-critical applications such as autonomous driving. Robust training and defend-by-denoising are typical strategies for defending adversarial perturbations. However, they either induce massive computational overhead or rely heavily upon specified priors, limiting generalized robustness against attacks of all kinds. To remedy it, this paper introduces a novel distortion-aware defense framework that can rebuild the pristine data distribution with a tailored intensity estimator and a diffusion model. To perform distortion-aware forward diffusion, we design a distortion estimation algorithm that is obtained by summing the distance of each point to the best-fitting plane of its local neighboring points, which is based on the observation of the local spatial properties of the adversarial point cloud. By iterative diffusion and reverse denoising, the perturbed point cloud under various distortions can be restored back to a clean distribution. This approach enables effective defense against adaptive attacks with varying noise budgets, enhancing the robustness of existing 3D deep recognition models.

Benign Shortcut for Debiasing: Fair Visual Recognition via Intervention with Shortcut Features

Yi Zhang
Jitao Sang
Junyang Wang
Dongmei Jiang
Yaowei Wang

Machine learning models often learn to make predictions that rely on sensitive social attributes like gender and race, which poses significant fairness risks, especially in societal applications, such as hiring, banking, and criminal justice. Existing work tackles this issue by minimizing the employed information about social attributes in models for debiasing. However, the high correlation between target task and these social attributes makes learning on the target task incompatible with debiasing. Given that model bias arises due to the learning of bias features (i.e., gender) that help target task optimization, we explore the following research question: Can we leverage shortcut features to replace the role of bias feature in target task optimization for debiasing? To this end, we propose Shortcut Debiasing, to first transfer the target task's learning of bias attributes from bias features to shortcut features, and then employ causal intervention to eliminate shortcut features during inference. The key idea of Shortcut Debiasing is to design controllable shortcut features to on one hand replace bias features in contributing to the target task during the training stage, and on the other hand be easily removed by intervention during the inference stage. This guarantees the learning of the target task does not hinder the elimination of bias features. We apply Shortcut Debiasing to several benchmark datasets, and achieve significant improvements over the state-of-the-art debiasing methods in both accuracy and fairness.

Model-Contrastive Learning for Backdoor Elimination

Zhihao Yue
Jun Xia
Zhiwei Ling
Ming Hu
Ting Wang
Xian Wei
Mingsong Chen

Due to the popularity of Artificial Intelligence (AI) techniques, we are witnessing an increasing number of backdoor injection attacks that are designed to maliciously threaten Deep Neural Networks (DNNs) causing misclassification. Although there exist various defense methods that can effectively erase backdoors from DNNs, they greatly suffer from both high Attack Success Rate (ASR) and a non-negligible loss in Benign Accuracy (BA). Inspired by the observation that a backdoored DNN tends to form a new cluster in its feature spaces for poisoned data, in this paper, we propose a novel two-stage backdoor defense method, named MCLDef, based on Model-Contrastive Learning (MCL). MCLDef can purify the backdoored model by pulling the feature representations of poisoned data towards those of their clean data counterparts. Due to the shrunken cluster of poisoned data, the backdoor formed by end-to-end supervised learning can be effectively eliminated. Comprehensive experimental results show that, with only 5% of clean data, MCLDef significantly outperforms state-of-the-art defense methods by up to 95.79% reduction in ASR, while in most cases, the BA degradation can be controlled within less than 2%. Our code is available at https://github.com/Zhihao151/MCL.

SIEGE: Self-Supervised Incremental Deep Graph Learning for Ethereum Phishing Scam Detection

Shucheng Li
Runchuan Wang
Hao Wu
Sheng Zhong
Fengyuan Xu

The phishing scams pose a serious threat to the ecosystem of Ethereum which is one of the largest blockchains in the world. Such a type of cyberattack recently has caused losses of millions of dollars. In this paper, we propose a Self-supervised IncrEmental deep Graph lEarning (SIEGE) model, for the phishing scam detection problem on Ethereum. To overcome the data scalability challenge, we propose splitting the original Ethereum transaction data and constructing transaction graphs for each split. Confronted with the minimal labeled data available, we resort to graph-based self-supervised learning. We design a spatial pretext task to learn high-quality node embeddings inside a single graph split, as well as an incremental learning paradigm and a temporal pretext task to facilitate information flow between different graph splits. To evaluate the effectiveness of SIEGE, we gather a real-world dataset consisting of six-month Ethereum transaction records. The results demonstrate that our model consistently outperforms baseline approaches in both transductive and inductive settings.

Collaborative Fraud Detection: How Collaboration Impacts Fraud Detection

Jinzhang Hu
Ruimin Hu
Zheng Wang
Dengshi Li
Junhang Wu
Lingfei Ren
Yilong Zang
Zijun Huang
Mei Wang

Collaborative fraud has become increasingly serious in telecom and social networks, but is hard to detect by traditional fraud detection methods. In this paper, we find a significant positive correlation between the increase of collaborative fraud and the degraded detection performance of traditional techniques, implying that those fraudsters that are difficult to detect with traditional methods are often collaborative in their fraudulent behavior. As we know, multiple objects may contact a single target object over a period of time. We define multiple objects with the same contact target as generalized objects, and their social behaviors can be combined and processed as the social behaviors of one object. We propose Fraud Detection Model based on Second-order and Collaborative Relationship Mining (COFD), exploring new research avenues for collaborative fraud detection. Our code and data are released at https://github.com/CatScarf/COFD-MM https://github.com/CatScarf/COFD-MM.

Degeneration-Tuning: Using Scrambled Grid shield Unwanted Concepts from Stable Diffusion

Zixuan Ni
Longhui Wei
Jiacheng Li
Siliang Tang
Yueting Zhuang
Qi Tian

Owing to the unrestricted nature of the content in the training data, large text-to-image diffusion models, such as Stable Diffusion (SD), are capable of generating images with potentially copyrighted or dangerous content based on corresponding textual concepts information. This includes specific intellectual property (IP), human faces, and various artistic styles. However, Negative Prompt, a widely used method for content removal, frequently fails to conceal this content due to inherent limitations in its inference logic. In this work, we propose a novel strategy named Degeneration-Tuning (DT) to shield contents of unwanted concepts from SD weights. By utilizing Scrambled Grid to reconstruct the correlation between undesired concepts and their corresponding image domain, we guide SD to generate meaningless content when such textual concepts are provided as input. As this adaptation occurs at the level of the model's weights, the SD, after DT, can be grafted onto other conditional diffusion frameworks like ControlNet to shield unwanted concepts. In addition to qualitatively showcasing the effectiveness of our DT method in protecting various types of concepts, a quantitative comparison of the SD before and after DT indicates that the DT method does not significantly impact the generative quality of other contents. The FID and IS scores of the model on COCO-30K exhibit only minor changes after DT, shifting from 12.61 and 39.20 to 13.04 and 38.25, respectively, which clearly outperforms the previous methods.

Unlearnable Examples Give a False Sense of Security: Piercing through Unexploitable Data with Learnable Examples

Wan Jiang
Yunfeng Diao
He Wang
Jianxin Sun
Meng Wang
Richang Hong

Safeguarding data from unauthorized exploitation is vital for privacy and security, especially in recent rampant research in security breach such as adversarial/membership attacks. To this end,unlearnable examples (UEs) have been recently proposed as a compelling protection, by adding imperceptible perturbation to data so that models trained on them cannot classify them accurately on original clean distribution. Unfortunately, we find UEs provide a false sense of security, because they cannot stop unauthorized users from utilizing other unprotected data to remove the protection, by turning unlearnable data into learnable again. Motivated by this observation, we formally define a new threat by introducinglearnable unauthorized examples (LEs) which are UEs with their protection removed. The core of this approach is a novel purification process that projects UEs onto the manifold of LEs. This is realized by a new joint-conditional diffusion model which denoises UEs conditioned on the pixel and perceptual similarity between UEs and LEs. Extensive experiments demonstrate that LE delivers state-of-the-art countering performance against both supervised UEs and unsupervised UEs in various scenarios, which is the first generalizable countermeasure to UEs across supervised learning and unsupervised learning. Our code is available at https://github.com/jiangw-0/LE_JCDP.

SESSION: Poster Session X: Multimedia systems -- Data Systems Management and Indexing

Pedestrian-specific Bipartite-aware Similarity Learning for Text-based Person Retrieval

Fei Shen
Xiangbo Shu
Xiaoyu Du
Jinhui Tang

Text-based person retrieval is a challenging task that aims to search pedestrian images with the same identity according to language descriptions. Current methods usually indiscriminately measure the similarity between text and image by matching global visual-textual features and matched local region-word features. However, these methods underestimate the key cue role of mismatched region-word pairs and ignore the problem of low similarity between matched region-word pairs. To alleviate these issues, we propose a novel Pedestrian-specific Bipartite-aware Similarity Learning (PBSL) framework that efficiently reveals the plausible and credible levels of contribution of pedestrian-specific mismatched and matched region-word pairs towards overall similarity. Specifically, to focus on mismatched region-word pairs, we first develop a new co-interactive attention that utilizes cross-modal information to guide the extraction of pedestrian-specific information in a single modality. We then design a negative similarity regularization mechanism to use the negative similarity score as a bias to correct the overall similarity. Additionally, to enhance the contribution of matched region-word pairs, we introduce graph networks to aggregate and propagate local information of pedestrian-specific, using overall visual-textual similarity to evaluate locally matched region-word pairs for weight refinement. Finally, extensive experiments are conducted on the CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets to demonstrate the competitive performance of the proposed PBSL in the text-based person retrieval task.

Generalizable Label Distribution Learning

Xingyu Zhao
Lei Qi
Yuexuan An
Xin Geng

Label Distributed Learning (LDL) is an emerging machine learning paradigm that has received extensive research in recent years. Owing to the excellent capability in dealing with label ambiguity, LDL has been widely adopted in many real-world scenarios. Though remarkable progress has been achieved in various tasks, one limitation with existing LDL methods is that they are all based on the i.i.d. assumption that training and test data are identically and independently distributed. As a result, they suffer obvious performance degradation and are no longer applicable when tested in out-of-distribution scenarios, which severely limits the application of LDL in many tasks. In this paper, we identify and investigate the Generalizable Label Distribution Learning (GLDL) problem. To handle such a challenging problem, we delve into the characteristics of GLDL and find that feature-label correlation and label-label correlation are two essential subjects in GLDL. Inspired by this finding, we propose a simple yet effective model-agnostic framework named Domain-Invariant Correlation lEarning (DICE). DICE mines and utilizes the correlation between feature and label that are invariant across different domains to learn a generalizable feature-label correlation by introducing a prior alignment strategy. In the meantime, it leverages a label correlation alignment strategy to further retain the consistency of label-label correlation in different domains. Extensive experiments verify the superior performance of DICE. Our work fills the gap in benchmarks and techniques for practical GLDL problems.

Dual Dynamic Proxy Hashing Network for Long-tailed Image Retrieval

Yan Jiang
Hongtao Xie
Lei Zhang
Pandeng Li
Dongming Zhang
Yongdong Zhang

Deep hashing has been extensively explored for image retrieval due to fast computation and efficient storage. Since conventional deep hashing methods are not suitable for the common scenario in real life that data exhibits a long-tailed distribution, several long-tailed hashing methods have been proposed recently. However, existing long-tail hashing methods seek to utilize fixed class centroids and cannot fully develop the discriminative ability of hash codes for tail-class samples. Specifically, fixed class centroids cannot characterize authentic semantics of tail classes or provide effective semantic information for hash codes learning under the long-tailed setting. To this end, we propose a novel Dual Dynamic Proxy Hashing Network (DDPHN) with two sets of learnable dynamic proxies, i.e. hash proxies and feature proxies, to improve the discrimination of hash codes for tail-class samples. Compared with fixed class centroids, learnable proxies can be optimized constantly via the proxy learning loss and depict accurate class semantics despite the scarcity of tail-class samples. Apart from low-dimensional binary hash proxies, we introduce high-dimensional continuous feature proxies that can describe semantic relationships more precisely, contributing to hash codes learning as well. To further leverage semantic information carried by proxies, we build a hypergraph by exploring neighborhood relationships in the feature space and then introduce a hypergraph neural network to transfer knowledge from proxies to samples in the Hamming space. Extensive experiments show the superiority of our learnable dynamic proxies and demonstrate that our method outperforms numerous deep hashing models and recent state-of-the-art long-tailed hashing methods.

Hybrid Interaction Temporal Knowledge Graph Embedding Based on Householder Transformations

Sensen Zhang
Xun Liang
Hui Tang
Zhenyu Guan

Temporal Knowledge Graph Embedding (TKGE) is a crucial technique for performing Temporal Knowledge Graph Completion (TKGC). The effectiveness of TKGE largely depends on the ability to model intrinsic relation patterns. However, as we know, most existing TKGE models usually embed KGs into a single geometric space such as Euclidean, hyperbolic or hyperspherical space to maintain their specific geometric structures (e.g., chain, hierarchy, and ring structures). None of the existing methods can simultaneously model relation patterns of chain, hierarchy, ring structures, and relation mapping properties. This paper constructs a hybrid interaction TKGE model HyIE, which learns spatial structures interactively between the Euclidean, hyperbolic and hyperspherical spaces. HyIE performs two Householder transformations of head and tail entities parameterized by relations in a high-dimensional mixed vector space. The curvature of hyperbolic and hyperspherical spaces depends on the product of both relation and temporal. The core of HyIE lies in implementing transformations and interactions of vectors in Euclidean, hyperbolic and hyperspherical spaces, and Household transformation of head and tail entities. Theoretically, HyIE can model crucial relation patterns and mapping properties simultaneously. Experimental results on five temporal knowledge graph benchmarks show that our HyIE achieves state-of-the-art performance.

C2MR: Continual Cross-Modal Retrieval for Streaming Multi-modal Data

Huaiwen Zhang
Yang Yang
Fan Qi
Shengsheng Qian
Changsheng Xu

Massive numbers of new images are uploaded to the internet every day. However, existing cross-modal retrieval (CMR) approaches struggle to accommodate this continuously growing data. The prevalent practice involves periodically retraining or fine-tuning a new model based on the accumulated data, which in turn invalidates billions of indexed features extracted by the previous model and incurs another substantial computational cost to extract new features for the entire data archive. Is it possible to develop a retrieval model that effectively captures the knowledge of upcoming sessions while preserving the discriminative power of features extracted in previous sessions? In this paper, we propose an online continual learning setup, OC-CMR, to formalize the data-incremental growth challenge faced by cross-modal retrieval systems. It consists of two key settings: 1) Similar to the real-world scenarios, the streaming multi-modal data arrives once per session; 2) Consider the computational costs, each instance of archived data has its feature extracted only once and by its corresponding model in its session. Based on our OC-CMR, we perform in-depth evaluations of state-of-the-art cross-modal retrieval methods and observe that they suffer from representational shift and collapse due to the catastrophic forgetting. To address this issue, we propose the Continual Cross-Modal Retrieval (C2MR) approach, which learns a shared common space not only across modalities but also sessions and maintains relationships between samples from distinct sessions via cross-modal relational coherence and semantic representation coordination. We construct two new benchmarks by adapting MS-COCO and Flickr30K datasets to the OC-CMR setting, providing a more challenging evaluation framework for CMR tasks. Experimental results demonstrate that our method effectively alleviates forgetting and significantly outperforms combinations of previous arts in cross-modal retrieval and continual learning.

When Perceptual Authentication Hashing Meets Neural Architecture Search

Yuanding Zhou
Xinran Li
Yaodong Fang
Chuan Qin

In recent years, many perceptual authentication hashing schemes have been proposed, especially for image content authentication. However, most of the schemes directly use the dataset of image processing during model training and evaluation, which is actually unreasonable due to the task difference. In this paper, we first propose a specialized dataset for perceptual authentication hashing of images (PAHI), and the image content-preserving manipulations used in this dataset are richer and more in line with realistic scenarios. Then, in order to achieve satisfactory perceptual robustness and discrimination capability of PAHI, we exploit the continuous neural architecture search (NAS) on the channel number and stack depth of the ConvNeXt architecture, and obtain two PAHI architectures i.e., NASRes and NASCoNt. The former has better overall performance, while the latter is better for some special manipulations such as image cropping and background overlap. Experimental results demonstrate that our architectures both can achieve competitive results compared with SOTA schemes, and the AUC areas are increased by 1.6 (NASCoNt) and 1.7 (NASRes), respectively.

GraphMedia: Communication-balanced Graph Searching for Billion-scale Social Media Access

Xinbiao Gan
Jiaqi Guo
Peilin Guo
Guang Wu
Jiaqi Si
Songzhu Mei
Cong Liu
Tiejun Li

The graph has recently enabled substantial advances in big data analysis. As graphs are increasing from billions to trillions, efficient graph processing requires large-scale distributed clusters, which have up to thousands of nodes. For big data applications of which the computation is relatively simple, while the communication, especially for imbalanced communication is the bottleneck on distributed clusters, where huge numbers of small messages are transferred through 2D-topology networks. Graph partitioning is the dominant factor to affect the performance of large-scale distributed graph processing. Current graph partitioning policies have paid extensive attention to the utilization of the power law of big graphs but failed to exploit the advanced architectural benefits of 2D topology. To address such a problem, this paper presents GraphMedia, a communication-balanced graph partitioning for distributed search at scale. The key idea of GraphMedia is a communication-balanced partitioning to balance communication based on hardware/software co-design, in which the power law of graphs would be explored to average communication among nodes, and communication would be balanced between row and column by leveraging advanced 2D-topology knowledge. We use both benchmarks and real-world graphs to validate GraphMedia. Specially, GraphMedia-based Graph500 tests on the Tianhe supercomputer are superior to the fastest systems in the latest Graph500 lists (June 2022). We finally apply GraphMedia to real-world graphs for online graph media access, which outperforms the state-of-the-art graph partitioning and graph system by orders of magnitude.

SESSION: Poster Session XI: Multimedia systems -- Systems and Middleware

FedCD: A Classifier Debiased Federated Learning Framework for Non-IID Data

Yunfei Long
Zhe Xue
Lingyang Chu
Tianlong Zhang
Junjiang Wu
Yu Zang
Junping Du

One big challenge to federated learning is the non-IID data distribution caused by imbalanced classes. Existing federated learning approaches tend to bias towards classes containing a larger number of samples during local updates, which causes unwanted drift in the local classifiers. To address this issue, we propose a classifier debiased federated learning framework named FedCD for non-IID data. We introduce a novel hierarchical prototype contrastive learning strategy to learn fine-grained prototypes for each class. The prototypes characterize the sample distribution within each class, which helps align the features learned in the representation layer of every client's local model. At the representation layer, we use fine-grained prototypes to rebalance the class distribution on each client and rectify the classification layer of each local model. To alleviate the bias of the classification layer of the local models, we incorporate a global information distillation method to enable the local classifier to learn decoupled global classification information. We also adaptively aggregate the class-level classifiers based on their quality to reduce the impact of unreliable classes in each aggregated classifier. This mitigates the impact of client-side classifier bias on the global classifier. Comprehensive experiments conducted on various datasets show that our method, FedCD, effectively corrects classifier bias and outperforms state-of-the-art federated learning methods.

Finding Efficient Pruned Network via Refined Gradients for Pruned Weights

Jangho Kim
Jayeon Yoo
Yeji Song
KiYoon Yoo
Nojun Kwak

With the growth of deep neural networks (DNN), the number of DNN parameters has drastically increased. This makes DNN models hard to be deployed on resource-limited embedded systems. To alleviate this problem, dynamic pruning methods have emerged, which try to find diverse sparsity patterns during training by utilizing Straight-Through-Estimator (STE) to approximate gradients of pruned weights. STE can help the pruned weights revive in the process of finding dynamic sparsity patterns. However, using these coarse gradients causes training instability and performance degradation owing to the unreliable gradient signal of the STE approximation. In this work, to tackle this issue, we introduce refined gradients to update the pruned weights by forming dual forwarding paths from two sets (pruned and unpruned) of weights. We propose a novel Dynamic Collective Intelligence Learning (DCIL) which makes use of the learning synergy between the collective intelligence of both weight sets. We verify the usefulness of the refined gradients by showing enhancements in the training stability and the model performance on the CIFAR and ImageNet datasets. DCIL outperforms various previously proposed pruning schemes including other dynamic pruning methods with enhanced stability during training. The code is provided in Github.

Moby: Empowering 2D Models for Efficient Point Cloud Analytics on the Edge

Jingzong LI
Yik Hong Cai
Libin Liu
Yu Mao
Chun Jason Xue
Hong Xu

3D object detection plays a pivotal role in many applications, most notably autonomous driving and robotics. These applications are commonly deployed on edge devices to promptly interact with the environment, and often require near real-time response. With limited computation power, it is challenging to execute 3D detection on the edge using highly complex neural networks. Common approaches such as offloading to the cloud induce significant latency overheads due to the large amount of point cloud data during transmission. To resolve the tension between wimpy edge devices and compute-intensive inference workloads, we explore the possibility of empowering fast 2D detection to extrapolate 3D bounding boxes. To this end, we present Moby, a novel system that demonstrates the feasibility and potential of our approach. We design a transformation pipeline for Moby that generates 3D bounding boxes efficiently and accurately based on 2D detection results without running 3D detectors. Further, we devise a frame offloading scheduler that decides when to launch the 3D detector judiciously in the cloud to avoid the errors from accumulating. Extensive evaluations on NVIDIA Jetson TX2 with real-world autonomous driving datasets demonstrate that Moby offers up to 91.9% latency improvement with modest accuracy loss over state of the art.

NIF: A Fast Implicit Image Compression with Bottleneck Layers and Modulated Sinusoidal Activations

Lorenzo Catania
Dario Allegra

In Implicit Neural Representations (INRs) a discrete signal is parameterized by a neural network that maps coordinates to the signal samples. INRs were successfully employed for encoding and compression, but such approaches are in their early stage and are still overcome by traditional codecs and autoencoders. Despite this, they have recently gained the attention of the research community due to their promising results as novel representation strategies for encoding visual content. In this paper, we propose Neural Imaging Format (NIF), an open-source INR-based image compression codec which takes advantage of a novel neural architecture which consists of two modules: a Genesis network, for mapping coordinates to pixels through bottleneck layers with sinusoidal activation units, and a Modulation network, for varying the period of the sinusoidal activations. Additionally, a final weights quantization step leads to an improvement in the compression ratio. Our proposal (NIF) consistently outperforms state-of-art INR-based compressors in terms of PSNR, by achieving comparable or better results with an outstanding up to X26 encoding speed. We also show that NIF reduces the gap between INR-based methods with respect to traditional approaches. Interestingly, our approach outperforms established codecs such as JPEG and WebP when one encodes high-resolution images at low-bitrate regimes. Extensive experiments on different datasets, a visual comparison, and an ablation study, prove the validity of the proposed approach.

ColSLAM: A Versatile Collaborative SLAM System for Mobile Phones Using Point-Line Features and Map Caching

Wanting Li
Yongcai Wang
Yongyu Guo
Shuo Wang
Yu Shao
Xuewei Bai
Xudong Cai
Qiang Ye
Deying Li

Over the past years, augmented reality (AR) based on mobile phones has gained great attention. When multiple phones are used in AR applications, collaborative simultaneous localization and mapping (SLAM) is considered one of the enabling technologies, i.e., multiple mobile phones complete the localization and mapping through collaboration. However, the state-of-the-art collaborative SLAM systems not only suffer from the delays introduced by a high-complexity graph optimization problem, but also may exhibit varying levels of accuracy across dissimilar environments or different types of mobile devices. In this paper, we propose a scalable and robust collaborative SLAM system, point-line-based Collaborative SLAM (ColSLAM). Technically, ColSLAM includes two innovative features that help achieve satisfactory scalability and robustness. First, a mapping cacher (MC) is designed for each agent on the server, which uses global keyframes to detect loop closures, updates the cached local map, and quickly responds to the agent's pose drifts. With MC, each agent's local pose is corrected using global knowledge in real-time. Secondly, to improve the robustness performance, ColSLAM employs point-line-fusion-based Visual Inertial Odometry (VIO), point-line-fusion-based NetVLAD loop detection, and an enhanced geometric verification and relative pose calculation method called PNPL. Empirical evaluations based on the EuRoc dataset and real degenerate environments demonstrate that ColSLAM outperforms the existing collaborative SLAM systems in terms of accuracy, robustness, and scalability.

A Hardware-efficient Unified Motion Estimation for Video Coding

Xizhong Zhu
Guoqing Xiang
Peng Zhang
Huizhu Jia
Xiaodong Xie

Motion estimation (ME) is one of the most critical tools in video coding and consumes the majority of the encoding complexity. Three types of ME are utilized in the latest video coding standards, namely integer, fractional, and affine MEs. They are implemented as three searches for the integer motion vector (IMV), fractional motion vector (FMV), and control point motion vectors (CPMVs). Many algorithms were proposed to reduce the complexity for them individually, but the overall overhead of three searches is still challenging for hardware implementations. Therefore, we propose a hardware-efficient Unified Motion Estimation (UME) to derive three types of MVs with only one search. An IME with sub-block refinement is performed to collect extra motion information while searching for the IMV. The FMV and CPMVs are then derived from the collected information using a mixed error surface and an overdetermined system. Compared to the default ME algorithms in VVC, the time cost for ME is reduced by 41.63% with a coding loss of only 0.87% under LDB configuration. For hardware implementations, the minimum required resources and corresponding latency are significantly reduced by 75.35% and 69.17%, respectively.

Edge-Assisted On-Device Model Update for Video Analytics in Adverse Environments

Yuxin Kong
Peng Yang
Yan Cheng

While large deep neural networks excel at general video analytics tasks, the significant demand on computing capacity makes them infeasible for real-time inference on resource-constrained end cameras. In this paper, we propose an edge-assisted framework that continuously updates the lightweight model deployed on the end cameras to achieve accurate predictions in adverse environments. This framework consists of three modules, namely, a key frame extractor, a trigger controller, and a retraining manager. The low-cost key frame extractor obtains frames that can best represent the current environment. Those frames are then transmitted and buffered as the retraining data for model update at the edge server. Once the trigger controller detects a significant accuracy drop in the selected frames, the retraining manager outputs the optimal retraining configuration balancing the accuracy and time cost. We prototype our system on two end devices of different computing capacities with one edge server. The results demonstrate that our approach significantly improves accuracy across all tested adverse environment scenarios (up to 24%) and reduces more than 50% of the retraining time compared to existing benchmarks.

Hardware-friendly Scalable Image Super Resolution with Progressive Structured Sparsity

Fangchen Ye
Jin Lin
Hongzhan Huang
Jianping Fan
Zhongchao Shi
Yuan Xie
Yanyun Qu

Single image super-resolution (SR) is an important low-level vision task, and the dynamic SR trading off performance and efficiency are increasingly in demand. The existing dynamic SR methods are divided into two classes: the structured pruning and non-structured compressing methods. The former removes redundant structures in the network, which often leads to significant performance degradation, and the latter searches for extremely sparse parameter masks, achieving promising performance, but they are not deployable in hardware platforms with irregular memory access. In order to solve the mentioned problems, we propose Hardware-friendly Scalable SR (HSSR) with progressively structured sparsity. The superiority of our method is that with only a single scalable model it covers multiple SR models with different sizes, without extra retraining or post-processing. HSSR contains the forward and backward processing. In the forward process, we gradually shrink the SR networks with structured iterative sparsity where grouping convolution together with knowledge distillation is conducted to reduce the amount of SR parameters and the computational complexity while keeping the performance, and in the backward process, we gradually expand the compressed SR networks with structured iterative recovery. Comprehensive experiments on benchmark datasets show that HSSR is perfectly compatible with common convolution baselines. Compared with the Slimmable method, our model is superior in performance, flops, and model size. Experimental results demonstrate that HSSR achieves significant compression, saving up to 1500K parameters and 100 GFlops calculation compared to the original model in real-world applications.

YOGA: Yet Another Geometry-based Point Cloud Compressor

Junteng Zhang
Tong Chen
Dandan Ding
Zhan Ma

A learning-based YOGA (Yet Another Geometry-based Point Cloud Compressor) is proposed. It is flexible, allowing for the separable lossy compression of geometry and color attributes, and variable-rate coding using a single neural model; it is high-efficiency, significantly outperforming the latest G-PCC standard quantitatively and qualitatively, e.g., 25% BD-BR gains using PCQM (Point Cloud Quality Metric) as the distortion assessment, and it is lightweight, e.g., similar runtime as the G-PCC codec, owing to the use of sparse convolution and parallel entropy coding. To this end, YOGA adopts a unified end-to-end learning-based backbone for separate geometry and attribute compression. The backbone uses a two-layer structure, where the downscaled thumbnail point cloud is encoded using G-PCC at the base layer, and upon G-PCC compressed priors, multiscale sparse convolutions are stacked at the enhancement layer to effectively characterize spatial correlations to compactly represent the full-resolution sample. In addition, YOGA integrates the adaptive quantization and entropy model group to enable variable-rate control, as well as adaptive filters for better quality restoration.

MCUNeRF: Packing NeRF into an MCU with 1MB Memory

Zhixiang Ye
Qinghao Hu
Tianli Zhao
Wangping Zhou
Jian Cheng

Neural Radiance Fields (NeRFs) have revolutionized 3D scene synthesis. Voxel grids are commonly employed to enhance training or rendering speed, but they entail additional storage requirements. The large model size and high computational and memory demands impede their progress on resource-constrained devices, e.g., Microcontroller Units (MCUs). Besides, there is currently no NeRF rendering framework available on MCU devices. In this paper, we propose a NeRF method named MCUNeRF for 3D scene synthesis on MCU devices. The proposed MCUNeRF compresses voxel grids via a hybrid quantization algorithm merging learned step-size quantization (LSQ) and optimized product quantization (OPQ). To further reduce the model storage, we also propose a codebook-sharing method that renders multiple objects with a single quantization codebook. Then we implement a NeRF-based rendering framework for MCU devices, which leverages a low-bit neural network computation framework, i.e. CMSIS-NN, to accelerate the rendering progress. Extensive experiments on four datasets such as Synthetic-NeRF demonstrate that our proposed method could compress model data by 20-40 times with comparable rendering quality, which enables NeRF-based scene rendering on MCU devices with only 1M SRAM.

ParliRobo: Participant Lightweight AI Robots for Massively Multiplayer Online Games (MMOGs)

Jianwei Zheng
Changnan Xiao
Mingliang Li
Zhenhua Li
Feng Qian
Wei Liu
Xudong Wu

Recent years have witnessed the profound influence of AI technologies on computer gaming. While grandmaster-level AI robots have largely come true for complex games based on heavy back-end support, in practice many game developers crave for participant AI robots (PARs) that behave like average-level humans with inexpensive infrastructures. Unfortunately, to date there has not been a satisfactory solution that registers large-scale use. In this work, we attempt to develop practical PARs (dubbed ParliRobo) showing acceptably humanoid behaviors with well affordable infrastructures under a challenging scenario-a 3D-FPS (first-person shooter) mobile MMOG with real-time interaction requirements. Based on comprehensive real-world explorations, we eventually enable our attempt through a novel ?transform and polish" methodology. It achieves ultralight implementations of the core system components by non-intuitive yet principled approaches, and meanwhile carefully fixes the probable side effect incurred on user perceptions. Evaluation results from large-scale deployment indicate the close resemblance (96% on average) in biofidelity metrics between ParliRobo and human players; moreover, in 73% mini Turing tests ParliRobo cannot be distinguished from human players.

LGViT: Dynamic Early Exiting for Accelerating Vision Transformer

Guanyu Xu
Jiawei Hao
Li Shen
Han Hu
Yong Luo
Hui Lin
Jialie Shen

Recently, the efficient deployment and acceleration of powerful vision transformers (ViTs) on resource-limited edge devices for providing multimedia services have become attractive tasks. Although early exiting is a feasible solution for accelerating inference, most works focus on convolutional neural networks (CNNs) and transformer models in natural language processing (NLP). Moreover, the direct application of early exiting methods to ViTs may result in substantial performance degradation. To tackle this challenge, we systematically investigate the efficacy of early exiting in ViTs and point out that the insufficient feature representations in shallow internal classifiers and the limited ability to capture target semantic information in deep internal classifiers restrict the performance of these methods. We then propose an early exiting framework for general ViTs termed LGViT, which incorporates heterogeneous exiting heads, namely, local perception head and global aggregation head, to achieve an efficiency-accuracy trade-off. In particular, we develop a novel two-stage training scheme, including end-to-end training and self-distillation with the backbone frozen to generate early exiting ViTs, which facilitates the fusion of global and local information extracted by the two types of heads. We conduct extensive experiments using three popular ViT backbones on three vision datasets. Results demonstrate that our LGViT can achieve competitive performance with approximately 1.8 × speed-up.

ENTRO: Tackling the Encoding and Networking Trade-off in Offloaded Video Analytics

Seyeon Kim
Kyungmin Bin
Donggyu Yang
Sangtae Ha
Song Chong
Kyunghan Lee

With the rapid advances of deep learning and the commercialization of high-definition cameras in mobile and embedded devices, the demands from latency-critical applications such as AR and XR for high-quality video analytics (HVA) are soaring. By the nature of HVA aiming at enabling detailed analytics even for small objects, its on-device implementation is suffering from thermal and battery issues, which makes offloaded HVA an attractive solution. This work provides unique observations on the tradeoff pertaining to offloaded HVA: the frame encoding time, the frame transmission time, and the HVA accuracy. Our observations pose a fundamental question: given a latency budget, how to choose the encoding option that properly combines between the encoding time and the transmission time to maximize the HVA accuracy. To answer this question, we propose an offloaded HVA system, ENTRO, which exploits this tradeoff in real-time to maximize the HVA accuracy under the latency budget. Our extensive evaluations with ENTRO implemented on Nvidia AGX Xavier and Samsung Galaxy S20 Ultra over WiFi networks show 8.8× improvement in latency without accuracy loss compared to DDS, the state-of-the-art offloaded video analytics. Our evaluation over commercial 5G and LTE networks also indicates that ENTRO flexibly adapts its encoding option under the tradeoff and enables the latency-bounded HVA with 4K frames.

A Blind Streaming System for Multi-client Online 6-DoF View Touring

Sheng-Ming Tang
Yuan-Chun Sun
Cheng-Hsin Hsu

Online 6-DoF view touring has become increasingly popular due to hardware advances and the recent pandemic. One way for content creators to support many 6-DoF clients is by transmitting 3D content to them, which leads to content leakage. Another way for content creators is to render and stream novel views for 6-DoF clients, which incurs staggering computational and networking workloads. In this paper, we develop a blind streaming system that leverages cloud service providers between content creators and 6-DoF clients. Our system has two core design objectives: (i) to generate high-quality novel views for 6-DoF clients without retrieving 3D content from content creators, (ii) to support many 6-DoF clients without overloading the content creators. We achieve these two goals in the following steps. First, we design a source view request/response interface between cloud service providers and content creators for efficient communications. Second, we design novel view optimization algorithms for cloud service providers to intelligently select the minimal set of source views while considering the workload of content creators. Third, we employ scalable client side view synthesis for 6-DoF clients with heterogeneous device capabilities and personalized 6-DoF client poses and preferences. Our evaluation results demonstrate the merits of our solution, compared to the state-of-the-arts, our system: (i) improves synthesized novel views by 2.27 dB in PSNR and 12 in VMAF on average and (ii) reduces the bandwidth consumption by 94% on average. In fact, our solution approaches the performance of an unrealistic optimal solution with unlimited source views, achieving performance gaps as small as 0.75 dB in PSNR and 3.8 in VMAF.

PatchBackdoor: Backdoor Attack against Deep Neural Networks without Model Modification

Yizhen Yuan
Rui Kong
Shenghao Xie
Yuanchun Li
Yunxin Liu

Backdoor attack is a major threat to deep learning systems in safety-critical scenarios, which aims to trigger misbehavior of neural network models under attacker-controlled conditions. However, most backdoor attacks have to modify the neural network models through training with poisoned data and/or direct model editing, which leads to a common but false belief that backdoor attack can be easily avoided by properly protecting the model. In this paper, we show that backdoor attacks can be achieved without any model modification. Instead of injecting backdoor logic into the training data or the model, we propose to place a carefully-designed patch (namely backdoor patch) in front of the camera, which is fed into the model together with the input images. The patch can be trained to behave normally at most of the time, while producing wrong prediction when the input image contains an attacker-controlled trigger object. Our main techniques include an effective training method to generate the backdoor patch and a digital-physical transformation modeling method to enhance the feasibility of the patch in real deployments. Extensive experiments show that PatchBackdoor can be applied to common deep learning models (VGG, MobileNet, ResNet) with an attack success rate of 93% to 99% on classification tasks. Moreover, we implement PatchBackdoor in real-world scenarios and show that the attack is still threatening.

VQBA: Visual-Quality-Driven Bit Allocation for Low-Latency Point Cloud Streaming

Shuoqian Wang
Mufeng Zhu
Na Li
Mengbai Xiao
Yao Liu

Video-based Point Cloud Compression (V-PCC) is an emerging standard for encoding dynamic point cloud data. With V-PCC, point cloud data is segmented, projected, and packed on to 2D video frames, which can be compressed using existing video coding standards such as H.264, H.265 and AV1. This makes it possible to support point cloud streaming via reliable video transmission systems. On the other hand, despite recent advances, many issues still remain and prevent V-PCC from being used in low-latency point cloud streaming. For instance, point cloud registration and patch generation can take a long time.

In this paper, we focus on one unique problem in V-PCC: bit allocation among different sub-streams - the geometry sub-stream and the attribute (color) sub-stream - with the goal of improving the visual quality of point clouds under the target bitrate. Existing approaches either do not fully utilize the available bandwidth or can take a long time to run, which cannot be used in scenarios that require low-latency. To this end, we propose a lightweight, frequency-domain-based profiling method for transforming the dynamic point cloud data into a one-dimension vector. By using two single-layer linear regression models, we can estimate the compressed bitrate for geometry data and color information. This allows us to perform bit allocation between the geometry map and the attribute map with simple calculations. Evaluation results show that compared to the baseline approach, our method can achieve better visual qualities with smaller encoded segment sizes under the target bitrate.

JAVP: Joint-Aware Video Processing with Edge-Cloud Collaboration for DNN Inference

Zheming Yang
Wen Ji
Qi Guo
Zhi Wang

Currently, massive video inference tasks are processed through edge-cloud collaboration. However, the diverse scenarios make it difficult to allocate the inference tasks efficiently, resulting in many wasted resources. In this paper, we propose a joint-aware video processing (JAVP) architecture for edge-cloud collaboration. First, we develop a multiscale complexity-aware model for predicting task complexity and determining its suitability for edge or cloud servers. The task is subsequently efficiently scheduled to the appropriate servers by integrating complexity with an adaptive resource-aware optimization algorithm. For input tasks, JAVP can dynamically and intelligently select the most appropriate server. The evaluation results on public datasets show that JAVP can improve the through-put by more than 70% compared to traditional cloud-only solutions while meeting accuracy requirements. And JAVP can improve the accuracy by 3%-5% and reduce delay and energy consumption by 16%-50% compared to state-of-the-art edge-cloud solutions.

SESSION: Poster Session XII: Multimedia systems -- Transport and Delivery

Hermes: Leveraging Implicit Inter-Frame Correlation for Bandwidth-Efficient Mobile Volumetric Video Streaming

Yizong Wang
Dong Zhao
Huanhuan Zhang
Chenghao Huang
Teng Gao
Zixuan Guo
Liming Pang
Huadong Ma

Volumetric videos offer viewers more immersive experiences, enabling a variety of applications. However, state-of-the-art streaming systems still need hundreds of Mbps, exceeding the common bandwidth capabilities of mobile devices. We find a research gap in reusing inter-frame redundant information to reduce bandwidth consumption, while the existing inter-frame compression methods rely on the so-called explicit correlation, i.e., the redundancy from the same/adjacent locations in the previous frame, which does not apply to highly dynamic frames or dynamic viewports. This work introduces a new concept called implicit correlation, i.e., the consistency of topological structures, which stably exists in dynamic frames and is beneficial for reducing bandwidth consumption. We design a mobile volumetric video streaming system Hermes consisting of an implicit correlation encoder to reduce bandwidth consumption and a hybrid streaming method that adapts to dynamic viewports. Experiments show that Hermes achieves a frame rate of 30+ FPS over daily networks and on commodity smartphones, with at least 3.37x improvement compared with two baselines.

Reparo: QoE-Aware Live Video Streaming in Low-Rate Networks by Intelligent Frame Recovery

Fulin Wang
Qing Li
Wanxin Shi
Gareth Tyson
Yong Jiang
Lianbo Ma
Peng Zhang
Yulong Lan
Zhicheng Li

Live video streaming has grown dramatically in recent years. A key challenge is achieving high video quality of experience (QoE) in low-rate networks. To tackle this problem, recent streaming approaches strategically drop video frames, thus reducing the bandwidth required. However, these methods are usually designed for video on demand (VoD) services and perform poorly in live video streaming. In this paper, we design a new live video streaming approach, Reparo, which aims to improve users' QoE in low-rate networks. On the upload client side, Reparo discards video frames such that they are never encoded or transmitted. To decide which frames should be dropped, we design a real-time Video Frame Discarding (VFD) model, which strives to minimize the impact on video quality while maximizing bandwidth savings. To complement this, Reparo further proposes a modified adaptive bitrate algorithm and two encoding modes, targeting low-frame-rate encoding. On the server side, Reparo then recovers the dropped frames using a lightweight Video Frame Interpolation Deep Neural Network (VFI-DNN). Experimental results show that, compared with vanilla DASH, Reparo reaches an SSIM gain of 0.018, or reduces bandwidth consumption by 30.86%. With an average bandwidth of 0.974Mbps, it improves QoE by 18.13% on average compared to DASH.

DeepSVC: Deep Scalable Video Coding for Both Machine and Human Vision

Hongbin Lin
Bolin Chen
Zhichen Zhang
Jielian Lin
Xu Wang
Tiesong Zhao

Nowadays, end-to-end video coding for both machine and human vision has become an emerging research topic. In complicated systems such as large-scale internet of video things (IoVT), feature streams and video streams can be separately encoded and delivered for machine judgement and human viewing. In this paper, we propose a deep scalable video codec (DeepSVC) to support three-layer scalability from machine to human vision. First, we design a semantic layer that encodes semantic features extracted from the captured video for machine analysis. This layer employs a conditional semantic compression (CSC) method to remove redundancies between semantic features. Second, we design a structure layer that can be combined with semantic layer to predict the captured video at a low quality. This layer effectively estimates video frames based on semantic layer with an interlayer frame prediction (IFP) network. Third, we design a texture layer that can be combined with the above two layers to reconstruct high-quality video signals. This layer also takes advantage of the IFP network to improve its coding efficiency. In large-scale IoVT systems, DeepSVC can deliver semantic layer for regular use and transmit the other layers on demand. Experimental results indicate that the proposed DeepSVC outperforms popular codecs for machine and human vision. Compared with scalable extension of H.265/HEVC (SHVC), the proposed DeepSVC reduces average bit-per-pixel (bpp) by 25.51%/27.63%/59.87% at the same mAP/PSNR/MS-SSIM. Sourcecode is available at: https://github.com/LHB116/DeepSVC.

Concerto: Client-server Orchestration for Real-Time Video Analytics

Chaoyang Li
Rui-Xiao Zhang
Tianchi Huang
Lianchen Jia
Lifeng Sun

The delay to obtain analysis results is an important metric in video analytics. Previous work has focused on reducing frame transmission and inference delay to optimize total delay. However, network fluctuations can cause frames to arrive at the backend simultaneously, leading to backend queuing delays. To address this issue, we propose Concerto, a joint front-and backend video analytics pipeline that optimizes both network transmission and backend queuing delays. The backend controls the frame queue, accelerating or skipping inference as needed to mitigate backend queuing delay. The frontend considers both delays when configuring frames to send, resulting in better total delay. Experiments show that Concerto significantly reduces backend queuing delay with minimal loss of accuracy.

Think before You Leap: Content-Aware Low-Cost Edge-Assisted Video Semantic Segmentation

Mingxuan Yan
Yi Wang
Xuedou Xiao
Zhiqing Luo
Jianhua He
Wei Wang

Offloading computing to edge servers is a promising solution to support growing video understanding applications at resource-constrained IoT devices. Recent efforts have been made to enhance the scalability of such systems by reducing inference costs on edge servers. However, existing research is not directly applicable to pixel-level vision tasks such as video semantic segmentation (VSS), partly due to the fluctuating VSS accuracy and segment bitrate caused by the dynamic video content. In response, we present Penance, a new edge inference cost reduction framework. By exploiting softmax outputs of VSS models and the prediction mechanism of H.264/AVC codecs, Penance optimizes model selection and compression settings to minimize the inference cost while meeting the required accuracy within the available bandwidth constraints. We implement Penance in a commercial IoT device with only CPUs. Experimental results show that Penance consumes a negligible 6.8% more computation resources than the optimal strategy while satisfying accuracy and bandwidth constraints with a low failure rate.

TwinStar: A Practical Multi-path Transmission Framework for Ultra-Low Latency Video Delivery

Haiping Wang
Zhenhua Yu
Ruixiao Zhang
Siping Tao
Hebin Yu
Shu Shi

Ultra-low latency video streaming has received explosive growth in the past few years. However, existing methods all focus on single-path transmission, which is ineffective in dealing with really poor network conditions. To tackle their problems, we propose TwinStar, a novel multi-path framework to improve the experience quality of ultra-low latency video. The core idea of TwinStar is to concurrently leverage multiple paths to mitigate the negative impacts of network jitter on a single path. In particular, by carefully designing the video encoding, data allocation and loss recovery, TwinStar is very robust to handle network dynamics and deliver high-quality video services. We have deployed TwinStar in a commercial cloud gaming platform and evaluated it with real-world networks. The extensive experiments demonstrate that TwinStar significantly outperforms the single-path transmission methods, with 91% reduction in stall ratio and 11% improvement in PSNR across all regions.

Addressing Scalability for Real-time Multiuser Holo-portation: Introducing and Assessing a Multipoint Control Unit (MCU) for Volumetric Video

Sergi Fernandez
Mario Montagud
David Rincón
Juame Moragues
Gianluca Cernigliaro

Scalability, interoperability, and cost efficiency are key remaining challenges to successfully providing real-time holo-portation (and Metaverse-like) services. This paper, for the first time, presents the design and integration of a Multipoint Control Unit (MCU) in a pioneering real-time holo-portation platform, supporting realistic and volumetric user representations (i.e., 3D holograms), with the aim of overcoming such challenges. The feasibility and implications of adopting such an MCU, in comparison with state-of-the-art architectural approaches, are assessed through experimentation in two different deployment setups, by iteratively increasing the number of concurrent users in shared sessions. The obtained results are promising, as it is empirically proved that the newly adopted stream multiplexing together with the novel per-client and per-frame Volumetric Video (VV) processing optimization features provided by the MCU allow increasing the number of concurrent users, while: (i) significantly reducing resources consumption metrics (e.g., CPU, GPU, bandwidth) and frame rate degradation on the client side; and (ii) keeping the end-to-end latency within acceptable limits.

ELFIC: A Learning-based Flexible Image Codec with Rate-Distortion-Complexity Optimization

Zhichen Zhang
Bolin Chen
Hongbin Lin
Jielian Lin
Xu Wang
Tiesong Zhao

Learning-based image coding has attracted increasing attentions for its higher compression efficiency than reigning image codecs. However, most existing learning-based codecs do not support variable rates with a single encoder; their decoders are also of fixed, high computational complexity. In this paper, we propose an End-to-end, Learning-based and Flexible Image Codec (ELFIC) that supports variable rate and flexible decoding complexity. First, we propose a general image codec with Nonlinear Feature Fusion Transform (NFFT) as nonlinear transforms to improve its Rate-Distortion (RD) performance. Second, we propose an Instance-aware Decoding Complexity Allocation (IDCA) approach, which exploits image contents for a tradeoff between reconstruction quality and computational complexity in the decoding process. Third, we propose an RD-Complexity (RDC) optimization algorithm, which maximizes the image quality under given rate and complexity constraints for the whole framework. Experimental results show that ELFIC achie-ves variable rate, flexible decoding complexity with the state-of-the-art RD performance. It also supports a more efficient decoding process by focusing on image contents. Source codes are available at https://github.com/Zhichen-Zhang/ELFIC-Image-Compression.

Mamba: Bringing Multi-Dimensional ABR to WebRTC

Yueheng Li
Zicheng Zhang
Hao Chen
Zhan Ma

Contemporary real-time video communication systems, such as WebRTC, use an adaptive bitrate (ABR) algorithm to assure high-quality and low-delay services, e.g., promptly adjusting video bitrate according to the instantaneous network bandwidth. However, target bitrate decisions in the network and bitrate control in the codec are typically incoordinated and simply ignoring the effect of inappropriate resolution and frame rate settings also leads to compromised results in bitrate control, thus devastatingly deteriorating the quality of experience (QoE). To tackle these challenges, Mamba, an end-to-end multi-dimensional ABR algorithm is proposed, which utilizes multi-agent reinforcement learning (MARL) to maximize the user's QoE by adaptively and collaboratively adjusting encoding factors including the quantization parameters (QP), resolution, and frame rate based on observed states such as network conditions and video complexity information in a video conferencing system. We also introduce curriculum learning to improve the training efficiency of MARL. Both the in-lab and real-world evaluation results demonstrate the remarkable efficacy of Mamba.

PDE-based Progressive Prediction Framework for Attribute Compression of 3D Point Clouds

Xiaodong Yang
Yiting Shao
Shan Liu
Thomas H. Li
Ge Li

In recent years, the diffusion-based image compression scheme has achieved significant success, which inspires us to use diffusion theory to employ the diffusion model for point cloud attribute compression. However, the relevant existing methods cannot be used to deal with our task due to the irregular structure of point clouds. To handle this, we propose the partial differential equation (PDE) based progressive prediction framework for attribute compression of 3D point clouds. Firstly, we propose a PDE-based prediction module, which performs prediction by optimizing attribute gradients, allowing the geometric distribution of adjacent areas to be fully utilized and explaining the weighting method for prediction. Besides, we propose a low-complexity method for calculating partial derivative operations on point clouds to address the uncertainty of neighbor occupancy in three-dimensional space. In the proposed prediction framework, we design a two-layer level of detail (LOD) structure, where the attribute information in the high level is used for interpolating the low level by edge-enhancing anisotropic diffusion (EED) to infer local features from the high-level information. After the diffusion-based interpolation, we design a texture-wise prediction method making use of interpolated values and texture information. Experiment results show that our proposed framework achieves an average of 12.00% BD-rate reduction and 1.59% bitrate saving compared with Predlift (PLT) under attribute near-lossless and attribute lossless conditions, respectively. Furthermore, additional experiments demonstrate our proposed scheme has better texture preservation and subjective quality.

SESSION: Brave New Ideas Session

Semantics2Hands: Transferring Hand Motion Semantics between Avatars

Zijie Ye
Jia Jia
Junliang Xing

Human hands, the primary means of non-verbal communication, convey intricate semantics in various scenarios. Due to the high sensitivity of individuals to hand motions, even minor errors in hand motions can significantly impact the user experience. Real applications often involve multiple avatars with varying hand shapes, highlighting the importance of maintaining the intricate semantics of hand motions across the avatars. Therefore, this paper aims to transfer the hand motion semantics between diverse avatars based on their respective hand models. To address this problem, we introduce a novel anatomy-based semantic matrix (ASM) that encodes the semantics of hand motions. The ASM quantifies the positions of the palm and other joints relative to the local frame of the corresponding joint, enabling precise retargeting of hand motions. Subsequently, we obtain a mapping function from the source ASM to the target hand joint rotations by employing an anatomy-based semantics reconstruction network (ASRN). We train the ASRN using a semi-supervised learning strategy on the Mixamo and InterHand2.6M datasets. We evaluate our method in intra-domain and cross-domain hand motion retargeting tasks. The qualitative and quantitative results demonstrate the significant superiority of our ASRN over the state-of-the-arts. Code available at https://github.com/abcyzj/Semantics2Hand

Combating Misinformation in the Era of Generative AI Models

Danni Xu
Shaojing Fan
Mohan Kankanhalli

Misinformation has been a persistent and harmful phenomenon affecting our society in various ways, including individuals' physical health and economic stability. With the rise of short video platforms and related applications, the spread of multi-modal misinformation, encompassing images, texts, audios, and videos have exacerbated these concerns. The introduction of generative AI models like ChatGPT and Stable Diffusion has further complicated matters, giving rise to Artificial Intelligence Generated Content (AIGC) and presenting new challenges in detecting and mitigating misinformation. Consequently, traditional approaches to misinformation detection and intervention have become inadequate in this evolving landscape. This paper explores the challenges posed by AIGC in the context of misinformation. It examines the issue from psychological and societal perspectives, and explores the subtle manipulation traces found in AIGC at signal, perceptual, semantic, and human levels. By scrutinizing manipulation traces such as signal manipulation, semantic inconsistencies, logical incoherence, and psychological strategies, our objective is to tackle AI-generated misinformation and provide a conceptual design of systematic explainable solution. Ultimately, we aim for this paper to contribute valuable insights into combating misinformation, particularly in the era of AIGC.

Against Opacity: Explainable AI and Large Language Models for Effective Digital Advertising

Qi Yang
Marlo Ongpin
Sergey Nikolenko
Alfred Huang
Aleksandr Farseev

The opaqueness of modern digital advertising, exemplified by platforms such as Meta Ads, raises concerns regarding their autonomous control over audience targeting, pricing structures, and ad relevancy assessments. Locked in their leading positions by network effects, "Metas and Googles of the world" attract countless advertisers who rely on intuition, with billions of dollars lost on ineffective social media ads. The platforms' algorithms use huge amounts of data unavailable to advertisers, and the algorithms themselves are opaque as well. This lack of transparency hinders the advertisers' ability to make informed decisions and necessitates efforts to promote transparency, standardize industry metrics, and strengthen regulatory frameworks. In this work, we propose novel ways to assist marketers in optimizing their advertising strategies via machine learning techniques designed to analyze and evaluate content, in particular, predict the click-through rates (CTR) of novel advertising content. Another important problem is that large volumes of data available in the competitive landscape, e.g., competitors' ads, impede the ability of marketers to derive meaningful insights. This leads to a pressing need for a novel approach that would allow us to summarize and comprehend complex data. Inspired by the success of ChatGPT in bridging the gap between large language models (LLMs) and a broader non-technical audience, we propose a novel system that facilitates marketers in data interpretation, called SODA, that merges LLMs with explainable AI, enabling better human-AI collaboration with an emphasis on the domain of digital marketing and advertising. By combining LLMs and explainability features, in particular modern text-image models, we aim to improve the synergy between human marketers and AI systems.

Let's ViCE! Mimicking Human Cognitive Behavior in Image Generation Evaluation

Federico Betti
Jacopo Staiano
Lorenzo Baraldi
Lorenzo Baraldi
Rita Cucchiara
Nicu Sebe

Research in Image Generation has recently made significant progress, particularly boosted by the introduction of VisionLanguage models which are able to produce high-quality visual content based on textual inputs. Despite ongoing advancements in terms of generation quality and realism, no methodical frameworks have been defined yet to quantitatively measure the quality of the generated content and the adherence with the prompted requests: so far, only humanbased evaluations have been adopted for quality satisfaction and for comparing different generative methods. We introduce a novel automated method for Visual Concept Evaluation (ViCE), i.e. to assess consistency between a generated/edited image and the corresponding prompt/instructions, with a process inspired by the human cognitive behaviour. ViCE combines the strengths of Large Language Models (LLMs) and Visual Question Answering (VQA) into a unified pipeline, aiming to replicate the human cognitive process in quality assessment. This method outlines visual concepts, formulates image-specific verification questions, utilizes the Q&A system to investigate the image, and scores the combined outcome. Although this brave new hypothesis of mimicking humans in the image evaluation process is in its preliminary assessment stage, results are promising and open the door to a new form of automatic evaluation which could have significant impact as the image generation or the image target editing tasks become more and more sophisticated.

MovieFactory: Automatic Movie Creation from Text using Large Generative Models for Language and Images

Junchen Zhu
Huan Yang
Huiguo He
Wenjing Wang
Zixi Tuo
Wen-Huang Cheng
Lianli Gao
Jingkuan Song
Jianlong Fu

In this paper, we present MovieFactory, a powerful framework to generate cinematic-picture (3072x1280), film-style (multi-scene), and multi-modality (sounding) movies on the demand of natural languages. As the first fully automated movie generation model to the best of our knowledge, our approach empowers users to create captivating movies with smooth transitions using simple text inputs, surpassing existing methods that produce soundless videos limited to a single scene of modest quality. To facilitate this distinctive functionality, we leverage ChatGPT to expand user-provided text into detailed sequential scripts for movie generation. Then we bring scripts to life visually and acoustically through vision generation and audio retrieval. To generate videos, we extend the capabilities of a pretrained text-to-image diffusion model through a two-stage process. Firstly, we employ spatial finetuning to bridge the gap between the pretrained image model and the new video dataset. Subsequently, we introduce temporal learning to capture object motion. In terms of audio, we leverage sophisticated retrieval models to select and align audio elements that correspond to the plot and visual content of the movie.

Jurassic World Remake: Bringing Ancient Fossils Back to Life via Zero-Shot Long Image-to-Image Translation

Alexander Martin
Haitian Zheng
Jie An
Jiebo Luo

With a strong understanding of the target domain from natural language, we produce promising results in translating across large domain gaps and bringing skeletons back to life. In this work, we use text-guided latent diffusion models for zero-shot image-to-image translation (I2I) across large domain gaps (longI2I), where large amounts of new visual features and new geometry need to be generated to enter the target domain. Being able to perform translations across large domain gaps has a wide variety of real-world applications in criminology, astrology, environmental conservation, and paleontology. In this work, we introduce a new task Skull2Animal for translating between skulls and living animals. On this task, we find that unguided Generative Adversarial Networks (GANs) are not capable of translating across large domain gaps. Instead of these traditional I2I methods, we explore the use of guided diffusion and image editing models and provide a new benchmark model, Revive-2I, capable of performing zero-shot I2I via text-prompting latent diffusion models. We find that guidance is necessary for longI2I because, to bridge the large domain gap, prior knowledge about the target domain is needed. In addition, we find that prompting provides the best and most scalable information about the target domain as classifier-guided diffusion models require retraining for specific use cases and lack stronger constraints on the target domain because of the wide variety of images they are trained on.

Diff4Rec: Sequential Recommendation with Curriculum-scheduled Diffusion Augmentation

Zihao Wu
Xin Wang
Hong Chen
Kaidong Li
Yi Han
Lifeng Sun
Wenwu Zhu

Sequential recommender systems often suffer from performance drops due to the data-sparsity issue in real-world scenarios. To address this issue, we bravely take advantage of the strength in diffusion model to conduct data augmentation for sequential recommendation in this paper. However, there remain two critical challenges for this scarcely-explored topic: (i) previous diffusion models are mostly designed for image generation aiming to capture pixel patterns, which can hardly be applied in data augmentation for sequential recommendation aiming to capture the user-item relations; (ii) given a specific diffusion model capable of user-item interaction augmentation, it is non-trivial to guarantee that the diffusion-generated data can always bring benefits towards the sequential recommendation model. To tackle these challenges, we propose Diff4Rec, a curriculum-scheduled diffusion augmentation framework for sequential recommendation. Specifically, a diffusion model is pre-trained on recommendation data via corrupting and reconstructing the user-item interactions in the latent space, and the generated predictions are leveraged to produce diversified augmentations for the sparse user-item interactions. Subsequently, a curriculum scheduling strategy is designed to progressively feed the diffusion-generated samples into the sequential recommenders, with respect to two levels, i.e., interaction augmentation and objective augmentation, to jointly optimize the data and model. Extensive experiments demonstrate that our proposed Diff4Rec framework is able to effectively achieve superior performance over several strong baselines, capable of making high-quality and robust sequential recommendations. We believe the proposed Diff4Rec has the promising potential to bring paradigm shift in multimedia recommendation.

SESSION: Doctoral Symposium

Text-to-Metaverse: Towards a Digital Twin-Enabled Multimodal Conditional Generative Metaverse

Ahmed Elhagry

Developing realistic and interactive virtual environments is a major hurdle in the progress of Metaverse. At present, majority of Metaverse applications necessitate the manual construction of 3D models which is both time-consuming and costly. Additionally, it is challenging to design environments that can promptly react to users' actions. To address this challenge, this paper proposes a novel approach to generate virtual worlds using digital twin (DT) technology and AI through a Text-to-Metaverse pipeline. This pipeline converts natural language input into a scene JSON, which is used to generate a 3D virtual world using two engines: Generative Script Engine (GSE) and Generative Metaverse Engine (GME). GME generates a design script from the JSON file, and then uses it to generate 3D objects in an environment. It aims to use multimodal AI and DT technology to produce realistic and highly detailed virtual environments. The proposed pipeline has potential applications including education, training, architecture, healthcare and entertainment, and could change the way designers and developers create virtual worlds. While this short paper covers an abstract as per the Doctorial Symposium's guidelines, it contributes to the research on generative models for multimodal data and provides a new direction for creating immersive virtual experiences.

Video Scene Graph Generation with Spatial-Temporal Knowledge

Tao Pu

Various video understanding tasks have been extensively explored in the multimedia community, among which the video scene graph generation (VidSGG) task is more challenging since it requires identifying objects in comprehensive scenes and deducing their relationships. Existing methods for this task generally aggregate object-level visual information from both spatial and temporal perspectives to better learn powerful relationship representations. However, these leading techniques merely implicitly model the spatial-temporal context, which may lead to ambiguous predicate predictions when visual relations vary frequently. In this work, I propose incorporating spatial-temporal knowledge into relation representation learning to effectively constrain the spatial prediction space within each image and sequential variation across temporal frames. To this end, I design a novel spatial-temporal knowledge-embedded transformer (STKET) that incorporates the prior spatial-temporal knowledge into the multi-head cross-attention mechanism to learn more representative relationship representations. Extensive experiments conducted on Action Genome demonstrate the effectiveness of the proposed STKET.

Limited-Reference Image Quality Assessment: Paradigms and Discussions

Keke Zhang

Image enhancement techniques play an important role in improving visual quality of media content. Image Quality Assessment (IQA) metrics are essential in comparing and improving image enhancement algorithms. Although these enhancement methods belong to different types of tasks in computer vision, the IQA for these methods has a common problem, namely, their reference information is limited. This poses a challenge to IQA: How to utilize the limited reference information to evaluate the qualities of distorted images. I am dedicated to approaching this challenge throughout my PhD. In this paper, first, I illustrate the motivation of my PhD project. Then, I formulate the problem and the ultimate goal of my PhD research. Next, I introduce some developed results in my research. Finally, I present my ongoing work and future directions.

Haptic-aware Interaction: Design and Evaluation

Ying Fang

The emerging haptic technology has introduced new media perceptions and also increased the immersive experiences of end-users. To date, novel haptic-audio-visual environments and their Quality of Experience (QoE) assessments are still challenging issues. In this work, we investigate the haptic-visual interaction QoE in virtual as well as real-world environments. First, we establish a haptic-visual interaction platform based on a balance ball Virtual Reality (VR) game scene and a haptic-visual interaction platform with data-glove-assisted remote control. Second, we conduct subjective tests to qualitatively and quantitatively analyze the impacts of system-related, user-related and task-related factors on QoE evaluation. Third, we propose learning-based QoE models to effectively evaluate the user-perceived QoE in haptic-visual interaction. In the future work, we aim to focus on the improvement of the two established platforms, with the addition of audio-related influencing factors and more haptic feedback, and optimizing the proposed QoE model for further improvement of haptic-audio-visual interaction applications.

Encoding and Decoding Narratives: Datafication and Alternative Access Models for Audiovisual Archives

Yuchen Yang

Situated in the intersection of audiovisual archives, computational methods, and immersive interactions, this work probes the increasingly important accessibility issues from a two-fold approach. Firstly, the work proposes an ontological data model to handle complex descriptors (metadata, feature vectors, etc.) with regard to user interactions. Secondly, this work examines text-to-video retrieval from an implementation perspective by proposing a classifier-enhanced workflow to deal with complex and hybrid queries and a training data augmentation workflow to improve performance. This work serves as the foundation for experimenting with novel public-facing access models to large audiovisual archives.

Zero-Shot Learning for Computer Vision Applications

Sandipan Sarma

Human beings possess the remarkable ability to recognize unseen concepts by integrating their visual perception of known concepts with some high-level descriptions. However, the best-performing deep learning frameworks today are supervised learners that struggle to recognize concepts without training on their labeled visual samples. Zero-shot learning (ZSL) has recently emerged as a solution that mimics humans and leverages multimodal information to transfer knowledge from seen to unseen concepts. This study aims to emphasize the practicality of ZSL, unlocking its potential across four different applications in computer vision, namely -- object recognition, object detection, action recognition, and human-object interaction detection. Several task-specific challenges are identified and addressed in the presented research hypotheses. Zero-shot frameworks are proposed to attain state-of-the-art performance, elucidating some future research directions as well.

SESSION: Technical Demonstrations

mPLUG-Octopus: The Versatile Assistant Empowered by A Modularized End-to-End Multimodal LLM

Qinghao Ye
Haiyang Xu
Ming Yan
Chenlin Zhao
Junyang Wang
Xiaoshan Yang
Ji Zhang
Fei Huang
Jitao Sang
Changsheng Xu

Inspired by the recent developments of large language models (LLMs), we propose mPLUG-Octopus, a versatile conversational assistant designed to provide users with coherent, engaging, and helpful interaction experiences in both text-only and multi-modal scenarios. Unlike traditional pipeline chatting systems, mPLUG-Octopus offers a diverse range of creative capabilities including open-domain QA, multi-turn chatting, and multi-modal creation, all built with a unified multimodal LLM without relying on any external API. With the modularized end-to-end multimodal LLM technology, mPLUG-Octopus efficiently facilitates engaging and open-domain conversation experience. It exhibits a wide range of uni/multi-modal elemental capabilities, enabling it to seamlessly communicate with users on open-domain topics and engage in multi-turn conversations. It also assists users in accomplishing various content creation and application tasks. Our conversational assistant can also be deployed on smart hardware to drive advanced AIGC applications.

Multimodal Emotion Interaction and Visualization Platform

Zheng Zhang
Songling Chen
Mixiao Hou
Guangming Lu

In this paper, we present a multimodal emotion analysis platform, which can flexibly capture, detect and analyze the emotions of video object with multiple modalities under different situations, including offline and online application scenarios. This system can visualize the dynamic effects of different types of emotions from both multimodal and unimodal circumstances. The presented emotion analysis results show instant and time series states in both specific modality and multiple modalities. Our system fills the current research and application gaps in multimodal emotion analysis with an interactive interface. Notably, the constructed system can adaptively process pre-recorded video clips as well as collected real-world data with excellent practicality and interactivity.

MobileVidFactory: Automatic Diffusion-Based Social Media Video Generation for Mobile Devices from Text

Junchen Zhu
Huan Yang
Wenjing Wang
Huiguo He
Zixi Tuo
Yongsheng Yu
Wen-Huang Cheng
Lianli Gao
Jingkuan Song
Jianlong Fu
Jiebo Luo

Videos for mobile devices become the most popular access to share and acquire information recently. For the convenience of users' creation, in this paper, we present a system, namely MobileVidFactory, to automatically generate vertical mobile videos where users only need to give simple texts mainly. Our system consists of two parts: basic and customized generation. In the basic generation, we utilize the pretrained image diffusion model, and adapt it to a high-quality open-domain vertical video generator. As for the audio, by retrieving from our big database, our system matches a suitable background sound for the video. Additionally to produce customized content, our system allows users to add specified screen texts for enriching visual expression, and specify texts for automatic reading with optional voices as they like.

Open-RoadAtlas: Leveraging VLMs for Road Condition Survey with Real-Time Mobile Auditing

Djamahl Etchegaray
Yadan Luo
Zachary FitzChance
Anthony Southon
Jinjiang Zhong

Road surveying plays a vital role in effective road network management for local governments. However, current practices pose challenges due to their costly, time-consuming, and inaccurate nature. In this paper, we propose an automated survey platform that supports weed, defect and asset monitoring with instance segmentation models. Empowered by recent advancements in vision-language models (VLMs), our solution offers improved flexibility for novel tasks with a limited label set. For domain specific classes, such as pavement cracks and potholes, we train a detector to identify their location given our sparsely annotated images, and alleviate false-positives by rejecting predictions outside regions of interest identified by VLMs. The proposed system directly involves managers in the survey process through a mobile application. The application allows three core functions: 1) capture and cloud upload, 2) real-time survey trajectory monitoring, 3) open-vocabulary detection.

H2V4Sports: Real-Time Horizontal-to-Vertical Video Converter for Sports Lives via Fast Object Detection and Tracking

Yi Han
Kaidong Li
Zihan Song
Wei Feng
Xiang Cao
Shida Guo
Xin Wang
Xuguang Duan
Wenwu Zhu

We present H2V4Sports, a real-time horizontal-to-vertical video converter specifically designed for sports live broadcasts. With the increasing demand of smartphone users who prefer to watch sports events on their vertical screens anywhere, anytime, our platform provides a seamless viewing experience. We achieve this by fine-tuning and pruning an object detector and tracker, which enables us to provide real-time, accurate key-object tracking results despite the complexity of sports scenes. Additionally, we propose a video virtual director platform that captures the most informative vertical zones from horizontal video live frames using various director logic for a smooth frame-to-frame transition. We have successfully demonstrated our platform in two popular sports: basketball and diving, and the results indicate that our technology delivers high-quality vertical scenes that are beneficial for smartphone users and other vertical scenarios.

360RVW: Fusing Real 360° Videos and Interactive Virtual Worlds

Mizuki Takenawa
Naoki Sugimoto
Leslie Wöhler
Satoshi Ikehata
Kiyoharu Aizawa

We propose a system to generate 360° realistic virtual worlds (360RVW) for the interactive spatial exploration of omnidirectional street-view videos. Our 360RVW enables users to explore photorealistic scenes with digital avatars, and interact with others. To create the virtual worlds our system only requires 360° videos with annotations of the start and end camera coordinate as input. We first detect street intersections to divide the input videos and remove the camera operator from the recordings using a video completion technique. Next, we analyze the 3D structure of the scene using semantic segmentation to define walkable areas. Finally, we render the environment using an ellipsoid projection surface to achieve a more realistic integration of the avatar into real-world 360° videos. The whole process is largely automated, enabling users to produce realistic and interactive virtual worlds without specialized skills or time-consuming manual interventions.

SetterVision: Motion-based Tactical Training System for Volleyball Setters in Virtual Reality

Yu-Hsuan Chen
Chen-Wei Fu
Wei-Lun Huang
Ming-Cong Su
Hsin-Yu Huang
Andrew Chen
Tse-Yu Pan

Volleyball, a sport characterized by unpredictable factors such as ball trajectory, teammate actions, and strategic positioning, presents a challenge when it comes to modeling and training due to its high levels of complexity. Successful gameplay relies on the coordinated efforts of all team members in the receiving, setting, and attacking phases. In real-life competitions, the setter's on-ball ability and decision-making are particularly crucial to the team's offensive success: To improve the training of setters in observing player movements while running and making informed attacking decisions, we propose the design of a virtual reality (VR) system which aims to enhance players' setting skills and strategic thinking to achieve more successful offensive plays with a lower cost.

BranchClash: A Fully On-Chain Tower Defense Blockchain Game with New Collaboration Mechanism

Hao Wu
Yueyao Li
Yan Zhuang
Xinyao Sun
Wei Cai

Mainstream blockchain games have drawn criticism for prioritizing economic systems over gameplay experience. Influenced by these economically-centered games, existing research on blockchain games predominantly focuses on the financial sector. We have developed BranchClash, a fully on-chain tower defense game on the Sepolia testnet of Ethereum. It introduces chain collaboration, a novel non-economically-centered game mechanism inspired by blockchain technology. BranchClash aims to expand unique game mechanics in blockchain games and explore innovative cooperative modes within the decentralized ecosystem.

Development of an Online Marathon System using Acoustic AR

Yuki Konishi
Panote Siriaraya
Da Li
Katsumi Tanaka
Yukiko Kawai
Shinsuke Nakajima

In recent years, the number of people who run for the purpose of improving their health and physical fitness has been increasing, but it is not easy to continue running. Therefore, we believe that it is very important to develop a running support system. In our previous study, we developed a running support system and an application that enables users to run with a virtual runner created by using their past running data in an acoustic augmented reality space. However, this system has the problem that it can only race against the past running records of oneself or one's acquaintance, and can only race against a single running record. Therefore, we considered it necessary to develop a system that allows users to race against an unspecified number of users and that allows users to arbitrarily select the distance and number of times they wish to race. In this paper, we propose an online marathon system that enables large-scale online races and examine the effect of the number of competitors on user motivation.

CFTF: Controllable Fine-grained Text2Face and Its Human-in-the-loop Suspect Portraits Application

Zhanbin Hu
Jianwu Wu
Danyang Gao
Yixu Zhou
Qiang Zhu

The traditional controllable face generation refers to the controllability of coarse-grained ranges such as facial features, expression postures, or viewing angles, but specific application scenarios require finer-grained control. This paper proposes a fine-grained and controllable face generation technology, CFTF. CFTF allows users to participate deeply in the face generation process through multiple rounds of language feedback. It not only enables control over coarse-grained features such as gender and viewing angle, but also provides flexible control over details such as hair color, accessories, and iris color. We apply CFTF to the suspect portrait scene, and perform multiple rounds of human-computer interaction based on the eyewitness's painting sketch of the suspect and descriptions of their facial features, realizing the "Human-in-the-loop" collaborative portrait drawing.

HoloSinger: Semantics and Music Driven Motion Generation with Octahedral Holographic Projection

Zeyu Jin
Zixuan Wang
Qixin Wang
Jia Jia
Ye Bai
Yi Zhao
Hao Li
Xiaorui Wang

Lyrics and music are both significant for a singer to perform a song. Therefore, it is important in singer's motion generation to model both semantic and acoustic correlation with motions at the same time. In this paper, we propose HoloSinger, a novel comprehensive system that synthesizes singing motions according to the given song. Additionally, we present singing avatar with octahedral holographic projection. For singing motion generation, we introduce a Transformer-VAE generative model to decompose lyrics and music, then fuse their impacts to synthesize singer's motions. Extensive experiments and user studies show that our method automatically generates realistic motions that adhere to musical choreography and reflect the lyric semantics appropriately. Furthermore, we design a desktop-level holographic projection device with an octahedral structure. It achieves high-definition holographic projection effects with smaller volume, larger imaging area ratio, and the ability of real-time AI interaction.

HumVis: Human-Centric Visual Analysis System

Dongkai Wang
Shiliang Zhang
Yaowei Wang
Yonghong Tian
Tiejun Huang
Wen Gao

Human-centric visual analysis is a fundamental task for many multimedia and computer vision applications, such as self-driving, multimedia retrieval, and augmented reality, etc. Based on our recent research efforts on fine-grained human visual analysis, we develop a robust and efficient human-centric visual analysis system named as HumVis. HumVis is built on a simple yet efficient contextual instance decoupling (CID) module, which can effectively separate different persons in an input image and output corresponding person structure information for visual analysis. Based on CID, HumVis achieves accurate multi-person pose estimation, multi-person foreground segmentation, multi-person part segmentation and 3D human mesh recovery for user-uploaded images/videos and support live stream presentation.

Personalized Content Recommender System via Non-verbal Interaction Using Face Mesh and Facial Expression

Yuya Moroto
Rintaro Yanagi
Naoki Ogawa
Kyohei Kamikawa
Keigo Sakurai
Ren Togo
Keisuke Maeda
Takahiro Ogawa
Miki Haseyama

Multimedia content recommendation needs to consider users' preferences for each content. Conventional recommender systems consider them with wearable sensors, however, wearing such sensors can lead to a burden on users. In this paper, we construct a recommender system that can explicitly estimate users' preferences without wearable sensors. Specifically, by constructing lightweight but strong machine learning models suitable for our system, the users' interest levels for contents can be estimated from facial images obtained from a widely used webcam. In addition, through the interaction that the user selects displayed contents, our system finds the tendency of personal preferences for recommending contents with high user satisfaction. Our system is available on https://www.lmd-demo.org/2022/start_eng.html.

IFS-SED: Incremental Few-Shot Sound Event Detection Using Explicit Learning and Calibration

Ming Feng
Kele Xu
Hengxing Cai

Sound event detection (SED) refers to recognizing the sound events in a continuous audio signal, which has drawn increasing interest during recent decades. The applications of SED seem to be evident in many fields, ranging from surveillance to monitoring applications. Despite the sustainable efforts that have been made, most of the previous attempts are performed on the closed-set, as only fixed and known sound event classes can be employed during the training. In this paper, we present our incremental few-shot SED framework under the open-set settings, as a practical machine listening system should be able to address unknown sound events. Specifically, an explicit learning and calibration-based multi-stage learning framework is utilized to address the challenges of catastrophic forgetting, and aim to achieve a better trade-off between stability and plasticity. To compress the model efficiently, the model prune and self-distillation paradigm are combined used for the model compression, thus our system can be deployed for the resource-limited devices. Our framework can also provide an uncertain estimation for the inference. Lastly, an interactive interface is presented to demonstrate the functions of our system.

ALDA: An Adaptive Layout Design Assistant for Diverse Posters throughout the Design Process

Qiuyun Zhang
Bin Guo
Lina Yao
Han Wang
Ying Zhang
Zhiwen Yu

Layout generation is important in the field of graphic design and has attracted intensive research attention recently. To further prompt human-computer interactions, we construct the ALDA to assist users throughout the design process, which achieves adaptive and diverse content expansions upon only one element for beginners and generates high-quality posters subsequently. Specifically, to obtain diverse contents, we propose aesthetic-aware design graphs (AGs) for effective poster representation and propose a self-constrained blending strategy upon related examples. In addition, we build a novel layout generator to better arrange elements conditioned on our AGs. Finally, we implement ALDA as an online tool with a set of controllable factors to enhance its practicality.

3D Creation at Your Fingertips: From Text or Image to 3D Assets

Yang Chen
Jingwen Chen
Yingwei Pan
Xinmei Tian
Tao Mei

We demonstrate an automatic 3D creation system, which can create realistic 3D assets solely from a text or image prompt without requiring any specialized 3D modeling skills. Users can either describe the object they envision in natural language or upload a reference image that records what they have seen with the phone. Our system will generate a high-quality 3D mesh that faithfully matches the users' input. We propose a coarse-to-fine framework to achieve this goal. Specifically, we first obtain a low-resolution mesh instantly by utilizing a pre-trained text/image conditional 3D generative model. Using such coarse mesh as the initialization, we further optimize a high-resolution textured 3D mesh with fine-grained appearance guidance from large-scale 2D diffusion models. Our system can create visually-pleasing results in minutes, which is significantly faster than existing methods. Meanwhile, the system ensures that the resulting 3D assets are precisely aligned with the input text or image prompt. With these advanced capabilities, our demonstration provides a streamlined and intuitive platform for users to incorporate 3D creation into their daily lives.

Reference-based Dense Pose Estimation via Partial 3D Point Cloud Matching

Rintaro Yanagi
Atsushi Hashimoto
Naoya Chiba
Yoshitaka Ushiku

Interacting with real-world objects is one of the fundamental tasks in multimedia. Despite its importance, existing object pose estimation targets only rigid objects. This demonstration proposes a novel application for non-rigid object pose estimation. Inspired by human dense pose estimation, we represent a pose of a non-rigid object as an indexed point cloud, where each index corresponds to that in a template. The correspondence is identified by a machine-learning-based 3D point cloud matching. Finding correspondence to the template point cloud enables a dense pose estimation with no object-specific learning processes. In the demonstration, we visualize the correspondence of points in observed depth images and the template. We also provide a demonstration of template point cloud reconstruction. Through these systems, onsite visitors can test our system with objects brought by themselves and have an experience with a state-of-the-art 3D point cloud matching method as well as this novel task.

EditAnything: Empowering Unparalleled Flexibility in Image Editing and Generation

Shanghua Gao
Zhijie Lin
Xingyu Xie
Pan Zhou
Ming-Ming Cheng
Shuicheng Yan

Image editing plays a vital role in computer vision field, aiming to realistically manipulate images while ensuring seamless integration. It finds numerous applications across various fields. In this work, we present EditAnything, a novel approach that empowers users with unparalleled flexibility in editing and generating image content. EditAnything introduces an array of advanced features, including cross-image dragging (e.g., try-on), region-interactive editing, controllable layout generation, and virtual character replacement. By harnessing these capabilities, users can engage in interactive and flexible editing, giving captivating outcomes that uphold the integrity of the original image. With its diverse range of tools, EditAnything caters to a wide spectrum of editing needs, pushing the boundaries of image editing and unlocking exciting new possibilities. The source code is released at https://github.com/sail-sg/EditAnything.

Zero-Shot Image Retrieval with Human Feedback

Lorenzo Agnolucci
Alberto Baldrati
Marco Bertini
Alberto Del Bimbo

Composed image retrieval extends traditional content-based image retrieval (CBIR) combining a query image with additional descriptive text to express user intent and specify supplementary requests related to the visual attributes of the query image. This approach holds significant potential for e-commerce applications, such as interactive multimodal searches and chatbots. In our demo, we present an interactive composed image retrieval system based on the SEARLE approach, which tackles this task in a zero-shot manner efficiently and effectively. The demo allows users to perform image retrieval iteratively refining the results using textual feedback.

SESSION: Grand Challenges

Finetuning Language Models for Multimodal Question Answering

Xin Zhang
Wen Xie
Ziqi Dai
Jun Rao
Haokun Wen
Xuan Luo
Meishan Zhang
Min Zhang

To achieve multi-modal intelligence, AI must be able to process and respond to inputs from multimodal sources. However, many current question answering models are limited to specific types of answers, such as yes/no and number, and require additional human assessments. Recently, Visual-Text Question Answering (VQTA) dataset has been proposed to fix this gap. In this paper, we conduct an exhaustive analysis and exploration of this task. Specifically, we implement a T5-based multi-modal generative network that overcomes the limitations of traditional labeling space and provides more freedom in responses. Our approach achieve the best performance in both English and Chinese tracks in the VTQA challenge.

A Hierarchical Deep Video Understanding Method with Shot-Based Instance Search and Large Language Model

Ruizhe Li
Jiahao Guo
Mingxi Li
Zhengqian Wu
Chao Liang

Deep video understanding (DVU) is often considered a challenge due to the aim of interpreting a video with storyline, which is designed to solve two levels of problems: predicting the human interaction in scene-level and identifying the relationship between two entities in movie-level. Based on our understanding of the movie characteristics and analysis of DVU tasks, in this paper, we propose a four-stage method to solve the task, which includes video structuring, shot based instance search, interaction & relation prediction and shot-scene summary & Question Answering (QA) with ChatGPT. In these four stages, shot based instance search allows accurate identification and tracking of characters at an appropriate video granularity. Using ChatGPT in QA, on the one hand, can narrow the answer space, on the other hand, with the help of the powerful text understanding ability, ChatGPT can help us answer the questions by giving background knowledge. We rank first in movie-level group 2 and scene-level group 1, second in movie-level group 1 and scene-level group 2 in ACM MM 2023 Grand Challenge.

Enhanced CatBoost with Stacking Features for Social Media Prediction

Shijian Mao
Wudong Xi
Lei Yu
Gaotian Lü
Xingxing Xing
Xingchen Zhou
Wei Wan

The Social Media Prediction (SMP) challenge aims to predict the future popularity of online posts by leveraging social media data. Social media data contains multimodal information, such as text, images, time series, etc. Previous methods have proposed many feature extraction and feature construction methods to represent these multimodal information, thereby predicting the popularity of posts. Despite the success of previous methods in extracting features from social media data, these features tend to be predominantly lower-order, posing a challenge in accurately capturing the rich information contained in text and images. In this paper, we propose a more diverse feature mining method and introduce a stacking block module to capture higher-order feature information contained in text and images. "lower-order" refers to the original high-dimensional embedding representation, while "high-order" pertains to the impact on post social popularity captured by tree models from text or image. We conducted massive experiments to evaluate the effectiveness of our proposed method and found that the stacking block module significantly improved performance.

Semi-Supervised Multimodal Emotion Recognition with Expression MAE

Zebang Cheng
Yuxiang Lin
Zhaoru Chen
Xiang Li
Shuyi Mao
Fan Zhang
Daijun Ding
Bowen Zhang
Xiaojiang Peng

The Multimodal Emotion Recognition (MER 2023) challenge aims to recognize emotion with audio, language, and visual signals, facilitating innovative technologies of affective computing. This paper presents our submission approach on the Semi-Supervised Learning Sub-Challenge (MER-SEMI). First, with large-scale unlabeled emotional videos, we train both image-based and video-based Masked Autoencoders to extract visual features, which termed as expression MAE (expMAE) for simplicity. The expMAE features are found to be largely complementary with other official baseline features. Second, since there is only a few labeled data, we use a classifier to generate pseudo labels for unlabeled videos which have high confidence for a certain category. In addition, we also explore several advanced large models for cross-feature extraction like CLIP, and apply factorized bilinear pooling (FBP) for multimodal feature fusion. Our methods finally achieved 88.55% in F1 score on MER-SEMI, ranking second place among all participating teams.

Towards Realistic Conversational Head Generation: A Comprehensive Framework for Lifelike Video Synthesis

Meng Liu
Yongqiang Li
Shuyan Zhai
Weili Guan
Liqiang Nie

The Vivid Talking Head Video Generation track of the "ACM Multimedia ViCo 2023 Conversational Head Generation Challenge'' aims to generate realistic face-to-face conversation videos based on audio and reference images. However, the direct synthesis of reference videos from audio and reference images poses a significant challenge. In response, we propose a comprehensive method that combines audio-driven 3DMM parameter prediction with a rendering network to generate high-fidelity lip-sync talking head videos. In the first stage, we leverage the audio input to predict the 3DMM parameters, capturing essential facial expressions and head poses needed for realistic video generation. In the second stage, we employ a sophisticated rendering network designed to ensure accurate lip synchronization and natural facial expressions. This network enhances the visual quality and realism of the generated videos. We are delighted to announce that our method achieved first place in the talking head generation track of the challenge and was honored with the People's Selection Award. The source code for our method is accessible1, allowing others to build upon our work and drive further advancements in the field of conversational head generation.

Invisible Video Watermark Method Based on Maximum Voting and Probabilistic Superposition

Kangshuai Guo
Zhijian Xu
Shichao Luo
Feigao Wei
Yan Wang
Yanru Zhang

Invisible watermarking is an essential measure for media publishers to declare ownership of their content, in the case of minimizing the impact on the viewing experience. In dealing with active attacks such as noise attacks, filtering attacks, geometric attacks, and lossy compression attacks, existing research still has great limitations. In this paper, through probabilistically superposition of "perturbed watermark obtain by maximum voting method" and "determined Bernoulli distribution" to restore the real embedded watermark, is used to deal with complex attack situations. Specifically, the special embedded watermark is obtained by sampling from the n-fold Bernoulli distribution with parameter p. Secondly, the HAAR wavelet transform is performed on the YUV channel of the video fixed-interval image to extract its low-pass component. Then Discrete Cosine Transform is performed to obtain its frequency domain representation. The watermark information is embedded into the frequency domain representation's singular to realize the embedded invisible watermark of video. The watermark bit of every block is determined by the maximum voting method for the disturbed watermark that performs DCT and SVD operations on the low-pass component of YUV channels. At this time, the determined Bernoulli distribution is probabilistically superimposed on the watermark information to guarantee distribution consistency. Finally, mean value operation and cluster processing are performed on the watermark information to reduce volatility. Experiments show that the method proposed in this paper has apparent advantages in solving Invisible video watermarking. With a PSNR of 41 and a corresponding BAR of 0.86, our team achieves 3rd place on the leaderboard of the Invisible Video Watermark Challenge 2023.

Gradient Boost Tree Network based on Extensive Feature Analysis for Popularity Prediction of Social Posts

Chih-Chung Hsu
Chia-Ming Lee
Xiu-Yu Hou
Chi-Han Tsai

Social media popularity (SMP) prediction is a complex task, affected by various features such as text, images, and spatial-temporal information. One major challenge in SMP is integrating features from multiple modalities without overemphasizing user-specific details while efficiently capturing relevant user information. This study introduces a robust multi-modality feature mining framework for predicting SMP scores by incorporating additional identity-related features sourced from the official SMP dataset when a user's path alias is accessible. Our preliminary analyses suggest these supplemental features significantly enrich the user-related context, contributing to a substantial improvement in performance and proving that non-identity features are relatively unimportant. This implies that we should focus more on discovering the identity-related features than other meta-data. To further validate our findings, we perform comprehensive experiments investigating the relationship between those identity-related features and scores. Finally, the LightGBM and TabNet are employed within our framework to effectively capture intricate semantic relationships among different modality features and user-specific data. Our experimental results confirm that these identity-related features, especially external ones, significantly improve the prediction performance of SMP tasks.

VTQAGen: BART-based Generative Model For Visual Text Question Answering

Haoru Chen
Tianjiao Wan
Zhimin Lin
Kele Xu
Jin Wang
Huaimin Wang

Visual Text Question Answering (VTQA) is a challenging task that requires answering questions pertaining to visual content by combining image understanding and language comprehension. The main objective is to develop models that can accurately provide relevant answers based on complementary information from both images and text, as well as the semantic meaning of the question. Despite ongoing efforts, the VTQA task presents several challenges, including multimedia alignment, multi-step cross-media reasoning, and handling open-ended questions. This paper introduces a novel generative framework called VTQAGen, which leverages a Multi- modal Attention Layer to combine image-text pairs and question inputs, as well as a BART-based model for reasoning and entity extraction from both images and text. The framework incorporates a step-based ensemble method to enhance model performance and generalization ability. VTQAGen utilizes an encoder-decoder generative model based on BART. Faster R-CNN is employed to extract visual regions of interest, while BART's encoder is modified to handle multi-modal interaction. The decoder stage utilizes the shift-predict approach and introduces step-based logits fusion to improve stability and accuracy. In the experiments, the proposed VTQAGen demonstrates superior performance on the testing set, securing second place in the ACM Multimedia Visual Text Question Answer Challenge.

Double-Fine-Tuning Multi-Objective Vision-and-Language Transformer for Social Media Popularity Prediction

Xiaolu Chen
Weilong Chen
Chenghao Huang
Zhongjian Zhang
Lixin Duan
Yanru Zhang

Social media popularity prediction aims to predict future interaction or attractiveness of new posts. However, in most existing works, there is a notable deficiency in the effective treatment of numerical features. Despite their significant potential to provide ample information, these features are often inadequately processed, leading to insufficiency of information acquirement. In this paper, we introduce a method, named Double-Fine-Tuning Multi-Objective Vision-and-Language Transformer (DFT-MOVLT). To supplement the information in vision-and-language pre-training (VLP), we propose compound text, which is concatenated by numerical data and text. Furthermore, during VLP, a transformer is trained using 3 objectives to ensure thorough feature extraction. Finally, for more generalized prediction, we fine-tune 2 models using different training ways and ensemble them. To evaluate the effectiveness of each mechanism adopted in the proposed method, we conduct an array of ablation experiments. Our team achieve the 3rd place in Social Media Prediction (SMP) Challenge 2023.

Cascaded Cross-Modal Transformer for Request and Complaint Detection

Nicolae-Catalin Ristea
Radu Tudor Ionescu

We propose a novel cascaded cross-modal transformer (CCMT) that combines speech and text transcripts to detect customer requests and complaints in phone conversations. Our approach leverages a multimodal paradigm by transcribing the speech using automatic speech recognition (ASR) models and translating the transcripts into different languages. Subsequently, we combine language-specific BERT-based models with Wav2Vec2.0 audio features in a novel cascaded cross-attention transformer model. We apply our system to the Requests Sub-Challenge of the ACM Multimedia 2023 Computational Paralinguistics Challenge, reaching unweighted average recalls (UAR) of 65.41% and 85.87% for the complaint and request classes, respectively.

Multi-scale Conformer Fusion Network for Multi-participant Behavior Analysis

Qiya Song
Renwei Dian
Bin Sun
Jie Xie
Shutao Li

Understanding and elucidating human behavior across diverse scenarios represents a pivotal research challenge in pursuing seamless human-computer interaction. However, previous research on multi-participant dialogues has mostly relied on proprietary datasets, which are not standardized and openly accessible. To propel advancements in this domain, the MultiMediate'23 Challenge presents two sub-challenges: Eye contact detection and Next speaker prediction, aiming to foster a comprehensive understanding of multi-participant behavior. To tackle these challenges, we propose a multi-scale conformer fusion network (MSCFN) for enhancing the perception of multi-participant group behaviors. The conformer block combines the strengths of transformers and convolution networks to facilitate the establishment of global and local contextual relationships between sequences. Then the output features from all Conformer blocks are concatenated to fusion multi-scale representations. Our proposed method was evaluated using the officially provided dataset, and it achieves the best and second best performance in next speaker prediction and gaze detection tasks of MultiMediate'23, respectively.

Advancing Audio Emotion and Intent Recognition with Large Pre-Trained Models and Bayesian Inference

Dejan Porjazovski
Yaroslav Getman
Tamás Grósz
Mikko Kurimo

Large pre-trained models are essential in paralinguistic systems, demonstrating effectiveness in tasks like emotion recognition and stuttering detection. In this paper, we employ large pre-trained models for the ACM Multimedia Computational Paralinguistics Challenge, addressing the Requests and Emotion Share tasks. We explore audio-only and hybrid solutions leveraging audio and text modalities. Our empirical results consistently show the superiority of the hybrid approaches over the audio-only models. Moreover, we introduce a Bayesian layer as an alternative to the standard linear output layer. The multimodal fusion approach achieves an 85.4% UAR on HC-Requests and 60.2% on HC-Complaints. The ensemble model for the Emotion Share task yields the best ρ value of .614. The Bayesian wav2vec2 approach, explored in this study, allows us to easily build ensembles, at the cost of fine-tuning only one model. Moreover, we can have usable confidence values instead of the usual overconfident posterior probabilities.

Automatic Audio Augmentation for Requests Sub-Challenge

Yanjie Sun
Kele Xu
Chaorun Liu
Yong Dou
Kun Qian

This paper presents our solution for the Requests Sub-challenge of the ACM Multimedia 2023 Computational Paralinguistics Challenge. Drawing upon the framework of self-supervised learning, we put forth an automated data augmentation technique for audio classification, accompanied by a multi-channel fusion strategy aimed at enhancing overall performance. Specifically, to tackle the issue of imbalanced classes in complaint classification, we propose an audio data augmentation method that generates appropriate augmentation strategies for the challenge dataset. Furthermore, recognizing the distinctive characteristics of the dual-channel HC-C dataset, we individually evaluate the classification performance of the left channel, right channel, channel difference, and channel sum, subsequently selecting the optimal integration approach. Our approach yields a significant improvement in performance when compared to the competitive baselines, particularly in the context of the complaint task. Moreover, our method demonstrates noteworthy cross-task transferability.

Answer-Based Entity Extraction and Alignment for Visual Text Question Answering

Jun Yu
Mohan Jing
Weihao Liu
Tongxu Luo
Bingyuan Zhang
Keda Lu
Fangyu Lei
Jianqing Sun
Jiaen Liang

As a variant of visual question answering (VQA), visual text question answering (VTQA) provides a text-image pair for each question. Text utilizes named entities to describe corresponding image. Consequently, the ability to perform multi-hop reasoning using named entities between text and image becomes critically important. However, existing models pay relatively less attention to this aspect. Therefore, we propose Answer-Based Entity Extraction and Alignment Model (AEEA) to enable a comprehensive understanding and support multi-hop reasoning. The core of AEEA lies in two main components: AKECMR and answer aware predictor. The former emphasizes the alignment of modalities and effectively distinguishes between intra-modal and inter-modal information, and the latter prioritizes the full utilization of intrinsic semantic information contained in answers during training. Our model outperforms the baseline by 2.24% on test-dev set and 1.06% on test set, securing the third place in VTQA2023(English).

Multi-Layer Acoustic & Linguistic Feature Fusion for ComParE-23 Emotion and Requests Challenge

Siddhant R. Viksit
Vinayak Abrol

The ACM Multimedia 2023 ComParE challenge focuses on classification/regression tasks for spoken customer-agent and emotionally rated conversations. The challenge baseline systems build upon the recent advancement in large-scale supervised/unsupervised foundational acoustic models that demonstrate consistently good performance across tasks. In this work, with the aim of improving the performance further, we present a novel multi-layer feature fusion method. In particular, the proposed approach leverages the hierarchical information from acoustic models using multi-layer statistics pooling, where we compute the weighted sum of layer-wise (mean and standard deviation) features. We further experimented with linguistic features and their late fusion with acoustic features, especially for subtasks involving complex conversations. Exploring various combinations of methods and features, we present four different systems tailored for each subchallenge, demonstrating significant performance gains over the baseline on the development and test set.

Sliding Window Seq2seq Modeling for Engagement Estimation

Jun Yu
Keda Lu
Mohan Jing
Ziqi Liang
Bingyuan Zhang
Jianqing Sun
Jiaen Liang

Engagement estimation in human conversations has been one of the most important research issues for natural human-robot interaction. However, previous datasets and studies mainly focus on the video-wise level of engagement estimation, therefore, can hardly reflect human's constantly changing engagement. Fortunately, the MultiMediate '23 challenge provides the frame-wise level of engagement estimation task. In this paper, we propose Sliding Window Seq2seq Modeling by BiLSTM and Transformer with powerful sequence modeling capabilities. Our method fully utilizes the global and local multi-modal feature information in the participants' videos and accurately expresses the engagement of the participants at each moment. Our method achieves the state-of-the-art CCC result of 0.71 for engagement estimation on the corresponding test sets.

Micro-Expression Spotting with Face Alignment and Optical Flow

Wenfeng Qin
Bochao Zou
Xin Li
Weiping Wang
Huimin Ma

Facial expression spotting holds significant importance as it can signify emotional changes. Particularly, micro-expressions possess the potential to reveal genuine emotions, making them even more valuable in practical domains such as public safety and finance. However, spotting micro-expressions proves challenging due to their subtle movements and brief duration. This paper proposes an expression spotting method based on face alignment and optical flow. We first use a finer crop-align technique to preprocess the facial videos by aligning the face and the nose tip. Then, regions of interest (ROIs) are defined by analyzing the statistics of action units. The optical flow features are then extracted and subjected to low-pass filtering to eliminate high-frequency noise. Furthermore, candidate expression segments are identified based on the magnitude of the processed optical flows. Finally, non-maximum suppression is utilized to remove overlapping segments. The effectiveness of the proposed method is evaluated on the challenge test set, resulting in an overall F1-score of 0.19. Additional results obtained from CAS(ME)2 and SAMM Long videos provide further verification of the method's efficacy. The code is available online.

UniFaRN: Unified Transformer for Facial Reaction Generation

Cong Liang
Jiahe Wang
Haofan Zhang
Bing Tang
Junshan Huang
Shangfei Wang
Xiaoping Chen

We propose the Unified Transformer for Facial Reaction GeneratioN (UniFaRN) framework for facial reaction prediction in dyadic interactions. Given the video and audio of one side, the task is to generate facial reactions of the other side. The challenge of the task lies in the fusion of multi-modal inputs and balancing appropriateness and diversity. We adopt the Transformer architecture to tackle the challenge by leveraging its flexibility of handling multi-modal data and ability to control the generation process. By successfully capturing the correlations between multi-modal inputs and outputs with unified layers and balancing the performance with sampling methods, we have won first place in the REACT2023 challenge.

Effect of Attention and Self-Supervised Speech Embeddings on Non-Semantic Speech Tasks

Payal Mohapatra
Akash Pandey
Yueyuan Sui
Qi Zhu

Human emotion understanding is pivotal in making conversational technology mainstream. We view speech emotion understanding as a perception task which is a more realistic setting. With varying contexts (languages, demographics etc.) different share of people perceive the same speech segment as a non-unanimous emotion. As part of the ACM Multimedia 2023 Computational Paralinguistics ChallengE (ComParE) in the EMotion Share track, we leverage their rich dataset of multilingual speakers and multi-label regression target of 'emotion share' or perception of that emotion. We demonstrate that the training scheme of different foundation models dictates their effectiveness for tasks beyond speech recognition, especially for non-semantic speech tasks like emotion understanding. This is a very complex task due to multilingual speakers, variability in the target labels, and inherent imbalance in the regression dataset. Our results show that HuBERT-Large with a self-attention-based light-weight sequence model provides 4.6% improvement over the reported baseline.

Data Augmentation for Human Behavior Analysis in Multi-Person Conversations

Kun Li
Dan Guo
Guoliang Chen
Feiyang Liu
Meng Wang

In this paper, we present the solution of our team HFUT-VUT for the MultiMediate Grand Challenge 2023 at ACM Multimedia 2023. The solution covers three sub-challenges: bodily behavior recognition, eye contact detection, and next speaker prediction. We select Swin Transformer as the baseline and exploit data augmentation strategies to address the above three tasks. Specifically, we crop the raw video to remove the noise from other parts. At the same time, we utilize data augmentation to improve the generalization of the model. As a result, our solution achieves the best results of 0.6262 for bodily behavior recognition in terms of mean average precision and the accuracy of 0.7771 for eye contact detection on the corresponding test set. In addition, our approach also achieves comparable results of 0.5281 for the next speaker prediction in terms of unweighted average recall.

DCTM: Dilated Convolutional Transformer Model for Multimodal Engagement Estimation in Conversation

Vu Ngoc Tu
Van Thong Huynh
Hyung-Jeong Yang
Soo-Hyung Kim
Shah Nawaz
Karthik Nandakumar
M. Zaigham Zaheer

Conversational engagement estimation is posed as a regression problem, entailing the identification of the favorable attention and involvement of the participants in the conversation. This task arises as a crucial pursuit to gain insights into human's interaction dynamics and behavior patterns within a conversation. In this research, we introduce a dilated convolutional Transformer for modeling and estimating human engagement in the MULTIMEDIATE 2023 competition. Our proposed system surpasses the baseline models, exhibiting a noteworthy 7% improvement on test set and 4% on validation set. Moreover, we employ different modality fusion mechanism and show that for this type of data, a simple concatenated method with self-attention fusion gains the best performance.

MAGIC-TBR: Multiview Attention Fusion for Transformer-based Bodily Behavior Recognition in Group Settings

Surbhi Madan
Rishabh Jain
Gulshan Sharma
Ramanathan Subramanian
Abhinav Dhall

Bodily behavioral language is an important social cue, and its automated analysis helps in enhancing the understanding of artificial intelligence systems. Furthermore, behavioral language cues are essential for active engagement in social agent-based user interactions. Despite the progress made in computer vision for tasks like head and body pose estimation, there is still a need to explore the detection of finer behaviors such as gesturing, grooming, or fumbling. This paper proposes a multiview attention fusion method named MAGIC-TBR that combines features extracted from videos and their corresponding Discrete Cosine Transform coefficients via a transformer-based approach. The experiments are conducted on the BBSI dataset and the results demonstrate the effectiveness of the proposed feature fusion with multiview attention. The code is available at: https://github.com/surbhimadan92/MAGIC-TBR

Hierarchical Audio-Visual Information Fusion with Multi-label Joint Decoding for MER 2023

Haotian Wang
Yuxuan Xi
Hang Chen
Jun Du
Yan Song
Qing Wang
Hengshun Zhou
Chenxi Wang
Jiefeng Ma
Pengfei Hu
Ya Jiang
Shi Cheng
Jie Zhang
Yuzhe Weng

In this paper, we propose a novel framework for recognizing both discrete and dimensional emotions. In our framework, deep features extracted from foundation models are used as robust acoustic and visual representations of raw video. Three different structures based on attention-guided feature gathering (AFG) are designed for deep feature fusion. Then, we introduce a joint decoding structure for emotion classification and valence regression in the decoding stage. A multi-task loss based on uncertainty is also designed to optimize the whole process. Finally, by combining three different structures on the posterior probability level, we obtain the final predictions of discrete and dimensional emotions. When tested on the dataset of multimodal emotion recognition challenge (MER 2023), the proposed framework yields consistent improvements in both emotion classification and valence regression. Our final system achieves state-of-the-art performance and ranks third on the leaderboard on MER-MULTI sub-challenge.

BEAMER: Behavioral Encoder to Generate Multiple Appropriate Facial Reactions

Ximi Hoque
Adamay Mann
Gulshan Sharma
Abhinav Dhall

This paper presents a framework for generating appropriate facial expressions for a listener engaged in a dyadic conversation. The ability to produce contextually suitable facial gestures in response to user interactions may enhance the user experience for avatars and social robots interaction. We propose a Transformer and Siamese architecture-based approach for generating appropriate facial expressions. Positive and negative Speaker-Listener pairs are created, applying a contrastive loss to facilitate learning. Furthermore, an ensemble of reconstruction quality sensitive loss functions is added to the network for learning discriminative features. The listener's facial reactions are represented with a combination of the 3D Morphable Model's coefficients and affect-related attributes (facial action units). The inputs to the network are pre-trained Transformer-based feature MARLIN and affect-related features. Experimental analysis demonstrate the effectiveness of the proposed method across various metrics in the form of an increase in performance compared to a variational auto-encoder-based baseline.

Efficient Micro-Expression Spotting Based on Main Directional Mean Optical Flow Feature

Jun Yu
Zhongpeng Cai
Shenshen Du
Xiaxin Shen
Lei Wang
Fang Gao

Human facial expressions can convey a great deal of information in daily life. Spotting macro-expression (MaE) and micro-expression (ME) intervals from long video sequences is a difficult challenge. In this paper, we propose an efficient framework for the expression spotting task. This framework consists of three main modules: Face Cropping and Alignment Module (FCAM), optical flow Feature Extraction Module (FEM), and expression Proposal Generation Module (PGM). The noise of optical flow features is reduced by face cropping and alignment, and the Main Directional Mean Optical Flow Feature of the regions of interest is extracted as the feature for expression spotting. Finally, the expression intervals are spotted by our designed expression proposal generation module. Our approach achieves very good results on the SAMM Long Videos and CAS(ME)^2. To demonstrate the transferability of our method, we tested it on the MEGC2023 unseen dataset and finally achieved the third place, proving the effectiveness of our method.

Mining High-quality Samples from Raw Data and Majority Voting Method for Multimodal Emotion Recognition

Qifei Li
Yingming Gao
Ya Li

Automatic emotion recognition has a wide range of applications in human-computer interaction. In this paper, we present our work in the Multimodel Emotion Recognition (MER) 2023, which contains three sub-challenges: MER-MULTI, MER-NOISE, and MER-SEMI. We first use a vanilla semi-supervised method to mine high quality samples from the MER-SEMI unlabeled dataset to expand the training set. Specifically, we ensemble three models trained with the official training set by a majority voting method, which is used to select samples with high prediction consistency. The selected samples together with the original training set are further augmented by adding noise. Then, the features of different modalities of expanded dataset are extracted from several pre-trained or fine-tuned models, and they are subsequently used to create different feature combinations to capture more effective emotion representations. Besides, we employ early fusion of different modal features and late fusion of different recognition models to obtain the final prediction. Experimental results show that our proposed method improves the performance over the official baselines by 30.4%, 55.3% and 1.57% for the three sub-challenges and ranks 4, 3, and 5, respectively. The present work sheds light on high-quality data mining and model ensemble by majority voting for multimodal emotion recognition.

Deep Video Understanding with Video-Language Model

Runze Liu
Yaqun Fang
Fan Yu
Ruiqi Tian
Tongwei Ren
Gangshan Wu

Pre-trained video-language models (VLMs) have shown superior performance in high-level video understanding tasks, analyzing multi-modal information, aligning with Deep Video Understanding Challenge (DVUC) requirements.In this paper, we explore pre-trained VLMs' potential in multimodal question answering for long-form videos. We propose a solution called Dual Branches Video Modeling (DBVM), which combines knowledge graph (KG) and VLMs, leveraging their strengths and addressing shortcomings.The KG branch recognizes and localizes entities, fuses multimodal features at different levels, and constructs KGs with entities as nodes and relationships as edges.The VLM branch applies a selection strategy to adapt input movies into acceptable length and a cross-matching strategy to post-process results providing accurate scene descriptions.Experiments conducted on the DVUC dataset validate the effectiveness of our DBVM.

Semi-Supervised Multimodal Emotion Recognition with Class-Balanced Pseudo-labeling

Haifeng Chen
Chujia Guo
Yan Li
Peng Zhang
Dongmei Jiang

This paper presents our solution for the Semi-Supervised Multimodal Emotion Recognition Challenge (MER2023-SEMI), addressing the issue of limited annotated data in emotion recognition. Recently, the self-training-based Semi-Supervised Learning~(SSL) method has demonstrated its effectiveness in various tasks, including emotion recognition. However, previous studies focused on reducing the confirmation bias of data without adequately considering the issue of data imbalance, which is of great importance in emotion recognition. Additionally, previous methods have primarily focused on unimodal tasks and have not considered the inherent multimodal information in emotion recognition tasks. We propose a simple yet effective semi-supervised multimodal emotion recognition method to address the above issues. We assume that the pseudo-labeled samples with consistent results across unimodal and multimodal classifiers have a more negligible confirmation bias. Based on this assumption, we suggest using a class-balanced strategy to select top-k high-confidence pseudo-labeled samples from each class. The proposed method is validated to be effective on the MER2023-SEMI Grand Challenge, with the weighted F1 score reaching 88.53% on the test set.

Leveraging the Latent Diffusion Models for Offline Facial Multiple Appropriate Reactions Generation

Jun Yu
Ji Zhao
Guochen Xie
Fengxin Chen
Ye Yu
Liang Peng
Minglei Li
Zonghong Dai

Offline Multiple Appropriate Facial Reaction Generation (OMAFRG) aims to predict the reaction of different listeners given a speaker, which is useful in the senario of human-computer interaction and social media analysis. In recent years, the Offline Facial Reactions Generation (OFRG) task has been explored in different ways. However, most studies only focus on the deterministic reaction of the listeners. The research of the non-deterministic (i.e. OMAFRG) always lacks of sufficient attention and the results are far from satisfactory. Compared with the deterministic OFRG tasks, the OMAFRG task is closer to the true circumstance but corresponds to higher difficulty for its requirement of modeling stochasticity and context. In this paper, we propose a new model named FRDiff to tackle this issue. Our model is developed based on the diffusion model architecture with some modification to enhance its ability of aggregating the context features. And the inherent property of stochasticity in diffusion model enables our model to generate multiple reactions. We conduct experiments on the datasets provided by the ACM Multimedia REACT2023 and obtain the second place on the board, which demonstrates the effectiveness of our method.

Improvements on SadTalker-based Approach for ViCo Conversational Head Generation Challenge

Wei Dai

This paper presents our solution in the ACM Multimedia ViCo 2023 Conversational Head Generation Challenge, which aims to generate vivid face-to-face conversation videos based on audio and reference images. Our approach builds upon the SadTalker framework with several improvements. Since SadTalker is already a mature and well-developed framework, consisting of multiple stages of algorithms and models, our enhancements mainly focus on the preprocessing and postprocessing stages. With our improved method, we achieved second place in both the Talking Head Generation track and the Listening Head Generation track.

Multimodal Emotion Recognition in Noisy Environment Based on Progressive Label Revision

Sunan Li
Hailun lian
Cheng Lu
Yan Zhao
Chuangao Tang
Yuan Zong
Wenming Zheng

The multimodal emotion recognition has attracted more attention in recent decades. Though remarkable progress has been achieved with the rapid development of deep learning, existing methods are still hard to tackle noise problems that occurred commonly in emotion recognition's practical application. To improve the robustness of the multimodal emotion recognition algorithm, we propose an MLP-based label revision algorithm. The framework consists of three complementary feature extraction networks that were verified in MER2023. After that, an MLP-based attention network with specially designed loss functions was used to fuse features from different modalities. Finally, the scheme that used the output probability of each emotion to revise the sample's output category was employed to revise the test set's label obtained by classifier. The samples that are most likely to be affected by noise and misclassified have a chance to get correct classification. The best experimental result shows that the F1-score of our algorithm on the test dataset of the MER 2023 Noise subchallenge is 86.35 and combined metric is 0.6694, which ranks 2nd at the MER 2023 NOISE subchallenge.

Integrating VideoMAE based model and Optical Flow for Micro- and Macro-expression Spotting

Ke Xu
Kang Chen
Licai Sun
Zheng Lian
Bin Liu
Gong Chen
Haiyang Sun
Mingyu Xu
Jianhua Tao

The task of interval localization of macro- and micro-expression in long videos has a wide range of applications in the field of human-computer interaction. Compared with macro-expression, micro-expression has shorter duration, lower intensity, and smaller number of samples, which make them more difficult to spot accurately in long videos. In this paper, we propose a pre-trained model combined with the optical flow method to improve the accuracy and robustness of macro- and micro-expression spotting. Firstly, self-supervised pre-training is performed on rich unlabeled data based on VideoMAE. Then, multiple models are trained on the datasets SAMM-LV and CAS(ME)³ for macro- and micro-expression with different fine-grains. Finally, different lengths of slices are generated based on the models with different fine-grains, and the optimal matching method through the combination of model fine-grainedness and slice lengths is explored. At the same time, macro- and micro-expression generating regions were spotted using the optical flow method, fused with the model outputs to supplement the spatio-temporal information not captured by the model and to exclude the interference of non-interested regions. We evaluated the performance of our method on the MEGC2023 testset (consisting of 10 long videos from SAMM and 20 long videos from CAS(ME)3) and won first place in the MEGC2023 Challenge. The results demonstrate the effectiveness of the method.

Hierarchical Semantic Perceptual Listener Head Video Generation: A High-performance Pipeline

Zhigang Chang
Weitai Hu
Qing Yang
Shibao Zheng

In dyadic speaker-listener interactions, the listener's head reactions, together with the speaker's head movements, form an important non-verbal semantic expression. The listener Head generation task aims to synthesize the responsive listener's head videos based on audios of the speaker and reference images of the listener. Compared to the talking-head generation, it is more challenging to capture the correlation cues from the speaker's audio and visual information. Following the ViCo baseline scheme, we propose a high-performance solution by enhancing the hierarchical semantic extraction capability of the audio encoder module and improving the decoder part, renderer and post-processing modules. Our solution gets the first place on the official leaderboard for the track of listening head generation. This paper is a technical report of ViCo@2023 Conversational Head Generation Challenge at the ACM Multimedia 2023 conference.

Unveiling Subtle Cues: Backchannel Detection Using Temporal Multimodal Attention Networks

Kangzhong Wang
MK Michael Cheung
Youqian Zhang
Chunxi Yang
Peter Q. Chen
Eugene Yujun Fu
Grace Ngai

Automatic detection of backchannel has great potential to enhance artificial mediators, which indicate listeners' attention and agreement in human communication. It is often expressed by subtle non-verbal cues that occur briefly and sparsely. Focusing on identifying and locating these subtle cues (i.e., their occurrence moment and the involved body parts), this paper proposes a novel approach for backchannel detection. In particular, our model utilizes temporal- and modality-attention modules to determine and lead the model to pay more attention to both the indicative moment and the accompanying body parts at that specific time. It achieves an accuracy of 68.6% on the testing set in MultiMediate'23 backchannel detection challenge, outperforming the counterparts. Furthermore, we conducted an ablation study to thoroughly understand the contributions of our model. This study underscores the effectiveness of our selection of modality inputs and the importance of the two attention modules in our model.

Query-aware Long Video Localization and Relation Discrimination for Deep Video Understanding

Yuanxing Xu
Yuting Wei
Bin Wu

The surge in video and social media content underscores the need for a deeper understanding of multimedia data. Most of the existing mature video understanding techniques perform well with short formats and content that requires only shallow understanding, but do not perform well with long format videos that require deep understanding and reasoning. Deep Video Understanding (DVU) Challenge aims to push the boundaries of multimodal extraction, fusion, and analytics to address the problem of holistically analyzing long videos and extract useful knowledge to solve different types of queries. This paper introduces a query-aware method for long video localization and relation discrimination, leveraging an image-language pretrained model. This model adeptly selects frames pertinent to queries, obviating the need for a complete movie-level knowledge graph. Our approach achieved first and fourth positions for two groups of movie-level queries. Sufficient experiments and final rankings demonstrate its effectiveness and robustness.

Building Robust Multimodal Sentiment Recognition via a Simple yet Effective Multimodal Transformer

Daoming Zong
Chaoyue Ding
Baoxiang Li
Dinghao Zhou
Jiakui Li
Ken Zheng
Qunyan Zhou

In this paper, we present the solutions to the MER-MULTI and MER-NOISE sub-challenges of the Multimodal Emotion Recognition Challenge (MER 2023). For the tasks MER-MULTI and MER-NOISE, participants are required to recognize both discrete and dimensional emotions. Particularly, in MER-NOISE, the test videos are corrupted with noise, necessitating the consideration of modality robustness. Our empirical findings indicate that different modalities contribute differently to the tasks, with a significant impact from the audio and visual modalities, while the text modality plays a weaker role in emotion prediction. To facilitate subsequent multimodal fusion, and considering that language information is implicitly embedded in large pre-trained speech models, we have made the deliberate choice to abandon the text modality and solely utilize visual and acoustic modalities for these sub-challenges. To address the potential underfitting of individual modalities during multimodal training, we propose to jointly train all modalities via a weighted blending of supervision signals. Furthermore, to enhance the robustness of our model, we employ a range of data augmentation techniques at the image level, waveform level, and spectrogram level. Experimental results show that our model ranks 1st in both MER-MULTI (0.7005) and MER-NOISE (0.6846) sub-challenges, validating the effectiveness of our method. Our code is publicly available at https://github.com/dingchaoyue/Multimodal-Emotion-Recognition-MER-and-Mu....

MultiMediate 2023: Engagement Level Detection using Audio and Video Features

Chunxi Yang
Kangzhong Wang
Peter Q. Chen
MK Michael Cheung
Youqian Zhang
Eugene Yujun Fu
Grace Ngai

Real-time engagement estimation holds significant potential across various research areas, particularly in the realm of human-computer interaction. It empowers artificial agents to dynamically adjust their responses based on user engagement levels, fostering more intuitive and immersive interactions. Despite the strides in automating real-time engagement estimation, the task remains challenging in real-world settings, especially when handling multi-modal human social signals. Capitalizing on human body and audio signals, this paper explores the appropriate feature representations of different modalities and effective modelling of dual conversations. This results in a novel and efficient multi-modal engagement detection model.We thoroughly evaluated our method in the MultiMediate'23 grand challenge. It performs consistently, with a notable improvement over the baseline model. Specifically, while the baseline achieves a concordance correlation coefficient (CCC) of 0.59, our approach yields a CCC of 0.70, suggesting its promising efficacy in real-life engagement detection.

The ACM Multimedia 2023 Deep Video Understanding Grand Challenge

Keith Curtis
George Awad
Afzal Godil
Ian Soboroff

This is the overview paper for the Deep Video Understanding (DVU) Grand Challenge. In recent years, a growing trend towards working on understanding videos (in particular movies) to a deeper level started to motivate researchers working in multimedia and computer vision to present new approaches and datasets to tackle this problem. This is a challenging research area which aims to develop a deep understanding of the relations which exist between different individuals and entities in movies using all available modalities such as video, audio, text and metadata. The aim of this grand challenge is to foster innovative research in this new direction and to provide benchmarking evaluations to advance technologies in the deep video understanding community.

MER 2023: Multi-label Learning, Modality Robustness, and Semi-Supervised Learning

Zheng Lian
Haiyang Sun
Licai Sun
Kang Chen
Mngyu Xu
Kexin Wang
Ke Xu
Yu He
Ying Li
Jinming Zhao
Ye Liu
Bin Liu
Jiangyan Yi
Meng Wang
Erik Cambria
Guoying Zhao
Björn W. Schuller
Jianhua Tao

The first Multimodal Emotion Recognition Challenge (MER 2023)1 was successfully held at ACM Multimedia. The challenge focuses on system robustness and consists of three distinct tracks: (1) MER-MULTI, where participants are required to recognize both discrete and dimensional emotions; (2) MER-NOISE, in which noise is added to test videos for modality robustness evaluation; (3) MER-SEMI, which provides a large amount of unlabeled samples for semi-supervised learning. In this paper, we introduce the motivation behind this challenge, describe the benchmark dataset, and provide some statistics about participants. To continue using this dataset after MER 2023, please sign a new End User License Agreement2 and send it to our official email address3. We believe this high-quality dataset can become a new benchmark in multimodal emotion recognition, especially for the Chinese research community.

Learning and Evaluating Human Preferences for Conversational Head Generation

Mohan Zhou
Yalong Bai
Wei Zhang
Ting Yao
Tiejun Zhao
Tao Mei

A reliable and comprehensive evaluation metric that aligns with manual preference assessments is crucial for conversational head video synthesis methods development. Existing quantitative evaluations often fail to capture the full complexity of human preference, as they only consider limited evaluation dimensions. Qualitative evaluations and user studies offer a solution but are time-consuming and labor-intensive. This limitation hinders the advancement of conversational head generation algorithms and systems. In this paper, we propose a novel learning-based evaluation metric named Preference Score (PS) for fitting human preference according to the quantitative evaluations across different dimensions. PS can serve as a quantitative evaluation without the need for human annotation. Experimental results validate the superiority of Preference Score in aligning with human perception, and also demonstrate robustness and generalizability to unseen data, making it a valuable tool for advancing conversation head generation. We expect this metric could facilitate new advances in conversational head generation. Project page: https://github.com/dc3ea9f/PreferenceScore.

REACT2023: The First Multiple Appropriate Facial Reaction Generation Challenge

Siyang Song
Micol Spitale
Cheng Luo
Germán Barquero
Cristina Palmero
Sergio Escalera
Michel Valstar
Tobias Baur
Fabien Ringeval
Elisabeth André
Hatice Gunes

The Multiple Appropriate Facial Reaction Generation Challenge (REACT2023) is the first competition event focused on evaluating multimedia processing and machine learning techniques for generating human-appropriate facial reactions in various dyadic interaction scenarios, with all participants competing strictly under the same conditions. The goal of the challenge is to provide the first benchmark test set for multi-modal information processing and to foster collaboration among the audio, visual, and audio-visual behaviour analysis and behaviour generation (a.k.a generative AI) communities, to compare the relative merits of the approaches to automatic appropriate facial reaction generation under different spontaneous dyadic interaction conditions. This paper presents: (i) the novelties, contributions and guidelines of the REACT2023 challenge; (ii) the dataset utilized in the challenge; and (iii) the performance of the baseline systems on the two proposed sub-challenges: Offline Multiple Appropriate Facial Reaction Generation and Online Multiple Appropriate Facial Reaction Generation, respectively. The challenge baseline code is publicly available at https://github.com/reactmultimodalchallenge/baseline_react2023.

MEGC2023: ACM Multimedia 2023 ME Grand Challenge

Adrian K. Davison
Jingting Li
Moi Hoon Yap
John See
Wen-Huang Cheng
Xiaobai Li
Xiaopeng Hong
Su-Jing Wang

Facial micro-expressions (MEs) are involuntary movements of the face that occur spontaneously when a person experiences an emotion but attempts to suppress or repress the facial expression, typically found in a high-stakes environment. Unfortunately, the small sample problem severely limits the automation of ME analysis. Furthermore, due to the weak and transient nature of MEs, it is difficult for models to distinguish it from other types of facial actions. Therefore, ME in long videos is a challenging task, and the current performance cannot meet the practical application requirements. Addressing these issues, this challenge focuses on ME and the macro-expression (MaE) spotting task. This year, in order to evaluate algorithms' performance more fairly, based on CAS(ME)2, SAMM Long Videos, SMIC-E-long, CAS(ME)3 and 4DME, we build an unseen cross-cultural long-video test set. All participating algorithms are required to run on this test set and submit their results on a leaderboard with a baseline result.

ACM Multimedia 2023 Grand Challenge Report: Invisible Video Watermark

Jin Chen
Yi Yu
Shien Song
Xinying Wang
Jie Yang
Yifei Xue
Yizhen Lao

MGTV recently organized a pioneering Invisible Video Watermark Challenge, inviting participants to create a framework capable of embedding invisible watermarks into videos and extracting them from watermarked content.

The invisible watermark serves as a discrete digital signature within the media data, imperceptible to the human eye. This technique safeguards the ownership and authenticity of multimedia content. While convolutional neural networks have demonstrated remarkable efficacy in image and video processing, the discourse on invisible watermarking remains limited. This challenge, therefore, presents an opportune moment to advance the field of invisible watermarking.

Furthermore, to support this endeavor, we curated the comprehensive MGTV_WM dataset, encompassing diverse video types. For further details, please refer to our official website (https://challenge.ai.mgtv.com/\#/track/18?locale=en).

The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests

Björn W. Schuller
Anton Batliner
Shahin Amiriparian
Alexander Barnhill
Maurice Gerczuk
Andreas Triantafyllopoulos
Alice E. Baird
Panagiotis Tzirakis
Chris Gagne
Alan S. Cowen
Nikola Lackovic
Marie-José Caraty
Claude Montacié

The ACM Multimedia 2023 Computational Paralinguistics Challenge addresses two different problems for the first time in a research competition under well-defined conditions: In the Emotion Share Sub-Challenge, a regression on speech has to be made; and in the Requests Sub-Challenges, requests and complaints need to be detected. We describe the Sub-Challenges, baseline feature extraction, and classifiers based on the 'usual' ComPaRE features, the auDeep toolkit, and deep feature extraction from pre-trained CNNs using the DeepSpectRum toolkit; in addition, wav2vec2 models are used.

MultiMediate '23: Engagement Estimation and Bodily Behaviour Recognition in Social Interactions

Philipp Müller
Michal Balazia
Tobias Baur
Michael Dietz
Alexander Heimerl
Dominik Schiller
Mohammed Guermal
Dominike Thomas
François Brémond
Jan Alexandersson
Elisabeth André
Andreas Bulling

Automatic analysis of human behaviour is a fundamental prerequisite for the creation of machines that can effectively interact with- and support humans in social interactions. In MultiMediate'23, we address two key human social behaviour analysis tasks for the first time in a controlled challenge: engagement estimation and bodily behaviour recognition in social interactions. This paper describes the MultiMediate'23 challenge and presents novel sets of annotations for both tasks. For engagement estimation we collected novel annotations on the NOvice eXpert Interaction (NOXI) database. For bodily behaviour recognition, we annotated test recordings of the MPIIGroupInteraction corpus with the BBSI annotation scheme. In addition, we present baseline results for both challenge tasks.

VTQA2023: ACM Multimedia 2023 Visual Text Question Answering Challenge

Kang Chen
Tianli Zhao
Xiangqian Wu

The ideal form of Visual Question Answering requires understanding, grounding and reasoning in the joint space of vision and language and serves as a proxy for the AI task of scene understanding. However, most existing VQA benchmarks are limited to just picking the answer from a pre-defined set of options and lack attention to text. We present a new challenge with a dataset that contains 23,781 questions based on 10124 image-text pairs. Specifically, the task requires the model to align multimedia representations of the same entity to implement multi-hop reasoning between image and text and finally use natural language to answer the question. The aim of this challenge is to develop and benchmark models that are capable of multimedia entity alignment, multi-step reasoning and open-ended answer generation.

SMP Challenge: An Overview and Analysis of Social Media Prediction Challenge

Bo Wu
Peiye Liu
Wen-Huang Cheng
Bei Liu
Zhaoyang Zeng
Jia Wang
Qiushi Huang
Jiebo Luo

Social Media Popularity Prediction (SMPP) is a crucial task that involves automatically predicting future popularity values of online posts, leveraging vast amounts of multimodal data available on social media platforms. Studying and investigating social media popularity becomes central to various online applications and requires novel methods of comprehensive analysis, multimodal comprehension, and accurate prediction.

SMP Challenge is an annual research activity that has spurred academic exploration in this area. This paper summarizes the challenging task, data, and research progress. As a critical resource for evaluating and benchmarking predictive models, we have released a large-scale SMPD benchmark encompassing approximately half a million posts authored by around 70K users. The research progress analysis provides an overall analysis of the solutions and trends in recent years. The SMP Challenge website (www.smp-challenge.com) provides the latest information and news.

SESSION: Open Source Session

Hawkeye: A PyTorch-based Library for Fine-Grained Image Recognition with Deep Learning

Jiabei He
Yang Shen
Xiu-Shen Wei
Ye Wu

Fine-Grained Image Recognition (FGIR) is a fundamental and challenging task in computer vision and multimedia that plays a crucial role in Intellectual Economy and Industrial Internet applications. However, the absence of a unified open-source software library covering various paradigms in FGIR poses a significant challenge for researchers and practitioners in the field. To address this gap, we present Hawkeye, a PyTorch-based library for FGIR with deep learning. Hawkeye is designed with a modular architecture, emphasizing high-quality code and human-readable configuration, providing a comprehensive solution for FGIR tasks. In Hawkeye, we have implemented 16 state-of-the-art fine-grained methods, covering 6 different paradigms, enabling users to explore various approaches for FGIR. To the best of our knowledge, Hawkeye represents the first open-source PyTorch-based library dedicated to FGIR. It is publicly available at https://github.com/Hawkeye-FineGrained/Hawkeye/, providing researchers and practitioners with a powerful tool to advance their research and development in the field of FGIR.

OpenFastVC: An Open Source Library for Video Coding Fast Algorithm Implementation

Hang Yuan
Wei Gao

Despite the remarkable coding gains exhibited by the recently released new-generation video coding standards, their serious computational complexity will pose a significant challenge in coding latency to practical applications. Therefore, the corresponding low-complexity optimizations assume paramount importance. To facilitate the research in this field, the first open source software library for video coding fast algorithm implementation, namely OpenFastVC, is proposed in this paper. Specifically, OpenFastVC offers the outputting and processing of the intermediate coding information, e.g., the CU partitioning results, which is indispensable to fast algorithm design. To facilitate the integration of the designed algorithms, OpenFastVC also provides application programming interfaces (APIs) for direct control over the encoding process. Moreover, the existing typical fast algorithms are further implemented in OpenFastVC, enabling researchers to evaluate the performance of their algorithm effortlessly. The release of this library is highly favorable for the design, implementation, and evaluation of video coding fast algorithms, thereby fostering the widespread adoption of the new coding standards. The open source library for OpenFastVC is available at https://openi.pcl.ac.cn/OpenCompression/OpenFastVC.

FastReID: A Pytorch Toolbox for General Instance Re-identification

Lingxiao He
Xingyu Liao
Wu Liu
Xinchen Liu
Peng Cheng
Tao Mei

General Instance Re-identification is a very important task in computer vision, which can be widely used in many practical applications, such as person/vehicle re-identification, face recognition, wildlife protection, commodity tracing, snapshots, and so on. To meet the increasing application demand for general instance re-identification, we present FastReID as a widely used software system. In FastReID, the highly modular and extensible design makes it easy for the researcher to achieve new research ideas. Friendly manageable system configuration and engineering deployment functions allow practitioners to quickly deploy models into productions. We have implemented some state-of-the-art projects, including person re-id, partial re-id, cross-domain re-id, and vehicle re-id. Moreover, we plan to release these pre-trained models on multiple benchmark datasets. FastReID is by far the most general and high-performance toolbox that supports single and multiple GPU servers, it can reproduce our project results very easily. The source codes and models have been released at https://github.com/JDAI-CV/fast-reid.

Ducho: A Unified Framework for the Extraction of Multimodal Features in Recommendation

Daniele Malitesta
Giuseppe Gassi
Claudio Pomo
Tommaso Di Noia

In multimodal-aware recommendation, the extraction of meaningful multimodal features is at the basis of high-quality recommendations. Generally, each recommendation framework implements its multimodal extraction procedures with specific strategies and tools. This is limiting for two reasons: (i) different extraction strategies do not ease the interdependence among multimodal recommendation frameworks; thus, they cannot be efficiently and fairly compared; (ii) given the large plethora of pre-trained deep learning models made available by different open source tools, model designers do not have access to shared interfaces to extract features. Motivated by the outlined aspects, we propose Ducho, a unified framework for the extraction of multimodal features in recommendation. By integrating three widely-adopted deep learning libraries as backends, namely, TensorFlow, PyTorch, and Transformers, we provide a shared interface to extract and process features where each backend's specific methods are abstracted to the end user. Noteworthy, the extraction pipeline is easily configurable with a YAML-based file where the user can specify, for each modality, the list of models (and their specific backends/parameters) to perform the extraction. Finally, to make Ducho accessible to the community, we build a public Docker image equipped with a ready-to-use CUDA environment and propose three demos to test its functionalities for different scenarios and tasks. The GitHub repository and the documentation are accessible at this link: https://github.com/sisinflab/Ducho.

Screen-based 3D Subjective Experiment Software

Songlin Fan
Wei Gao

Recently, widespread 3D graphics (e.g., point clouds and meshes) have drawn considerable efforts from academia and industry to assess their perceptual quality by conducting subjective experiments. However, lacking a handy software for 3D subjective experiments complicates the construction of 3D graphics quality assessment datasets, thus hindering the prosperity of relevant fields. In this paper, we develop a powerful platform with which users can flexibly design their 3D subjective methodologies and build high-quality datasets, easing a broad spectrum of 3D graphics subjective quality study. To accurately illustrate the perceptual quality differences of 3D stimuli, our software can simultaneously render the source stimulus and impaired stimulus and allows both stimuli to respond synchronously to viewer interactions. Compared with amateur 3D visualization tool-based or image/video rendering-based schemes, our approach embodies typical 3D applications while minimizing cognitive overload during subjective experiments. We organized a subjective experiment involving 40 participants to verify the validity of the proposed software. Experimental analyses demonstrate that subjective tests on our software can produce reasonable subjective quality scores of 3D models. All resources in this paper can be found at https://openi.pcl.ac.cn/OpenDatasets/3DQA.

HypLL: The Hyperbolic Learning Library

Max van Spengler
Philipp Wirth
Pascal Mettes

Deep learning in hyperbolic space is quickly gaining traction in the fields of machine learning, multimedia, and computer vision. Deep networks commonly operate in Euclidean space, implicitly assuming that data lies on regular grids. Recent advances have shown that hyperbolic geometry provides a viable alternative foundation for deep learning, especially when data is hierarchical in nature and when working with few embedding dimensions. Currently however, no accessible open-source library exists to build hyperbolic network modules akin to well-known deep learning libraries. We present HypLL, the Hyperbolic Learning Library to bring the progress on hyperbolic deep learning together. HypLL is built on top of PyTorch, with an emphasis in its design for ease-of-use, in order to attract a broad audience towards this new and open-ended research direction. The code is available at: https://github.com/maxvanspengler/hyperbolic_learning_library.

pyUDLF: A Python Framework for Unsupervised Distance Learning Tasks

Gustavo Leticio
Lucas Pascotti Valem
Leonardo Tadeu Lopes
Daniel Carlos Guimarães Pedronette

The representation of multimedia content experienced tremendous advances in the last decades. Mainly supported by deep learning models, impressive results have been obtained. However, despite such advances in representation, the definition of similarity has been neglected. Effectively computing the similarity between representations remains a challenge. Traditional distance functions, such as the Euclidean distance, are not able to properly consider the relevant similarity information encoded in the dataset manifold. In fact, manifolds are essential to perception in many scenarios, such that exploiting the underlying structure of dataset manifolds plays a central role in multimedia content understanding and retrieval. In this paper, we present a framework for unsupervised distance learning which provides easy and uniform access to methods capable of considering the dataset manifold for redefining similarity. Such methods perform context-sensitive similarity learning based on more global measures, capable of improving the effectiveness of retrieval and machine learning tasks. The framework can use distance, similarity, or ranking information both as input and output and compute traditional retrieval effectiveness measures. Implemented as a wrapper in Python, the framework allows integration with a large number of Python libraries while keeping a back-end in C++ for efficiency. The paper also discusses diverse applications of the methods available in the pyUDLF framework, including image re-ranking, video retrieval, person re-ID, and pre-processing of distance measurements for clustering and classification.

OpenDMC: An Open-Source Library and Performance Evaluation for Deep-learning-based Multi-frame Compression

Wei Gao
Shangkun Sun
Huiming Zheng
Yuyang Wu
Hua Ye
Yongchi Zhang

Video streaming has become an essential component of our everyday routines. Nevertheless, video data imposes a significant strain on data usage, demanding substantial bandwidth and storage resources for effective transmission. To suit explosively increasing video transmission and storage requirements, deep-learning-based video compression has developed rapidly in the past few years. New methods have mushroomed in order to achieve better Rate-Distortion (RD) performance. However, the absence of an algorithm library that can effectively sort, classify, and conduct extensive benchmark testing on existing algorithms remains a challenge. In this paper, we present an open-source algorithm library called OpenDMC, which integrates a variety of end-to-end video compression methods in cross-platform environments. We provide comprehensive descriptions of the algorithms used in the library, including their contributions and implementation details. We perform a thorough benchmarking test to evaluate the performance of the algorithms. We meticulously compare and analyze each algorithm based on various metrics, including RD performance, running time, and GPU memory usage. The open-source library for OpenDMC is available at https://openi.pcl.ac.cn/OpenDMC/.

MATK: The Meme Analytical Tool Kit

Ming Shan Hee
Aditi Kumaresan
Nguyen Khoi Hoang
Nirmalendu Prakash
Rui Cao
Roy Ka-Wei Lee

The rise of social media platforms has brought about a new digital culture called memes. Memes, which combine visuals and text, can strongly influence public opinions on social and cultural issues. As a result, people have become interested in categorizing memes, leading to the development of various datasets and multimodal models that show promising results in this field. However, there is currently a lack of a single library that allows for the reproduction, evaluation, and comparison of these models using fair benchmarks and settings. To fill this gap, we introduce the Meme Analytical Tool Kit (MATK), an open-source toolkit specifically designed to support existing memes datasets and cutting-edge multimodal models. MATK aims to assist researchers and engineers in training and reproducing these multimodal models for meme classification tasks, while also providing analysis techniques to gain insights into their strengths and weaknesses. To access MATK, please visit https://github.com/Social-AI-Studio/MATK.

Emotion Recognition ToolKit (ERTK): Standardising Tools For Emotion Recognition Research

Aaron Keesing
Yun Sing Koh
Vithya Yogarajan
Michael Witbrock

Many software packages and toolkits have been developed for machine learning, in particular for natural language processing and automatic speech recognition. However, there are few software packages designed for emotion recognition. Emotion datasets have diverse structures and annotations, and feature extractors often have different interfaces, which requires writing code specific to each interface. To improve the standardisation and reproducibility of emotion recognition research, we present the Emotion Recognition ToolKit (ERTK), a Python library for emotion recognition. ERTK comprises processing scripts for emotion datasets, standard interfaces to feature extractors, and a framework for defining experiments with declarative configuration files. ERTK is modular and extensible, which allows for easily incorporating additional models and processors. The current version of ERTK focuses on emotional speech, however, the library is modular and can be easily extended to other modalities, which we plan for future releases. ERTK is open-source and available from GitHub: https://github.com/Strong-AI-Lab/emotion.

SESSION: Tutorial Summaries

Revisiting Learning Paradigms for Multimedia Data Generation

Xu Tan

With the development of deep learning, multimedia data generation (e.g., image generation, audio synthesis, music composition, and video generation) has attracted a lot of attention. Deep learning methods for data generation usually build a mapping from source condition X to target data Y. The target Y (e.g., image, speech, music, video) is usually high-dimensional and complex, and contains rich information not exist in source data, which hinders the effective and efficient learning on the source-target mapping. Representation learning has achieved rapid progress in the past decade, which is beneficial for data understanding tasks. However, traditional representation learning cannot address the challenges faced by multimedia data generation tasks. This tutorial revisits the learning paradigms for data generation and introduces a paradigm called regeneration learning that can improve the effectiveness and efficiency of multimedia data generation. We show that a variety of tasks in multimedia data generation (e.g., image generation, speech synthesis, music composition, video generation) can benefit from this regeneration learning paradigm, and a lot of recent popular data generation models (e.g., DALL-E 1/2, Stable Diffusion, AudioLM, NaturalSpeech 2, MusicLM) can be covered by this learning paradigm.

Efficient Multimedia Computing: Unleashing the Power of AutoML

Debanjan Datta
Gerald Friedland

As the field of multimedia computing has grown rapidly, so has the need for larger datasets[5] and increased modeling capacity. Navigating this complex landscape often necessitates the use of sophisticated tools and cloud architectures, which all need to be addressed before the actual research commences. Recently, AutoML, an innovation previously exclusive to tabular data, has expanded to encompass multimedia data. This development has the potential to greatly streamline the research process, allowing researchers to shift their focus from model construction to the core content of their problems. In doing so, AutoML not only optimizes resource utilization but also boosts the reproducibility of results. The aim of this tutorial is to acquaint the multimedia community with AutoML technologies, underscoring their advantages and their practical applications in the field.

Disentangled Representation Learning for Multimedia

Xin Wang
Hong Chen
Wenwu Zhu

Disentangled Representation Learning (DRL) aims to learn a model capable of identifying and disentangling the underlying factors hidden in the observable data in representation form. The process of separating underlying factors of variation into variables with semantic meaning benefits in learning explainable representations of data, which imitates the meaningful understanding process of humans when observing an object or relation. As a general learning strategy, DRL has demonstrated its power in improving the model explainability, controllability, robustness, as well as generalization capacity in a wide range of scenarios such as computer vision, natural language processing, data mining etc. In this tutorial, we comprehensively present DRL from various aspects including motivations, definitions, methodologies, evaluations, applications and model designs for multimedia. We discuss works on DRL based on two well-recognized definitions, i.e., Intuitive Definition and Group Theory Definition. We further categorize the methodologies for DRL into four groups, i.e., Traditional Statistical Approaches, Variational Auto-encoder Based Approaches, Generative Adversarial Networks Based Approaches, Hierarchical Approaches and Other Approaches. We also analyze principles to design different DRL models that may benefit different tasks in practical multimedia applications. Finally, we point out challenges in DRL as well as potential research directions deserving future investigations. We believe this tutorial may provide insights for promoting the DRL research in the multimedia community.

Diffusion Models in Generative AI

Cem Sazara

Diffusion models have shown impressive capabilities in the generative AI space. These models have the capability to create images in a variety of styles from photorealistic and futuristic to many more artistic styles by simply using text prompts. This tutorial aims to introduce the underlying mechanisms that make these models successful along with hands-on exercises. The tutorial will start with explaining the diffusion concept with forward and reverse processes. Then, it will cover the fine-tuning process and the control procedures such as guidance and conditioning. The provided hands-on exercises will help apply these concepts on some real-world problems.

SESSION: Panel Summaries

On the Impact of Interactive eXtended Reality: Challenges and Opportunities for Multimedia Research

Irene Viola
Maria Torres Vega

Extended Reality (XR) has been hailed as the new frontier of media, ushering new possibilities for societal areas such as communications, training, entertainment, gaming, and cultural heritage. However, despite the remarkable technical advances, current XR applications are in their majority local and individual experiences. In fact, three great barriers stand between current technology and remote immersive interactive life-like experiences, namely content realism, by means of Artificial Intelligence (AI) techniques, motion-to-photon latency, and accurate human-centric driven experiences able to map real and virtual worlds seamlessly. Overcoming these barriers will require novel solutions at all elements of the end-to-end transmission chain. In this panel, together with the leading experts of the SIGMM community, we will explore the challenges and opportunities to unlock the next generation of interactive XR applications and services.

Panel: Multimodal Large Foundation Models

Mohan Kankanhalli
Marcel Worring

The surprisingly fluent predictive performance of LLM (Large Language Models) as well as the high-quality photo-realistic rendering of Diffusion Models has heralded a new beginning in the area of Generative AI. Such kinds of deep learning based models with billions of parameters and pre-trained on massive-scale data-sets are also called Large Foundation Models (LFM). These models not only have caught the public imagination but also have led to an unprecedented surge in interest towards the applications of these models. Instead of the previous approach of developing AI models for specific tasks, more and more researchers are developing large task-agnostic models pre-trained on massive data, which can then be adapted to a variety of downstream tasks via fine-tuning, fewshot learning, or zero-shot learning. Some examples are ChatGPT, LLaMA, GPT-4, Flamingo, MidJourney, Stable-Diffusion and DALLE. Some of them can handle text (e.g., ChatGPT, LLaMA) while some others (e.g., GPT-4 and Flamingo) can utilize multimodal data and can hence be considered Multimodal Large Foundation Models (MLFM).

Several recent studies have shown that when adapted to specific tasks (e.g., visual question answering), the foundation models can often surpass the performance of state-of-the-art, fully supervised AI models. However, applying foundation models to specialized domain tasks (e.g., medical diagnosis, financial recommendation etc.) raises many ethical issues (e.g., privacy, model bias or hallucinations).

The panel members will discuss the emerging trends in the development and use of large multimodal foundation models. Some of the issues to be discussed are: Research issues in going from LLM to MLFM Behaviour of MLFM Application Potential of MLFM Trust issues in MLFM Limitations of MLFM Societal, Legal and Regulatory issues of MLFM Promising future research in MLFM This panel will bring together several leading experts from universities, research institutions, and industry who will discuss and debate together with the audience. We invite everybody to participate and contribute towards this important and promising research direction.

SESSION: Workshop Summaries

MMSports '23: 6th International Workshop on Multimedia Content Analysis in Sports

Hideo Saito
Thomas B. Moeslund
Rainer Lienhart

The sixth ACM International Workshop on Multimedia Content Analysis in Sports (ACM MMSports'23) is part of the ACM International Conference on Multimedia 2023 (ACM Multimedia 2023). The goal of this workshop is to bring together researchers and practitioners from academia and industry to address challenges and report progress in mining, analyzing, understanding, and visualizing multimedia/multimodal data in sports, sports broadcasts, sports games and sports medicine. The combination of sports and modern technology offers a novel and intriguing field of research with promising approaches for visual broadcast augmentation and understanding, for statistical analysis and evaluation, and for sensor fusion during workouts as well as competitions. There is a lack of research communities focusing on the fusion of multiple modalities. We are helping to close this research gap with this workshop series on multimedia content analysis in sports.

MRAC'23: 1st International Workshop on Multimodal and Responsible Affective Computing

Zheng Lian
Erik Cambria
Guoying Zhao
Björn W. Schuller
Jianhua Tao

Multimodal emotion recognition has become an important research topic due to its wide applications in human-computer interaction. Over the last few decades, the technology has made remarkable progress with the development of deep learning. However, existing technologies are hard to meet the demand for practical applications. To this end, we organize this workshop to bring together researchers in this field to further discuss recent research and future directions.

UAVM '23: 2023 Workshop on UAVs in Multimedia: Capturing the World from a New Perspective

Zhedong Zheng
Yujiao Shi
Tingyu Wang
Jun Liu
Jianwu Fang
Yunchao Wei
Tat-seng Chua

Unmanned Aerial Vehicles (UAVs), also known as drones, have become increasingly popular in recent years due to their ability to capture high-quality multimedia data from the sky. With the rise of multimedia applications, such as aerial photography, cinematography, and mapping, UAVs have emerged as a powerful tool for gathering rich and diverse multimedia content. This workshop aims to bring together researchers, practitioners, and enthusiasts interested in UAV multimedia to explore the latest advancements, challenges, and opportunities in this exciting field. The workshop covers various topics related to UAV multimedia, including aerial image and video processing, machine learning for UAV data analysis, UAV swarm technology, and UAV-based multimedia applications. In the context of the ACM Multimedia conference, this workshop is highly relevant as multimedia data from UAVs is becoming an increasingly important source of content for many multimedia applications. The workshop provides a platform for researchers to share their work and discuss potential collaborations, as well as an opportunity for practitioners to learn about the latest developments in UAV multimedia technology. Overall, this workshop provides a unique opportunity to explore the exciting and rapidly evolving field of UAV multimedia and its potential impact on the wider multimedia community.

SUMAC '23: 5th Workshop on the analySis, Understanding and proMotion of heritAge Contents: Advances in Machine Learning, Signal Processing, Multimodal Techniques and Human-machine Interaction

Valérie Gouet-Brunet
Ronak Kosti
Li Weng

SUMAC 2023 is the fifth edition of the workshop on analySis, Understanding and proMotion of heritAge Contents. It is held in Ottawa, Canada on November 2, 2023 and is co-located with the 31st ACM International Conference on Multimedia. The workshop's objective is to present and discuss the latest and most significant trends, challenges and advances in the fields of machine learning, signal processing, multimodal techniques and human-machine interaction. The workshop is dedicated to the valorization of cultural heritage, with the emphasis on the unlocking of and access to the big data of the past. A representative scope of Computer Science methodologies dedicated to the processing of multimedia heritage contents and their exploitation is covered by the works presented, with the ambition of advancing and raising awareness about this fully developing research field. The complete SUMAC'23 workshop proceedings are available at: https://dl.acm.org/doi/proceedings/10.1145/3581783.3610949.

McGE '23: 1st International Workshop on Multimedia Content Generation and Evaluation: New Methods and Practice

Cheng Jin
Liang He
Mingli Song
Rui Wang

The proposed workshop's topics, focusing on multimedia content generation, quality assessment, datasets and construction, is crucial due to its direct impact on the growth and success of the multimedia field. Multimedia content generation is essential for various applications, such as entertainment, advertising, and education. Quality assessment ensures the overall value and effectiveness of multimedia content, directly influencing user satisfaction and application success. Datasets are indispensable for training and evaluating multimedia algorithms, driving innovation, and fostering progress in the field. Finally, effective dataset construction methods set new benchmarks for the research community, stimulating innovation and unlocking new opportunities for leveraging multimedia data in various applications. The goal of this workshop is to bring together leading researchers in the field in a joint forum for advancing multimedia content generation and evaluation.

MuSe 2023 Challenge: Multimodal Prediction of Mimicked Emotions, Cross-Cultural Humour, and Personalised Recognition of Affects

Shahin Amiriparian
Lukas Christ
Andreas König
Alan Cowen
Eva-Maria Meßner
Erik Cambria
Björn W. Schuller

The 4th Multimodal Sentiment Analysis Challenge (MuSe) focuses on Multimodal Prediction of Mimicked Emotions, Cross-Cultural Humour, and Personalised Recognition of Affects. The workshop takes place in conjunction with ACM Multimedia'23. We provide three datasets as part of the challenge: (i) The Hume-Vidmimic dataset which offers 30+ hours of expressive behaviour data from 557 participants. It involves mimicking and rating emotions: Approval, Disappointment, and Uncertainty. This multimodal resource is valuable for studying human emotional expressions. (ii) The 2023 edition of the Passau Spontaneous Football Coach Humor (Passau-SFCH) dataset comprises German football press conference recordings within the training set, while videos of English football press conferences are included in the unseen test set. This unique configuration offers a cross-cultural evaluation environment for humour recognition. (iii) The Ulm-Trier Social Stress Test (Ulm-TSST) dataset contains recordings of subjects under stress. It involves arousal and valence signals, with some test labels provided to aid personalisation. Based on these datasets, we formulate three multimodal affective computing challenges: (1) Mimicked Emotions Sub-Challenge (MuSe-Mimic) for categorical emotion prediction, (2) Cross-Cultural Humour Detection Sub-Challenge (MuSe-Humour) for cross-cultural humour detection, and (3) Personalisation Sub-Challenge (MuSe-Personalisation) for personalised dimensional emotion recognition. In this summary, we outline the challenge's motivation, participation guidelines, conditions, and results.

MADiMa '23: 8th International Workshop on Multimedia Assisted Dietary Management

Stavroula G. Mougiakakou
Keiji Yanai
Dario Allegra

This abstract provides a summary and overview of the 8th International Workshop on Multimedia Assisted Dietary Management.

IXR '23: 2nd International Workshop on Interactive eXtended Reality

Irene Viola
Hadi Amirpour
Stephanie Arévalo Arboleda
Maria Torres Vega

Despite remarkable advances, current Extended Reality (XR) applications are in their majority local and individual experiences. A plethora of interactive applications, such as teleconferencing, telesurgery, interconnection in new buildings project chain, Cultural Heritage, and Museum contents communication, are well on their way to integrating immersive technologies. However, interconnected, and interactive XR, where participants can virtually interact across vast distances, remains a distant dream. In fact, three great barriers stand between current technology and remote immersive interactive life-like experiences, namely (i) content realism, (ii) motion-to-photon latency, and accurate (iii) human-centric quality assessment and control. Overcoming these barriers will require novel solutions at all elements of the end-to-end transmission chain. This workshop focuses on the challenges, applications, and major advancements in multimedia, networks, and end-user infrastructures to enable the next generation of interactive XR applications and services.

NarSUM '23: The 2nd Workshop on User-Centric Narrative Summarization of Long Videos

Mohan S. Kankanhalli
Ioannis (Yiannis) Patras
Jianquan Liu
Yongkang Wong
Takahiro Komamizu
Satoshi Yamazaki
Karen Stephen
Kajal Kansal

With video capture devices becoming widely popular, the amount of video data generated per day has seen a rapid increase over the past few years. Browsing through hours of video data to retrieve useful information is a tedious and boring task. Video Summarization technology has played a crucial role in addressing this issue. It is a well-researched topic in the multimedia community. However, the focus so far has been limited to creating summary to videos which are short (only a few minutes). This workshop aims to call for researchers on relevant background to focus on novel solutions for user-centric narrative summarization of long videos. This workshop will also cover important aspects of video summarization research like what is "important" in a video, how to evaluate the goodness of a created summary, open challenges in video summarization, etc.

HCMA '23: 4th International Workshop on Human-Centric Multimedia Analysis

Jingkuan Song
Wu Liu
Xinchen Liu
Dingwen Zhang
Chaowei Fang Chaowei Fang
Hongyuan Zhu
Wenbing Huang
John Smith
Xin Wang

Understanding human interactions within diverse media contexts has emerged as a fundamental challenge. The explosive growth of multimedia data not only provides opportunities for human-centirc analysis but also increases the complexity of processing multimodal data. To address this pivotal challenge and explore its multifaceted dimensions, the Fourth International Workshop on Human-Centric Multimedia Analysis is concentrated on the tasks of human-centric analysis with multimedia and multimodal information. By delving into the nuances of human behavior within multimedia, this workshop aims to uncover novel insights, showcase innovative methodologies, and discuss future directions. With a spotlight on cutting-edge research and a focus on real-world applications, the workshop seeks to equip researchers and practitioners with the tools and knowledge to navigate the intricacies of human-centric multimedia analysis.

FME '23: 3rd Facial Micro-Expression Workshop

Adrian K. Davison
Jingting Li
Moi Hoon Yap
John See
Wen-Huang Cheng
Xiaobai Li
Xiaopeng Hong
Su-Jing Wang

Micro-expressions are facial movements that are extremely short and not easily detected, which often reflect the genuine emotions of individuals. Micro-expressions are important cues for understanding real human emotions and can be used for non-contact, non-perceptual deception detection, or abnormal emotion recognition. It has broad application prospects in national security, judicial practice, health prevention, and clinical practice. However, micro-expression feature extraction and learning are highly challenging because they are typically short in duration, low intensity, and have local facial asymmetry. In addition, the intelligent micro-expression analysis combined with deep learning technology is also plagued by the problem of relatively small data samples. Not only is micro-expression elicitation very difficult, micro-expression annotation is also very time-consuming and laborious. More importantly, the micro-expression generation mechanism is not yet clear, which shackles the application of micro-expressions in real scenarios. FME'23 is the inaugural workshop in this area of research, with the aim of promoting interactions between researchers and scholars from within this niche area of research. This year we hope to discuss the growing ethical conversations when using face data, and how we can come to a consensus on micro-expression standards within affective computing.

Deep Multimodal Learning for Information Retrieval

Wei Ji
Yinwei Wei
Zhedong Zheng
Hao Fei
Tat-seng Chua

Information retrieval (IR) is a fundamental technique that aims to acquire information from a collection of documents, web pages, or other sources. While traditional text-based IR has achieved great success, the under-utilization of varied data sources in different modalities (i.e., text, images, audio, and video) would hinder IR techniques from giving its full advancement and thus limits the applications in the real world. Within recent years, the rapid development of deep multimodal learning paves the way for advancing IR with multi-modality. Benefiting from a variety of data types and modalities, some latest prevailing techniques are invented to show great facilitation in multi-modal and IR learning, such as CLIP, ChatGPT, GPT4, etc. In the context of IR, deep multi-modal learning has shown the prominent potential to improve the performance of retrieval systems, by enabling them to better understand and process the diverse types of data that they encounter. Given the great potential shown by multimodal-empowered IR, there can be still unsolved challenges and open questions in the related directions. With this workshop, we aim to provide a platform for discussion about multi-modal IR among scholars, practitioners, and other interested parties.

AMC-SME '23: 2023 Workshop on Advanced Multimedia Computing for Smart Manufacturing and Engineering

Junxin Chen
Wei Wang
Gwanggil Jeon

Recent years have witnessed dramatic progress in computer vision technologies and their broad applications, where manufacturing and industrial fields are important branches that highly require computer vision to bring them intelligent updating. As the name suggests, our workshop focus on advanced multimedia computing for smart manufacturing and engineering. We want to collect advances in using computer vision in various applications of smart manufacturing and engineering, theoretical research and practical applications are both welcome. The accepted papers cover the practical applications of advanced multimedia computing in tunnel water leakage recognition and segmentation, multi-class lane detection, gaze estimation, and also the theoretical achievements on spectrum sensing, semantic segmentation, image classification, information security, etc. This workshop goals to boost the concern of the public on exploiting multimedia computing for intelligent manufacturing and engineering.

LGM3A '23: 1st Workshop on Large Generative Models Meet Multimodal Applications

Zheng Wang
Cheng Long
Shihao Xu
Bingzheng Gan
Wei Shi
Zhao Cao
Tat-Seng Chua

A large language model is a type of artificial intelligence model designed to understand and generate natural language text, such as GPT, T5, RoBERTa, BERT, etc. These models are trained on vast amounts of text data, allowing them to learn the patterns and structures of human language. With the increasing amount of multimodal information such as audio, visual, and text data generated, there is a growing need of leveraging large generative language model for multimodal applications. Recently, a few notable multimodal models (e.g., BLIP, Flamingo, KOSMOS, PaLM-E, LLaVA, Visual ChatGPT, GPT-4, etc.) with a combination of large language models significantly enhanced their understanding and generate more accurate and nuanced responses. The workshop will provide an opportunity for researchers, practitioners, and industry professionals to explore the latest trends and best practices in the field of multimodal applications of large generative models. The workshop will also focus on exploring the challenges and opportunities of integrating large language models with other AI technologies such as computer vision and speech recognition.