MM '22: Proceedings of the 30th ACM International Conference on Multimedia

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Full Citation in the ACM Digital Library

SESSION: Keynote Talks

Alexa, let's work together! How Alexa Helps Customers Complete Tasks with Verbal and Visual Guidance in the Alexa Prize TaskBot Challenge

  • Yoelle Maarek

In this talk, I will present the Alexa Prize TaskBot Challenge, which allows selected academic teams to develop TaskBots. TaskBots are agents that interact with Alexa users who require assistance (via "Alexa, let's work together") to complete everyday tasks requiring multiple steps and decisions, such as cooking and home improvement. One of the unique elements of this challenge is its multi-modal nature, where users receive both verbal guidance and visual instructions, when a screen is available (e.g., on Echo Show devices). Some of the hard AI challenges the teams addressed included leveraging domain knowledge, tacking dialogue state, supporting adaptive and robust conversations and probably the most relevant to this conference: handling multi-modal interactions.

Data Science against COVID-19: The Valencian Experience

  • Nuria Oliver

This invited talk describes the work that a multi-disciplinary team of 20+ volunteer scientists did between March of 2020 and April of 2022, working very closely with the Presidency of the Valencian Government to support their decision-making during the COVID-19 pandemic in Spain. This team was known as the Data Science against COVID-19 taskforce. The team's work was structured in 4 areas: (1) large-scale human mobility modeling; (2) development of computational epidemiological models (metapopulation, individual and LSTM-based models); (3) development of predictive models of hospital and intensive care units' occupancy; and (4) a large-scale, online citizen surveys called the COVID19impactsurvey ( with over 720,000 answers worldwide. This survey enabled us to shed light on the impact that the pandemic had on people's lives during the period of study [3,4,5]. In the talk, I will present the results obtained in each of these four areas, including winning the 500K XPRIZE Pandemic Response Challenge [1] and obtaining a best paper award at ECML-PKDD 2021 [2]. I will share the lessons learned in this very special initiative of collaboration between the civil society at large (through the citizen survey), the scientific community (through the Data Science against COVID-19 taskforce) and a public administration (through our collaboration with the Presidency of the Valencian Government). For those interested in knowing more about this initiative, WIRED magazine published an extensive article describing the story of this effort:

Grounding, Meaning and Foundation Models: Adventures in Multimodal Machine Learning

  • Douwe Kiela

In this talk I will present a vision for acquiring perceptually grounded meaning in machines, as a key next challenge for natural language processing. I will cover some recent work that tries to improve how we do model evaluation in multimodal settings, focusing on the new Adversarial VQA and Winoground evaluation datasets. After that, I will talk about our latest large-scale vision and language "foundation model", called FLAVA: a single holistic universal transformer that targets all modalities at once and that shows impressive performance on a wide range of tasks.

SESSION: Oral Session I: Engaging Users with Multimedia -- Emotional and Social Signals

A Multi-view Spectral-Spatial-Temporal Masked Autoencoder for Decoding Emotions with Self-supervised Learning

  • Rui Li
  • Yiting Wang
  • Wei-Long Zheng
  • Bao-Liang Lu

Affective Brain-computer Interface has achieved considerable advances that researchers can successfully interpret labeled and flawless EEG data collected in laboratory settings. However, the annotation of EEG data is time-consuming and requires a vast workforce which limits the application in practical scenarios. Furthermore, daily collected EEG data may be partially damaged since EEG signals are sensitive to noise. In this paper, we propose a Multi-view Spectral-Spatial-Temporal Masked Autoencoder (MV-SSTMA) with self-supervised learning to tackle these challenges towards daily applications. The MV-SSTMA is based on a multi-view CNN-Transformer hybrid structure, interpreting the emotion-related knowledge of EEG signals from spectral, spatial, and temporal perspectives. Our model consists of three stages: 1) In the generalized pre-training stage, channels of unlabeled EEG data from all subjects are randomly masked and later reconstructed to learn the generic representations from EEG data; 2) In the personalized calibration stage, only few labeled data from a specific subject are used to calibrate the model; 3) In the personal test stage, our model can decode personal emotions from the sound EEG data as well as damaged ones with missing channels. Extensive experiments on two open emotional EEG datasets demonstrate that our proposed model achieves state-of-the-art performance on emotion recognition. In addition, under the abnormal circumstance of missing channels, the proposed model can still effectively recognize emotions.

Counterfactual Reasoning for Out-of-distribution Multimodal Sentiment Analysis

  • Teng Sun
  • Wenjie Wang
  • Liqaing Jing
  • Yiran Cui
  • Xuemeng Song
  • Liqiang Nie

Existing studies on multimodal sentiment analysis heavily rely on textual modality and unavoidably induce the spurious correlations between textual words and sentiment labels. This greatly hinders the model generalization ability. To address this problem, we define the task of out-of-distribution (OOD) multimodal sentiment analysis. This task aims to estimate and mitigate the bad effect of textual modality for strong OOD generalization. To this end, we embrace causal inference, which inspects the causal relationships via a causal graph. From the graph, we find that the spurious correlations are attributed to the direct effect of textual modality on the model prediction while the indirect one is more reliable by considering multimodal semantics. Inspired by this, we devise a model-agnostic counterfactual framework for multimodal sentiment analysis, which captures the direct effect of textual modality via an extra text model and estimates the indirect one by a multimodal model. During the inference, we first estimate the direct effect by the counterfactual inference, and then subtract it from the total effect of all modalities to obtain the indirect effect for reliable prediction. Extensive experiments show the superior effectiveness and generalization ability of our proposed framework.

MAFW: A Large-scale, Multi-modal, Compound Affective Database for Dynamic Facial Expression Recognition in the Wild

  • Yuanyuan Liu
  • Wei Dai
  • Chuanxu Feng
  • Wenbin Wang
  • Guanghao Yin
  • Jiabei Zeng
  • Shiguang Shan

Dynamic facial expression recognition (FER) databases provide important data support for affective computing and applications. However, most FER databases are annotated with several basic mutually exclusive emotional categories and contain only one modality, e.g., videos. The monotonous labels and modality cannot accurately imitate human emotions and fulfill applications in the real world. In this paper, we propose MAFW, a large-scale multi-modal compound affective database with 10,045 video-audio clips in the wild. Each clip is annotated with a compound emotional category and a couple of sentences that describe the subjects' affective behaviors in the clip. For the compound emotion annotation, each clip is categorized into one or more of the 11 widely-used emotions, i.e., anger, disgust, fear, happiness, neutral, sadness, surprise, contempt, anxiety, helplessness, and disappointment. To ensure high quality of the labels, we filter out the unreliable annotations by an Expectation Maximization (EM) algorithm, and then obtain 11 single-label emotion categories and 32 multi-label emotion categories. To the best of our knowledge, MAFW is the first in-the-wild multi-modal database annotated with compound emotion annotations and emotion-related captions. Additionally, we also propose a novel Transformer-based expression snippet feature learning method to recognize the compound emotions leveraging the expression-change relations among different emotions and modalities. Extensive experiments on MAFW database show the advantages of the proposed method over other state-of-the-art methods for both uni- and multi-modal FER. Our MAFW database is publicly available from

SER30K: A Large-Scale Dataset for Sticker Emotion Recognition

  • Shengzhe Liu
  • Xin Zhang
  • Jufeng Yang

With the popularity of instant messaging applications, online chatting plays an essential role in our daily life. The prevailing use of stickers to express emotions in online chatting leads to the necessity of multimodal sticker emotion recognition. Considering the lack of sticker emotion data, we collect a large-scale sticker emotion recognition dataset named SER30K. It consists of a total of 1,887 sticker themes with total 30,739 sticker images. Some commonly used images, such as realistic images and facial expression images, have been well studied in the field of emotion analysis. However, it is still challenging to understand the emotion of sticker images. Since the characteristics in stickers from the same theme are similar, we can only accurately predict the emotion by capturing the local information (e.g., expressions, poses) and understanding the global information (e.g., relations among objects). To tackle this challenge, we propose a LOcal Re-Attention multimodal network (LORA) to learn sticker emotions in an end-to-end manner. Different from previous approaches using convolutional neural networks, LORA employs the vision transformer to extract visual features, leading to better capture the global relations. In addition, we design a local re-attention module to focus on important region information. Then a simple but efficient modal fusion module combines visual and language features. Extensive experiments are performed on the SER30K and other emotion recognition datasets, demonstrating the effectiveness of our proposed method. Our code, model and dataset are released on

SESSION: Poster Session I: Engaging Users with Multimedia -- Emotional and Social Signals

Representation Learning through Multimodal Attention and Time-Sync Comments for Affective Video Content Analysis

  • Jicai Pan
  • Shangfei Wang
  • Lin Fang

Although temporal patterns inherent in visual and audio signals are crucial for affective video content analysis, they have not been thoroughly explored yet. In this paper, we propose a novel Temporal-Aware Multimodal (TAM) method to fully capture the temporal information. Specifically, we design a cross-temporal multimodal fusion module that applies attention-based fusion to different modalities within and across video segments. As a result, it fully captures the temporal relations between different modalities. Furthermore, a single emotion label lacks supervision for learning representation of each segment, making temporal pattern mining difficult. We leverage time-synchronized comments (TSCs) as auxiliary supervision, since these comments are easily accessible and contain rich emotional cues. Two TSC-based self-supervised tasks are designed: the first aims to predict the emotional words in a TSC from video representation and TSC contextual semantics, and the second predicts the segment in which the TSC appears by calculating the correlation between video representation and TSC embedding. These self-supervised tasks are used to pre-train the cross-temporal multimodal fusion module on a large-scale video-TSC dataset, which is crawled from the web without labeling costs. These self-supervised pre-training tasks prompt the fusion module to perform representation learning on segments including TSC, thus capturing more temporal affective patterns. Experimental results on three benchmark datasets show that the proposed fusion module achieves state-of-the-art results in affective video content analysis. Ablation studies verify that after TSC-based pre-training, the fusion module learns more segments' affective patterns and achieves better performance.

TFF-Former: Temporal-Frequency Fusion Transformer for Zero-training Decoding of Two BCI Tasks

  • Xujin Li
  • Wei Wei
  • Shuang Qiu
  • Huiguang He

Brain-computer interface (BCI) systems provide a direct connection between the human brain and external devices. Visual evoked BCI systems including Event-related Potential (ERP) and Steady-state Visual Evoked Potential (SSVEP) have attracted extensive attention because of their strong brain responses and wide applications. Previous studies have made some breakthroughs in within-subject decoding algorithms for specific tasks. However, there are two challenges in current decoding algorithms in BCI systems. Firstly, current decoding algorithms cannot accurately classify EEG signals without the data of the new subject, but the calibration procedure is time-consuming. Secondly, algorithms are tailored to extract features for one specific task, which limits their applications across tasks. In this study, we proposed a Temporal-Frequency Fusion Transformer (TFF-Former) for zero-training decoding across two BCI tasks. EEG data were organized into temporal-spatial and frequency-spatial forms, which can be considered as two views. In the TFF-Former framework, two symmetrical Transformer streams were designed to extract view-specific features. The cross-view module based on the cross-attention mechanism was proposed to guide each stream to strengthen common representations of features across EEG views. Additionally, an attention-based fusion module was built to fuse the representations from the two views effectively. The mean mask mechanism was applied to adaptively decrease redundant EEG tokens aggregation for the integration of common representations. We validated our method on the self-collected RSVP dataset and benchmark SSVEP dataset. Experimental results demonstrated that our TFF-Former model achieved competitive performance compared with models in each of the above paradigms. It can further promote the application of visual evoked EEG-based BCI system.

Towards Unbiased Visual Emotion Recognition via Causal Intervention

  • Yuedong Chen
  • Xu Yang
  • Tat-Jen Cham
  • Jianfei Cai

Although much progress has been made in visual emotion recognition, researchers have realized that modern deep networks tend to exploit dataset characteristics to learn spurious statistical associations between the input and the target. Such dataset characteristics are usually treated as dataset bias, which damages the robustness and generalization performance of these recognition systems. In this work, we scrutinize this problem from the perspective of causal inference, where such dataset characteristic is termed as a confounder which misleads the system to learn the spurious correlation. To alleviate the negative effects brought by the dataset bias, we propose a novel Interventional Emotion Recognition Network (IERN) to achieve the backdoor adjustment, which is one fundamental deconfounding technique in causal inference. Specifically, IERN starts by disentangling the dataset-related context feature from the actual emotion feature, where the former forms the confounder. The emotion feature will then be forced to see each confounder stratum equally before being fed into the classifier. A series of designed tests validate the efficacy of IERN, and experiments on three emotion benchmarks demonstrate that IERN outperforms state-of-the-art approaches for unbiased visual emotion recognition.

Bodily Behaviors in Social Interaction: Novel Annotations and State-of-the-Art Evaluation

  • Michal Balazia
  • Philipp Müller
  • Ákos Levente Tánczos
  • August von Liechtenstein
  • François Brémond

Body language is an eye-catching social signal and its automatic analysis can significantly advance artificial intelligence systems to understand and actively participate in social interactions. While computer vision has made impressive progress in low-level tasks like head and body pose estimation, the detection of more subtle behaviors such as gesturing, grooming, or fumbling is not well explored. In this paper we present BBSI, the first set of annotations of complex Bodily Behaviors embedded in continuous Social Interactions in a group setting. Based on previous work in psychology, we manually annotated 26 hours of spontaneous human behavior in the MPIIGroupInteraction dataset with 15 distinct body language classes. We present comprehensive descriptive statistics on the resulting dataset as well as results of annotation quality evaluations. For automatic detection of these behaviors, we adapt the Pyramid Dilated Attention Network (PDAN), a state-of-the-art approach for human action detection. We perform experiments using four variants of spatial-temporal features as input to PDAN: Two-Stream Inflated 3D CNN, Temporal Segment Networks, Temporal Shift Module and Swin Transformer. Results are promising and indicate a great room for improvement in this difficult task. Representing a key piece in the puzzle towards automatic understanding of social behavior, BBSI is fully available to the research community.

Learning from Label Relationships in Human Affect

  • Niki Maria Foteinopoulou
  • Ioannis Patras

Human affect and mental state estimation in an automated manner, face a number of difficulties, including learning from labels with poor or no temporal resolution, learning from few datasets with little data (often due to confidentiality constraints) and, (very) long, in-the-wild videos. For these reasons, deep learning methodologies tend to overfit, that is, arrive at latent representations with poor generalisation performance on the final regression task. To overcome this, in this work, we introduce two complementary contributions. First, we introduce a novel relational loss for multilabel regression and ordinal problems that regularises learning and leads to better generalisation. The proposed loss uses label vector inter-relational information to learn better latent representations by aligning batch label distances to the distances in the latent feature space. Second, we utilise a two-stage attention architecture that estimates a target for each clip by using features from the neighbouring clips as temporal context. We evaluate the proposed methodology on both continuous affect and schizophrenia severity estimation problems, as there are methodological and contextual parallels between the two. Experimental results demonstrate that the proposed methodology outperforms the baselines that are trained using the supervised regression loss, as well as pre-training the network architecture with an unsupervised contrastive loss. In the domain of schizophrenia, the proposed methodology outperforms previous state-of-the-art by a large margin, achieving a PCC of up to 78% performance close to that of human experts (85%) and much higher than previous works (uplift of up to 40%). In the case of affect recognition, we outperform previous vision-based methods in terms of CCC on both the OMG and the AMIGOS datasets. Specifically for AMIGOS, we outperform previous SoTA CCC for both arousal and valence by 9% and 13% respectively, and in the OMG dataset we outperform previous vision works by up to 5% for both arousal and valence.

Brain Topography Adaptive Network for Satisfaction Modeling in Interactive Information Access System

  • Ziyi Ye
  • Xiaohui Xie
  • Yiqun Liu
  • Zhihong Wang
  • Xuesong Chen
  • Min Zhang
  • Shaoping Ma

With the growth of information on the Web, most users heavily rely on information access systems (e.g., search engines, recommender systems, etc.) in their daily lives. During this procedure, modeling users' satisfaction status plays an essential part in improving their experiences with the systems. In this paper, we aim to explore the benefits of using Electroencephalography (EEG) signals for satisfaction modeling in interactive information access system design. Different from existing EEG classification tasks, the arisen of satisfaction involves multiple brain functions, such as arousal, prototypicality, and appraisals, which are related to different brain topographical areas. Thus modeling user satisfaction raises great challenges to existing solutions. To address this challenge, we propose BTA, a Brain Topography Adaptive network with a multi-centrality encoding module and a spatial attention mechanism module to capture cognitive connectives in different spatial distances. We explore the effectiveness of BTA for satisfaction modeling in two popular information access scenarios, i.e., search and recommendation. Extensive experiments on two real-world datasets verify the effectiveness of introducing brain topography adaptive strategy in satisfaction modeling. Furthermore, we also conduct search result re-ranking task and video rating prediction task based on the satisfaction inferred from brain signals on search and recommendation scenarios, respectively. Experimental results show that brain signals extracted with BTA help improve the performance of interactive information access systems significantly.

DPCNet: Dual Path Multi-Excitation Collaborative Network for Facial Expression Representation Learning in Videos

  • Yan Wang
  • Yixuan Sun
  • Wei Song
  • Shuyong Gao
  • Yiwen Huang
  • Zhaoyu Chen
  • Weifeng Ge
  • Wenqiang Zhang

Current works of facial expression learning in video consume significant computational resources to learn spatial channel feature representations and temporal relationships. To mitigate this issue, we propose a Dual Path multi-excitation Collaborative Network (DPCNet) to learn the critical information for facial expression representation from fewer keyframes in videos. Specifically, the DPCNet learns the important regions and keyframes from a tuple of four view-grouped frames by multi-excitation modules and produces dual-path representations of one video with consistency under two regularization strategies. A spatial-frame excitation module and a channel-temporal aggregation module are introduced consecutively to learn spatial-frame representation and generate complementary channel-temporal aggregation, respectively. Moreover, we design a multi-frame regularization loss to enforce the representation of multiple frames in the dual view to be semantically coherent. To obtain consistent prediction probabilities from the dual path, we further propose a dual path regularization loss, aiming to minimize the divergence between the distributions of two-path embeddings. Extensive experiments and ablation studies show that the DPCNet can significantly improve the performance of video-based FER and achieve state-of-the-art results on the large-scale DFEW dataset.

Pursuing Knowledge Consistency: Supervised Hierarchical Contrastive Learning for Facial Action Unit Recognition

  • Yingjie Chen
  • Chong Chen
  • Xiao Luo
  • Jianqiang Huang
  • Xian-Sheng Hua
  • Tao Wang
  • Yun Liang

With the increasing need for emotion analysis, facial action unit (AU) recognition has attracted much more attention as a fundamental task for affective computing. Although deep learning has boosted the performance of AU recognition to a new level in recent years, it remains challenging to extract subject-consistent representations since the appearance changes caused by AUs are subtle and ambiguous among subjects. We observe that there are three kinds of inherent relations among AUs, which can be treated as strong prior knowledge, and pursuing the consistency of such knowledge is the key to learning subject-consistent representations. To this end, we propose a supervised hierarchical contrastive learning method (SupHCL) for AU recognition to pursue knowledge consistency among different facial images and different AUs, which is orthogonal to methods focusing on network architecture design. Specifically, SupHCL contains three relation consistency modules, i.e., unary, binary, and multivariate relation consistency modules, which take the corresponding kind of inherent relations as extra supervision to encourage knowledge-consistent distributions of both AU-level and image-level representations. Experiments conducted on two commonly used AU benchmark datasets, BP4D and DISFA, demonstrate the effectiveness of each relation consistency module and the superiority of SupHCL.

Unsupervised Domain Adaptation Integrating Transformer and Mutual Information for Cross-Corpus Speech Emotion Recognition

  • Shiqing Zhang
  • Ruixin Liu
  • Yijiao Yang
  • Xiaoming Zhao
  • Jun Yu

This paper focuses on an interesting task, i.e., unsupervised cross-corpus Speech Emotion Recognition (SER), in which the labelled training (source) corpus and the unlabelled testing (target) corpus have different feature distributions, resulting in the discrepancy between the source and target domains. To address this issue, this paper proposes an unsupervised domain adaptation method integrating Transformers and Mutual Information (MI) for cross-corpus SER. Initially, our method employs encoder layers of Transformers to capture long-term temporal dynamics in an utterance from the extracted segment-level log-Mel spectrogram features, thereby producing the corresponding utterance-level features for each utterance in two domains. Then, we propose an unsupervised feature decomposition method with a hybrid Max-Min MI strategy to separately learn domain-invariant features and domain-specific features from the extracted mixed utterance-level features, in which the discrepancy between two domains is eliminated as much as possible and meanwhile their individual characteristic is preserved. Finally, an interactive Multi-Head attention fusion strategy is designed to learn the complementarity between domain-invariant features and domain-specific features so that they can be interactively fused for SER. Extensive experiments on the IEMOCAP and MSP-Improv datasets demonstrate the effectiveness of our proposed method on unsupervised cross-corpus SER tasks, outperforming state-of-the-art unsupervised cross-corpus SER methods.

Co-Completion for Occluded Facial Expression Recognition

  • Zhen Xing
  • Weimin Tan
  • Ruian He
  • Yangle Lin
  • Bo Yan

The existence of occlusions brings in semantically irrelevant visual patterns and leads to the content loss of occluded regions. Although previous works have made improvement on occluded facial expression recognition, they do not explicitly handle the interference factors aforementioned. In this paper, we propose an intuitive and simplified workflow, Co-Completion, which combines occlusion discarding and feature completion together to reduce the impact of occlusions on facial expression recognition. To protect key features from being contaminated and reduce the dependency of feature completion on occlusion discarding, guidance from discriminative regions is also introduced for joint feature completion. Moreover, we release the COO-RW database for occlusion simulation and refine the occlusion generation protocol for fair comparison in this filed. Experiments on synthetic and realistic databases demonstrate the superiority of our method. The COO-RW database can be downloaded from

Generalized Inter-class Loss for Gait Recognition

  • Weichen Yu
  • Hongyuan Yu
  • Yan Huang
  • Liang Wang

Gait recognition is a unique biometric technique that can be performed at a long distance non-cooperatively and has broad applications in public safety and intelligent traffic systems. The previous gait works focus more on minimizing the intra-class variance while ignoring the significance of constraining inter-class variance. To this end, we propose a generalized inter-class loss that resolves the inter-class variance from both sample-level feature distribution and class-level feature distribution. Instead of equal penalty strength on pair scores, the proposed loss optimizes sample-level inter-class feature distribution by dynamically adjusting the pairwise weight. Further, in class-level distribution, the proposed loss adds a constraint on the uniformity of inter-class feature distribution, which forces the feature representations to approximate a hypersphere and keep maximal inter-class variance. In addition, the proposed method automatically adjusts the margin between classes which enables the inter-class feature distribution to be more flexible. The proposed method can be generalized to different gait recognition networks and achieves significant improvements. We conduct a series of experiments on CASIA-B and OUMVLP, and the experimental results show that the proposed loss can significantly improve the performance and achieves the state-of-the-art performances.

Feeling Without Sharing: A Federated Video Emotion Recognition Framework Via Privacy-Agnostic Hybrid Aggregation

  • Fan Qi
  • Zixin Zhang
  • Xianshan Yang
  • Huaiwen Zhang
  • Changsheng Xu

The explosion of video data brings new opportunities and challenges for emotion recognition. Video emotion applications have great commercial value, but the potential to involve illegal snooping on personal feelings has led to controversy over privacy protection. The federated learning (FL) paradigm can substantially address the growing public concerns about data privacy in video emotion recognition. However, conventional FL methods perform poorly due to the uniqueness of the task: the data are heterogeneous across clients induced by emotional label skew and cross-culture expression differences. To mitigate the heterogeneous data, we propose EmoFed, a practical framework of federated learning video-based emotion recognition via multi-group clustering and privacy-agnostic hybrid aggregation. It yields a generically applicable and improved model while protecting privacy, which trains local models under group-aware personalized aggregation. To further encourage communicating comprehensive and privacy-agnostic information among clients, we upload model parameters of both the global layers and personalization layers to the server. We utilize the homomorphically encrypted method for personalization layers, which incurs no learning accuracy loss since no noise is added to the model updates during the encryption/decryption process. The proposed method works on video-based emotion recognition tasks to predict actors' emotional expressions and induced emotion by viewers. Extensive experiments and ablation studies on four benchmarks have demonstrated the efficacy and practicability of our method.

Self-Paced Label Distribution Learning for In-The-Wild Facial Expression Recognition

  • Jianjian Shao
  • Zhenqian Wu
  • Yuanyan Luo
  • Shudong Huang
  • Xiaorong Pu
  • Yazhou Ren

Label distribution learning (LDL) has achieved great progress in facial expression recognition (FER), where the generating label distribution is a key procedure for LDL-based FER. However, many existing researches have shown the common problem with noisy samples in FER, especially on in-the-wild datasets. This issue may lead to generating unreliable label distributions (which can be seen as label noise), and will further negatively affect the FER model. To this end, we propose a play-and-plug method of self-paced label distribution learning (SPLDL) for in-the-wild FER. Specifically, a simple yet efficient label distribution generator is adopted to generate label distributions to guide label distribution learning. We then introduce self-paced learning (SPL) paradigm and develop a novel self-paced label distribution learning strategy, which considers both classification losses and distribution losses. SPLDL first learns easy samples with reliable label distributions and gradually steps to complex ones, effectively suppressing the negative impact introduced by noisy samples and unreliable label distributions. Extensive experiments on in-the-wild FER datasets (\emphi.e., RAF-DB and AffectNet) based on three backbone networks demonstrate the effectiveness of the proposed method.

Uncertainty-Aware Semi-Supervised Learning of 3D Face Rigging from Single Image

  • Yong Zhao
  • Haifeng Chen
  • Hichem Sahli
  • Ke Lu
  • Dongmei Jiang

We present a method to rig 3D faces via Action Units (AUs), viewpoint and light direction, from single input image. Existing 3D methods for face synthesis and animation rely heavily on 3D morphable model (3DMM), which was built on 3D data and cannot provide intuitive expression parameters, while AU-driven 2D methods cannot handle head pose and lighting effect. We bridge the gap by integrating a recent 3D reconstruction method with 2D AU-driven method in a semi-supervised fashion. Built upon the auto-encoding 3D face reconstruction model that decouples depth, albedo, viewpoint and light without any supervision, we further decouple expression from identity for depth and albedo with a novel conditional feature translation module and pretrained critics for AU intensity estimation and image classification. Novel objective functions are designed using unlabeled in-the-wild images and in-door images with AU labels. We also leverage uncertainty losses to model the probably changing AU region of images as input noise for synthesis, and model the noisy AU intensity labels for intensity estimation of the AU critic. Experiments with face editing and animation on four datasets show that, compared with six state-of-the-art methods, our proposed method is superior and effective on expression consistency, identity similarity and pose similarity.

A Unified Framework against Topology and Class Imbalance

  • Junyu Chen
  • Qianqian Xu
  • Zhiyong Yang
  • Xiaochun Cao
  • Qingming Huang

The Area Under ROC curve (AUC) is widely used as an evaluation metric in various applications. Due to its insensitivity towards class distribution, directly optimizing AUC performs well on the class imbalance problem. However, existing AUC optimization methods are limited to regular data such as text, images, and video. AUC optimization on graph data, which is ubiquitous and important, is seldom studied. Different from regular data, AUC optimization on graphs suffers from not only the class imbalance but also topology imbalance. To solve the complicated imbalance problem, we propose a unified topology-aware AUC optimization (TOPOAUC) framework, which could simultaneously deal with the topology and class imbalance problem in graph learning. We develop a multi-class AUC optimization work to deal with the class imbalance problem. With respect to topology imbalance, we propose a T opology-A ware I mportance L earning mechanism (TAIL), which considers the topology of pairwise nodes and different contributions of topology information to pairwise node neighbors. Extensive experiments on three real-world datasets demonstrate the effectiveness of our proposed method.

Unified Multi-modal Pre-training for Few-shot Sentiment Analysis with Prompt-based Learning

  • Yang Yu
  • Dong Zhang
  • Shoushan Li

Multi-modal sentiment analysis (MSA) has become more and more attractive in both academia and industry. The conventional studies normally require massive labeled data to train the deep neural models. To alleviate the above issue, in this paper, we conduct few-shot MSA with quite a small number of labeled samples. Inspired by the success of textual prompt-based fine-tuning (PF) approaches in few-shot scenario, we introduce a multi-modal prompt-based fine-tuning (MPF) approach. To narrow the semantic gap between language and vision, we propose unified pre-training for multi-modal prompt-based fine-tuning (UP-MPF) with two stages. First, in unified pre-training stage, we employ a simple and effective task to obtain coherent vision-language representations from fixed pre-trained language models (PLMs), i.e., predicting the rotation direction of the input image with a prompt phrase as input concurrently. Second, in multi-modal prompt-based fine-tuning, we freeze the visual encoder to reduce more parameters, which further facilitates few-shot MSA. Extensive experiments and analysis on three coarse-grained and three fine-grained MSA datasets demonstrate the better performance of our UP-MPF against the state-of-the-art of PF, MSA, and multi-modal pre-training approaches.

Temporal Sentiment Localization: Listen and Look in Untrimmed Videos

  • Zhicheng Zhang
  • Jufeng Yang

Video sentiment analysis aims to uncover the underlying attitudes of viewers, which has a wide range of applications in real world. Existing works simply classify a video into a single sentimental category, ignoring the fact that sentiment in untrimmed videos may appear in multiple segments with varying lengths and unknown locations. To address this, we propose a challenging task, i.e., Temporal Sentiment Localization (TSL), to find which parts of the video convey sentiment. To systematically investigate fully- and weakly-supervised settings for TSL, we first build a benchmark dataset named TSL-300, which is consisting of 300 videos with a total length of 1,291 minutes. Each video is labeled in two ways, one of which is frame-by-frame annotation for the fully-supervised setting, and the other is single-frame annotation, i.e., only a single frame with strong sentiment is labeled per segment for the weakly-supervised setting. Due to the high cost of labeling a densely annotated dataset, we propose TSL-Net in this work, employing single-frame supervision to localize sentiment in videos. In detail, we generate the pseudo labels for unlabeled frames using a greedy search strategy, and fuse the affective features of both visual and audio modalities to predict the temporal sentiment distribution. Here, a reverse mapping strategy is designed for feature fusion, and a contrastive loss is utilized to maintain the consistency between the original feature and the reverse prediction. Extensive experiments show the superiority of our method against the state-of-the-art approaches.

VigilanceNet: Decouple Intra- and Inter-Modality Learning for Multimodal Vigilance Estimation in RSVP-Based BCI

  • Xinyu Cheng
  • Wei Wei
  • Changde Du
  • Shuang Qiu
  • Sanli Tian
  • Xiaojun Ma
  • Huiguang He

Recently, brain-computer interface (BCI) technology has made impressive progress and has been developed for many applications. Thereinto, the BCI system based on rapid serial visual presentation (RSVP) is a promising information detection technology. However, the use of RSVP is closely related to the user's performance, which can be influenced by their vigilance levels. Therefore it is crucial to detect vigilance levels in RSVP-based BCI. In this paper, we conducted a long-term RSVP target detection experiment to collect electroencephalography (EEG) and electrooculogram (EOG) data at different vigilance levels. In addition, to estimate vigilance levels in RSVP-based BCI, we propose a multimodal method named VigilanceNet using EEG and EOG. Firstly, we define the multiplicative relationships in conventional EOG features that can better describe the relationships between EOG features, and design an outer product embedding module to extract the multiplicative relationships. Secondly, we propose to decouple the learning of intra- and inter-modality to improve multimodal learning. Specifically, for intra-modality, we introduce an intra-modality representation learning (intra-RL) method to obtain effective representations of each modality by letting each modality independently predict vigilance levels during the multimodal training process. For inter-modality, we employ the cross-modal Transformer based on cross-attention to capture the complementary information between EEG and EOG, which only pays attention to the inter-modality relations. Extensive experiments and ablation studies are conducted on the RSVP and SEED-VIG public datasets. The results demonstrate the effectiveness of the method in terms of regression error and correlation.

EASE: Robust Facial Expression Recognition via Emotion Ambiguity-SEnsitive Cooperative Networks

  • Lijuan Wang
  • Guoli Jia
  • Ning Jiang
  • Haiying Wu
  • Jufeng Yang

Facial Expression Recognition (FER) plays a crucial role in the real-world applications. However, large-scale FER datasets collected in the wild usually contain noises. More importantly, due to the ambiguity of emotion, facial images with multiple emotions are hard to be distinguished from the ones with noisy labels. Therefore, it is challenging to train a robust model for FER. To address this, we propose Emotion Ambiguity-SEnsitive cooperative networks (EASE) which contain two components. First, the ambiguity-sensitive learning module divides the training samples into three groups. The samples with small-losses in both networks are considered as clean samples, and the ones with large-losses are noisy. Note for the conflict samples that one network disagrees with the other, we distinguish the samples conveying ambiguous emotions from the ones with noises, using the polarity cues of emotions. Here, we utilize KL divergence to optimize the networks, enabling them to pay attention to the non-dominant emotions. The second part of EASE aims to enhance the diversity of the cooperative networks. With the training epochs increasing, the cooperative networks would converge to a consensus. We construct a penalty term according to the correlation between the features, which helps the networks learn diverse representations from the images. Extensive experiments on 6 popular facial expression datasets demonstrate that EASE outperforms the state-of-the-art approaches.

Mimicking the Annotation Process for Recognizing the Micro Expressions

  • Bo-Kai Ruan
  • Ling Lo
  • Hong-Han Shuai
  • Wen-Huang Cheng

Micro-expression recognition (MER) has recently become a popular research topic due to its wide applications, e.g., movie rating and recognizing the neurological disorder. By virtue of deep learning techniques, the performance of MER has been significantly improved and reached unprecedented results. This paper proposes a novel architecture to mimic how the expressions are annotated. Specifically, during the annotation process in several datasets, the AU labels are first obtained with FACS, and the expression labels are then decided based on the combinations of the AU labels. Meanwhile, these AU labels describe either the eyes or mouth movements (mutually-exclusive). Following this idea, we design a dual-branch structure with a new augmentation method to separately capture the eyes and mouth features and teach the model what the general expressions should be. Moreover, to adaptively fuse the area features for different expressions, we propose Area Weighted Module to assign different weights to each region. Additionally, we set up an auxiliary task to align the AU similarity scores to help our model capture facial patterns further with AU labels. The proposed approach outperforms other state-of-the-art methods in terms of accuracy on the CASME II and SAMM datasets. Moreover, we provide a new visualization approach to show the relationship between the facial regions and AU features.

SESSION: Oral Session II: Engaging User with Multimedia -- Multimedia Search and Recommendation

Machine Unlearning for Image Retrieval: A Generative Scrubbing Approach

  • Peng-Fei Zhang
  • Guangdong Bai
  • Zi Huang
  • Xin-Shun Xu

Data owners have the right to request for deleting their data from a machine learning (ML) model. In response, a naïve way is to retrain the model with the original dataset excluding the data to forget, which is however unrealistic as the required dataset may no longer be available and the retraining process is usually computationally expensive. To cope with this reality, machine unlearning has recently attained much attention, which aims to enable data removal from a trained ML model responding to deletion requests, without retraining the model from scratch or full access to the original training dataset. Existing unlearning methods mainly focus on handling conventional ML methods, while unlearning deep neural networks (DNNs) based models remains underexplored, especially for the ones trained on large-scale datasets.

In this paper, we make the first attempt to realize data forgetting on deep models for image retrieval. Image retrieval targets at searching relevant data to the query according to similarity measures. Intuitively, unlearning a deep image retrieval model can be achieved by breaking down its ability of similarity modeling on the data to forget. To this end, we propose a generative scrubbing (GS) method that learns a generator to craft noisy data to manipulate the model weights. A novel framework is designed consisting of the generator and the target retrieval model, where a pair of coupled static and dynamic learning procedures are performed simultaneously. This novel learning strategy effectively enables the generated noisy data to fade away the memory of the model on the data to forget whilst retaining the information of the remaining data. Extensive experiments on three widely-used datasets have successfully verified the effectiveness of the proposed method.

Partially Relevant Video Retrieval

  • Jianfeng Dong
  • Xianke Chen
  • Minsong Zhang
  • Xun Yang
  • Shujie Chen
  • Xirong Li
  • Xun Wang

Current methods for text-to-video retrieval (T2VR) are trained and tested on video-captioning oriented datasets such as MSVD, MSR-VTT and VATEX. A key property of these datasets is that videos are assumed to be temporally pre-trimmed with short duration, whilst the provided captions well describe the gist of the video content. Consequently, for a given paired video and caption, the video is supposed to be fully relevant to the caption. In reality, however, as queries are not known a priori, pre-trimmed video clips may not contain sufficient content to fully meet the query. This suggests a gap between the literature and the real world. To fill the gap, we propose in this paper a novel T2VR subtask termed Partially Relevant Video Retrieval (PRVR). An untrimmed video is considered to be partially relevant w.r.t. a given textual query if it contains a moment relevant to the query. PRVR aims to retrieve such partially relevant videos from a large collection of untrimmed videos. PRVR differs from single video moment retrieval and video corpus moment retrieval, as the latter two are to retrieve moments rather than untrimmed videos. We formulate PRVR as a multiple instance learning (MIL) problem, where a video is simultaneously viewed as a bag of video clips and a bag of video frames. Clips and frames represent video content at different time scales. We propose a Multi-Scale Similarity Learning (MS-SL) network that jointly learns clip-scale and frame-scale similarities for PRVR. Extensive experiments on three datasets (TVR, ActivityNet Captions, and Charades-STA) demonstrate the viability of the proposed method. We also show that our method can be used for improving video corpus moment retrieval.

From Abstract to Details: A Generative Multimodal Fusion Framework for Recommendation

  • Fangxiong Xiao
  • Lixi Deng
  • Jingjing Chen
  • Houye Ji
  • Xiaorui Yang
  • Zhuoye Ding
  • Bo Long

In E-commerce recommendation, Click-Through Rate (CTR) prediction has been extensively studied in both academia and industry to enhance user experience and platform benefits. At present, most popular CTR prediction methods are concatenation-based models that represent items by simply merging multiple heterogeneous features including ID, visual, and text features into a large vector. As these heterogeneous modalities have moderately different properties, directly concatenating them without mining the correlation and reducing the redundancy are unlikely to achieve the optimal fusion results. Besides, these concatenation-based models treat all modalities equally for each user and overlook the fact that users tend to pay unequal attention to information of various modalities when browsing items in the real scenario. To address the above issues, this paper proposes a generative multimodal fusion framework (GMMF) for CTR prediction task. To eliminate the redundancy and strength the complementary of multimodal features, GMMF generates the new visual and text representations by a Difference-Set network (DSN). These representations are non-overlapping with the information conveyed by ID embedding. Specifically, DSN maps ID embedding into visual and text modalities and depicts the difference between multiple modalities based on their properties. Besides, GMMF learns unequal weights to multiple modalities with a Modal-Interest network (MIN) modeling users' preference on heterogeneous modalities. These weights reflect the usual habits and hobbies of users. Finally, We conduct extensive experiments on both public and collected industrial datasets, and the results show that GMMF greatly improves performance and achieves state-of-the-art performance.

Bi-directional Heterogeneous Graph Hashing towards Efficient Outfit Recommendation

  • Weili Guan
  • Xuemeng Song
  • Haoyu Zhang
  • Meng Liu
  • Chung-Hsing Yeh
  • Xiaojun Chang

Personalized outfit recommendation, which aims to recommend the outfits to a given user according to his/her preference, has gained increasing research attention due to its economic value. Nevertheless, the majority of existing methods mainly focus on improving the recommendation effectiveness, while overlooking the recommendation efficiency. Inspired by this, we devise a novel bi-directional heterogeneous graph hashing scheme, called BiHGH, towards efficient personalized outfit recommendation. In particular, this scheme consists of three key components: heterogeneous graph node initialization, bi-directional sequential graph convolution, and hash code learning. We first unify four types of entities (i.e., users, outfits, items, and attributes) and their relations via a heterogeneous four-partite graph. To perform graph learning, we then creatively devise a bi-directional graph convolution algorithm to sequentially transfer knowledge via repeating upwards and downwards convolution, whereby we divide the four-partite graph into three subgraphs and each subgraph only involves two adjacent entity types. We ultimately adopt the bayesian personalized ranking loss for the user preference learning and design the dual similarity preserving regularization to prevent the information loss during hash learning. Extensive experiments on the benchmark dataset demonstrate the superiority of BiHGH.

Semantic Structure Enhanced Contrastive Adversarial Hash Network for Cross-media Representation Learning

  • Meiyu Liang
  • Junping Du
  • Xiaowen Cao
  • Yang Yu
  • Kangkang Lu
  • Zhe Xue
  • Min Zhang

Deep cross-media hashing technology provides an efficient cross-media representation learning solution for cross-media search. However, the existing methods do not consider both fine-grained semantic features and semantic structures to mine implicit cross-media semantic associations, which leads to weaker semantic discrimination and consistency for cross-media representation. To tackle this problem, we propose a novel semantic structure enhanced contrastive adversarial hash network for cross-media representation learning (SCAHN). Firstly, in order to capture more fine-grained cross-media semantic associations, a fine-grained cross-media attention feature learning network is constructed, thus the learned saliency features of different modalities are more conducive to cross-media semantic alignment and fusion. Secondly, for further improving learning ability of implicit cross-media semantic associations, a semantic label association graph is constructed, and the graph convolutional network is utilized to mine the implicit semantic structures, thus guiding learning of discriminative features of different modalities. Thirdly, a cross-media and intra-media contrastive adversarial representation learning mechanism is proposed to further enhance the semantic discriminativeness of different modal representations, and a dual-way adversarial learning strategy is developed to maximize cross-media semantic associations, so as to obtain cross-media unified representations with stronger discriminativeness and semantic consistency preserving power. Extensive experiments on several cross-media benchmark datasets demonstrate that the proposed SCAHN outperforms the state-of-the-art methods.

Cross-Domain 3D Model Retrieval Based On Contrastive Learning And Label Propagation

  • Dan Song
  • Yue Yang
  • Weizhi Nie
  • Xuanya Li
  • An-An Liu

In this work, we aim to tackle the task of unsupervised image based 3D model retrieval, where we seek to retrieve unlabeled 3D models that are most visually similar to the 2D query image. Due to the challenging modality gap between 2D images and 3D models, existing mainstream methods adopt domain-adversarial techniques to eliminate the gap, which cannot guarantee category-level alignment that is important for retrieval performance. Recent methods align the class centers of 2D images and 3D models to pay attention to the category-level alignment. However, there still exist two main issues: 1) the category-level alignment is too rough, and 2) the category prediction of unlabeled 3D models is not accurate. To overcome the first problem, we utilize contrastive learning for fine-grained category-level alignment across domains, which pulls both prototypes and samples with the same semantic information closer and pushes those with different semantic information apart. To provide reliable semantic prediction for contrastive learning and also address the second issue, we propose the consistent decision for pseudo labels of 3D models based on both the trained image classifier and label propagation. Experiments are carried out on MI3DOR and MI3DOR-2 datasets, and the results demonstrate the effectiveness of our proposed method.

Interactive Video Corpus Moment Retrieval using Reinforcement Learning

  • Zhixin Ma
  • Chong Wah Ngo

Known-item video search is effective with human-in-the-loop to interactively investigate the search result and refine the initial query. Nevertheless, when the first few pages of results are swamped with visually similar items, or the search target is hidden deep in the ranked list, finding the know-item target usually requires a long duration of browsing and result inspection. This paper tackles the problem by reinforcement learning, aiming to reach a search target within a few rounds of interaction by long-term learning from user feedbacks. Specifically, the system interactively plans for navigation path based on feedback and recommends a potential target that maximizes the long-term reward for user comment. We conduct experiments for the challenging task of video corpus moment retrieval (VCMR) to localize moments from a large video corpus. The experimental results on TVR and DiDeMo datasets verify that our proposed work is effective in retrieving the moments that are hidden deep inside the ranked lists of CONQUER and HERO, which are the state-of-the-art auto-search engines for VCMR.

Hierarchical Graph Embedded Pose Regularity Learning via Spatio-Temporal Transformer for Abnormal Behavior Detection

  • Chao Huang
  • Yabo Liu
  • Zheng Zhang
  • Chengliang Liu
  • Jie Wen
  • Yong Xu
  • Yaowei Wang

Abnormal behavior detection in surveillance video is a fundamental task in modern public security. Different from typical pixel-based solutions, pose-based approaches leverage low-dimensional and strongly-structured skeleton feature, which enables the anomaly detector to be immune to complex background noise and obtain higher efficiency. However, existing pose-based methods only utilize the pose of each individual independently while ignore the important interactions between individuals. In this paper, we present a hierarchical graph embedded pose regularity learning framework via spatio-temporal transformer, which leverages the strength of graph representation in encoding strongly-structured skeleton feature. Specifically, skeleton feature is encoded as the hierarchical graph representation, which jointly models the interactions among multiple individuals and the correlations among body joints within the same individual. Furthermore, a novel task-specific spatial-temporal graph transformer is designed to encode the hierarchical spatio-temporal graph embeddings of human skeletons and learn the regular patterns within normal training videos. Experimental results indicate that our method obtains superior performance over state-of-the-art methods on several challenging datasets.

HMTN: Hierarchical Multi-scale Transformer Network for 3D Shape Recognition

  • Yue Zhao
  • Weizhi Nie
  • Zan Gao
  • An-an Liu

As an important field of multimedia, 3D shape recognition has attracted much research attention in recent years. Various approaches have been proposed, within which the multiview-based methods show their promising performances. In general, an effective 3D shape recognition algorithm should take both the multiview local and global visual information into consideration, and explore the inherent properties of generated 3D descriptors to guarantee the performance of feature alignment in the common space. To tackle these issues, we propose a novel Hierarchical Multi-scale Transformer Network (HMTN) for the 3D shape recognition task. In HMTN, we propose a multi-level regional transformer (MLRT) module for shape descriptor generation. MLRT includes two branches that aim to extract the intra-view local characteristics by modeling region-wise dependencies and give the supervision of multiview global information under different granularities. Specifically, MLRT can comprehensively consider the relations of different regions and focus on the discriminative parts, which improves the effectiveness of the learned descriptors. Finally, we adopt the cross-granularity contrastive learning (CCL) mechanism for shape descriptor alignment in the common space. It can explore and utilize the cross-granularity semantic correlation to guide the descriptor extraction process while performing the instance alignment based on the category information. We evaluate the proposed network on several public benchmarks, and HMTN achieves competitive performance compared with the state-of-the-art (SOTA) methods.

IDEAL: High-Order-Ensemble Adaptation Network for Learning with Noisy Labels

  • Peng-Fei Zhang
  • Zi Huang
  • Guangdong Bai
  • Xin-Shun Xu

Data annotations obtained for supervised learning often suffer from label noise, which would inevitably incur unreliable deep neural networks. Existing solutions to this problem typically limit the scope to instance-independent label noise. Due to the high illegibility of data and the inexperience of annotators, instance-dependent noise has also been widely observed, however, not being investigated. In this paper, we propose a novel \underlineIDE ntify and \underlineAL ign (IDEAL) methodology, which aims to eliminate the feature distribution shift raised by a broad spectrum of noise patterns. The proposed model is capable of learning noise-resilient feature representations, thereby correctly predicting data instances. More specifically, we formulate the robust learning against noisy labels as a domain adaptation problem by identifying noisy data (i.e., data samples with incorrect labels) and clean data from the dataset as two domains and minimizing their domain discrepancy in the feature space. In this framework, a high-order-ensemble adaptation network is devised to provide high-confidence predictions, according to which a specific criterion is defined for differentiating clean and noisy data. A new metric based on data augmentation is designed to measure the discrepancy between the clean and noisy domains. Along with a min-max learning strategy between the feature encoder and the classifier on the discrepancy, the domain gap will be bridged, which encourages a noise-resilient model. In-depth theoretical analysis and extensive experiments on widely-used benchmark datasets demonstrate the effectiveness of the proposed method.

DVR: Micro-Video Recommendation Optimizing Watch-Time-Gain under Duration Bias

  • Yu Zheng
  • Chen Gao
  • Jingtao Ding
  • Lingling Yi
  • Depeng Jin
  • Yong Li
  • Meng Wang

Recommender systems are prone to be misled by biases in the data. Models trained with biased data fail to capture the real interests of users, thus it is critical to alleviate the impact of bias to achieve unbiased recommendation. In this work, we focus on an essential bias in micro-video recommendation, duration bias. Specifically, existing micro-video recommender systems usually consider watch time as the most critical metric, which measures how long a user watches a video. Since videos with longer duration tend to have longer watch time, there exists a kind of duration bias, making longer videos tend to be recommended more against short videos. In this paper, we empirically show that commonly-used metrics are vulnerable to duration bias, making them NOT suitable for evaluating micro-video recommendation. To address it, we further propose an unbiased evaluation metric, called WTG (short for Watch Time Gain). Empirical results reveal that WTG can alleviate duration bias and better measure recommendation performance. Moreover, we design a simple yet effective model named DVR (short for Debiased Video Recommendation) that can provide unbiased recommendation of micro-videos with varying duration, and learn unbiased user preferences via adversarial learning. Extensive experiments based on two real-world datasets demonstrate that DVR successfully eliminates duration bias and significantly improves recommendation performance with over 30% relative progress. Codes and datasets are released at

SESSION: Poster Session II: Engaging User with Multimedia -- Multimedia Search and Recommendation

Video Moment Retrieval with Hierarchical Contrastive Learning

  • Bolin Zhang
  • Chao Yang
  • Bin Jiang
  • Xiaokang Zhou

This paper explores the task of video moment retrieval (VMR), which aims to localize the temporal boundary of a specific moment from an untrimmed video by a sentence query. Previous methods either extract pre-defined candidate moment features and select the moment that best matches the query by ranking, or directly align the boundary clips of a target moment with the query and predict matching scores. Despite their effectiveness, these methods mostly focus only on aligning the query and single-level clip or moment features, and ignore the different granularities involved in the video itself, such as clip, moment, or video, resulting in insufficient cross-modal interaction. To this end, we propose a Temporal Localization Network with Hierarchical Contrastive Learning (HCLNet) for the VMR task. Specifically, we introduce a hierarchical contrastive learning method to better align the query and video by maximizing the mutual information (MI) between query and three different granularities of video to learn informative representations. Meanwhile, we introduce a self-supervised cycle-consistency loss to enforce the further semantic alignment between fine-grained video clips and query words. Experiments on three standard benchmarks show the effectiveness of our proposed method.

Learning to Retrieve Videos by Asking Questions

  • Avinash Madasu
  • Junier Oliva
  • Gedas Bertasius

The majority of traditional text-to-video retrieval systems operate in static environments, i.e., there is no interaction between the user and the agent beyond the initial textual query provided by the user. This can be suboptimal if the initial query has ambiguities, which would lead to many falsely retrieved videos. To overcome this limitation, we propose a novel framework for Video Retrieval using Dialog (ViReD), which enables the user to interact with an AI agent via multiple rounds of dialog. The key contribution of our framework is a novel multimodal question generator that learns to ask questions that maximize the subsequent video retrieval performance. Our multimodal question generator uses (i) the video candidates retrieved during the last round of interaction with the user and (ii) the text-based dialog history documenting all previous interactions, to generate questions that incorporate both visual and linguistic cues relevant to video retrieval. Furthermore, to generate maximally informative questions, we propose an Information-Guided Supervision (IGS), which guides the question generator to ask questions that would boost subsequent video retrieval accuracy. We validate the effectiveness of our interactive ViReD framework on the AVSD dataset, showing that our interactive method performs significantly better than traditional non-interactive video retrieval systems. Furthermore, we also demonstrate that our proposed approach also generalizes to the real-world settings that involve interactions with real humans, thus, demonstrating the robustness and generality of our framework.

HEART: Towards Effective Hash Codes under Label Noise

  • Jinan Sun
  • Haixin Wang
  • Xiao Luo
  • Shikun Zhang
  • Wei Xiang
  • Chong Chen
  • Xian-Sheng Hua

Hashing, which encodes raw data into compact binary codes, has grown in popularity for large-scale image retrieval due to its storage and computation efficiency. Although deep supervised hashing has lately shown promising performance, they mostly assume that the semantic labels of training data are ideally noise-free, which is often unrealistic in real-world applications. In this paper, considering the practical application, we focus on the problem of learning to hash with label noise and propose a novel method called HEART to address the problem. HEART is a holistic framework which explores latent semantic distributions to select both clean samples and pairs of high confidence for mitigating the impacts of label noise. From a statistical perspective, our HEART characterizes each image by its multiple augmented views that can be considered as examples from its latent distribution and then calculates semantic distances between images using energy distances between their latent distributions. With semantic distances, we can select confident similar pairs to guide hashing contrastive learning for high-quality hash codes. Moreover, to prevent the memorization of noisy examples, we propose a novel strategy to identify clean samples which have small variations of losses on the latent distributions and train the network on clean samples using a pointwise loss. Experimental results on several popular benchmark datasets demonstrate the effectiveness of our HEART compared with a wide range of baselines.

Learning Hybrid Behavior Patterns for Multimedia Recommendation

  • Zongshen Mu
  • Yueting Zhuang
  • Jie Tan
  • Jun Xiao
  • Siliang Tang

Multimedia recommendation aims to predict user preferences where users interact with multimodal items. Collaborative filtering based on graph convolutional networks manifests impressive performance gains in multimedia recommendation. This is attributed to the capability of learning good user and item embeddings by aggregating the collaborative signals from high-order neighbors. However, previous researches [37,38] fail to explicitly mine different behavior patterns (i.e., item categories, common user interests) by exploiting user-item and item-item graphs simultaneously, which plays an important role in modeling user preferences. And it is the lack of different behavior pattern constraints and multimodal feature reconciliations that results in performance degradation. Towards this end, We propose a Hybrid Clustering Graph Convolutional Network (HCGCN) for multimedia recommendation. We perform high-order graph convolutions inside user-item clusters and item-item clusters to capture various user behavior patterns. Meanwhile, we design corresponding clustering losses to enhance user-item preference feedback and multimodal representation learning constraint to adjust the modality importance, making more accurate recommendations. Experimental results on three real-world multimedia datasets not only demonstrate the significant improvement of our model over the state-of-the-art methods, but also validate the effectiveness of integrating hybrid user behavior patterns for multimedia recommendation.

Breaking Isolation: Multimodal Graph Fusion for Multimedia Recommendation by Edge-wise Modulation

  • Feiyu Chen
  • Junjie Wang
  • Yinwei Wei
  • Hai-Tao Zheng
  • Jie Shao

In a multimedia recommender system, rich multimodal dynamics of user-item interactions are worth availing ourselves of and have been facilitated by Graph Convolutional Networks (GCNs). Yet, the typical way of conducting multimodal fusion with GCN-based models is either through graph mergence fusion that delivers insufficient inter-modal dynamics, or through node alignment fusion that brings in noises which potentially harm multimodal modelling. Unlike existing works, we propose EgoGCN, a structure that seeks to enhance multimodal learning of user-item interactions. At its core is a simple yet effective fusion operation dubbed EdGe-wise mOdulation (EGO) fusion. EGO fusion adaptively distils edge-wise multimodal information and learns to modulate each unimodal node under the supervision of other modalities. It breaks isolated unimodal propagations, allows the most informative inter-modal messages to spread, whilst preserving intra-modal processing. We present a hard modulation and a soft modulation to fully investigate the multimodal dynamics behind. Experiments on two real-world datasets show that EgoGCN comfortably beats prior methods.

Image-Text Matching with Fine-Grained Relational Dependency and Bidirectional Attention-Based Generative Networks

  • Jianwei Zhu
  • Zhixin Li
  • Yufei Zeng
  • Jiahui Wei
  • Huifang Ma

Generally, most existing cross-modal retrieval methods only consider global or local semantic embeddings, lacking fine-grained dependencies between objects. At the same time, it is usually ignored that the mutual transformation between modalities also facilitates the embedding of modalities. Given these problems, we propose a method called BiKA (Bidirectional Knowledge-assisted embedding and Attention-based generation). The model uses a bidirectional graph convolutional neural network to establish dependencies between objects. In addition, it employs a bidirectional attention-based generative network to achieve the mutual transformation between modalities. Specifically, the knowledge graph is used for local matching to constrain the local expression of the modalities, in which the generative network is used for mutual transformation to constrain the global expression of the modalities. In addition, we also propose a new position relation embedding network to embed position relation information between objects. The experiments on two public datasets show that the performance of our method has been dramatically improved compared to many state-of-the-art models.

Visual Grounding in Remote Sensing Images

  • Yuxi Sun
  • Shanshan Feng
  • Xutao Li
  • Yunming Ye
  • Jian Kang
  • Xu Huang

Ground object retrieval from a large-scale remote sensing image is very important for lots of applications. We present a novel problem of visual grounding in remote sensing images. Visual grounding aims to locate the particular objects (in the form of the bounding box or segmentation mask) in an image by a natural language expression. The task already exists in the computer vision community. However, existing benchmark datasets and methods mainly focus on natural images rather than remote sensing images. Compared with natural images, remote sensing images contain large-scale scenes and the geographical spatial information of ground objects (e.g., longitude, latitude). The existing method cannot deal with these challenges. In this paper, we collect a new visual grounding dataset, called RSVG, and design a new method, namely GeoVG. In particular, the proposed method consists of a language encoder, image encoder, and fusion module. The language encoder is used to learn numerical geospatial relations and represent a complex expression as a geospatial relation graph. The image encoder is applied to learn large-scale remote sensing scenes with adaptive region attention. The fusion module is used to fuse the text and image feature for visual grounding. We evaluate the proposed method by comparing it to the state-of-the-art methods on RSVG. Experiments show that our method outperforms the previous methods on the proposed datasets.

Prompt-based Zero-shot Video Moment Retrieval

  • Guolong Wang
  • Xun Wu
  • Zhaoyuan Liu
  • Junchi Yan

Video moment retrieval aims at localizing a specific moment from an untrimmed video by a sentence query. Most methods rely on heavy annotations of video moment-query pairs. Recent zero-shot methods reduced annotation cost, yet they neglected the global visual feature due to the separation of video and text learning process. To avoid the lack of visual features, we propose a Prompt-based Zero-shot Video Moment Retrieval (PZVMR) method. Motivated by the frame of prompt learning, we design two modules: 1) Proposal Prompt (PP): We randomly masks sequential frames to build a prompt to generate proposals; 2) Verb Prompt (VP): We provide patterns of nouns and the masked verb to build a prompt to generate pseudo queries with verbs. Our PZVMR utilizes task-relevant knowledge distilled from pre-trained CLIP and adapts the knowledge to VMR. Unlike the pioneering work, we introduce visual features into each module. Extensive experiments show that our PZVMR not only outperforms the existing zero-shot method (PSVL) on two public datasets (Charades-STA and ActivityNet-Captions) by 4.4% and 2.5% respectively in mIoU, but also outperforms several methods using stronger supervision.

Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning

  • Yabing Wang
  • Jianfeng Dong
  • Tianxiang Liang
  • Minsong Zhang
  • Rui Cai
  • Xun Wang

Despite the recent developments in the field of cross-modal retrieval, there has been less research focusing on low-resource languages due to the lack of manually annotated datasets. In this paper, we propose a noise-robust cross-lingual cross-modal retrieval method for low-resource languages. To this end, we use Machine Translation (MT) to construct pseudo-parallel sentence pairs for low-resource languages. However, as MT is not perfect, it tends to introduce noise during translation, rendering textual embeddings corrupted and thereby compromising the retrieval performance. To alleviate this, we introduce a multi-view self-distillation method to learn noise-robust target-language representations, which employs a cross-attention module to generate soft pseudo-targets to provide direct supervision from the similarity-based view and feature-based view. Besides, inspired by the back-translation in unsupervised MT, we minimize the semantic discrepancies between origin sentences and back-translated sentences to further improve the noise robustness of the textual encoder. Extensive experiments are conducted on three video-text and image-text cross-modal retrieval benchmarks across different languages, and the results demonstrate that our method significantly improves the overall performance without using extra human-labeled data. In addition, equipped with a pre-trained visual encoder from a recent vision and language pre-training framework, i.e., CLIP, our model achieves a significant performance gain, showing that our method is compatible with popular pre-training models. Code and data are available at

Learn to Understand Negation in Video Retrieval

  • Ziyue Wang
  • Aozhu Chen
  • Fan Hu
  • Xirong Li

Negation is a common linguistic skill that allows human to express what we do NOT want. Naturally, one might expect video retrieval to support natural-language queries with negation, e.g., finding shots of kids sitting on the floor and not playing with a dog. However, the state-of-the-art deep learning based video retrieval models lack such ability, as they are typically trained on video description datasets such as MSR-VTT and VATEX that lack negated descriptions. Their retrieved results basically ignore the negator in the sample query, incorrectly returning videos showing kids playing with dog. This paper presents the first study on learning to understand negation in video retrieval and make contributions as follows. By re-purposing two existing datasets (MSR-VTT and VATEX), we propose a new evaluation protocol for video retrieval with negation. We propose a learning based method for training a negation-aware video retrieval model. The key idea is to first construct a soft negative caption for a specific training video by partially negating its original caption, and then compute a bidirectionally constrained loss on the triplet. This auxiliary loss is weightedly added to a standard retrieval loss. Experiments on the re-purposed benchmarks show that re-training the CLIP (Contrastive Language-Image Pre-Training) model by the proposed method clearly improves its ability to handle queries with negation. In addition, the model performance on the original benchmarks is also improved.

AdsCVLR: Commercial Visual-Linguistic Representation Modeling in Sponsored Search

  • Yongjie Zhu
  • Chunhui Han
  • Yuefeng Zhan
  • Bochen Pang
  • Zhaoju Li
  • Hao Sun
  • Si Li
  • Boxin Shi
  • Nan Duan
  • Weiwei Deng
  • Ruofei Zhang
  • Liangjie Zhang
  • Qi Zhang

Sponsored search advertisements (ads) appear next to search results when consumers look for products and services on search engines. As the fundamental basis of search ads, relevance modeling has attracted increasing attention due to the significant research challenges and tremendous practical value. In this paper, we address the problem of multi-modal modeling in sponsored search, which models the relevance between user query and commercial ads with multi-modal structured information. To solve this problem, we propose a transformer architecture with Ads data on Commercial Visual-Linguistic Representation (AdsCVLR) with contrastive learning that naturally extends the transformer encoder with the complementary multi-modal inputs, serving as a strong aggregator of image-text features. We also make a public advertising dataset, which includes 480K labeled query-ad pairwise data with structured information of image, title, seller, description, and so on. Empirically, we evaluate the AdsCVLR model over the large industry dataset, and the experimental results of online/offline tests show the superiority of our method.

Differentiable Cross-modal Hashing via Multimodal Transformers

  • Junfeng Tu
  • Xueliang Liu
  • Zongxiang Lin
  • Richang Hong
  • Meng Wang

Cross-modal hashing aims at projecting the cross modal content into a common Hamming space for efficient search. Most existing work first encodes the samples with a deep network and then binaries the encoded feature into hashing code. However, the relative location information in the image may be lost when an image is encoded by the convolutional network, which makes it challenging to model the relationship of different modalities. Moreover, it is NP-hard to optimize the model with the discrete sign binary function popularly used in existing solutions. To address these issues, we propose a differentiable cross-modal hashing method that utilizes the multimodal transformer as the backbone to capture the location information in an image when encoding the visual content. In addition, a novel differentiable cross-modal hashing method is proposed to generate the binary code by a selecting mechanism, which could be formulated as a continuous and easily optimized problem. We perform extensive experiments on several cross modal datasets and the results show that the proposed method outperforms many existing solutions.

Multi-Level Region Matching for Fine-Grained Sketch-Based Image Retrieval

  • Zhixin Ling
  • Zhen Xing
  • Jiangtong Li
  • Li Niu

Fine-Grained Sketch-Based Image Retrieval (FG-SBIR) is to use free-hand sketches as queries to perform instance-level retrieval in an image gallery. Existing works usually leverage only high-level information and perform matching in a single region. However, both low-level and high-level information are helpful to establish fine-grained correspondence. Besides, we argue that matching different regions between each sketch-image pair can further boost model robustness. Therefore, we propose Multi-Level Region Matching (MLRM) for FG-SBIR, which consists of two modules: a Discriminative Region Extraction module (DRE) and a Region and Level Attention module (RLA). In DRE, we propose Light-weighted Attention Map Augmentation (LAMA) to extract local feature from different regions. In RLA, we propose a transformer-based attentive matching module to learn attention weights to explore different importance from different image/sketch regions and feature levels. Furthermore, to ensure that the geometrical and semantic distinctiveness is well modeled, we also explore a novel LAMA overlapping penalty and a local region-negative triplet loss in our proposed MLRM method. Comprehensive experiments conducted on five datasets (i.e., Sketchy, QMUL-ChairV2, QMUL-ShoeV2, QMUL-Chair, QMUL-Shoe) demonstrate effectiveness of our method.

DDGHM: Dual Dynamic Graph with Hybrid Metric Training for Cross-Domain Sequential Recommendation

  • Xiaolin Zheng
  • Jiajie Su
  • Weiming Liu
  • Chaochao Chen

Sequential Recommendation (SR) characterizes evolving patterns of user behaviors by modeling how users transit among items. However, the short interaction sequences limit the performance of existing SR. To solve this problem, we focus on Cross-Domain Sequential Recommendation (CDSR) in this paper, which aims to leverage information from other domains to improve the sequential recommendation performance of a single domain. Solving CDSR is challenging. On the one hand, how to retain single domain preferences as well as integrate cross-domain influence remains an essential problem. On the other hand, the data sparsity problem cannot be totally solved by simply utilizing knowledge from other domains, due to the limited length of the merged sequences. To address the challenges, we propose DDGHM, a novel framework for the CDSR problem, which includes two main modules, i.e., dual dynamic graph modeling and hybrid metric training. The former captures intra-domain and inter-domain sequential transitions through dynamically constructing two-level graphs, i.e., the local graphs and the global graphs, and incorporating them with a fuse attentive gating mechanism. The latter enhances user and item representations by employing hybrid metric learning, including collaborative metric for achieving alignment and contrastive metric for preserving uniformity, to further alleviate data sparsity issue and improve prediction accuracy. We conduct experiments on two benchmark datasets and the results demonstrate the effectiveness of DDGHM.

Spatial-Temporal Aligned Multi-Agent Learning for Visual Dialog Systems

  • Yong Zhuang
  • Tong Yu
  • Junda Wu
  • Shiqu Wu
  • Shuai Li

Existing interactive learning systems usually train models on simulators as surrogates for real users. Due to the limited amount of user data, trained simulators may lead to biased results as it fails to well represent real users. One solution is to model users as agents, and then simultaneously train the interactive system and user agents by multi-agent reinforcement learning (MARL) frameworks. However, developing efficient MARL frameworks for modern interactive multimodal systems is still challenging. First, given the existence of multimodal data, how to develop accurate multimodal fusion within and between agents in each interaction is challenging and unclear. Second, interactions between users and systems are complex and it is challenging to track and synchronize the interactions over time. The above multimodal fusion between agents and synchronization over time becomes even more challenging, when the amount of user data is limited. To jointly address these challenges and achieve more sample-efficient learning, we propose a novel spatial-temporal aligned sta multi-agent reinforcement learning framework to better align the multimodal data within and between agents over time. Based on our framework, we develop sample-efficient visual dialog systems. Through extensive experiments and analysis, we validate the effectiveness of our spatial-temporal aligned sta multi-agent reinforcement learning framework in visual dialog systems.

Learning Intrinsic and Extrinsic Intentions for Cold-start Recommendation with Neural Stochastic Processes

  • Huafeng Liu
  • Liping Jing
  • Dahai Yu
  • Mingjie Zhou
  • Michael Ng

User behavior data in recommendation are driven by the complex interactions of many intentions behind the user's decision making process. However, user behavior data tends to be sparse because of the limited user response and the vase combinations of users and items, which result in unclear user intentions and suffer from cold-start problem. The intentions are highly compound, and may range from high-level ones that govern user's intrinsic interests and realize the underlying reasons behind the user's decision making processes, to low-level one that characterize a user's extrinsic preference when executing intention to specific items. In this paper, we propose an intention neural process model (INP) for user cold-start recommendation (i.e., user with very few historical interactions), a novel extension of the neural stochastic process family using a general meta learning strategy with intrinsic and extrinsic intention learning for robust user preference learning. By regarding the recommendation process for each user as a stochastic process, INP defines distributions over functions, is capable of rapid adaptation to new users. Our approach learns intrinsic intentions by inferring the high-level concepts associated with user interests or purposes, while capturing the target preference of a user by performing self-supervised intention matching between historical items and target items in a disentangled latent space. Extrinsic intentions are learned by simultaneously generating the point-wise implicit feedback data and creates the pair-wise ranking list by sufficient exploiting both interacted and non-interacted items for each user. Empirical results show that our approach can achieve substantial improvement over the state-of-the-art baselines on cold-start recommendation.

Camera-specific Informative Data Augmentation Module for Unbalanced Person Re-identification

  • Pingting Hong
  • Dayan Wu
  • Bo Li
  • Weipinng Wang

Person re-identification~(Re-ID) aims at retrieving the same person across the non-overlapped camera networks. Recent works have achieved impressive performance due to the rapid development of deep learning techniques. However, most existing methods have ignored the practical unbalanced property in real-world Re-ID scenarios. In fact, the number of pedestrian images in different cameras vary a lot. Some cameras cover thousands of images while others only have a few. As a result, the camera-unbalanced problem will reduce intra-camera diversity, then the model cannot learn camera-invariant features to distinguish pedestrians from "poor" cameras. In this paper, we design a novel camera-specific informative data augmentation module~(CIDAM) to alleviate the proposed camera-unbalanced problem. Specifically, we first calculate the camera-specific distribution online, then refine the "poor" camera-specific covariance matrix with similar cameras defined in the prototype-based similarity matrix. Consequently, informative augmented samples are generated by combining original samples with sampled random vectors in feature space. To ensure these augmented samples can better benefit the model training, we further propose a dynamic-threshold-based contrastive loss. Since augmented samples may not be as real as original ones, we calculate a threshold for each original one dynamically and only push hard negative augmented samples away. Moreover, our CIDAM can be compatible with a variety of existing Re-ID methods. Extensive experiments prove the effectiveness of our method.

TopicVAE: Topic-aware Disentanglement Representation Learning for Enhanced Recommendation

  • Zhiqiang Guo
  • Guohui Li
  • Jianjun Li
  • Huaicong Chen

Learning disentangled representations that reflect user preference based on user behavior (implicit feedback, such as click and purchase) and content information (e.g., plot description, poster) has become a hot research topic in modern recommender systems. However, most existing methods considering content information are not well-designed to disentangle user preference features due to neglecting the diversity of user preference on different semantic topics of items, resulting in sub-optimal performance and low interpretability. To address this problem, we propose a novelTopic-aware Disentangled Variational AutoEncoder (TopicVAE) to learn disentangled representations for enhanced recommendation. Specifically, we first utilize an attention-based topic extraction to extract the topic-level item representations and topic-item probability distribution from item content, and then introduce variational autoencoder to infer topic-level disentangled user representations. To guide the learning of topic-level disentanglement, we present a topic-guided self-supervised contrastive loss to promote the otherness of different topics by introducing a neighborhood-based user representation as guidance. Besides, a heuristic regularization is designed to force each dimension of the disentangled representations to independently reflect a fine-grained factor of a specific topic (e.g., red or blue for color) for feature-level disentanglement. Extensive experimental studies on three public datasets show that TopicVAE significantly outperforms several state-of-the-art baselines. Further empirical experiments also illustrate the interpretability of disentangled representations learned by TopicVAE.

Pixel-Level Anomaly Detection via Uncertainty-aware Prototypical Transformer

  • Chao Huang
  • Chengliang Liu
  • Zheng Zhang
  • Zhihao Wu
  • Jie Wen
  • Qiuping Jiang
  • Yong Xu

Pixel-level visual anomaly detection, which aims to recognize the abnormal areas from images, plays an important role in industrial fault detection and medical diagnosis. However, it is a challenging task due to the following reasons: i) the large variation of anomalies; and ii) the ambiguous boundary between anomalies and their normal surroundings. In this work, we present an uncertainty-aware prototypical transformer (UPformer), which takes into account both the diversity and uncertainty of anomaly to achieve accurate pixel-level visual anomaly detection. To this end, we first design a memory-guided prototype learning transformer encoder to learn and memorize the prototypical representations of anomalies for enabling the model to capture the diversity of anomalies. Additionally, an anomaly detection uncertainty quantizer is designed to learn the distributions of anomaly detection for measuring the anomaly detection uncertainty. Furthermore, an uncertainty-aware transformer decoder is proposed to leverage the detection uncertainties to guide the model to focus on the uncertain areas and generate the final detection results. As a result, our method achieves more accurate anomaly detection by combining the benefits of prototype learning and uncertainty estimation. Experimental results on five datasets indicate that our method achieves state-of-the-art anomaly detection performance.

Dynamic Prototype Mask for Occluded Person Re-Identification

  • Lei Tan
  • Pingyang Dai
  • Rongrong Ji
  • Yongjian Wu

Although person re-identification has achieved an impressive improvement in recent years, the common occlusion case caused by different obstacles is still an unsettled issue in real application scenarios. Existing methods mainly address this issue by employing body clues provided by an extra network to distinguish the visible part. Nevertheless, the inevitable domain gap between the assistant model and the ReID datasets has highly increased the difficulty to obtain an effective and efficient model. To escape from the extra pre-trained networks and achieve an automatic alignment in an end-to-end trainable network, we propose a novel Dynamic Prototype Mask (DPM) based on two self-evident prior knowledge. Specifically, we first devise a Hierarchical Mask Generator which utilizes the hierarchical semantic to select the visible pattern space between the high-quality holistic prototype and the feature representation of the occluded input image. Under this condition, the occluded representation could be well aligned in a selected subspace spontaneously. Then, to enrich the feature representation of the high-quality holistic prototype and provide a more complete feature space, we introduce a Head Enrich Module to encourage different heads to aggregate different patterns representation in the whole image. Extensive experimental evaluations conducted on occluded and holistic person re-identification benchmarks demonstrate the superior performance of the DPM over the state-of-the-art methods.

Meta Reconciliation Normalization for Lifelong Person Re-Identification

  • Nan Pu
  • Yu Liu
  • Wei Chen
  • Erwin M. Bakker
  • Michael S. Lew

Lifelong person re-identification (LReID) is a challenging and emerging task, which concerns the ReID capability on both seen and unseen domains after learning across different domains continually. Existing works on LReID are devoted to introducing commonly-used lifelong learning approaches, while neglecting a serious side effect caused by using normalization layers in the context of domain-incremental learning. In this work, we aim to raise awareness of the importance of training proper batch normalization layers by proposing a new meta reconciliation normalization (MRN) method specifically designed for tackling LReID. Our MRN consists of grouped mixture standardization and additive rectified rescaling components, which are able to automatically maintain an optimal balance between domain-dependent and domain-independent statistics, and even adapt MRN for different testing instances. Furthermore, inspired by synaptic plasticity in human brain, we present a MRN-based meta-learning framework for mining the meta-knowledge shared across different domains, even without replaying any previous data, and further improve the model's LReID ability with theoretical analyses. Our method achieves new state-of-the-art performances on both balanced and imbalanced LReID benchmarks.

Attack is the Best Defense: Towards Preemptive-Protection Person Re-Identification

  • Lin Wang
  • Wanqian Zhang
  • Dayan Wu
  • Fei Zhu
  • Bo Li

Person Re-IDentification (ReID) aims at retrieving images of the same person across multiple camera views. Despite its popularity in surveillance and public safety, the leakage of identity information is still at risk. For example, once obtaining the illegal access to ReID systems, malicious user can accurately retrieve the target person, leading to the exposure of private information. Recently, some pioneering works protect private images with adversarial examples by adding imperceptible perturbations to target images. However, in this paper, we argue that directly applying adversary-based methods to protect the ReID system is sub-optimal due to the 'overlap identity' issue. Specifically, merely pushing the adversarial image away from its original label would probably make it move into the vicinity of other identities. This leads to the potential risk of being retrieved when querying with all the other identities exhaustively. We thus propose a novel preemptive-Protection person Re-IDentification (PRIDE) method. By explicitly constraining the adversarial image to an isolated location, the target person is far away from neither the original identity nor any other identities, which protects him from being retrieved by illegal queries. Moreover, we further propose two crucial attack scenarios (Random Attack and Order Attack) and a novel Success Protection Rate (SPR) metric to quantify the protection ability. Experiments show consistent outperformance of our method over other baselines across different ReID models, datasets and attack scenarios.

TAGPerson: A Target-Aware Generation Pipeline for Person Re-identification

  • Kai Chen
  • Weihua Chen
  • Tao He
  • Rong Du
  • Fan Wang
  • Xiuyu Sun
  • Yuchen Guo
  • Guiguang Ding

Nowadays, real data in person re-identification (ReID) task is facing privacy issues, e.g., the banned dataset DukeMTMC-ReID. Thus it becomes much harder to collect real data for ReID task. Meanwhile, the labor cost of labeling ReID data is still very high and further hinders the development of the ReID research. Therefore, many methods turn to generate synthetic images for ReID algorithms as alternatives instead of real images. However, there is an inevitable domain gap between synthetic and real images. In previous methods, the generation process is based on virtual scenes, and their synthetic training data can not be changed according to different target real scenes automatically. To handle this problem, we propose a novel Target-Aware Generation pipeline to produce synthetic person images, called TAGPerson. Specifically, it involves a parameterized rendering method, where the parameters are controllable and can be adjusted according to the target scenes. In TAGPerson, we extract information from target scenes and use them to control our parameterized rendering process to generate target-aware synthetic images, which would hold a smaller gap to the real images in the specific target domain. In our experiments, our target-aware synthetic images can achieve a much higher performance than the generalized synthetic images on MSMT17, i.e. 47.5% vs. 40.9% for rank-1 accuracy. We will release this toolkit for the ReID community to generate synthetic images at any desired taste. The code is available at:

Efficient Hash Code Expansion by Recycling Old Bits

  • Dayan Wu
  • Qinghang Su
  • Bo Li
  • Weiping Wang

Deep hashing methods have been intensively studied and successfully applied in large-scale multimedia retrieval. In real-world scenarios, code length can not be set once for all if retrieval accuracy is not satisfying. However, when code length increases, conventional deep hashing methods have to retrain their models and regenerate the whole database codes, which is impractical for large-scale retrieval system. In this paper, we propose an interesting deep hashing method from a brand new perspective, called Code Expansion oriented Deep Hashing (CEDH). Different from conventional deep hashing methods, our CEDH focuses on the fast expansion of existing hash codes. Instead of regenerating all bits from raw images, the new bits in CEDH can be incrementally learned by recycling the old ones. Specifically, we elaborately design an end-to-end asymmetric framework to simultaneously optimize a CNN model for query images and a code projection matrix for database images. With the learned code projection matrix, hash codes can achieve fast expansion through simple matrix multiplication. Subsequently, a novel code expansion hashing loss is proposed to preserve the similarities between query codes and expanded database codes. Due to the loose coupling in our framework, our CEDH is compatible with a variety of deep hashing methods. Moreover, we propose to adopt smooth similarity matrix to solve "similarity contradiction" problem existing in multi-label image datasets, thus further improving our performance on multi-label datasets. Extensive experiments on three widely used image retrieval benchmarks demonstrate that CEDH can significantly reduce the cost for expanding database codes (about 100,000x faster with GPU and 1,000,000x faster with CPU) when code length increases while keeping the state-of-the-art retrieval accuracy. Our code is available at

Adaptive Anti-Bottleneck Multi-Modal Graph Learning Network for Personalized Micro-video Recommendation

  • Desheng Cai
  • Shengsheng Qian
  • Quan Fang
  • Jun Hu
  • Changsheng Xu

Micro-video recommendation has attracted extensive research attention with the increasing popularity of micro-video sharing platforms. There exists a substantial amount of excellent efforts made to the micro-video recommendation task. Recently, homogeneous (or heterogeneous) GNN-based approaches utilize graph convolutional operators (or meta-path based similarity measures) to learn meaningful representations for users and micro-videos and show promising performance for the micro-video recommendation task. However, these methods may suffer from the following problems: (1) fail to aggregate information from distant or long-range nodes; (2) ignore the varying intensity of users' preferences for different items in micro-video recommendations; (3) neglect the similarities of multi-modal contents of micro-videos for recommendation tasks. In this paper, we propose a novel Adaptive Anti-Bottleneck Multi-Modal Graph Learning Network for personalized micro-video recommendation. Specifically, we design a collaborative representation learning module and a semantic representation learning module to fully exploit user-video interaction information and the similarities of micro-videos, respectively. Furthermore, we utilize an anti-bottleneck module to automatically learn the importance weights of short-range and long-range neighboring nodes to obtain more expressive representations of users and micro-videos. Finally, to consider the varying intensity of users' preferences for different micro-videos, we design and optimize an adaptive recommendation loss to train our model in an end-to-end manner. We evaluate our method on three real-world datasets and the results demonstrate that the proposed model outperforms the baselines.

Show Me What I Like: Detecting User-Specific Video Highlights Using Content-Based Multi-Head Attention

  • Uttaran Bhattacharya
  • Gang Wu
  • Stefano Petrangeli
  • Viswanathan Swaminathan
  • Dinesh Manocha

We propose a method to detect individualized highlights for users on given target videos based on their preferred highlight clips marked on previous videos they have watched. Our method explicitly leverages the contents of both the preferred clips and the target videos using pre-trained features for the objects and the human activities. We design a multi-head attention mechanism to adaptively weigh the preferred clips based on their object- and human-activity-based contents, and fuse them using these weights into a single feature representation for each user. We compute similarities between these per-user feature representations and the per-frame features computed from the desired target videos to estimate the user-specific highlight clips from the target videos. We test our method on a large-scale highlight detection dataset containing the annotated highlights of individual users. Compared to current baselines, we observe an absolute improvement of 2-4% in the mean average precision of the detected highlights. We also perform extensive ablation experiments on the number of preferred highlight clips associated with each user as well as on the object- and human-activity-based feature representations to validate that our method is indeed both content-based and user-specific.

Prototype-based Selective Knowledge Distillation for Zero-Shot Sketch Based Image Retrieval

  • Kai Wang
  • Yifan Wang
  • Xing Xu
  • Xin Liu
  • Weihua Ou
  • Huimin Lu

Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) is an emerging research task that aims to retrieve data of new classes across sketches and images. It is challenging due to the heterogeneous distributions and the inconsistent semantics across seen and unseen classes of the cross-modal data of sketches and images. To realize knowledge transfer, the latest approaches introduce knowledge distillation, which optimizes the student network through the teacher signal distilled from the teacher network pre-trained on large-scale datasets. However, these methods often ignore the mispredictions of the teacher signal, which may make the model vulnerable when disturbed by the wrong output of the teacher network. To tackle the above issues, we propose a novel method termed Prototype-based Selective Knowledge Distillation (PSKD) for ZS-SBIR. Our PSKD method first learns a set of prototypes to represent categories and then utilizes an instance-level adaptive learning strategy to strengthen semantic relations between categories. Afterwards, a correlation matrix targeted for the downstream task is established through the prototypes. With the learned correlation matrix, the teacher signal given by transformers pre-trained on ImageNet and fine-tuned on the downstream dataset, can be reconstructed to weaken the impact of mispredictions and selectively distill knowledge on the student network. Extensive experiments conducted on three widely-used datasets demonstrate that the proposed PSKD method establishes the new state-of-the-art performance on all datasets for ZS-SBIR.

ARRA: Absolute-Relative Ranking Attack against Image Retrieval

  • Siyuan Li
  • Xing Xu
  • Zailei Zhou
  • Yang Yang
  • Guoqing Wang
  • Heng Tao Shen

With the extensive application of deep learning, adversarial attacks especially query-based attacks receive more concern than ever before. However, the scenarios assumed by existing query-based attacks against image retrieval are usually too simple to satisfy the attack demand. In this paper, we propose a novel method termed Absolute-Relative Ranking Attack (ARRA) that considers a more practical attack scenario. Specifically, we propose two compatible goals for the query-based attack, i.e., absolute ranking attack and relative ranking attack, which aim to change the relative order of chosen candidates and assign the specific ranks to chosen candidates in retrieval list respectively. We further devise the Absolute Ranking Loss (ARL) and Relative Ranking Loss (RRL) for the above goals and implement our ARRA by minimizing their combination with black-box optimizers and evaluate the attack performance by attack success rate and normalized ranking correlation. Extensive experiments conducted on widely-used SOP and CUB-200 datasets demonstrate the superiority of the proposed approach over the baselines. Moreover, the attack result on a real-world image retrieval system, i.e., Huawei Cloud Image Search, also proves the practicability of our ARRA approach.

Invariant Representation Learning for Multimedia Recommendation

  • Xiaoyu Du
  • Zike Wu
  • Fuli Feng
  • Xiangnan He
  • Jinhui Tang

Multimedia recommendation forms a personalized ranking task with multimedia content representations which are mostly extracted via generic encoders. However, the generic representations introduce spurious correlations --- the meaningless correlation from the recommendation perspective. For example, suppose a user bought two dresses on the same model, this co-occurrence would produce a correlation between the model and purchases, but the correlation is spurious from the view of fashion recommendation. Existing work alleviates this issue by customizing preference-aware representations, requiring high-cost analysis and design.

In this paper, we propose an Invariant Representation Learning Framework (InvRL) to alleviate the impact of the spurious correlations. We utilize environments to reflect the spurious correlations and determine each environment with a set of interactions. We then learn invariant representations --- the inherent factors attracting user attention --- to make a consistent prediction of user-item interaction across various environments. In this light, InvRL proposes two iteratively executed modules to cluster user-item interactions and learn invariant representations. With them, InvRL trains a final recommender model thus mitigating the spurious correlations. We demonstrate InvRL on a cutting-edge recommender model UltraGCN and conduct extensive experiments on three public multimedia recommendation datasets, Movielens, Tiktok, and Kwai. The experimental results validate the rationality and effectiveness of InvRL. Codes are released at

Early-Learning regularized Contrastive Learning for Cross-Modal Retrieval with Noisy Labels

  • Tianyuan Xu
  • Xueliang Liu
  • Zhen Huang
  • Dan Guo
  • Richang Hong
  • Meng Wang

Cross modal retrieval receives intensive attention for flexible queries between different modalities. However, in practice it is challenging to retrieve cross modal content with noisy labels. The latest research on machine learning shows that a model tends to fit cleanly labeled data at early learning stage and then memorize the data with noisy labels. Although the clustering strategy in cross modal retrieval can be utilized for alleviating outliers, the networks will rapidly overfit after clean data is fitted well and the noisy labels begin to force the cluster center drift. Motivated by these fundamental phenomena, we propose an Early Learning regularized Contrastive Learning method for Cross Modal Retrieval with Noisy Labels (ELRCMR). In the solution, we propose to project the multi-modal data to a shared feature space by contrastive learning, in which early learning regularization is employed to prevent the memorization of noisy labels when training the model, and the dynamic weight balance strategy is employed to alleviate clustering drift. We evaluated the method with extensive experiments, and the result shows the proposed method could solve the cluster drift in conventional solutions and achieve promising performance on widely used benchmark datasets.

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

  • Yiwei Ma
  • Guohai Xu
  • Xiaoshuai Sun
  • Ming Yan
  • Ji Zhang
  • Rongrong Ji

Video-text retrieval has been a crucial and fundamental task in multi-modal research. The development of video-text retrieval has been considerably promoted by large-scale multi-modal contrastive pre-training, which primarily focuses on coarse-grained or fine-grained contrast. However, cross-grained contrast, which is the contrast between coarse-grained representations and fine-grained representations, has rarely been explored in prior research. Compared with fine-grained or coarse-grained contrasts, cross-grained contrast calculate the correlation between coarse-grained features and each fine-grained feature, and is able to filter out the unnecessary fine-grained features guided by the coarse-grained feature during similarity calculation, thus improving the accuracy of retrieval. To this end, this paper presents a novel multi-grained contrastive model, namely X-CLIP, for video-text retrieval. However, another challenge lies in the similarity aggregation problem, which aims to aggregate fine-grained and cross-grained similarity matrices to instance-level similarity. To address this challenge, we propose the Attention Over Similarity Matrix (AOSM) module to make the model focus on the contrast between essential frames and words, thus lowering the impact of unnecessary frames and words on retrieval results. With multi-grained contrast and the proposed AOSM module, X-CLIP achieves outstanding performance on five widely-used video-text retrieval datasets, including MSR-VTT (49.3 R@1), MSVD (50.4 R@1), LSMDC (26.1 R@1), DiDeMo (47.8 R@1) and ActivityNet (46.2 R@1).

Mixed Supervision for Instance Learning in Object Detection with Few-shot Annotation

  • Yi Zhong
  • Chengyao Wang
  • Shiyong Li
  • Zhu Zhou
  • Yaowei Wang
  • Wei-Shi Zheng

Mixed supervision for object detection (MSOD) that utilizes image-level annotations and a small amount of instance-level annotations has emerged as an efficient tool by alleviating the requirement for a large amount of costly instance-level annotations and providing effective instance supervision on previous methods that only use image-level annotations. In this work, we introduce the mixed supervision instance learning (MSIL), as a novel MSOD framework to leverage a handful of instance-level annotations to provide more explicit and implicit supervision. Rather than just adding instance-level annotations directly on loss functions for detection, we aim to dig out more effective explicit and implicit relations between these two different level annotations. In particular, we firstly propose the Instance-Annotation Guided Image Classification strategy to provide explicit guidance from instance-level annotations by using positional relation to force the image classifier to focus on the proposals which contain the correct object. And then, in order to exploit more implicit interaction between the mixed annotations, an instance reproduction strategy guided by the extra instance-level annotations is developed for generating more accurate pseudo ground truth, achieving a more discriminative detector. Finally, a false target instance mining strategy is used to refine the above processing by enriching the number and diversity of training instances with the position and score information. Our experiments show that the proposed MSIL framework outperforms recent state-of-the-art mixed supervised detectors with a large margin on both the Pascal VOC2007 and the MS-COCO dataset.

Improved Deep Unsupervised Hashing via Prototypical Learning

  • Zeyu Ma
  • Wei Ju
  • Xiao Luo
  • Chong Chen
  • Xian-Sheng Hua
  • Guangming Lu

Hashing has become increasingly popular in approximate nearest neighbor search in recent years due to its storage and computational efficiency. While deep unsupervised hashing has shown encouraging performance recently, its efficacy in the more realistic unsupervised situation is far from satisfactory due to two limitations. On one hand, they usually neglect the underlying global semantic structure in the deep feature space. On the other hand, they also ignore reconstructing the global structure in the hash code space. In this research, we develop a simple yet effective approach named deeP U nsupeR vised hashing via P rototypical LEarning.. Specifically, introduces both feature prototypes and hashing prototypes to model the underlying semantic structures of the images in both deep feature space and hash code space. Then we impose a smoothness constraint to regularize the consistency of the global structures in two spaces through our semantic prototypical consistency learning. Moreover, our method encourages the prototypical consistency for different augmentations of each image via contrastive prototypical consistency learning. Comprehensive experiments on three benchmark datasets demonstrate that our proposed performs better than a variety of state-of-the-art retrieval methods.

Adaptive Camera Margin for Mask-guided Domain Adaptive Person Re-identification

  • Rui Wang
  • Feng Chen
  • Jun Tang
  • Pu Yan

Research on transferring the learned person re-identification (ReID) model in the source domain to other domains is of great importance since deploying a ReID model to a new scenario is common in practical applications. Most of existing unsupervised domain adaptation methods for person ReID employ the framework of pre-training in the source domain, and clustering and fine-tuning in the target domain. However, how to reduce the intra-domain variations and narrow the inter-domain gaps is far from solved and remains a challenging problem under this framework. In this paper, we address these issues from two aspects. Firstly, a voted-mask guided image channel shuffling strategy for data augmentation is proposed to enhance visual diversity, where image channel shuffling is used as an efficient tool to bridge the inter-domain gap, and voted masks are employed to extract the foregrounds of pedestrian images to relief the negative effects of various backgrounds for reducing the intra-domain variations. Secondly, a novel plug-and-play metric named adaptive camera margin is proposed to fully exploit the low-cost camera tags for producing high-quality pseudo labels, which can significantly reduce the intra-domain variations without extra training cost. Specifically, the proposed network consists of a sensitive branch and an adaptive branch accompanied with our strategy of data augmentation, which are embedded into a joint learning framework to decouple visual representations for better capturing transferable features across different domains in both two stages. Adaptive camera margin is employed to pull samples with different camera IDs closer in the procedure of DBSCAN clustering, which can reduce the influence of intra-domain variations caused by camera shift to a large extent in an effective and efficient manner. Comprehensive experiments show that the proposed method achieves competitive performance compared with state-of-the-art methods on the benchmark datasets. Source code will be released at:

BadHash: Invisible Backdoor Attacks against Deep Hashing with Clean Label

  • Shengshan Hu
  • Ziqi Zhou
  • Yechao Zhang
  • Leo Yu Zhang
  • Yifeng Zheng
  • Yuanyuan He
  • Hai Jin

Due to its powerful feature learning capability and high efficiency, deep hashing has achieved great success in large-scale image retrieval. Meanwhile, extensive works have demonstrated that deep neural networks (DNNs) are susceptible to adversarial examples, and exploring adversarial attack against deep hashing has attracted many research efforts. Nevertheless, backdoor attack, another famous threat to DNNs, has not been studied for deep hashing yet. Although various backdoor attacks have been proposed in the field of image classification, existing approaches failed to realize a truly imperceptive backdoor attack that enjoys invisible triggers and clean label setting simultaneously, and they cannot meet the intrinsic demand of image retrieval backdoor.

In this paper, we propose BadHash, the first imperceptible backdoor attack against deep hashing, which can effectively generate invisible and input-specific poisoned images with clean label. We first propose a new conditional generative adversarial network (cGAN) pipeline to effectively generate poisoned samples. For any given benign image, it seeks to generate a natural-looking poisoned counterpart with a unique invisible trigger. In order to improve the attack effectiveness, we introduce a label-based contrastive learning network LabCLN to exploit the semantic characteristics of different labels, which are subsequently used for confusing and misleading the target model to learn the embedded trigger. We finally explore the mechanism of backdoor attacks on image retrieval in the hash space. Extensive experiments on multiple benchmark datasets verify that BadHash can generate imperceptible poisoned samples with strong attack ability and transferability over state-of-the-art deep hashing schemes.

EliMRec: Eliminating Single-modal Bias in Multimedia Recommendation

  • Xiaohao Liu
  • Zhulin Tao
  • Jiahong Shao
  • Lifang Yang
  • Xianglin Huang

The main idea of multimedia recommendation is to introduce the profile content of multimedia documents as an auxiliary, so as to endow recommenders with generalization ability and gain better performance. However, recent studies using non-uniform datasets roughly fuse single-modal features into multi-modal features and adopt the strategy of directly maximizing the likelihood of user preference scores, leading to the single-modal bias. Owing to the defect in architecture, there is still room for improvement for recent multimedia recommendation.

In this paper, we propose EliMRec, a generic and modal-agnostic framework to eliminate the single-modal bias in multimedia recommendation. From our observation, biased predictive reasoning is influenced directly by the single modality rather than considering the all given multiple views of the item. Through the novel perspective of causal inference, we manage to explain the single-modal issue and exploit the inner working of multi-modal fusion. To eliminate single-modal bias, we enhance the bias-capture ability of a general multimedia recommendation framework and imagine several counterfactual worlds that control one modality variant with other modality fixed or blank. Truth to be told, counterfactual analysis enables us to identify and eliminate bias lying in the direct effect from single-modal features to the preference score. Extensive experiments on real-world datasets demonstrate that our method significantly improves over several state-of-the-art baselines like LightGCN and MMGCN. Codes are available at

Patch-based Knowledge Distillation for Lifelong Person Re-Identification

  • Zhicheng Sun
  • Yadong MU

The task of lifelong person re-identification aims to match a person across multiple cameras given continuous data streams. Similar to other lifelong learning tasks, it severely suffers from the so-called catastrophic forgetting problem, which refers to the notable performance degradation on previously-seen data after adapting the model to some newly incoming data. To alleviate it, a few existing methods have utilized knowledge distillation to enforce consistency between the original and adapted models. However, the effectiveness of such a strategy can be largely reduced facing the data distribution discrepancy between seen and new data. The hallmark of our work is using adaptively-chosen patches (rather than whole images as in other works) to pilot the forgetting-resistant distillation. Specifically, the technical contributions of our patch-based new solution are two-fold: first, a novel patch sampler is proposed. It is fully differentiable and trained to select a diverse set of image patches that stay crucial and discriminative under streaming data. Secondly, with those patches we curate a novel knowledge distillation framework. Valuable patch-level knowledge within individual patch features and mutual relations is well preserved by the two newly introduced distillation modules, further mitigating catastrophic forgetting. Extensive experiments on twelve person re-identification datasets clearly validate the superiority of our method over state-of-the-art competitors by large performance margins.

SESSION: Oral Session III: Engaging User with Multimedia -- Summarization, Analytics, and Storytelling

MAPLE: Masked Pseudo-Labeling autoEncoder for Semi-supervised Point Cloud Action Recognition

  • Xiaodong Chen
  • Wu Liu
  • Xinchen Liu
  • Yongdong Zhang
  • Jungong Han
  • Tao Mei

Recognizing human actions from point cloud videos has attracted tremendous attention from both academia and industry due to its wide applications like automatic driving, robotics, and so on. However, current methods for point cloud action recognition usually require a huge amount of data with manual annotations and a complex backbone network with high computation cost, which makes it impractical for real-world applications. Therefore, this paper considers the task of semi-supervised point cloud action recognition. We propose a Masked Pseudo-Labeling autoEncoder (MAPLE) framework to learn effective representations with much fewer annotations for point cloud action recognition. In particular, we design a novel and efficient Decoupled spatial-temporal TransFormer (DestFormer) as the backbone of MAPLE. In DestFormer, the spatial and temporal dimensions of the 4D point cloud videos are decoupled to achieve an efficient self-attention for learning both long-term and short-term features. Moreover, to learn discriminative features from fewer annotations, we design a masked pseudo-labeling autoencoder structure to guide the DestFormer to reconstruct features of masked frames from the available frames. More importantly, for unlabeled data, we exploit the pseudo-labels from the classification head as the supervision signal for the reconstruction of features from the masked frames. Finally, comprehensive experiments demonstrate that MAPLE achieves superior results on three public benchmarks and outperforms the state-of-the-art method by 8.08% accuracy on the MSR-Action3D dataset.

DHHN: Dual Hierarchical Hybrid Network for Weakly-Supervised Audio-Visual Video Parsing

  • Xun Jiang
  • Xing Xu
  • Zhiguo Chen
  • Jingran Zhang
  • Jingkuan Song
  • Fumin Shen
  • Huimin Lu
  • Heng Tao Shen

The Weakly-Supervised Audio-Visual Video Parsing (AVVP) task aims to parse a video into temporal segments and predict their event categories in terms of modalities, labeling them as either audible, visible, or both. Since the temporal boundaries and modalities annotations are not provided, only video-level event labels are available, this task is more challenging than conventional video understanding tasks.Most previous works attempt to analyze videos by jointly modeling the audio and video data and then learning information from the segment-level features with fixed lengths. However, such a design exist two defects: 1) The various semantic information hidden in temporal lengths is neglected, which may lead the models to learn incorrect information; 2) Due to the joint context modeling, the unique features of different modalities are not fully explored. In this paper, we propose a novel AVVP framework termedDual Hierarchical Hybrid Network (DHHN) to tackle the above two problems. Our DHHN method consists of three components: 1) A hierarchical context modeling network for extracting different semantics in multiple temporal lengths; 2) A modality-wise guiding network for learning unique information from different modalities; 3) A dual-stream framework generating audio and visual predictions separately. It maintains the best adaptions on different modalities, further boosting the video parsing performance. Extensive quantitative and qualitative experiments demonstrate that our proposed method establishes the new state-of-the-art performance on the AVVP task.

SESSION: Poster Session III: Engaging User with Multimedia -- Summarization, Analytics, and Storytelling

Weakly-Supervised Temporal Action Alignment Driven by Unbalanced Spectral Fused Gromov-Wasserstein Distance

  • Dixin Luo
  • Yutong Wang
  • Angxiao Yue
  • Hongteng Xu

Temporal action alignment aims at segmenting videos into clips and tagging each clip with a textual description, which is an important task of video semantic analysis. Most existing methods, however, rely on supervised learning to train their alignment models, whose applications are limited because of the common insufficiency issue of labeled videos. To mitigate this issue, we propose a weakly-supervised temporal action alignment method based on a novel computational optimal transport technique called unbalanced spectral fused Gromov-Wasserstein (US-FGW) distance. Instead of using videos with known clips and corresponding textual tags, our method just needs each training video to be associated with a set of (unsorted) texts while does not require the fine-grained correspondence between the frames and the texts. Given such weakly-supervised video-text pairs, our method trains the representation models of the video frames and the texts jointly in a probabilistic or deterministic autoencoding architecture and penalizes the US-FGW distance between the distribution of visual latent codes and that of textual latent codes. We compute the US-FGW distance efficiently by leveraging the Bregman ADMM algorithm. Furthermore, we generalize classic contrastive learning framework and reformulate it based on the proposed US-FGW distance, which provides a new viewpoint of contrastive learning for our problem. Experimental results show that our method and its variants outperform state-of-the-art weakly-supervised temporal action alignment methods, whose results are even comparable to those derived by supervised learning methods on some specific evaluation measurements. The code is available at \url

A Knowledge Augmented and Multimodal-Based Framework for Video Summarization

  • Jiehang Xie
  • Xuanbai Chen
  • Shao-Ping Lu
  • Yulu Yang

Video summarization aims to generate a compact version of a lengthy video that retains its primary content. In general, humans are gifted with producing a high-quality video summary, because they acquire crucial content through multiple dimensional information and own abundant background knowledge about the original video. However, existing methods rarely consider multichannel information and ignore the impact of external knowledge, resulting in the limited quality of the generated summaries. This paper proposes a knowledge augmented and multimodal-based video summarization method, termed KAMV, to address the problem above. Specifically, we design a knowledge encoder with a hybrid method consisting of generation and retrieval, to capture descriptive content and latent connections between events and entities based on the external knowledge base, which can provide rich implicit knowledge for better comprehending the video viewed. Furthermore, for the sake of exploring the interactions among visual, audio, implicit knowledge and emphasizing the content that is most relevant to the desired summary, we present a fusion module under the supervision of these multimodal information. By conducting extensive experiments on four public datasets, the results demonstrate the superior performance yielded by the proposed KAMV compared to the state-of-the-art video summarization approaches.

MMT: Image-guided Story Ending Generation with Multimodal Memory Transformer

  • Dizhan Xue
  • Shengsheng Qian
  • Quan Fang
  • Changsheng Xu

As a specific form of story generation, Image-guided Story Ending Generation (IgSEG) is a recently proposed task of generating a story ending for a given multi-sentence story plot and an ending-related image. Unlike existing image captioning tasks or story ending generation tasks, IgSEG aims to generate a factual description that conforms to both the contextual logic and the relevant visual concepts. To date, existing methods for IgSEG ignore the relationships between the multimodal information and do not integrate multimodal features appropriately. Therefore, in this work, we propose Multimodal Memory Transformer (MMT), an end-to-end framework that models and fuses both contextual and visual information to effectively capture the multimodal dependency for IgSEG. Firstly, we extract textual and visual features separately by employing modality-specific large-scale pretrained encoders. Secondly, we utilize the memory-augmented cross-modal attention network to learn cross-modal relationships and conduct the fine-grained feature fusion effectively. Finally, a multimodal transformer decoder constructs attention among multimodal features to learn the story dependency and generates informative, reasonable, and coherent story endings. In experiments, extensive automatic evaluation results and human evaluation results indicate the significant performance boost of our proposed MMT over state-of-the-art methods on two benchmark datasets.

An End-to-End Conditional Generative Adversarial Network Based on Depth Map for 3D Craniofacial Reconstruction

  • Niankai Zhang
  • Junli Zhao
  • Fuqing Duan
  • Zhenkuan Pan
  • Zhongke Wu
  • Mingquan Zhou
  • Xianfeng Gu

Craniofacial reconstruction is fundamental in resolving forensic cases. It is rather challenging due to the complex topology of the craniofacial model and the ambiguous relationship between a skull and the corresponding face. In this paper, we propose a novel approach for 3D craniofacial reconstruction by utilizing Conditional Generative Adversarial Networks (CGAN) based on craniofacial depth map. More specifically, we treat craniofacial reconstruction as a mapping problem from skull to face. We represent 3D cran- iofacial shapes with depth maps, which include most craniofacial features for identification purposes and are easy to generate and apply to neural networks. We designed an end-to-end neural networks model based on CGAN then trained the model with paired craniofacial data to automatically learn the complex nonlinear relationship between skull and face. By introducing body mass index classes(BMIC) into CGAN, we can realize objective reconstruction of 3D facial geometry according to its skull, which is a complicated 3D shape generation task with different topologies. Through comparative experiments, our method shows accuracy and verisimilitude in craniofacial reconstruction results.

Clustering Generative Adversarial Networks for Story Visualization

  • Bowen Li
  • Philip H. S. Torr
  • Thomas Lukasiewicz

Story visualization aims to generate a series of images, semantically matching a given sequence of sentences, one for each, and different output images within a story should be consistent with each other. Current methods generate story images by using a heavy architecture with two generative adversarial networks (GANs), one for image quality, and one for story consistency, and also rely on additional segmentation masks or auxiliary captioning networks. In this paper, we aim to build a concise and single-GAN-based network, neither depending on additional semantic information nor captioning networks. To achieve this, we propose a contrastive-learning- and clustering-learning-based approach for story visualization. Our network utilizes contrastive losses between language and visual information to maximize the mutual information between them, and further extends it with clustering learning in the training process to capture semantic similarity across modalities. So, the discriminator in our approach provides comprehensive feedback to the generator, regarding both image quality and story consistency at the same time, allowing to have a single-GAN-based network to produce high-quality synthetic results. Extensive experiments on two datasets demonstrate that our single-GAN-based network has a smaller number of total parameters in the network, but achieves a major step up from previous methods, which improves FID from 78.64 to 39.17, and FSD from 94.53 to 41.18 on Pororo-SV, and establishes a strong benchmark FID of 76.51 and FSD of 19.74 on Abstract Scenes.

DeViT: Deformed Vision Transformers in Video Inpainting

  • Jiayin Cai
  • Changlin Li
  • Xin Tao
  • Chun Yuan
  • Yu-Wing Tai

This paper presents a novel video inpainting architecture named Deformed Vision Transformers (DeViT). We make three significant contributions to this task: First, we extended previous Transformers with patch alignment by introducing Deformed Patch-based Homography Estimator (DePtH), which enriches the patch-level feature alignments in key and query with additional offsets learned from patch pairs without additional supervision. DePtH enables our method to handle challenging scenes or agile motion with in-plane or out-of-plane deformation, which previous methods usually fail. Second, we introduce the Mask Pruning-based Patch Attention (MPPA) to improve the standard patch-wised feature matching by pruning out less essential features and considering the saliency map. MPPA enhances the matching accuracy between warped tokens with invalid pixels. Third, we introduce the Spatial-Temporal weighting Adaptor (STA) module to assign more accurate attention to spatial-temporal tokens under the guidance of the Deformation Factor learned from DePtH, especially for videos with agile motions. Experimental results demonstrate that our method outperforms previous state-of-the-art methods in quality and quantity and achieves a new state-of-the-art for video inpainting.

Multi-Level Spatiotemporal Network for Video Summarization

  • Ming Yao
  • Yu Bai
  • Wei Du
  • Xuejun Zhang
  • Heng Quan
  • Fuli Cai
  • Hongwei Kang

With the increasing of ubiquitous devices with cameras, video content is widely produced in the industry. Automation video summarization allows content consumers effectively retrieve the moments that capture their primary attention. Existing supervised methods mainly focus on frame-level information. As a natural phenomenon, video fragments in different shots are richer in semantics than frames. We leverage this as a free latent supervision signal and introduce a novel model named multi-level spatiotemporal network (MLSN). Our approach contains Multi-Level Feature Representations (MLFR) and Local Relative Loss (LRL). MLFR module consists of frame-level features, fragment-level features, and shot-level features with relative position encoding. For videos of different shot durations, it can flexibly capture and accommodate semantic information of different spatiotemporal granularities; LRL utilizes the partial ordering relations among frames of each fragment to capture highly discriminative features to improve the sensitivity of the model. Our method substantially improves the best existing published method by 7% on our industrial products dataset LSVD. Meanwhile, experimental results on two widely used benchmark datasets SumMe and TVSum demonstrate that our method outperforms most state-of-the-art ones.

SESSION: Oral Session IV: Experience -- Interactions and Quality of Experience

TVFormer: Trajectory-guided Visual Quality Assessment on 360° Images with Transformers

  • Li Yang
  • Mai Xu
  • Tie Liu
  • Liangyu Huo
  • Xinbo Gao

Visual quality assessment (VQA) on 360° images plays an important role in optimizing immersive multimedia systems. Due to the absence of pristine 360° images in real world, blind VQA (BVQA) on 360° images has drawn much research attention. In subjective VQA on 360^ images, human intuitively make the quality-scoring decisions through the quality degradation of each observed viewport on the head trajectories. Unfortunately, the existing BVQA works for 360° images neglect the dynamic property of head trajectories with viewport interactions, thus failing to obtain human-like quality scores. In this paper, we propose a novel Transformer-based approach for trajectory-guided VQA on 360° images (named TVFormer), in which both the tasks of head trajectory prediction and BVQA can be accomplished for 360° images. In the first task, we develop a trajectory-aware memory updater (TMU) module, for maintaining the coherence and accuracy of predicted head trajectories. To capture the long-range quality dependency across time-ordered viewports, we propose a spatio-temporal factorized self-attention (STF) module in the encoder of TVFormer for the BVQA task. By implanting the predicted head trajectories into the BVQA task, we can obtain the human-like quality scores. Extensive experiments demonstrate the superior BVQA performance of TVFormer over state-of-the-art approaches on three benchmark datasets.

KnifeCut: Refining Thin Part Segmentation with Cutting Lines

  • Zheng Lin
  • Zheng-Peng Duan
  • Zhao Zhang
  • Chun-Le Guo
  • Ming-Ming Cheng

Objects with thin structures remain challenging for current image segmentation techniques. Their outputs often do well in the main body but with thin parts unsatisfactory. In practical use, they inevitably need post-processing. However, repairing them is time-consuming and laborious, either in professional editing applications (e.g. PhotoShop) or by current interactive image segmentation methods (e.g. by click, scribble, and polygon). To refine the thin parts for unsatisfactory pre-segmentation, we propose an efficient interaction mode, where users only need to draw a line across the mislabeled thin part like cutting with a knife. This low-stress and intuitive action does not require the user to aim deliberately, and is friendly when using the mouse, touchpad, and mobile devices. Additionally, the line segment provides a contrasting prior because it passes through both the foreground and background regions and there must be thin part pixels on it. Based on the interaction idea, we propose KnifeCut, which offers the users two results, where one only focuses on the target thin part and the other provides the refinements for all thin parts that share similar features with the target one. To our best knowledge, KnifeCut is the first method to solve interactive thin structure refinement pertinently. Extensive experiments and visualized results further demonstrate its friendliness, convenience, and effectiveness. The project page is available on

Multi-view Layout Design for VR Concert Experience

  • Minju Kim
  • Yuhyun Lee
  • Jungjin Lee

Owing to the COVID-19 pandemic, concerts are gradually being held online. Beyond live-streaming, it has recently become popular to utilize various realistic video technologies to add entertainment value and immersion to online concerts. We conducted a multi-view layout design study in a virtual reality environment with a head-mounted display to help users effectively explore and immerse themselves in multiple videos from various angles. Based on an analysis of an existing user interface for multi-view navigation and the characteristics of virtual reality, we proposed four layouts, i.e., 1) an evenly divided space, 2) an evenly divided designated space, 3) a widget type, and 4) an avatar type. We implemented a prototype by applying Korean pop concerts, where multi-view videos are the most actively utilized, and then conducted a user study to evaluate the usability and preferences of the proposed layouts. The results show that it is adequate to arrange the multi-view videos within a 60° to 110° space and on the left and right side of the main view, which is a range that the users can comfortably access. In addition, when placing multiple videos in a designated space, it is helpful to use visual effects or simple avatars to avoid visual burden being placed on the users.

Magic ELF: Image Deraining Meets Association Learning and Transformer

  • Kui Jiang
  • Zhongyuan Wang
  • Chen Chen
  • Zheng Wang
  • Laizhong Cui
  • Chia-Wen Lin

Convolutional neural network (CNN) and Transformer have achieved great success in multimedia applications. However, little effort has been made to effectively and efficiently harmonize these two architectures to satisfy image deraining. This paper aims to unify these two architectures to take advantage of their learning merits for image deraining. In particular, the local connectivity and translation equivariance of CNN and the global aggregation ability of self-attention (SA) in Transformer are fully exploited for specific local context and global structure representations. Based on the observation that rain distribution reveals the degradation location and degree, we introduce degradation prior to help background recovery and accordingly present the association refinement deraining scheme. A novel multi-input attention module (MAM) is proposed to associate rain perturbation removal and background recovery. Moreover, we equip our model with effective depth-wise separable convolutions to learn the specific feature representations and trade off computational complexity. Extensive experiments show that our proposed method (dubbed as ELF) outperforms the state-of-the-art approach (MPRNet) by 0.25 dB on average, but only accounts for 11.7% and 42.1% of its computational cost and parameters.

Exploring the Effectiveness of Video Perceptual Representation in Blind Video Quality Assessment

  • Liang Liao
  • Kangmin Xu
  • Haoning Wu
  • Chaofeng Chen
  • Wenxiu Sun
  • Qiong Yan
  • Weisi Lin

With the rapid growth of in-the-wild videos taken by non-specialists, blind video quality assessment (VQA) has become a challenging and demanding problem. Although lots of efforts have been made to solve this problem, it remains unclear how the human visual system (HVS) relates to the temporal quality of videos. Meanwhile, recent work has found that the frames of natural video transformed into the perceptual domain of the HVS tend to form a straight trajectory of the representations. With the obtained insight that distortion impairs the perceived video quality and results in a curved trajectory of the perceptual representation, we propose a temporal perceptual quality index (TPQI) to measure the temporal distortion by describing the graphic morphology of the representation. Specifically, we first extract the video perceptual representations from the lateral geniculate nucleus (LGN) and primary visual area (V1) of the HVS, and then measure the straightness and compactness of their trajectories to quantify the degradation in naturalness and content continuity of video. Experiments show that the perceptual representation in the HVS is an effective way of predicting subjective temporal quality, and thus TPQI can, for the first time, achieve comparable performance to the spatial quality metric and be even more effective in assessing videos with large temporal variations. We further demonstrate that by combining with NIQE, a spatial quality metric, TPQI can achieve top performance over popular in-the-wild video datasets. More importantly, TPQI does not require any additional information beyond the video being evaluated and thus can be applied to any datasets without parameter tuning. Source code is available at

You Only Align Once: Bidirectional Interaction for Spatial-Temporal Video Super-Resolution

  • Mengshun Hu
  • Kui Jiang
  • Zhixiang Nie
  • Zheng Wang

Spatial-Temporal Video Super-Resolution (ST-VSR) technology generates high-quality videos with higher resolution and higher frame rates. Existing advanced methods accomplish ST-VSR tasks through the association of Spatial and Temporal video super-resolution (S-VSR and T-VSR). These methods require two alignments and fusions in S-VSR and T-VSR, which is obviously redundant and fails to sufficiently explore the information flow of consecutive spatial LR frames. Although bidirectional learning (future-to-past and past-to-future) was introduced to cover all input frames, the direct fusion of final predictions fails to sufficiently exploit intrinsic correlations of bidirectional motion learning and spatial information from all frames. We propose an effective yet efficient recurrent network with bidirectional interaction for ST-VSR, where only one alignment and fusion is needed. Specifically, it first performs backward inference from future to past, and then follows forward inference to super-resolve intermediate frames. The backward and forward inferences are assigned to learn structures and details to simplify the learning task with joint optimizations. Furthermore, a Hybrid Fusion Module (HFM) is designed to aggregate and distill information to refine spatial information and reconstruct high-quality video frames. Extensive experiments on two public datasets demonstrate that our method outperforms state-of-the-art methods in efficiency, and reduces calculation cost by about 22%.

A Deep Learning based No-reference Quality Assessment Model for UGC Videos

  • Wei Sun
  • Xiongkuo Min
  • Wei Lu
  • Guangtao Zhai

Quality assessment for User Generated Content (UGC) videos plays an important role in ensuring the viewing experience of end-users. Previous UGC video quality assessment (VQA) studies either use the image recognition model or the image quality assessment (IQA) models to extract frame-level features of UGC videos for quality regression, which are regarded as the sub-optimal solutions because of the domain shifts between these tasks and the UGC VQA task. In this paper, we propose a very simple but effective UGC VQA model, which tries to address this problem by training an end-to-end spatial feature extraction network to directly learn the quality-aware spatial feature representation from raw pixels of the video frames. We also extract the motion features to measure the temporal-related distortions that the spatial features cannot model. The proposed model utilizes very sparse frames to extract spatial features and dense frames (i.e. the video chunk) with a very low spatial resolution to extract motion features, which thereby has low computational complexity. With the better quality-aware features, we only use the simple multilayer perception layer (MLP) network to regress them into the chunk-level quality scores, and then the temporal average pooling strategy is adopted to obtain the video-level quality score. We further introduce a multi-scale quality fusion strategy to solve the problem of VQA across different spatial resolutions, where the multi-scale weights are obtained from the contrast sensitivity function of the human visual system. The experimental results show that the proposed model achieves the best performance on five popular UGC VQA databases, which demonstrates the effectiveness of the proposed model.

SESSION: Poster Session IV: Experience - Interactions and Quality of Experience

Improving Meeting Inclusiveness using Speech Interruption Analysis

  • Szu-Wei Fu
  • Yaran Fan
  • Yasaman Hosseinkashi
  • Jayant Gupchup
  • Ross Cutler

Meetings are a pervasive method of communication within all types of companies and organizations, and using remote collaboration systems to conduct meetings has increased dramatically since the COVID-19 pandemic. However, not all meetings are inclusive, especially in terms of the participation rates among attendees. In a recent large-scale survey conducted at Microsoft, the top suggestion given by meeting participants for improving inclusiveness is to improve the ability of remote participants to interrupt and acquire the floor during meetings. We show that the use of the virtual raise hand (VRH) feature can lead to an increase in predicted meeting inclusiveness at Microsoft. One challenge is that VRH is used in less than $1%$ of all meetings. In order to drive adoption of its usage to improve inclusiveness (and participation), we present a machine learning-based system that predicts when a meeting participant attempts to obtain the floor, but fails to interrupt (termed a 'failed interruption'). This prediction can be used to nudge the user to raise their virtual hand within the meeting. We believe this is the first failed speech interruption detector, and the performance on a realistic test set has an area under curve (AUC) of 0.95 with a true positive rate (TPR) of 50% at a false positive rate (FPR) of 1%. To our knowledge, this is also the first dataset of interruption categories (including the failed interruption category) for remote meetings. Finally, we believe this is the first such system designed to improve meeting inclusiveness through speech interruption analysis and active intervention.

Transductive Aesthetic Preference Propagation for Personalized Image Aesthetics Assessment

  • Yaohui Li
  • Yuzhe Yang
  • Huaxiong Li
  • Haoxing Chen
  • Liwu Xu
  • Leida Li
  • Yaqian Li
  • Yandong Guo

Personalized image aesthetics assessment (PIAA) aims at capturing individual aesthetic preference. Fine-tuning on personalized data has been proven to be effective in PIAA task. However, a fixed fine-tuning strategy may cause under/over-fitting on limited personal data and it also brings additional training cost. To alleviate these issues, we employ a meta learning-based Transductive Aesthetic Preference Propagation (TAPP-PIAA) algorithm under regression manner to substitute the fine-tuning strategy. Specifically, each user's data is regarded as a meta-task and spilt into support and query set. Then, we extract deep aesthetic features with a pre-trained generic image aesthetic assessment (GIAA) model. Next, we treat image features as graph nodes and their similarities as edge weights to construct an undirected nearest neighbor graph for inference. Instead of fine-tuning on support set, TAPP-PIAA propagates aesthetic preference from support to query set with a predefined propagation formula. Finally, to learn a generalizable aesthetic representation for various users, we optimize our TAPP-PIAA across different users with meta-learning framework. Experimental results indicate that our TAPP-PIAA can surpass the state-of-the-art methods on benchmark databases.

Multi-Mode Interactive Image Segmentation

  • Zheng Lin
  • Zhao Zhang
  • Ling-Hao Han
  • Shao-Ping Lu

Large-scale pixel-level annotations are scarce for current data-hungry medical image analysis models. For the fast acquisition of annotations, an economical and efficient interactive medical image segmentation method is urgently needed. However, current techniques usually fail in many cases, as their interaction styles cannot work on various inherent ambiguities of medical images, such as irregular shapes and fuzzy boundaries. To address this problem, we propose a multi-mode interactive segmentation framework for medical images, where diverse interaction modes can be chosen and allowed to cooperate with each other. In our framework, users can encircle the target regions with various initial interaction modes according to the structural complexity. Then, based on the initial segmentation, users can jointly utilize the region and boundary interactions to refine the mislabeled regions caused by different ambiguities. We evaluate our framework on extensive medical images, including X-ray, CT, MRI, ultrasound, endoscopy, and photo. Sufficient experimental results and user study show that our framework is a reliable choice for image annotation in various real scenes.

Deep-BVQM: A Deep-learning Bitstream-based Video Quality Model

  • Nasim Jamshidi Avanaki
  • Steven Schmidt
  • Thilo Michael
  • Saman Zadtootaghaj
  • Sebastian Möller

With the rapid increase of video streaming content, high-quality video quality metrics, mainly signal-based video quality metrics, are emerging, notably VMAF, SSIMPLUS, and AVQM. Besides signal-based video quality metrics, within the standardization body, ITU-T Study Group 12, two well-known bitstream-based video quality metrics are developed named P.1203 and P.1204.3. Due to the low complexity and low level of access to the bitstream data, these models gained attention from network providers and service providers. In this paper, we proposed a new bitstream-based model named Deep-BVQM, which outperforms the standard models on the tested datasets. While the model comes with slightly higher computational complexity, it offers a frame-level quality prediction which is essential diagnostic information for some video streaming services such as cloud gaming. Deep-BVQM is developed in two layers; first, the frame quality was predicted using a lightweight CNN model. Next, the latent features of the CNN were used to train an LSTM network to predict the video quality in a short-term duration.

MESH2IR: Neural Acoustic Impulse Response Generator for Complex 3D Scenes

  • Anton Ratnarajah
  • Zhenyu Tang
  • Rohith Aralikatti
  • Dinesh Manocha

We propose a mesh-based neural network (MESH2IR) to generate acoustic impulse responses (IRs) for indoor 3D scenes represented using a mesh. The IRs are used to create a high-quality sound experience in interactive applications and audio processing. Our method can handle input triangular meshes with arbitrary topologies (2K - 3M triangles). We present a novel training technique to train MESH2IR using energy decay relief and highlight its benefits. We also show that training MESH2IR on IRs preprocessed using our proposed technique significantly improves the accuracy of IR generation. We reduce the non-linearity in the mesh space by transforming 3D scene meshes to latent space using a graph convolution network. Our MESH2IR is more than 200 times faster than a geometric acoustic algorithm on a CPU and can generate more than 10,000 IRs per second on an NVIDIA GeForce RTX 2080 Ti GPU for a given furnished indoor 3D scene. The acoustic metrics are used to characterize the acoustic environment. We show that the acoustic metrics of the IRs predicted from our MESH2IR match the ground truth with less than 10% error. We also highlight the benefits of MESH2IR on audio and speech processing applications such as speech dereverberation and speech separation. To the best of our knowledge, ours is the first neural-network-based approach to predict IRs from a given 3D scene mesh in real-time.

Quality Assessment of Image Super-Resolution: Balancing Deterministic and Statistical Fidelity

  • Wei Zhou
  • Zhou Wang

There has been a growing interest in developing image super-resolution (SR) algorithms that convert low-resolution (LR) to higher resolution images, but automatically evaluating the visual quality of super-resolved images remains a challenging problem. Here we look at the problem of SR image quality assessment (SR IQA) in a two-dimensional (2D) space of deterministic fidelity (DF) versus statistical fidelity (SF). This allows us to better understand the advantages and disadvantages of existing SR algorithms, which produce images at different clusters in the 2D space of (DF, SF). Specifically, we observe an interesting trend from more traditional SR algorithms that are typically inclined to optimize for DF while losing SF, to more recent generative adversarial network (GAN) based approaches that by contrast exhibit strong advantages in achieving high SF but sometimes appear weak at maintaining DF. Furthermore, we propose an uncertainty weighting scheme based on content-dependent sharpness and texture assessment that merges the two fidelity measures into an overall quality prediction named the Super Resolution Image Fidelity (SRIF) index, which demonstrates superior performance against state-of-the-art IQA models when tested on subject-rated datasets.

No-reference Omnidirectional Image Quality Assessment Based on Joint Network

  • Chaofan Zhang
  • Shiguang Liu

In panoramic multimedia applications, the perception quality of the omnidirectional content often comes from the observer's perception of the viewports and the overall impression after browsing. Starting from this hypothesis, this paper proposes a deep-learning based joint network to model the no-reference quality assessment of omnidirectional images. On the one hand, motivated by different scenarios that lead to different human understandings, a convolutional neural network (CNN) is devised to simultaneously encode the local quality features and the latent perception rules of different viewports, which are more likely to be noticed by the viewers. On the other hand, a recurrent neural network (RNN) is designed to capture the interdependence between viewports from their sequence representation, and then predict the impact of each viewport on the observer's overall perception. Experiments on two popular omnidirectional image quality databases demonstrate that the proposed method outperforms the state-of-the-art omnidirectional image quality metrics.

PassWalk: Spatial Authentication Leveraging Lateral Shift and Gaze on Mobile Headsets

  • Abhishek Kumar
  • Lik-Hang Lee
  • Jagmohan Chauhan
  • Xiang Su
  • Mohammad A. Hoque
  • Susanna Pirttikangas
  • Sasu Tarkoma
  • Pan Hui

Secure and usable user authentication on mobile headsets is a challenging problem. The miniature-sized touchpad on such devices becomes a hurdle to user interactions that impact usability. However, the most common authentication methods, i.e., the standard QWERTY virtual keyboard or mid-air inputs to enter passwords are highly vulnerable to shoulder surfing attacks. In this paper, we present PassWalk, a keyboard-less authentication system leveraging multi-modal inputs on mobile headsets. PassWalk demonstrates the feasibility of user authentication driven by the user's gaze and lateral shifts (i.e., footsteps) simultaneously. The keyboard-less authentication interface in PassWalk enables users to accomplish highly mobile inputs of graphical passwords, containing digital overlays and physical objects. We conduct an evaluation with 22 recruited participants (15 legitimate users and 7 attackers). Our results show that PassWalk provides high security (only 1.1% observation attacks were successful) with a mean authentication time of 8.028s, which outperforms the commercial method of using the QWERTY virtual keyboard (21.5% successful attacks) and a research prototype LookUnLock (5.5% successful attacks). Additionally, PassWalk entails a significantly smaller workload on the user than the current commercial methods.

Adaptive Hypergraph Convolutional Network for No-Reference 360-degree Image Quality Assessment

  • Jun Fu
  • Chen Hou
  • Wei Zhou
  • Jiahua Xu
  • Zhibo Chen

In no-reference 360-degree image quality assessment (NR 360IQA), graph convolutional networks (GCNs), which model interactions between viewports through graphs, have achieved impressive performance. However, prevailing GCN-based NR 360IQA methods suffer from three main limitations. First, they only use high-level features of the distorted image to regress the quality score, while the human visual system scores the image based on hierarchical features. Second, they simplify complex high-order interactions between viewports in a pairwise fashion through graphs. Third, in the graph construction, they only consider the spatial location of the viewport, ignoring its content characteristics. Accordingly, to address these issues, we propose an adaptive hypergraph convolutional network for NR 360IQA, denoted as AHGCN. Specifically, we first design a multi-level viewport descriptor for extracting hierarchical representations from viewports. Then, we model interactions between viewports through hypergraphs, where each hyperedge connects two or more viewports. In the hypergraph construction, we build a location-based hyperedge and a content-based hyperedge for each viewport. Experimental results on two public 360IQA databases demonstrate that our proposed approach has a clear advantage over state-of-the-art full-reference and no-reference IQA models.

DeepWSD: Projecting Degradations in Perceptual Space to Wasserstein Distance in Deep Feature Space

  • Xingran Liao
  • Baoliang Chen
  • Hanwei Zhu
  • Shiqi Wang
  • Mingliang Zhou
  • Sam Kwong

Existing deep learning-based full-reference IQA (FR-IQA) models usually predict the image quality in a deterministic way by explicitly comparing the features, gauging how severely distorted an image is by how far the corresponding feature lies from the space of the reference images. Herein, we look at this problem from a different viewpoint and propose to model the quality degradation in perceptual space from a statistical distribution perspective. As such, the quality is measured based upon the Wasserstein distance in the deep feature domain. More specifically, the 1D Wasserstein distance at each stage of the pre-trained VGG network is measured, based on which the final quality score is performed. The deep Wasserstein distance (DeepWSD) performed on features from neural networks enjoys better interpretability of the quality contamination caused by various types of distortions and presents an advanced quality prediction capability. Extensive experiments and theoretical analysis show the superiority of the proposed DeepWSD in terms of both quality prediction and optimization. The implementation of our method is publicly available at

Angular Gap: Reducing the Uncertainty of Image Difficulty through Model Calibration

  • Bohua Peng
  • Mobarakol Islam
  • Mei Tu

Curriculum learning needs example difficulty to proceed from easy to hard. However, the credibility of image difficulty is rarely investigated, which can seriously affect the effectiveness of curricula. In this work, we propose Angular Gap, a measure of difficulty based on the difference in angular distance between feature embeddings and class-weight embeddings built by hyperspherical learning. To ascertain difficulty estimation, we introduce class-wise model calibration, as a post-training technique, to the learnt hyperbolic space. This bridges the gap between probabilistic model calibration and angular distance estimation of hyperspherical learning. We show the superiority of our calibrated Angular Gap over recent difficulty metrics on CIFAR10-H and ImageNetV2. We further propose a curriculum based on Angular Gap for unsupervised domain adaptation that can translate from learning easy samples to mining hard samples. We combine this curriculum with a state-of-the-art self-training method, Cycle Self Training (CST). The proposed Curricular CST learns robust representations and outperforms recent baselines on Office31 and VisDA 2017.

GCL: Graph Calibration Loss for Trustworthy Graph Neural Network

  • Min Wang
  • Hao Yang
  • Qing Cheng

Despite the great success of Graph Neural Networks (GNNs), the trustworthiness is still lack-explored. A very recent study suggests that GNNs are under-confident on the predictions which is opposite to deep neural networks. In this paper, we investigate why this is the case. We discover that the "shallow" network of GNNs is the central cause. To address this challenge, we propose a novel Graph Calibration Loss (GCL), the first end-to-end calibration method for GNNs, which reshapes the standard Cross Entropy loss and is encouraged to assign up-weights loss to high-confidence examples. Through empirical observation and theoretical justification, we discover the GCL's calibration mechanism is to add a minimal-entropy regulariser to KL-divergence to bring down the entropy of correctly classified samples. To evaluate the effectiveness of the GCL, we train several representative GNNs models which use the GCL as loss function on various citation networks datasets, and further apply the GCL to a self-training framework. Compared to the existed methods, the proposed method achieves state-of-the-art calibration performance on node classification task and even improves the standard classification accuracy in almost all cases.

Image Quality Assessment: From Mean Opinion Score to Opinion Score Distribution

  • Yixuan Gao
  • Xiongkuo Min
  • Yucheng Zhu
  • Jing Li
  • Xiao-Ping Zhang
  • Guangtao Zhai

Recently, many methods have been proposed to predict the image quality which is generally described by the mean opinion score (MOS) of all subjective ratings given to an image. However, few efforts focus on predicting the opinion score distribution of the image quality ratings. In fact, the opinion score distribution reflecting subjective diversity, uncertainty, etc., can provide more subjective information about the image quality than a single MOS, which is worthy of in-depth study. In this paper, we propose a convolutional neural network based on fuzzy theory to predict the opinion score distribution of image quality. The proposed method consists of three main steps: feature extraction, feature fuzzification and fuzzy transfer. Specifically, we first use the pre-trained VGG16 without fully-connected layers to extract image features. Then, the extracted features are fuzzified by fuzzy theory, which is used to model epistemic uncertainty in the process of feature extraction. Finally, a fuzzy transfer network is used to predict the opinion score distribution of image quality by learning the mapping from epistemic uncertainty to the uncertainty existing in the image quality ratings. In addition, a new loss function is designed based on the subjective uncertainty of the opinion score distribution. Extensive experimental results prove the superior prediction performance of our proposed method.

No-Reference Image Quality Assessment Using Dynamic Complex-Valued Neural Model

  • Zihan Zhou
  • Yong Xu
  • Ruotao Xu
  • Yuhui Quan

Deep convolutional neural networks (CNNs) have become a promising approach to no-reference image quality assessment (NR-IQA). This paper aims at improving the power of CNNs for NR-IQA in two aspects. Firstly, motivated by the deep connection between complex-valued transforms and human visual perception, we introduce complex-valued convolutions and phase-aware activations beyond traditional real-valued CNNs, which improves the accuracy of NR-IQA without bringing noticeable additional computational costs. Secondly, considering the content-awareness of visual quality perception, we include a dynamic filtering module for better extracting content-aware features, which predicts features based on both local content and global semantics. These two improvements lead to a complex-valued content-aware neural NR-IQA model with good generalization. Extensive experiments on both synthetically and authentically distorted data have demonstrated the state-of-the-art performance of the proposed approach.

Hybrid Conditional Deep Inverse Tone Mapping

  • Tong Shao
  • Deming Zhai
  • Junjun Jiang
  • Xianming Liu

Emerging modern displays are capable to render ultra-high definition (UHD) media contents with high dynamic range (HDR) and wide color gamut (WCG). Although more and more native contents as such have been getting produced, the total amount is still in severe lack. Considering the massive amount of legacy contents with standard dynamic range (SDR) which may be exploitable, the urgent demand for proper conversion techniques thus springs up. In this paper, we try to tackle the conversion task from SDR to HDR-WCG for media contents and consumer displays. We propose a deep learning based SDR-to-HDR solution, Hybrid Conditional Deep Inverse Tone Mapping (HyCondITM), which is an end-to-end trainable framework including global transform, local adjustment, and detail refinement in a single unified pipeline. We present a hybrid condition network that can simultaneously extract both global and local priors for guidance to achieve scene-adaptive and spatially-variant manipulations. Experiments show that our method achieves state-of-the-art performance in both quantitative comparisons and visual quality, out-performing the previous methods.

Where Are You Looking?: A Large-Scale Dataset of Head and Gaze Behavior for 360-Degree Videos and a Pilot Study

  • Yili Jin
  • Junhua Liu
  • Fangxin Wang
  • Shuguang Cui

360° videos in recent years have experienced booming development. Compared to traditional videos, 360° videos are featured with uncertain user behaviors, bringing opportunities as well as challenges. Datasets are necessary for researchers and developers to explore new ideas and conduct reproducible analyses for fair comparisons among different solutions. However, existing related datasets mostly focused on users' field of view (FoV), ignoring the more important eye gaze information, not to mention the integrated extraction and analysis of both FoV and eye gaze. Besides, users' behavior patterns are highly related to videos, yet most existing datasets only contained videos with subjective and qualitative classification from video genres, which lack quantitative analysis and fail to characterize the intrinsic properties of a video scene. To this end, we first propose a quantitative taxonomy for 360° videos that contains three objective technical metrics. Based on this taxonomy, we collect a dataset containing users' head and gaze behaviors simultaneously, which outperforms existing datasets with rich dimensions, large scale, strong diversity, and high frequency. Then we conduct a pilot study on users' behaviors and get some interesting findings such as user's head direction will follow his/her gaze direction with the most possible time interval. A case of application in tile-based 360° video streaming based on our dataset is later conducted, demonstrating a great performance improvement of existing works by leveraging our provided gaze information. Our dataset is available at

SESSION: Oral Session V: Experience -- Art and Culture

Im2Oil: Stroke-Based Oil Painting Rendering with Linearly Controllable Fineness Via Adaptive Sampling

  • Zhengyan Tong
  • Xiaohang Wang
  • Shengchao Yuan
  • Xuanhong Chen
  • Junjie Wang
  • Xiangzhong Fang

This paper proposes a novel stroke-based rendering (SBR) method that translates images into vivid oil paintings. Previous SBR techniques usually formulate the oil painting problem as pixel-wise approximation. Different from this technique route, we treat oil painting creation as an adaptive sampling problem. Firstly, we compute a probability density map based on the texture complexity of the input image. Then we use the Voronoi algorithm to sample a set of pixels as the stroke anchors. Next, we search and generate an individual oil stroke at each anchor. Finally, we place all the strokes on the canvas to obtain the oil painting. By adjusting the hyper-parameter maximum sampling probability, we can control the oil painting fineness in a linear manner. Comparison with existing state-of-the-art oil painting techniques shows that our results have higher fidelity and more realistic textures. A user opinion test demonstrates that people behave more preference toward our oil paintings than the results of other methods. More interesting results and the code are in

ReLyMe: Improving Lyric-to-Melody Generation by Incorporating Lyric-Melody Relationships

  • Chen Zhang
  • Luchin Chang
  • Songruoyao Wu
  • Xu Tan
  • Tao Qin
  • Tie-Yan Liu
  • Kejun Zhang

Lyric-to-melody generation, which generates melody according to given lyrics, is one of the most important automatic music composition tasks. With the rapid development of deep learning, previous works address this task with end-to-end neural network models. However, deep learning models cannot well capture the strict but subtle relationships between lyrics and melodies, which compromises the harmony between lyrics and generated melodies. In this paper, we propose ReLyMe, a method that incorporates Relationships between Lyrics and Melodies from music theory to ensure the harmony between lyrics and melodies. Specifically, we first introduce several principles that lyrics and melodies should follow in terms of tone, rhythm, and structure relationships. These principles are then integrated into neural network lyric-to-melody models by adding corresponding constraints during the decoding process to improve the harmony between lyrics and melodies. We use a series of objective and subjective metrics to evaluate the generated melodies. Experiments on both English and Chinese song datasets show the effectiveness of ReLyMe, demonstrating the superiority of incorporating lyric-melody relationships from the music domain into neural lyric-to-melody generation.

SongDriver: Real-time Music Accompaniment Generation without Logical Latency nor Exposure Bias

  • Zihao Wang
  • Kejun Zhang
  • Yuxing Wang
  • Chen Zhang
  • Qihao Liang
  • Pengfei Yu
  • Yongsheng Feng
  • Wenbo Liu
  • Yikai Wang
  • Yuntao Bao
  • Yiheng Yang

Real-time music accompaniment generation has a wide range of applications in the music industry, such as music education and live performances. However, automatic real-time music accompaniment generation is still understudied and often faces a trade-off between logical latency and exposure bias. In this paper, we propose SongDriver, a real-time music accompaniment generation system without logical latency nor exposure bias. Specifically, SongDriver divides one accompaniment generation task into two phases: 1) The arrangement phase, where a Transformer model first arranges chords for input melodies in real-time, and caches the chords for the next phase instead of playing them out. 2) The prediction phase, where a CRF model generates playable multi-track accompaniments for the coming melodies based on previously cached chords. With this two-phase strategy, SongDriver directly generates the accompaniment for the upcoming melody, achieving zero logical latency. Furthermore, when predicting chords for a timestep, SongDriver refers to the cached chords from the first phase rather than its previous predictions, which avoids the exposure bias problem. Since the input length is often constrained under real-time conditions, another potential problem is the loss of long-term sequential information. To make up for this disadvantage, we extract four musical features from a long-term music piece before the current time step as global information. In the experiment, we train SongDriver on some open-source datasets and an original àiMusic Dataset built from Chinese-style modern pop music sheets. The results show that SongDriver outperforms existing SOTA (state-of-the-art) models on both objective and subjective metrics, meanwhile significantly reducing the physical latency.

CACOLIT: Cross-domain Adaptive Co-learning for Imbalanced Image-to-Image Translation

  • Yijun Wang
  • Tao Liang
  • Jianxin Lin

State-of-the-art unsupervised image-to-image translation (I2I) methods have made great progress on transferring images from a source domain X to a target domain Y. However, training these unsupervised I2I models on imbalanced target domain (e.g., Y with limited samples) usually causes mode collapse, which has not been well solved in current literature. In this work, we propose a new Cross-domain Adaptive Co-learning paradigm, CACOLIT, to alleviate the imbalanced unsupervised I2I training problem. Concretely, CACOLIT first constructs a teacher translation model by introducing an auxiliary domain along with source domain as well as two complementary student translation models formulating an I2I closed loop. Then, the two student models are simultaneously learned by transferring correspondence knowledge from teacher model in an interactive way. With extensive experiments on both human face style transfer and animal face translation tasks, we demonstrate that our adaptive co-learning model effectively transfers correspondence knowledge from teacher model to student models and generates more diverse and realistic images than existing I2I methods both qualitatively and quantitatively.

EuglPollock: Rethinking Interspecies Collaboration through Art Making

  • Kyungwon Lee
  • Yu-Kyung Jang
  • Jaewoo Jung
  • Dong Hwan Kim
  • Hyun Jean Lee
  • Seung Ah Lee

Humans are no longer the exclusive creators of art; art can now be produced by non-human actors such as artificial intelligence, machines, or animals. This paper presents EuglPollock, a platform for creating artwork through interactions between humans and algae called Euglena gracilis. Through light-mediated interactions between human users and microorganisms under a microscope, EuglPollock generates various versions of artworks, each of them unique and non-repeatable. This paper proposes a new method to create art while simultaneously raising interest in microorganisms and microbiology through interspecies collaborations.

SESSION: Poster Session V: Experience -- Art and Culture

Draw Your Art Dream: Diverse Digital Art Synthesis with Multimodal Guided Diffusion

  • Nisha Huang
  • Fan Tang
  • Weiming Dong
  • Changsheng Xu

Digital art synthesis is receiving increasing attention in the multimedia community because of engaging the public with art effectively. Current digital art synthesis methods usually use single-modality inputs as guidance, thereby limiting the expressiveness of the model and the diversity of generated results. To solve this problem, we propose the multimodal guided artwork diffusion (MGAD) model, which is a diffusion-based digital artwork generation approach that utilizes multimodal prompts as guidance to control the classifier-free diffusion model. Additionally, the contrastive language-image pretraining (CLIP) model is used to unify text and image modalities. Extensive experimental results on the quality and quantity of the generated digital art paintings confirm the effectiveness of the combination of the diffusion model and multimodal guidance. Code is available at

AesUST: Towards Aesthetic-Enhanced Universal Style Transfer

  • Zhizhong Wang
  • Zhanjie Zhang
  • Lei Zhao
  • Zhiwen Zuo
  • Ailin Li
  • Wei Xing
  • Dongming Lu

Recent studies have shown remarkable success in universal style transfer which transfers arbitrary visual styles to content images. However, existing approaches suffer from the aesthetic-unrealistic problem that introduces disharmonious patterns and evident artifacts, making the results easy to spot from real paintings. To address this limitation, we propose AesUST, a novel Aesthetic-enhanced Universal Style Transfer approach that can generate aesthetically more realistic and pleasing results for arbitrary styles. Specifically, our approach introduces an aesthetic discriminator to learn the universal human-delightful aesthetic features from a large corpus of artist-created paintings. Then, the aesthetic features are incorporated to enhance the style transfer process via a novel Aesthetic-aware Style-Attention (AesSA) module. Such an AesSA module enables our AesUST to efficiently and flexibly integrate the style patterns according to the global aesthetic channel distribution of the style image and the local semantic spatial distribution of the content image. Moreover, we also develop a new two-stage transfer training strategy with two aesthetic regularizations to train our model more effectively, further improving stylization performance. Extensive experiments and user studies demonstrate that our approach synthesizes aesthetically more harmonious and realistic results than state of the art, greatly narrowing the disparity with real artist-created paintings. Our code is available at

Semi-supervised Human Pose Estimation in Art-historical Images

  • Matthias Springstein
  • Stefanie Schneider
  • Christian Althaus
  • Ralph Ewerth

Gesture as language of non-verbal communication has been theoretically established since the 17th century. However, its relevance for the visual arts has been expressed only sporadically. This may be primarily due to the sheer overwhelming amount of data that traditionally had to be processed by hand. With the steady progress of digitization, though, a growing number of historical artifacts have been indexed and made available to the public, creating a need for automatic retrieval of art-historical motifs with similar body constellations or poses. Since the domain of art differs significantly from existing real-world data sets for human pose estimation due to its style variance, this presents new challenges. In this paper, we propose a novel approach to estimate human poses in art-historical images. In contrast to previous work that attempts to bridge the domain gap with pre-trained models or through style transfer, we suggest semi-supervised learning for both object and keypoint detection. Furthermore, we introduce a novel domain-specific art data set that includes both bounding box and keypoint annotations of human figures. Our approach achieves significantly better results than methods that use pre-trained models or style transfer.

Understanding and Identifying Artwork Plagiarism with the Wisdom of Designers: A Case Study on Poster Artworks

  • Shenglan Cui
  • Fang Liu
  • Tongqing Zhou
  • Mohan Zhang

The wide sharing and rapid dissemination of digital artworks has aggravated the issues of plagiarism, raising significant concerns in cultural preservation and copyright protection. Yet, modes of plagiarism are formally uncharted, causing rough plagiarism detection practices with duplicate checking. This work is thus devoted to understanding artwork plagiarism, with poster design as the running case, for building more dedicated detection techniques. As the first study of such, we elaborate on 8 elements that form unique posters and 6 judgement criteria for plagiarism using an exploratory study with designers. Second, we build a novel poster dataset with plagiarism annotations according to the criteria. Third, we propose models, leveraging the combination of primary elements and criteria of plagiarism, to find suspect instances in a retrieval process. The models are trained under the context of modern artwork and evaluated on the poster plagiarism dataset. The proposal is shown to outperform the baseline with superior Top-K accuracy (~33%) and retrieval performance (~42%).

REMOT: A Region-to-Whole Framework for Realistic Human Motion Transfer

  • Quanwei Yang
  • Xinchen Liu
  • Wu Liu
  • Hongtao Xie
  • Xiaoyan Gu
  • Lingyun Yu
  • Yongdong Zhang

Human Video Motion Transfer (HVMT) aims to, given an image of a source person, generate his/her video that imitates the motion of the driving person. Existing methods for HVMT mainly exploit Generative Adversarial Networks (GANs) to perform the warping operation based on the flow estimated from the source person image and each driving video frame. However, these methods always generate obvious artifacts due to the dramatic differences in poses, scales, and shifts between the source person and the driving person. To overcome these challenges, this paper presents a novel REgion-to-whole human MOtion Transfer (REMOT) framework based on GANs. To generate realistic motions, the REMOT adopts a progressive generation paradigm: it first generates each body part in the driving pose without flow-based warping, then composites all parts into a complete person of the driving motion. Moreover, to preserve the natural global appearance, we design a Global Alignment Module to align the scale and position of the source person with those of the driving person based on their layouts. Furthermore, we propose a Texture Alignment Module to keep each part of the person aligned according to the similarity of the texture. Finally, through extensive quantitative and qualitative experiments, our REMOT achieves state-of-the-art results on two public benchmarks.

GroupDancer: Music to Multi-People Dance Synthesis with Style Collaboration

  • Zixuan Wang
  • Jia Jia
  • Haozhe Wu
  • Junliang Xing
  • Jinghe Cai
  • Fanbo Meng
  • Guowen Chen
  • Yanfeng Wang

Different people dance in different styles. So when multiple people dance together, the phenomenon of style collaboration occurs: people need to seek common points while reserving differences in various dancing periods. Thus, we introduce a novel Music-driven Group Dance Synthesis task. Compared with single-people dance synthesis explored by most previous works, modeling the style collaboration phenomenon and choreographing for multiple people are more complicated and challenging. Moreover, the lack of sufficient records for conducting multi-people choreography in prior datasets further aggravates this problem. To address these issues, we construct a rich-annotated 3D Multi-Dancer Choreography dataset (MDC) and newly devise a metric SCEU for style collaboration evaluation. To our best knowledge, MDC is the first 3D dance dataset that collects both individual and collaborated music-dance pairs. Based on MDC, we present a novel framework, GroupDancer, consisting of three stages: Dancer Collaboration, Motion Choreography and Motion Transition. The Dancer Collaboration stage determines when and which dancers should collaborate their dancing styles from music. Afterward, the Motion Choreography stage produces a motion sequence for each dancer. Finally, the Motion Transition stage fills the gaps between the motions to achieve fluent and natural group dance. To make GroupDancer trainable from end to end and able to synthesize group dance with style collaboration, we propose mixed training and selective updating strategies. Comprehensive evaluations on the MDC dataset demonstrate that the proposed GroupDancer model can synthesize quite satisfactory group dance synthesis results with style collaboration.

CharFormer: A Glyph Fusion based Attentive Framework for High-precision Character Image Denoising

  • Daqian Shi
  • Xiaolei Diao
  • Lida Shi
  • Hao Tang
  • Yang Chi
  • Chuntao Li
  • Hao Xu

Degraded images commonly exist in the general sources of character images, leading to unsatisfactory character recognition results. Existing methods have dedicated efforts to restoring degraded character images. However, the denoising results obtained by these methods do not appear to improve character recognition performance. This is mainly because current methods only focus on pixel-level information and ignore critical features of a character, such as its glyph, resulting in character-glyph damage during the denoising process. In this paper, we introduce a novel generic framework based on glyph fusion and attention mechanisms, i.e., CharFormer, for precisely recovering character images without changing their inherent glyphs. Unlike existing frameworks, CharFormer introduces a parallel target task for capturing additional information and injecting it into the image denoising backbone, which will maintain the consistency of character glyphs during character image denoising. Moreover, we utilize attention-based networks for global-local feature interaction, which will help to deal with blind denoising and enhance denoising performance. We compare CharFormer with state-of-the-art methods on multiple datasets. The experimental results show the superiority of CharFormer quantitatively and qualitatively.

Delving into the Frequency: Temporally Consistent Human Motion Transfer in the Fourier Space

  • Guang Yang
  • Wu Liu
  • Xinchen Liu
  • Xiaoyan Gu
  • Juan Cao
  • Jintao Li

Human motion transfer refers to synthesizing photo-realistic and temporally coherent videos that enable one person to imitate the motion of others. However, current synthetic videos suffer from the temporal inconsistency in sequential frames that significantly degrades the video quality, yet is far from solved by existing methods in the pixel domain. Recently, some works on DeepFake detection try to distinguish the natural and synthetic images in the frequency domain because of the frequency insufficiency of image synthesizing methods. Nonetheless, there is no work to study the temporal inconsistency of synthetic videos from the aspects of the frequency-domain gap between natural and synthetic videos. Therefore, in this paper, we propose to delve into the frequency space for temporally consistent human motion transfer. First of all, we make the first comprehensive analysis of natural and synthetic videos in the frequency domain to reveal the frequency gap in both the spatial dimension of individual frames and the temporal dimension of the video. To close the frequency gap between the natural and synthetic videos, we propose a novel Frequency-based human MOtion TRansfer framework, named FreMOTR, which can effectively mitigate the spatial artifacts and the temporal inconsistency of the synthesized videos. FreMOTR explores two novel frequency-based regularization modules: 1) the Frequency-domain Appearance Regularization (FAR) to improve the appearance of the person in individual frames and 2) Temporal Frequency Regularization (TFR) to guarantee the temporal consistency between adjacent frames. Finally, comprehensive experiments demonstrate that the FreMOTR not only yields superior performance in temporal consistency metrics but also improves the frame-level visual quality of synthetic videos. In particular, the temporal consistency metrics are improved by nearly 30% than the state-of-the-art model.

Adaptive Affine Transformation: A Simple and Effective Operation for Spatial Misaligned Image Generation

  • Zhimeng Zhang
  • Yu Ding

One challenging problem, named spatial misaligned image generation, describing a translation between two face/pose images with large spatial deformation, is widely faced in tasks of face/pose reenactment. Advanced researchers use the dense flow to solve this problem. However, under a complex spatial deformation, even using carefully designed networks, intrinsical complexities make it difficult to compute an accurate dense flow, leading to distorted results. Different from those dense flow based methods, we propose one simple but effective operator named AdaAT (Adaptive Affine Transformation) to realize misaligned image generation. AdaAT simulates spatial deformation by computing hundreds of affine transformations, resulting in less distortions. Without computing any dense flow, AdaAT directly carries out affine transformations in feature channel spaces. Furthermore, we package several AdaAT operators to one universal AdaAT module that is used for different face/pose generation tasks. To validate the effectiveness of our AdaAT, we conduct qualitative and quantitative experiments on four common datasets in the tasks of talking face generation, face reenactment, pose transfer and person image generation. We achieve state-of-the-art results on three of them.

RCRN: Real-world Character Image Restoration Network via Skeleton Extraction

  • Daqian Shi
  • Xiaolei Diao
  • Hao Tang
  • Xiaomin Li
  • Hao Xing
  • Hao Xu

Constructing high-quality character image datasets is challenging because real-world images are often affected by image degradation. There are limitations when applying current image restoration methods to such real-world character images, since (i) the categories of noise in character images are different from those in general images; (ii) real-world character images usually contain more complex image degradation, e.g., mixed noise at different noise levels. To address these problems, we propose a real-world character restoration network (RCRN) to effectively restore degraded character images, where character skeleton information and scale-ensemble feature extraction are utilized to obtain better restoration performance. The proposed method consists of a skeleton extractor (SENet) and a character image restorer (CiRNet). SENet aims to preserve the structural consistency of the character and normalize complex noise. Then, CiRNet reconstructs clean images from degraded character images and their skeletons. Due to the lack of benchmarks for real-world character image restoration, we constructed a dataset containing 1,606 character images with real-world degradation to evaluate the validity of the proposed method. The experimental results demonstrate that RCRN outperforms state-of-the-art methods quantitatively and qualitatively.

Exploring Negatives in Contrastive Learning for Unpaired Image-to-Image Translation

  • Yupei Lin
  • Sen Zhang
  • Tianshui Chen
  • Yongyi Lu
  • Guangping Li
  • Yukai Shi

Unpaired image-to-image translation aims to find a mapping between the source domain and the target domain. To alleviate the problem of the lack of supervised labels for the source images, cycle-consistency based methods have been proposed for image structure preservation by assuming a reversible relationship between unpaired images. However, this assumption only uses limited correspondence between image pairs. Recently, contrastive learning (CL) has been used to further investigate the image correspondence in unpaired image translation by using patch-based positive/negative learning. Patch-based contrastive routines obtain the positives by self-similarity computation and recognize the rest patches as negatives. This flexible learning paradigm obtains auxiliary contextualized information at a low cost. As the negatives own an impressive sample number, with curiosity, we make an investigation based on a question: are all negatives necessary for feature contrastive learning? Unlike previous CL approaches that use negatives as much as possible, in this paper, we study the negatives from an information-theoretic perspective and introduce a new negative Pruning technology for Unpaired image-to-image Translation (PUT) by sparsifying and ranking the patches. The proposed algorithm is efficient, flexible and enables the model to learn essential information between corresponding patches stably. By putting quality over quantity, only a few negative patches are required to achieve better results. Lastly, we validate the superiority, stability, and versatility of our model through comparative experiments.

Sundial-GAN: A Cascade Generative Adversarial Networks Framework for Deciphering Oracle Bone Inscriptions

  • Xiang Chang
  • Fei Chao
  • Changjing Shang
  • Qiang Shen

Oracle Bone Inscription (OBI) is an early hieroglyph in China, which is the most famous ancient writing system in the world. However, only a small number of OBI characters have been fully deciphered today. Chinese characters have different forms in different historical stages; therefore, it is very difficult to directly translate OBI characters to modern Chinese characters due to the long historic evolutionary process. In this paper, we propose a cascade generative adversarial networks (GAN) framework for deciphering OBI characters, named "Sundial-GAN'', which is a cascaded structure to simulate Chinese characters' evolutionary process from an OBI character to its potential modern Chinese character. We select four representative stages in the evolutionary process of OBI, each of which is implemented by an individual GAN structure based on the characteristics of each evolutionary stage. These structures are cascaded in sequence to accurately simulate the Chinese characters' evolutionary process. For each input OBI character, Sundial-GAN can successfully generate the input's different forms at the four historical stages. Extensive experiments and comparisons demonstrate that generated characters at each stage have high similarities with real existing characters; therefore, the proposed method can significantly improve the efficiency and accuracy of OBI deciphering for archaeological researchers. Compared to direct image-to-image translation methods, our approach allows for a smoother translation process, a better grasp of details, and more effective avoiding random mappings in GANs.

Structure-Enhanced Pop Music Generation via Harmony-Aware Learning

  • Xueyao Zhang
  • Jinchao Zhang
  • Yao Qiu
  • Li Wang
  • Jie Zhou

Pop music generation has always been an attractive topic for both musicians and scientists for a long time. However, automatically composing pop music with a satisfactory structure is still a challenging issue. In this paper, we propose to leverage harmony-aware learning for structure-enhanced pop music generation. On the one hand, one of the participants of harmony, chord, represents the harmonic set of multiple notes, which is integrated closely with the spatial structure of music, the texture. On the other hand, the other participant of harmony, chord progression, usually accompanies the development of the music, which promotes the temporal structure of music, the form. Moreover, when chords evolve into chord progression, the texture and form can be bridged by the harmony naturally, which contributes to the joint learning of the two structures. Furthermore, we propose the Harmony-Aware Hierarchical Music Transformer (HAT), which can exploit the structure adaptively from the music, and make the musical tokens interact hierarchically to enhance the structure in multi-level musical elements. Experimental results reveal that compared to the existing methods, HAT owns a much better understanding of the structure and it can also improve the quality of generated music, especially in the form and texture.

Dynamic Weighted Semantic Correspondence for Few-Shot Image Generative Adaptation

  • Xingzhong Hou
  • Boxiao Liu
  • Shuai Zhang
  • Lulin Shi
  • Zite Jiang
  • Haihang You

Few-shot image generative adaptation, which finetunes well-trained generative models on limited examples, is of practical importance. The main challenge is that the few-shot model easily becomes overfitting. It can be attributed to two aspects: the lack of sample diversity for the generator and the failure of fidelity discrimination for the discriminator. In this paper, we introduce two novel methods to solve the diversity and fidelity respectively. Concretely, we propose dynamic weighted semantic correspondence to keep the diversity for the generator, which benefits from the richness of samples generated by source models. To prevent discriminator overfitting, we propose coupled training paradigm across the source and target domains to keep the feature extraction capability of the discriminator backbone. Extensive experiments show that our method outperforms previous methods both on image quality and diversity significantly.

The Beauty of Repetition in Machine Composition Scenarios

  • Zhejing Hu
  • Xiao Ma
  • Yan Liu
  • Gong Chen
  • Yongxu Liu

Repetition, a basic form of artistic creation, appears in most musical works and delivers enthralling aesthetic experiences. However, repetition remains underexplored in terms of automatic music composition. As an initial effort in repetition modelling, this paper focuses on generating motif-level repetitions via domain knowledge-based and example-based learning techniques. A novel repetition transformer (R-Transformer) that combines a Transformer encoder and a repetition-aware learner is trained on a new repetition dataset with 584,329 samples from different categories of motif repetition. The Transformer encoder learns the representation among music notes from the repetition dataset; the novel repetition-aware learner exploits repetitions' unique characteristics based on music theory. Experiments show that, with any given motif, R-Transformer can generate a large number of variable and beautiful repetitions. With ingenious fusion of these high-quality pieces, the musicality and appeal of machine-composed music have been greatly improved.

CariPainter: Sketch Guided Interactive Caricature Generation

  • Xin Huang
  • Dong Liang
  • Hongrui Cai
  • Juyong Zhang
  • Jinyuan Jia

In this paper, we propose CariPainter, the first interactive caricature generating and editing method. The main challenge of caricature generation lies in the fact that it not only exaggerates the facial geometry but also refreshes the facial texture. We solve this challenging problem by utilizing the semantic segmentation maps as an intermediary domain, removing the influence of photo texture while preserving the person-specific geometry features. Specifically, our proposed method consists of two main components: CariSketchNet and CariMaskGAN. CariSketchNet exaggerates the photo segmentation map to construct CariMask. Then, CariMask is converted into a caricature by CariMaskGAN. In this step, users can edit and adjust the geometry of caricatures freely. Additionally, we propose a semantic detail pre-processing approach, which considerably increases details of generated images and allows modification of hair strands, wrinkles, and beards. Extensive experimental results show that our method produces higher-quality caricatures as well as supports easily used interactive modification.

Cartoon-Flow: A Flow-Based Generative Adversarial Network for Arbitrary-Style Photo Cartoonization

  • Jieun Lee
  • Hyeonwoo Kim
  • Jonghwa Shim
  • Eenjun Hwang

Photo cartoonization aims to convert photos of real-world scenes into cartoon-style images. Recently, generative adversarial network (GAN)-based methods for photo cartoonization have been proposed to generate pleasable cartoonized images. However, as these methods can transfer only learned cartoon styles to photos, they are limited in general-purpose applications where unlearned styles are often required. To address this limitation, an arbitrary style transfer (AST) method that transfers arbitrary artistic style into content images can be used. However, conventional AST methods do not perform satisfactorily in cartoonization for two reasons. First, they cannot capture the unique characteristics of cartoons that differ from common artistic styles. Second, they suffer from content leaks in which the semantic structure of the content is distorted. In this paper, to solve these problems, we propose a novel arbitrary-style photo cartoonization method, Cartoon-Flow. More specifically, we construct a new hybrid GAN with an invertible neural flow generator to effectively preserve content information. In addition, we introduce two new losses for cartoonization: (1) edge-promoting smooth loss to learn the unique characteristics of cartoons with smooth surfaces and clear edges, and (2) line loss to mimic the line drawing of cartoons. Extensive experiments demonstrate that the proposed method outperforms previous methods both quantitatively and qualitatively.

SESSION: Oral Session VI: Experience -- Multimedia Applications

Span-based Audio-Visual Localization

  • Yiling Wu
  • Xinfeng Zhang
  • Yaowei Wang
  • Qingming Huang

This paper focuses on the audio-visual event localization task that aims to match both visible and audible components in a video to identify the event of interest. Existing methods primarily ignore the continuity of audio-visual events and classify each segment separately. They either classify the event category score of each segment separately or calculate the event-relevant score of each segment separately. However, events in video are often continuous and last several segments. Motivated by these, we propose a span-based framework that considers consecutive segments jointly. The span-based framework handles the audio-visual localization task by predicting the event class and extracting the event span. Specifically, a [CLS] token is applied to collect the global information with self-attention mechanisms to predict the event class. Relevance scores and positional embeddings are inserted into the span predictor to estimate the start and end boundaries of the event. Multi-modal Mixup are further used to improve the robustness and generalization of the model. Experiments conducted on the AVE dataset demonstrate that the proposed method outperforms state-of-the-art methods.

PC-Dance: Posture-controllable Music-driven Dance Synthesis

  • Jibin Gao
  • Junfu Pu
  • Honglun Zhang
  • Ying Shan
  • Wei-Shi Zheng

Music-driven dance synthesis is a task to generate high-quality dance according to the music given by the user, which has promising entertainment applications. However, most of the existing methods cannot provide an efficient and effective way for user intervention in dance generation, e.g., posture-controllable. In this work, we propose a powerful framework named PC-Dance to perform adaptive posture-controllable music-driven dance synthesis. Consisting of an music-to-dance alignment embedding network (M2D-Align) and a posture-controllable dance synthesis (PC-Syn), PC-Dance allows fine-grained control by input anchor poses efficiently without artist participation. Specifically, to relieve the cost of artist participation but ensure generating high-quality dance efficiently, a self-supervised rhythm alignment module is designed to further learn the music-to-dance alignment embedding. As for PC-Syn, we introduce an efficient scheme for adaptive motion graph construction (AMGC), which could improve the efficiency of graph-based optimization and preserve the diversity of motions. Since there is few related public dataset, we collect an MMD-ARC dataset for music-driven dance synthesis. The experimental results on MMD-ARC dataset demonstrate the effectiveness of our framework and the feasibility for dance synthesis with adaptive posture controlling.

Delving Globally into Texture and Structure for Image Inpainting

  • Haipeng Liu
  • Yang Wang
  • Meng Wang
  • Yong Rui

Image inpainting has achieved remarkable progress and inspired abundant methods, where the critical bottleneck is identified as how to fulfill the high-frequency structure and low-frequency texture information on the masked regions with semantics. To this end, deep models exhibit powerful superiority to capture them, yet constrained on the local spatial regions. In this paper, we delve globally into texture and structure information to well capture the semantics for image inpainting. As opposed to the existing arts trapped on the independent local patches, the texture information of each patch is reconstructed from all other patches across the whole image, to match the coarsely filled information, especially the structure information over the masked regions. Unlike the current decoder-only transformer within the pixel level for image inpainting, our model adopts the transformer pipeline paired with both encoder and decoder. On one hand, the encoder captures the texture semantic correlations of all patches across image via self-attention module. On the other hand, an adaptive patch vocabulary is dynamically established in the decoder for the filled patches over the masked regions. Building on this, a structure-texture matching attention module anchored on the known regions comes up to marry the best of these two worlds for progressive inpainting via a probabilistic diffusion process. Our model is orthogonal to the fashionable arts, such as Convolutional Neural Networks (CNNs), Attention and Transformer model, from the perspective of texture and structure information for image inpainting. The extensive experiments over the benchmarks validate its superiority. Our code is available here

Rethinking Open-World Object Detection in Autonomous Driving Scenarios

  • Zeyu Ma
  • Yang Yang
  • Guoqing Wang
  • Xing Xu
  • Heng Tao Shen
  • Mingxing Zhang

Existing object detection models have been demonstrated to successfully discriminate and localize the predefined object categories under the seen or similar situations. However, the open-world object detection as required by autonomous driving perception systems refers to recognizing unseen objects under various scenarios. On the one hand, the knowledge gap between seen and unseen object categories poses extreme challenges for models trained with supervision only from the seen object categories. On the other hand, the domain differences across different scenarios also cause an additional urge to take the domain gap into consideration by aligning the sample or label distribution. Aimed at resolving these two challenges simultaneously, we firstly design a pre-training model to formulate the mappings between visual images and semantic embeddings from the extra annotations as guidance to link the seen and unseen object categories through a self-supervised manner. Within this formulation, the domain adaptation is then utilized for extracting the domain-agnostic feature representations and alleviating the misdetection of unseen objects caused by the domain appearance changes. As a result, the more realistic and practical open-world object detection problem is visited and resolved by our novel formulation, which could detect the unseen categories from unseen domains without any bounding box annotations while there is no obvious performance drop in detecting the seen categories. We are the first to formulate a unified model for open-world task and establish a new state-of-the-art performance for this challenge.

MVLayoutNet: 3D Layout Reconstruction with Multi-view Panoramas

  • Zhihua Hu
  • Bo Duan
  • Yanfeng Zhang
  • Mingwei Sun
  • Jingwei Huang

We present MVLayoutNet, a network for holistic 3D reconstruction from multi-view panoramas. Our core contribution is to seamlessly combine learned monocular layout estimation and multi-view stereo (MVS) for accurate layout reconstruction in both 3D and image space. We jointly train a layout module to produce an initial layout and a novel MVS module to obtain accurate layout geometry. Unlike standard MVSNet, our MVS module takes a newly-proposed layout cost volume, which aggregates multi-view costs at the same depth layer into corresponding layout elements. We additionally provide an attention-based scheme that guides the MVS module to focus on structural regions. Such a design considers both local pixel-level costs and global holistic information for better reconstruction. Experiments show that our method outperforms state-of-the-arts in terms of depth rmse by 21.7% and 41.2% on the 2D-3D-S [1] and ZInD [4] datasets. For complex scenes with multiple rooms, our method can be applied to each layout element of a precomputed topology to accurately reconstruct a globally coherent layout geometry.

Wavelet-enhanced Weakly Supervised Local Feature Learning for Face Forgery Detection

  • Jiaming Li
  • Hongtao Xie
  • Lingyun Yu
  • Yongdong Zhang

Face forgery detection is getting increasing attention due to the security threats caused by forged faces. Recently, local patch-based approaches have achieved sound achievements due to effective attention to local details. However, there are still unignorable problems: a) local feature learning requires patch-level labels to circumvent label noise, which is not practical in real-world scenarios; b) the commonly used DCT (FFT) transform loses all spatial information, which brings difficulty in handling local details. To compensate for such limitations, a novel wavelet-enhanced weakly supervised local feature learning framework is proposed in this paper. Specifically, to supervise the learning of local features with only image-level labels, two modules are devised based on the idea of multi-instance learning: local relation constraint module (LRCM) and category knowledge-guided local feature aggregation module (CKLFA). LRCM constrains the maximum distance between local features of forged face images greater than that of real face images. CKLFA adaptively aggregates local features based on their correlation to global embedding containing global category information. Combining these two modules, the network is encouraged to learn discriminative local features supervised only by image-level labels. Besides, a multi-level wavelet-powered feature enhancement module is developed to promote the network mining local forgery artifacts from spatio-frequency domain, which is beneficial to learning discriminative local features. Extensive experiments show that our approach outperforms previous state-of-the-art methods when only image-level labels are available and achieves comparable or even better performance than counterparts using patch-level labels.

ADGNet: Attention Discrepancy Guided Deep Neural Network for Blind Image Quality Assessment

  • Xiaoyu Ma
  • Yaqi Wang
  • Chang Liu
  • Suiyu Zhang
  • Dingguo Yu

This work explores how to efficiently incorporate semantic knowledge for blind image quality assessment and proposes an end-to-end attention discrepancy guided deep neural network for perceptual quality assessment. Our method is established on a multi-task learning framework in which two sub-tasks including semantic recognition and image quality prediction are jointly optimized with a shared feature-extracting branch and independent spatial-attention branch. The discrepancy between semantic-aware attention and quality-aware attention is leveraged to refine the quality predictions. The proposed ADGNet is based on the observation that human visual systems exhibit different mechanisms when viewing images with different amounts of distortion. Such a manner would result in the variation of attention discrepancy between the quality branch and semantic branch, which are therefore employed to enhance the accuracy and generalization ability of our method. We systematically study the major components of our framework, and experimental results on both authentically and synthetically distorted image quality datasets demonstrate the superiority of our model as compared to the state-of-the-art approaches.

Decoupling Recognition from Detection: Single Shot Self-Reliant Scene Text Spotter

  • Jingjing Wu
  • Pengyuan Lyu
  • Guangming Lu
  • Chengquan Zhang
  • Kun Yao
  • Wenjie Pei

Typical text spotters follow the two-stage spotting strategy: detect the precise boundary for a text instance first and then perform text recognition within the located text region. While such strategy has achieved substantial progress, there are two underlying limitations. 1) The performance of text recognition depends heavily on the precision of text detection, resulting in the potential error propagation from detection to recognition. 2) The RoI cropping which bridges the detection and recognition brings noise from background and leads to information loss when pooling or interpolating from feature maps. In this work we propose the single shot Self-Reliant Scene Text Spotter (SRSTS), which circumvents these limitations by decoupling recognition from detection. Specifically, we conduct text detection and recognition in parallel and bridge them by the shared positive anchor point. Consequently, our method is able to recognize the text instances correctly even though the precise text boundaries are challenging to detect. Additionally, our method reduces the annotation cost for text detection substantially. Extensive experiments on regular-shaped benchmark and arbitrary-shaped benchmark demonstrate that our SRSTS compares favorably to previous state-of-the-art spotters in terms of both accuracy and efficiency.

Real-World Blind Super-Resolution via Feature Matching with Implicit High-Resolution Priors

  • Chaofeng Chen
  • Xinyu Shi
  • Yipeng Qin
  • Xiaoming Li
  • Xiaoguang Han
  • Tao Yang
  • Shihui Guo

A key challenge of real-world image super-resolution (SR) is to recover the missing details in low-resolution (LR) images with complex unknown degradations (\eg, downsampling, noise and compression). Most previous works restore such missing details in the image space. To cope with the high diversity of natural images, they either rely on the unstable GANs that are difficult to train and prone to artifacts, or resort to explicit references from high-resolution (HR) images that are usually unavailable. In this work, we propose Feature Matching SR (FeMaSR), which restores realistic HR images in a much more compact feature space. Unlike image-space methods, our FeMaSR restores HR images by matching distorted LR image features to their distortion-free HR counterparts in our pretrained HR priors, and decoding the matched features to obtain realistic HR images. Specifically, our HR priors contain a discrete feature codebook and its associated decoder, which are pretrained on HR images with a Vector Quantized Generative Adversarial Network (VQGAN). Notably, we incorporate a novel semantic regularization in VQGAN to improve the quality of reconstructed images. For the feature matching, we first extract LR features with an LR encoder consisting of several Swin Transformer blocks and then follow a simple nearest neighbour strategy to match them with the pretrained codebook. In particular, we equip the LR encoder with residual shortcut connections to the decoder, which is critical to the optimization of feature matching loss and also helps to complement the possible feature matching errors.Experimental results show that our approach produces more realistic HR images than previous methods. Code will be made publicly available.

Leveraging GAN Priors for Few-Shot Part Segmentation

  • Mengya Han
  • Heliang Zheng
  • Chaoyue Wang
  • Yong Luo
  • Han Hu
  • Bo Du

Few-shot part segmentation aims to separate different parts of an object given only a few annotated samples. Due to the challenge of limited data, existing works mainly focus on learning classifiers over pre-trained features, failing to learn task-specific features for part segmentation. In this paper, we propose to learn task-specific features in a "pre-training"-"fine-tuning" paradigm. We conduct prompt designing to reduce the gap between the pre-train task (i.e., image generation) and the downstream task (i.e., part segmentation), so that the GAN priors for generation can be leveraged for segmentation. This is achieved by projecting part segmentation maps into the RGB space and conducting interpolation between RGB segmentation maps and original images. Specifically, we design a fine-tuning strategy to progressively tune an image generator into a segmentation generator, where the supervision of the generator varying from images to segmentation maps by interpolation. Moreover, we propose a two-stream architecture, i.e., a segmentation stream to generate task-specific features, and an image stream to provide spatial constraints. The image stream can be regarded as a self-supervised auto-encoder, and this enables our model to benefit from large-scale support images. Overall, this work is an attempt to explore the internal relevance between generation tasks and perception tasks by prompt designing. Extensive experiments show that our model can achieve state-of-the-art performance on several part segmentation datasets.

MaMiCo: Macro-to-Micro Semantic Correspondence for Self-supervised Video Representation Learning

  • Bo Fang
  • Wenhao Wu
  • Chang Liu
  • Yu Zhou
  • Dongliang He
  • Weipinng Wang

Contrastive self-supervised learning (CSL) has remarkably promoted the progress of visual representation learning. However, existing video CSL methods mainly focus on clip-level temporal semantic consistency. The temporal and spatial semantic correspondence across different granularities, i.e., video, clip, and frame levels, is typically overlooked. To tackle this issue, we propose a self-supervised Macro-to-Micro Semantic Correspondence (MaMiCo) learning framework, pursuing fine-grained spatiotemporal representations from a macro-to-micro perspective. Specifically, MaMiCo constructs a multiple branch architecture of T-MaMiCo and S-MaMiCo on a temporally-nested clip pyramid (video-to-frame). On the pyramid, T-MaMiCo aims at temporal correspondence by simultaneously assimilating semantic invariance representations and retaining appearance dynamics in long temporal ranges. For spatial correspondence, S-MaMiCo perceives subtle motion cues via ameliorating dense CSL for videos where stationary clips are applied for stably dense contrasting reference to alleviate semantic inconsistency caused by ''mismatching''. Extensive experiments justify that MaMiCo learns rich general video representations and works well on various downstream tasks, e.g., (fine-grained) action recognition, action localization, and video retrieval.

ChebyLighter: Optimal Curve Estimation for Low-light Image Enhancement

  • Jinwang Pan
  • Deming Zhai
  • Yuanchao Bai
  • Junjun Jiang
  • Debin Zhao
  • Xianming Liu

Low-light enhancement aims to recover a high contrast normal light image from a low-light image with bad exposure and low contrast. Inspired by curve adjustment in photo editing software and Chebyshev approximation, this paper presents a novel model for brightening low-light images. The proposed model, ChebyLighter, learns to estimate pixel-wise adjustment curves for a low-light image recurrently to reconstruct an enhanced output. In ChebyLighter, Chebyshev image series are first generated. Then pixel-wise coefficient matrices are estimated with Triple Coefficient Estimation (TCE) modules and the final enhanced image is recurrently reconstructed by Chebyshev Attention Weighted Summation (CAWS). The TCE module is specifically designed based on dual attention mechanism with three necessary inputs. Our method can achieve ideal performance because adjustment curves can be obtained with numerical approximation by our model. With extensive quantitative and qualitative experiments on diverse test images, we demonstrate that the proposed method performs favorably against state-of-the-art low-light image enhancement algorithms.

Bayesian based Re-parameterization for DNN Model Pruning

  • Xiaotong Lu
  • Teng Xi
  • Baopu Li
  • Gang Zhang
  • Weisheng Dong
  • Guangming Shi

Filter pruning, as an effective strategy to obtain efficient compact structures from over-parametric deep neural networks(DNN), has attracted a lot of attention. Previous pruning methods select channels for pruning by developing different criteria, yet little attention has been devoted to whether these criteria can represent correlations between channels. Meanwhile, most existing methods generally ignore the parameters being pruned and only perform additional training on the retained network to reduce accuracy loss. In this paper, we present a novel perspective of re-parametric pruning by Bayesian estimation. First, we estimate the probability distribution of different channels based on Bayesian estimation and indicate the importance of the channels by the discrepancy in the distribution before and after channel pruning. Second, to minimize the variation in distribution after pruning, we re-parameterize the pruned network based on the probability distribution to pursue optimal pruning. We evaluate our approach on popular datasets with some typical network architectures, and comprehensive experimental results validate that this method illustrates better performance compared to the state-of-the-art approaches.

ReCoRo: Region-Controllable Robust Light Enhancement with User-Specified Imprecise Masks

  • Dejia Xu
  • Hayk Poghosyan
  • Shant Navasardyan
  • Yifan Jiang
  • Humphrey Shi
  • Zhangyang Wang

Low-light enhancement is an increasingly important function in image editing and visual creation. Most existing enhancing algorithms are trained to enlighten a given image in a globally homogeneous way, and (implicitly) to some predefined extent of brightness. They are neither capable of enhancing only local regions of interest ("where") while keeping the overall visual appearance plausible, nor producing outputs at a range of different illumination levels ("how much"). Those hurdles significantly limit the prospect of flexible, customizable, or even user-interactive low-light enhancement. To address these gaps, we propose <u>Re</u>gion-<u>Co</u>ntrollable <u>Ro</u>bust Light Enhancement (ReCoRo), a novel framework that allows users to directly specify "where" and "how much" they want to enhance from an input low-light image; meanwhile, the model will learn to intelligently maintain the overall consistent visual appearance and plausible composition via a discriminator. Moreover, since in practical mobile APPs, such user specifications often come in imprecise forms (e.g., finger-drawn masks), we propose to bake in domain-specific data augmentations into training ReCoRo, so that the learned model can gain resilience to various roughly-supplied user masks. Up to our best knowledge, ReCoRo is the first of its kind that allows the user to localize the enlightenment region as well as to control the light intensity. Extensive experiments clearly demonstrate that ReCoRo outperforms state-of-the-art methods in terms of qualitative results, quantitative metrics, and versatile controllability. Project repository:

Domain-Specific Fusion Of Objective Video Quality Metrics

  • Aaron Chadha
  • Ioannis Katsavounidis
  • Ayan Kumar Bhunia
  • Cosmin Stejerean
  • Mohammad Umar Karim Khan
  • Yiannis Andreopoulos

Video processing algorithms like video upscaling, denoising, and compression are now increasingly optimized for perceptual quality metrics instead of signal distortion. This means that they may score well for metrics like video multi-method assessment fusion (VMAF), but this may be because of metric overfitting. This imposes the need for costly subjective quality assessments that cannot scale to large datasets and large parameter explorations. We propose a methodology that fuses multiple quality metrics based on small scale subjective testing in order to unlock their use at scale for specific application domains of interest. This is achieved by employing pseudo-random sampling of the resolution, quality range and test video content available, which is initially guided by quality metrics in order to cover the quality range useful to each application. The selected samples then undergo a subjective test, such as ITU-T P.910 absolute categorical rating, with the results of the test postprocessed and used as the means to derive the best combination of multiple objective metrics using support vector regression. We showcase the benefits of this approach in two applications: video encoding with and without perceptual preprocessing, and deep video denoising & upscaling of compressed content. For both applications, the derived fusion of metrics allows for a more robust alignment to mean opinion scores than a perceptually-uninformed combination of the original metrics themselves. The dataset and code is available at

Learning for Motion Deblurring with Hybrid Frames and Events

  • Wen Yang
  • Jinjian Wu
  • Jupo Ma
  • Leida Li
  • Weisheng Dong
  • Guangming Shi

Event camera responds to the brightness changes at each pixel independently with microsecond accuracy. Event cameras offer attractive property that can record well high-speed scene but ignore static and non-moving areas, while conventional frame cameras are able to acquire the whole intensity information of the scene but suffer from motion blur. Therefore, it would be desirable to combine the best of two cameras for reconstructing high quality intensity frame with no motion blur. The human visual system presents a two-pathway procedure for non-action-based representation and objects motion perception, which corresponds well to the hybrid frame and event. In this paper, inspired by the two-pathway visual system, a novel dual-stream based framework is proposed for motion deblurring (DS-Deblur), which flexibly utilizes the respective advantages from frame and event. A complementary-unique information splitting based feature fusion module is firstly proposed to adaptively aggregate the frame and event progressively at multiple levels, which is well-grounded on the hierarchical process in twopathway visual system. Then, a recurrent spatio-temporal feature transformation module is designed to exploit relevant information between adjacent frames, in which features of both current and previous frames are transformed in a global-local manner. Extensive experiments on both synthetic and real motion blur datasets demonstrate our method achieves state-of-the-art performance. Project website:

Bidirectional Self-Training with Multiple Anisotropic Prototypes for Domain Adaptive Semantic Segmentation

  • Yulei Lu
  • Yawei Luo
  • Li Zhang
  • Zheyang Li
  • Yi Yang
  • Jun Xiao

A thriving trend for domain adaptive segmentation endeavors to generate the high-quality pseudo labels for target domain and retrain the segmentor on them. Under this self-training paradigm, some competitive methods have sought to the latent-space information, which establishes the feature centroids (a.k.a prototypes) of the semantic classes and determines the pseudo label candidates by their distances from these centroids. In this paper, we argue that the latent space contains more information to be exploited thus taking one step further to capitalize on it. Firstly, instead of merely using the source-domain prototypes to determine the target pseudo labels as most of the traditional methods do, we bidirectionally produce the target-domain prototypes to degrade those source features which might be too hard or disturbed for the adaptation. Secondly, existing attempts simply model each category as a single and isotropic prototype while ignoring the variance of the feature distribution, which could lead to the confusion of similar categories. To cope with this issue, we propose to represent each category with multiple and anisotropic prototypes via Gaussian Mixture Model, in order to fit the de facto distribution of source domain and estimate the likelihood of target samples based on the probability density. We apply our method on GTA5->Cityscapes and Synthia->Cityscapes tasks and achieve 61.2% and 62.8% respectively in terms of mean IoU, substantially outperforming other competitive self-training methods. Noticeably, in some categories which severely suffer from the categorical confusion such as "truck" and "bus", our method achieves 56.4% and 68.8% respectively, which further demonstrates the effectiveness of our design. The code and model are available at

Semi-supervised Crowd Counting via Density Agency

  • Hui Lin
  • Zhiheng Ma
  • Xiaopeng Hong
  • Yaowei Wang
  • Zhou Su

In this paper, we propose a new agency-guided semi-supervised counting approach. First, we build a learnable auxiliary structure, namely the density agency to bring the recognized foreground regional features close to corresponding density sub-classes (agents) and push away background ones. Second, we propose a density-guided contrastive learning loss to consolidate the backbone feature extractor. Third, we build a regression head by using a transformer structure to refine the foreground features further. Finally, an efficient noise depression loss is provided to minimize the negative influence of annotation noises. Extensive experiments on four challenging crowd counting datasets demonstrate that our method achieves superior performance to the state-of-the-art semi-supervised counting methods by a large margin. The code is available at

AEDNet: Asynchronous Event Denoising with Spatial-Temporal Correlation among Irregular Data

  • Huachen Fang
  • Jinjian Wu
  • Leida Li
  • Junhui Hou
  • Weisheng Dong
  • Guangming Shi

Dynamic Vision Sensor (DVS) is a compelling neuromorphic camera compared to conventional camera, but it suffers from fiercer noise. Due to the nature of irregular format and asynchronous readout, DVS data is always transformed into a regular tensor (e.g., 3D voxel or image) for deep learning method, which corrupts its own asynchronous properties. To maintain asynchronous, we establish an innovative asynchronous event denoise neural network, named AEDNet, which directly consumes the correlation of the irregular signal in spatial-temporal range without destroying its original structural property. Based on the property of continuation in temporal domain and discreteness in spatial domain, we decompose the DVS signal into two parts, i.e., temporal correlation and spatial affinity, and separately process these two parts. Our spatial feature embedding unit is a unique feature extraction module that extracts feature from event-level, which perfectly maintains its spatial-temporal correlation. To test effectiveness, we build a novel dataset named DVSCLEAN containing both simulated and real-world data. The experimental results of AEDNet achieve SOTA.

Learnability Enhancement for Low-light Raw Denoising: Where Paired Real Data Meets Noise Modeling

  • Hansen Feng
  • Lizhi Wang
  • Yuzhi Wang
  • Hua Huang

Low-light raw denoising is an important and valuable task in computational photography where learning-based methods trained with paired real data are mainstream. However, the limited data volume and complicated noise distribution have constituted a learnability bottleneck for paired real data, which limits the denoising performance of learning-based methods. To address this issue, we present a learnability enhancement strategy to reform paired real data according to noise modeling. Our strategy consists of two efficient techniques: shot noise augmentation (SNA) and dark shading correction (DSC). Through noise model decoupling, SNA improves the precision of data mapping by increasing the data volume and DSC reduces the complexity of data mapping by reducing the noise complexity. Extensive results on the public datasets and real imaging scenarios collectively demonstrate the state-of-the-art performance of our method.

Multi-Modal Experience Inspired AI Creation

  • Qian Cao
  • Xu Chen
  • Ruihua Song
  • Hao Jiang
  • Guang Yang
  • Zhao Cao

AI creation, such as poem or lyrics generation, has attracted increasing attention from both industry and academic communities, with many promising models proposed in the past few years. Existing methods usually estimate the outputs based on single and independent visual or textual information. However, in reality, humans usually make creations according to their experiences, which may involve different modalities and be sequentially correlated. To model such human capabilities, in this paper, we define and solve a novel AI creation problem based on human experiences. More specifically, we study how to generate texts based on sequential multi-modal information. Compared with the previous works, this task is much more difficult because the designed model has to well understand and adapt the semantics among different modalities and effectively convert them into the output in a sequential manner. To alleviate these difficulties, we firstly design a multi-channel sequence-to-sequence architecture equipped with a multi-modal attention network. For more effective optimization, we then propose a curriculum negative sampling strategy tailored for the sequential inputs. To benchmark this problem and demonstrate the effectiveness of our model, we manually labeled a new multi-modal experience dataset. With this dataset, we conduct extensive experiments by comparing our model with a series of representative baselines, where we can demonstrate significant improvements in our model based on both automatic and human-centered metrics. The code and data are available at:

Factorized and Controllable Neural Re-Rendering of Outdoor Scene for Photo Extrapolation

  • Boming Zhao
  • Bangbang Yang
  • Zhenyang Li
  • Zuoyue Li
  • Guofeng Zhang
  • Jiashu Zhao
  • Dawei Yin
  • Zhaopeng Cui
  • Hujun Bao

Expanding an existing tourist photo from a partially captured scene to a full scene is one of the desired experiences for photography applications. Although photo extrapolation has been well studied, it is much more challenging to extrapolate a photo (i.e., selfie) from a narrow field of view to a wider one while maintaining a similar visual style. In this paper, we propose a factorized neural re-rendering model to produce photorealistic novel views from cluttered outdoor Internet photo collections, which enables the applications including controllable scene re-rendering, photo extrapolation and even extrapolated 3D photo generation. Specifically, we first develop a novel factorized re-rendering pipeline to handle the ambiguity in the decomposition of geometry, appearance and illumination. We also propose a composited training strategy to tackle the unexpected occlusion in Internet images. Moreover, to enhance photo-realism when extrapolating tourist photographs, we propose a novel realism augmentation process to complement appearance details, which automatically propagates the texture details from a narrow captured photo to the extrapolated neural rendered image. The experiments and photo editing examples on outdoor scenes demonstrate the superior performance of our proposed method in both photo-realism and downstream applications. Code and the supplementary material are available on the project webpage:

On Generating Identifiable Virtual Faces

  • Zhuowen Yuan
  • Zhengxin You
  • Sheng Li
  • Zhenxing Qian
  • Xinpeng Zhang
  • Alex Kot

Face anonymization with generative models have become increasingly prevalent since they sanitize private information by generating virtual face images, ensuring both privacy and image utility. Such virtual face images are usually not identifiable after the removal or protection of the original identity. In this paper, we formalize and tackle the problem of generating identifiable virtual face images. Our virtual face images are visually different from the original ones for privacy protection. In addition, they are bound with new virtual identities, which can be directly used for face recognition. We propose an Identifiable Virtual Face Generator (IVFG) to generate the virtual face images. The IVFG projects the latent vectors of the original face images into virtual ones according to a user specific key, based on which the virtual face images are generated. To make the virtual face images identifiable, we propose a multi-task learning objective as well as a triplet styled training strategy to learn the IVFG. We evaluate the performance of our virtual face images using different face recognizers on diffident face image datasets, all of which demonstrate the effectiveness of the IVFG for generate identifiable virtual face images.

Keyword Spotting in the Homomorphic Encrypted Domain Using Deep Complex-Valued CNN

  • Peijia Zheng
  • Zhiwei Cai
  • Huicong Zeng
  • Jiwu Huang

In this paper, we propose a non-interactive scheme to achieve end-to-end keyword spotting in the homomorphic encrypted domain using deep learning techniques. We carefully designed a complex-valued convolutional neural network (CNN) structure for the encrypted domain keyword spotting to take full advantage of the limited multiplicative depth. At the same depth, the proposed complex-valued CNN can learn more speech representations than the real-valued CNN, thus achieving higher accuracy in keyword spotting. The complex activation function of the complex-valued CNN is non-arithmetic and cannot be supported by homomorphic encryption. To implement the complex activation function in the encrypted domain without interaction, we design methods to approximate complex activation functions with low-degree polynomials while preserving the keyword spotting performance. Our scheme supports single-instruction multiple-data (SIMD), which reduces the total size of ciphertexts and improves computational efficiency. We conducted extensive experiments to investigate our performance with various metrics, such as accuracy, robustness, and F1-score. The experimental results show that our approach significantly outperforms the state-of-the-art solutions on every metric.

Cycle-Interactive Generative Adversarial Network for Robust Unsupervised Low-Light Enhancement

  • Zhangkai Ni
  • Wenhan Yang
  • Hanli Wang
  • Shiqi Wang
  • Lin Ma
  • Sam Kwong

Getting rid of the fundamental limitations in fitting to the paired training data, recent unsupervised low-light enhancement methods excel in adjusting illumination and contrast of images. However, for unsupervised low light enhancement, the remaining noise suppression issue due to the lacking of supervision of detailed signal largely impedes the wide deployment of these methods in real-world applications. Herein, we propose a novel Cycle-Interactive Generative Adversarial Network (CIGAN) for unsupervised low-light image enhancement, which is capable of not only better transferring illumination distributions between low/normal-light images but also manipulating detailed signals between two domains, e.g., suppressing/synthesizing realistic noise in the cyclic enhancement/degradation process. In particular, the proposed low-light guided transformation feed-forwards the features of low-light images from the generator of enhancement GAN (eGAN) into the generator of degradation GAN (dGAN). With the learned information of real low-light images, dGAN can synthesize more realistic diverse illumination and contrast in low-light images. Moreover, the feature randomized perturbation module in dGAN learns to increase the feature randomness to produce diverse feature distributions, persuading the synthesized low-light images to contain realistic noise. Extensive experiments demonstrate both the superiority of the proposed method and the effectiveness of each module in CIGAN.

Skeleton2Humanoid: Animating Simulated Characters for Physically-plausible Motion In-betweening

  • Yunhao Li
  • Zhenbo Yu
  • Yucheng Zhu
  • Bingbing Ni
  • Guangtao Zhai
  • Wei Shen

Human motion synthesis is a long-standing problem with various applications in digital twins and the Metaverse. However, modern deep learning based motion synthesis approaches barely consider the physical plausibility of synthesized motions and consequently they usually produce unrealistic human motions. In order to solve this problem, we propose a system "Skeleton2Humanoid" which performs physics-oriented motion correction at test time by regularizing synthesized skeleton motions in a physics simulator. Concretely, our system consists of three sequential stages: (I) test time motion synthesis network adaptation, (II) skeleton to humanoid matching and (III) motion imitation based on reinforcement learning (RL). Stage I introduces a test time adaptation strategy, which improves the physical plausibility of synthesized human skeleton motions by optimizing skeleton joint locations. Stage II performs an analytical inverse kinematics strategy, which converts the optimized human skeleton motions to humanoid robot motions in a physics simulator, then the converted humanoid robot motions can be served as reference motions for the RL policy to imitate. Stage III introduces a curriculum residual force control policy, which drives the humanoid robot to mimic complex converted reference motions in accordance with the physical law. We verify our system on a typical human motion synthesis task, motion-in-betweening. Experiments on the challenging LaFAN1 dataset show our system can outperform prior methods significantly in terms of both physical plausibility and accuracy. Code will be released for research purposes at:

Hybrid Spatial-Temporal Entropy Modelling for Neural Video Compression

  • Jiahao Li
  • Bin Li
  • Yan Lu

For neural video codec, it is critical, yet challenging, to design an efficient entropy model which can accurately predict the probability distribution of the quantized latent representation. However, most existing video codecs directly use the ready-made entropy model from image codec to encode the residual or motion, and do not fully leverage the spatial-temporal characteristics in video. To this end, this paper proposes a powerful entropy model which efficiently captures both spatial and temporal dependencies. In particular, we introduce the latent prior which exploits the correlation among the latent representation to squeeze the temporal redundancy. Meanwhile, the dual spatial prior is proposed to reduce the spatial redundancy in a parallel-friendly manner. In addition, our entropy model is also versatile. Besides estimating the probability distribution, our entropy model also generates the quantization step at spatial-channel-wise. This content-adaptive quantization mechanism not only helps our codec achieve the smooth rate adjustment in single model but also improves the final rate-distortion performance by dynamic bit allocation. Experimental results show that, powered by the proposed entropy model, our neural codec can achieve 18.2% bitrate saving on UVG dataset when compared with H.266 (VTM) using the highest compression ratio configuration. It makes a new milestone in the development of neural video codec. The codes are at

Geometric Warping Error Aware CNN for DIBR Oriented View Synthesis

  • Shuai Li
  • Kaixin Wang
  • Yanbo Gao
  • Xun Cai
  • Mao Ye

Depth Image based Rendering (DIBR) oriented view synthesis is an important virtual view generation technique. It warps the reference view images to the target viewpoint based on their depth maps, without requiring many available viewpoints. However, in the 3D warping process, pixels are warped to fractional pixel locations and then rounded (or interpolated) to integer pixels, resulting in geometric warping error and reducing the image quality. This resembles, to some extent, the image super-resolution problem, but with unfixed fractional pixel locations. To address this problem, we propose a geometric warping error aware CNN (GWEA) framework to enhance the DIBR oriented view synthesis. First, a deformable convolution based geometric warping error aware alignment (GWEA-DCA) module is developed, by taking advantage of the geometric warping error preserved in the DIBR module. The offset learned in the deformable convolution can account for the geometric warping error to facilitate the mapping from the fractional pixels to integer pixels. Moreover, in view that the pixels in the warped images are of different qualities due to the different strengths of warping errors, an attention enhanced view blending (GWEA-AttVB) module is further developed to adaptively fuse the pixels from different warped images. Finally, a partial convolution based hole filling and refinement module fills the remaining holes and improves the quality of the overall image. Experiments show that our model can synthesize higher-quality images than the existing methods, and ablation study is also conducted, validating the effectiveness of each proposed module.

SESSION: Poster Session VI: Experience -- Multimedia Applications

FedMed-ATL: Misaligned Unpaired Cross-Modality Neuroimage Synthesis via Affine Transform Loss

  • Jinbao Wang
  • Guoyang Xie
  • Yawen Huang
  • Yefeng Zheng
  • Yaochu Jin
  • Feng Zheng

The existence of completely aligned and paired multi-modal neuroimaging data has proved its effectiveness in the diagnosis of brain diseases. However, collecting the full set of well-aligned and paired data is impractical, since the practical difficulties may include high cost, long time acquisition, image corruption, and privacy issues. Previously, the misaligned unpaired neuroimaging data (termed as MUD) are generally treated as noisy labels. However, such a noisy label-based method fails to accomplish well when misaligned data occurs distortions severely. For example, the angle of rotation is different. In this paper, we propose a novel federated self-supervised learning (FedMed) for brain image synthesis. An affine transform loss (ATL) was formulated to make use of severely distorted images without violating privacy legislation for the hospital. We then introduce a new data augmentation procedure for self-supervised training and fed it into three auxiliary heads, namely auxiliary rotation, auxiliary translation, and auxiliary scaling heads. The proposed method demonstrates the advanced performance in both the quality of our synthesized results under a severely misaligned and unpaired data setting, and better stability than other GAN-based algorithms. The proposed method also reduces the demand for deformable registration while encouraging to leverage the misaligned and unpaired data. Experimental results verify the outstanding performance of our learning paradigm compared to other state-of-the-art approaches.

Towards Blind Watermarking: Combining Invertible and Non-invertible Mechanisms

  • Rui Ma
  • Mengxi Guo
  • Yi Hou
  • Fan Yang
  • Yuan Li
  • Huizhu Jia
  • Xiaodong Xie

Blind watermarking provides powerful evidence for copyright protection, image authentication, and tampering identification.However, it remains a challenge to design a watermarking model with high imperceptibility and robustness against strong noise attacks. To resolve this issue, we present a framework Combining the Invertible and Non-invertible (CIN) mechanisms. The CIN is composed of the invertible part to achieve high imperceptibility and the non-invertible part to strengthen the robustness against strong noise attacks. For the invertible part, we develop a diffusion and extraction module (DEM) and a fusion and split module (FSM) to embed and extract watermarks symmetrically in an invertible way. For the non-invertible part, we introduce a non-invertible attention-based module (NIAM) and the noise-specific selection module (NSM) to solve the asymmetric extraction under a strong noise attack. Extensive experiments demonstrate that our framework outperforms the current state-of-the-art methods of imperceptibility and robustness significantly. Our framework can achieve an average of 99.99% accuracy and 67.66 dB PSNR under noise-free conditions, while 96.64% and 39.28 dB combined strong noise attacks. The code will be available in

Improving Transferability for Domain Adaptive Detection Transformers

  • Kaixiong Gong
  • Shuang Li
  • Shugang Li
  • Rui Zhang
  • Chi Harold Liu
  • Qiang Chen

DETR-style detectors stand out amongst in-domain scenarios, but their properties in domain shift settings are under-explored. This paper aims to build a simple but effective baseline with a DETR-style detector on domain shift settings based on two findings. For one, mitigating the domain shift on the backbone and the decoder output features excels in getting favorable results. For another, advanced domain alignment methods in both parts further enhance the performance. Thus, we propose the Object-Aware Alignment (OAA) module and the Optimal Transport based Alignment (OTA) module to achieve comprehensive domain alignment on the outputs of the backbone and the detector. The OAA module aligns the foreground regions identified by pseudo-labels in the backbone outputs, leading to domain-invariant base features. The OTA module utilizes sliced Wasserstein distance to maximize the retention of location information while minimizing the domain gap in the decoder outputs. We implement the findings and the alignment modules into our adaptation method, and it benchmarks the DETR-style detector on the domain shift settings. Experiments on various domain adaptive scenarios validate the effectiveness of our method.

Support for Teaching Mathematics of the Blind by Sighted Tutors Through Multisensual Access to Formulas with Braille Converters and Speech

  • Dariusz Mikulowski

Nowadays, teaching various subjects at school is successfully supported by information and remote technologies such as Google Class, Moodle and others. Nevertheless, students with special needs such as the visually impaired (BVI) face incredible barriers to using such remote technologies, especially with learning mathematics or physics. The main problem is that BVI uses different tools and techniques than their sighted peers, i.e., a different way of working with mathematical expressions or a lack of the possibility to edit graphics. Traditional methods such as the Brailler, figure models or cubarithms are still used. Another challenge is that there are entirely different systems of presenting formulas in different countries, so-called Braille mathematical notations. To overcome these barriers, we propose universal tools to assist sighted teachers and BVI students in remote training math using a multimodal form of editing of mathematical formulas. It consists of the simultaneous combination of three forms of presentation of math formulas in graphical form for the teacher, intelligent reading through speech synthesis and Braille mathematical notation for BVI. It is possible thanks to the use of intelligent converters between formats such as MathML, intelligent text and Braille and dedicated editors that allow for creating math documents by students and teachers.

Geometry Aligned Variational Transformer for Image-conditioned Layout Generation

  • Yunning Cao
  • Ye Ma
  • Min Zhou
  • Chuanbin Liu
  • Hongtao Xie
  • Tiezheng Ge
  • Yuning Jiang

Layout generation is a novel task in computer vision, which combines the challenges in both object localization and aesthetic appraisal, widely used in advertisements, posters and slides design. An accurate and pleasant layout should consider both the intra-domain relationship within layout elements and the inter-domain relationship between layout elements and image. However, most previous methods simply focus on image-content-agnostic layout generation, without leveraging the complex visual information from the image. To this end, we explore a novel paradigm entitled image-conditioned layout generation, which aims to add text overlays to an image in a semantically coherent manner. Specifically, we propose an Image-Conditioned Variational Transformer (ICVT) that autoregressively generates various layouts in an image. First, self-attention mechanism is adopted to model the contextual relationship within layout elements, while cross-attention mechanism is used to fuse the visual information of conditional images. Subsequently, we take them as building blocks of conditional variational autoencoder (CVAE), which demonstrates appealing diversity. Second, in order to alleviate the gap between layout elements domain and visual domain, we design a Geometry Alignment module, in which the geometric information of the image is aligned with the layout representation. In addition, we construct a large-scale advertisement poster layout designing dataset with delicate layout and saliency map annotations. Experimental results show that our model can adaptively generate layouts in the non-intrusive area of the image, resulting in a harmonious layout design.

PVSeRF: Joint Pixel-, Voxel- and Surface-Aligned Radiance Field for Single-Image Novel View Synthesis

  • Xianggang Yu
  • Jiapeng Tang
  • Yipeng Qin
  • Chenghong Li
  • Xiaoguang Han
  • Linchao Bao
  • Shuguang Cui

We present PVSeRF, a learning framework that reconstructs neural radiance fields from single-view RGB images, for novel view synthesis. Previous solutions, such as pixelNeRF, rely only on pixel-aligned features and suffer from feature ambiguity issues. As a result, they struggle with the disentanglement of geometry and appearance, leading to implausible geometries and blurry results. To address this challenge, we propose to incorporate explicit geometry reasoning and combine it with pixel-aligned features for radiance field prediction. Specifically, in addition to pixel-aligned features, we further constrain the radiance field learning to be conditioned on i) voxel-aligned features learned from a coarse volumetric grid and ii) fine surface-aligned features extracted from a regressed point cloud. We show that the introduction of such geometry-aware features helps to achieve a better disentanglement between appearance and geometry, i.e. recovering more accurate geometries and synthesizing higher quality images of novel views. Extensive experiments against state-of-the-art methods on ShapeNet benchmarks demonstrate the superiority of our approach for single-image novel view synthesis.

Cross-Modality High-Frequency Transformer for MR Image Super-Resolution

  • Chaowei Fang
  • Dingwen Zhang
  • Liang Wang
  • Yulun Zhang
  • Lechao Cheng
  • Junwei Han

Improving the resolution of magnetic resonance (MR) image data is critical to computer-aided diagnosis and brain function analysis. Higher resolution helps to capture more detailed content, but typically induces to lower signal-to-noise ratio and longer scanning time. To this end, MR image super-resolution has become a widely-interested topic in recent times. Existing works establish extensive deep models with the conventional architectures based on convolutional neural networks (CNN). In this work, to further advance this research field, we make an early effort to build a Transformer-based MR image super-resolution framework, with careful designs on exploring valuable domain prior knowledge. Specifically, we consider two-fold domain priors including the high-frequency structure prior and the inter-modality context prior, and establish a novel Transformer architecture, called Cross-modality high-frequency Transformer (Cohf-T), to introduce such priors into super-resolving the low-resolution (LR) MR images. Experiments on two datasets indicate that Cohf-T achieves new state-of-the-art performance.

Adma-GAN: Attribute-Driven Memory Augmented GANs for Text-to-Image Generation.

  • Xintian Wu
  • Hanbin Zhao
  • Liangli Zheng
  • Shouhong Ding
  • Xi Li

As a challenging task, text-to-image generation aims to generate photo-realistic and semantically consistent images according to the given text descriptions. Existing methods mainly extract the text information from only one sentence to represent an image and the text representation effects the quality of the generated image well. However, directly utilizing the limited information in one sentence misses some key attribute descriptions, which are the crucial factors to describe an image accurately. To alleviate the above problem, we propose an effective text representation method with the complements of attribute information. Firstly, we construct an attribute memory to jointly control the text-to-image generation with sentence input. Secondly, we explore two update mechanisms, sample-aware and sample-joint mechanisms, to dynamically optimize a generalized attribute memory. Furthermore, we design an attribute-sentence-joint conditional generator learning scheme to align the feature embeddings among multiple representations, which promotes the cross-modal network training. Experimental results illustrate that the proposed method obtains substantial performance improvements on both the CUB (FID from 14.81 to 8.57) and COCO (FID from 21.42 to 12.39) datasets.

Efficient Multiple Kernel Clustering via Spectral Perturbation

  • Chang Tang
  • Zhenglai Li
  • Weiqing Yan
  • Guanghui Yue
  • Wei Zhang

Clustering is a fundamental task in the machine learning and data mining community. Among existing clustering methods, multiple kernel clustering (MKC) has been widely investigated due to its effectiveness to capture non-linear relationships among samples. However, most of the existing MKC methods bear intensive computational complexity in learning an optimal kernel and seeking the final clustering partition. In this paper, based on the spectral perturbation theory, we propose an efficient MKC method that reduces the computational complexity from O(n3) to O(nk2 + k3), with n and k denoting the number of data samples and the number of clusters, respectively. The proposed method recovers the optimal clustering partition from base partitions by maximizing the eigen gaps to approximate the perturbation errors. An equivalent optimization objective function is introduced to obtain base partitions. Furthermore, a kernel weighting scheme is embedded to capture the diversity among multiple kernels. Finally, the optimal partition, base partitions, and kernel weights are jointly learned in a unified framework. An efficient alternate iterative optimization algorithm is designed to solve the resultant optimization problem. Experimental results on various benchmark datasets demonstrate the superiority of the proposed method when compared to other state-of-the-art ones in terms of both clustering efficacy and efficiency.

DOMFN: A Divergence-Orientated Multi-Modal Fusion Network for Resume Assessment

  • Yang Yang
  • Jingshuai Zhang
  • Fan Gao
  • Xiaoru Gao
  • Hengshu Zhu

In talent management, resume assessment aims to analyze the quality of a job seeker's resume, which can assist recruiters to discover suitable candidates and benefit job seekers improving resume quality in return. Recent machine learning based methods on large-scale public resume datasets have provided the opportunity for automatic assessment for reducing manual costs. However, most existing approaches are still content-dominated and ignore other valuable information. Inspired by practical resume evaluations that consider both the content and layout, we construct the multi-modalities from resumes but face a new challenge that sometimes the performance of multi-modal fusion is even worse than the best uni-modality. In this paper, we experimentally find that this phenomenon is due to the cross-modal divergence. Therefore, we need to consider when is it appropriate to perform multi-modal fusion? To address this problem, we design an instance-aware fusion method, i.e., Divergence-Orientated Multi-Modal Fusion Network (DOMFN), which can adaptively fuse the uni-modal predictions and multi-modal prediction based on cross-modal divergence. Specifically, DOMFN computes a functional penalty score to measure the divergence of cross-modal predictions. Then, the learned divergence can be used to decide whether to conduct multi-modal fusion and be adopted into an amended loss for reliable training. Consequently, DOMFN rejects multi-modal prediction when the cross-modal divergence is too large, avoiding the overall performance degradation, so as to achieve better performance than uni-modalities. In experiments, qualitative comparison with baselines on real-world dataset demonstrates the superiority and explainability of the proposed DOMFN, e.g., we find a meaningful phenomenon that multi-modal fusion has positive effects for assessing resumes from UI Designer and Enterprise Service positions, whereas affects the assessment of Technology and Product Operation positions.

Generative Steganography Network

  • Ping Wei
  • Sheng Li
  • Xinpeng Zhang
  • Ge Luo
  • Zhenxing Qian
  • Qing Zhou

Steganography usually modifies cover media to embed secret data. A new steganographic approach called generative steganography (GS) has emerged recently, in which stego images (images containing secret data) are generated from secret data directly without cover media. However, existing GS schemes are often criticized for their poor performances. In this paper, we propose an advanced generative steganography network (GSN) that can generate realistic stego images without using cover images. We firstly introduce the mutual information mechanism in GS, which helps to achieve high secret extraction accuracy. Our model contains four sub-networks, i.e., an image generator (G), a discriminator (D), a steganalyzer (S), and a data extractor (E). D and S act as two adversarial discriminators to ensure the visual quality and security of generated stego images. E is to extract the hidden secret from generated stego images. The generator G is flexibly constructed to synthesize either cover or stego images with different inputs. It facilitates covert communication by concealing the function of generating stego images in a normal generator. A module named secret block is designed to hide secret data in the feature maps during image generation, with which high hiding capacity and image fidelity are achieved. In addition, a novel hierarchical gradient decay (HGD) skill is developed to resist steganalysis detection. Experiments demonstrate the superiority of our work over existing methods.

You Only Hypothesize Once: Point Cloud Registration with Rotation-equivariant Descriptors

  • Haiping Wang
  • Yuan Liu
  • Zhen Dong
  • Wenping Wang

In this paper, we propose a novel local descriptor-based framework, called You Only Hypothesize Once (YOHO), for the registration of two unaligned point clouds. In contrast to most existing local descriptors which rely on a fragile local reference frame to gain rotation invariance, the proposed descriptor achieves the rotation invariance by recent technologies of group equivariant feature learning, which brings more robustness to point density and noise. Meanwhile, the descriptor in YOHO also has a rotation-equivariant part, which enables us to estimate the registration from just one correspondence hypothesis. Such property reduces the searching space for feasible transformations, thus greatly improving both the accuracy and the efficiency of YOHO. Extensive experiments show that YOHO achieves superior performances with much fewer needed RANSAC iterations on four widely-used datasets, the 3DMatch/3DLoMatch datasets, the ETH dataset and the WHU-TLS dataset. More details are shown in our project page:

Disentangled Representation Learning for Multimodal Emotion Recognition

  • Dingkang Yang
  • Shuai Huang
  • Haopeng Kuang
  • Yangtao Du
  • Lihua Zhang

Multimodal emotion recognition aims to identify human emotions from text, audio, and visual modalities. Previous methods either explore correlations between different modalities or design sophisticated fusion strategies. However, the serious problem is that the distribution gap and information redundancy often exist across heterogeneous modalities, resulting in learned multimodal representations that may be unrefined. Motivated by these observations, we propose a Feature-Disentangled Multimodal Emotion Recognition (FDMER) method, which learns the common and private feature representations for each modality. Specifically, we design the common and private encoders to project each modality into modality-invariant and modality-specific subspaces, respectively. The modality-invariant subspace aims to explore the commonality among different modalities and reduce the distribution gap sufficiently. The modality-specific subspaces attempt to enhance the diversity and capture the unique characteristics of each modality. After that, a modality discriminator is introduced to guide the parameter learning of the common and private encoders in an adversarial manner. We achieve the modality consistency and disparity constraints by designing tailored losses for the above subspaces. Furthermore, we present a cross-modal attention fusion module to learn adaptive weights for obtaining effective multimodal representations. The final representation is used for different downstream tasks. Experimental results show that the FDMER outperforms the state-of-the-art methods on two multimodal emotion recognition benchmarks. Moreover, we further verify the effectiveness of our model via experiments on the multimodal humor detection task.

Relative Alignment Network for Source-Free Multimodal Video Domain Adaptation

  • Yi Huang
  • Xiaoshan Yang
  • Ji Zhang
  • Changsheng Xu

Video domain adaptation aims to transfer knowledge from labeled source videos to unlabeled target videos. Existing video domain adaptation methods require full access to the source videos to reduce the domain gap between the source and target videos, which are impractical in real scenarios where the source videos are not available with concerns in transmission efficiency or privacy issues. To address this problem, in this paper, we propose to solve a source-free domain adaptation task for videos where only a pre-trained source model and unlabeled target videos are available for learning a multimodal video classification model. Existing source-free domain adaptation methods cannot be directly applied to this task, since videos always suffer from domain discrepancy along both the multimodal and temporal aspects, which brings difficulties in domain adaptation especially when the source data are unavailable. In this paper, we propose a Multimodal and Temporal Relative Alignment Network (MTRAN) to deal with the above challenges. To explicitly imitate the domain shifts contained in the multimodal information and the temporal dynamics of the source and target videos, we divide the target videos into two splits according to the self-entropy values of the classification results. The low-entropy videos are deemed to be source-like while the high-entropy videos are deemed to be target-like. Then, we adopt a self-entropy-guided MixUp strategy to generate synthetic samples and hypothetical samples as instance-level based on source-like and target-like videos, and push each synthetic sample to be similar with the corresponding hypothetical sample that is slightly closer to the source-like videos than the synthetic sample by multimodal and temporal relative alignment schemes. We evaluate the proposed model on four public video datasets. The results show that our model outperforms existing state-of-the-art methods.

PRO-Face: A Generic Framework for Privacy-preserving Recognizable Obfuscation of Face Images

  • Lin Yuan
  • Linguo Liu
  • Xiao Pu
  • Zhao Li
  • Hongbo Li
  • Xinbo Gao

A number of applications (e.g., video surveillance and authentication) rely on automated face recognition to guarantee functioning of secure services, and meanwhile, have to take into account the privacy of individuals exposed under camera systems. This is the so-called Privacy-Utility trade-off. However, most existing approaches to facial privacy protection focus on removing identifiable visual information from images, leaving protected face unrecognizable to machine, which sacrifice utility for privacy. To tackle the privacy-utility challenge, we propose a novel, generic, effective, yet lightweight framework for Privacy-preserving Recognizable Obfuscation of Face images (named as PRO-Face). The framework allows one to first process a face image using any preferred obfuscation, such as image blur, pixelate and face morphing. It then leverages a Siamese network to fuse the original image with its obfuscated form, generating the final protected image visually similar to the obfuscated one from human perception (for privacy) but still recognized as the original identity by machine (for utility). The framework supports various obfuscations for facial anonymization. The face recognition can be performed accurately not only across anonymized images but also between plain and anonymized ones, based on only pre-trained recognizers. Those feature the "generic" merit of the proposed framework. In-depth objective and subjective evaluations demonstrate the effectiveness of the proposed framework in both privacy protection and utility preservation under distinct scenarios. Our source code, models and any supplementary materials are made publicly available.

Skeleton-based Action Recognition via Adaptive Cross-Form Learning

  • Xuanhan Wang
  • Yan Dai
  • Lianli Gao
  • Jingkuan Song

Skeleton-based action recognition aims to project skeleton sequences to action categories, where skeleton sequences are derived from multiple forms of pre-detected points. Compared with earlier methods that focus on exploring single-form skeletons via Graph Convolutional Networks (GCNs), existing methods tend to improve GCNs by leveraging multi-form skeletons due to their complementary cues. However, these methods (either adapting structure of GCNs or model ensemble) require the co-existence of all skeleton forms during both training and inference stages, while a typical situation in real life is the existence of only partial forms for inference. To tackle this, we present Adaptive Cross-Form Learning (ACFL), which empowers well-designed GCNs to generate complementary representation from single-form skeletons without changing model capacity. Specifically, each GCN model in ACFL not only learns action representation from the single-form skeletons, but also adaptively mimics useful representations derived from other forms of skeletons. In this way, each GCN can learn how to strengthen what has been learned, thus exploiting model potential and facilitating action recognition as well. Extensive experiments conducted on three challenging benchmarks, i.e., NTU-RGB+D 120, NTU-RGB+D 60 and UAV-Human, demonstrate the effectiveness and generalizability of our method. Specifically, the ACFL significantly improves various GCN models (i.e., CTR-GCN, MS-G3D, and Shift-GCN), achieving a new record for skeleton-based action recognition.

Sample Weighted Multiple Kernel K-means via Min-Max optimization

  • Yi Zhang
  • Weixuan Liang
  • Xinwang Liu
  • Sisi Dai
  • Siwei Wang
  • Liyang Xu
  • En Zhu

A representative multiple kernel clustering (MKC) algorithm, termed simple multiple kernel k-means (SMKKM), is recently proposed to optimally mine useful information from a set of pre-specified kernels to improve clustering performance. Different from existing min-min learning framework, it puts a novel min-max optimization manner, which attracts considerable attention in related community. Despite achieving encouraged success, we observe that SMKKM only focuses on combination coefficients among kernels and ignores the relationship among the importance of different samples. As a result, it does not sufficiently consider different contributions of each sample to clustering, and thus cannot effectively obtain the "ideal" similarity structure, leading to unsatisfying performance. To address this issue, this paper proposes a novel sample weighted multiple kernel k-means via min-max optimization (SWMKKM), which sufficiently considers the sum of relationship between one sample and the others to represent the sample weights. Such a weighting criterion helps clustering algorithm pay more attention to samples with more positive effects on clustering and avoids unreliable overestimation for samples with poor quality. Based on SMKKM, we adopt a reduced gradient algorithm with proved convergence to solve the resultant optimization problem. Comprehensive experiments on multiple benchmark datasets demonstrate that our proposed SWMKKM dramatically improves the state-of-the-art MKC algorithms, verifying the effectiveness of our proposed sample weighting criterion.

MIntRec: A New Dataset for Multimodal Intent Recognition

  • Hanlei Zhang
  • Hua Xu
  • Xin Wang
  • Qianrui Zhou
  • Shaojie Zhao
  • Jiayan Teng

Multimodal intent recognition is a significant task for understanding human language in real-world multimodal scenes. Most existing intent recognition methods have limitations in leveraging the multimodal information due to the restrictions of the benchmark datasets with only text information. This paper introduces a novel dataset for multimodal intent recognition (MIntRec) to address this issue. It formulates coarse-grained and fine-grained intent taxonomies based on the data collected from the TV series Superstore. The dataset consists of 2,224 high-quality samples with text, video, and audio modalities and has multimodal annotations among twenty intent categories. Furthermore, we provide annotated bounding boxes of speakers in each video segment and achieve an automatic process for speaker annotation. MIntRec is helpful for researchers to mine relationships between different modalities to enhance the capability of intent recognition. We extract features from each modality and model cross-modal interactions by adapting three powerful multimodal fusion methods to build baselines. Extensive experiments show that employing the non-verbal modalities achieves substantial improvements compared with the text-only modality, demonstrating the effectiveness of using multimodal information for intent recognition. The gap between the best-performing methods and humans indicates the challenge and importance of this task for the community. The full dataset and codes are available for use at

Adaptive Transformer-Based Conditioned Variational Autoencoder for Incomplete Social Event Classification

  • Zhangming Li
  • Shengsheng Qian
  • Jie Cao
  • Quan Fang
  • Changsheng Xu

With the rapid development of the Internet and the expanding scale of social media, incomplete social event classification has increasingly become a challenging task. The key for incomplete social event classification is to accurately leverage the image-level and text-level information. However, most of the existing approaches may suffer from the following limitations: (1) Most Generative Models use the available features to generate the incomplete modality features for social events classification while ignoring the rich semantic label information. (2) The majority of existing multi-modal methods just simply concatenate the coarse-grained image features and text features of the event to get the multi-modal features to classify social events, which ignores the irrelevant multi-modal features and limits their modeling capabilities. To tackle these challenges, in this paper, we propose an Adaptive Transformer-Based Conditioned Variational Autoencoder Network (AT-CVAE) for incomplete social event classification. In the AT-CVAE, we propose a novel Transformer-based Conditioned Variational Autoencoder to jointly model the textual information, visual information and label information into a unified deep model, which can generate more discriminative latent features and enhance the performance of incomplete social event classification. Furthermore, the Mixture-of-Experts Mechanism is utilized to dynamically acquire the weights of each multi-modal information, which can better filter out the irrelevant multi-modal information and capture the vitally important information. Extensive experiments are conducted on two public event datasets, demonstrating the superior performance of our AT-CVAE method.

Learning Modality-Specific and -Agnostic Representations for Asynchronous Multimodal Language Sequences

  • Dingkang Yang
  • Haopeng Kuang
  • Shuai Huang
  • Lihua Zhang

Understanding human behaviors and intents from videos is a challenging task. Video flows usually involve time-series data from different modalities, such as natural language, facial gestures, and acoustic information. Due to the variable receiving frequency for sequences from each modality, the collected multimodal streams are usually unaligned. For multimodal fusion of asynchronous sequences, the existing methods focus on projecting multiple modalities into a common latent space and learning the hybrid representations, which neglects the diversity of each modality and the commonality across different modalities. Motivated by this observation, we propose a Multimodal Fusion approach for learning modality-Specific and modality-Agnostic representations (MFSA) to refine multimodal representations and leverage the complementarity across different modalities. Specifically, a predictive self-attention module is used to capture reliable contextual dependencies and enhance the unique features over the modality-specific spaces. Meanwhile, we propose a hierarchical cross-modal attention module to explore the correlations between cross-modal elements over the modality-agnostic space. In this case, a double-discriminator strategy is presented to ensure the production of distinct representations in an adversarial manner. Eventually, the modality-specific and -agnostic multimodal representations are used together for downstream tasks. Comprehensive experiments on three multimodal datasets clearly demonstrate the superiority of our approach.

DoF-NeRF: Depth-of-Field Meets Neural Radiance Fields

  • Zijin Wu
  • Xingyi Li
  • Juewen Peng
  • Hao Lu
  • Zhiguo Cao
  • Weicai Zhong

Neural Radiance Field (NeRF) and its variants have exhibited great success on representing 3D scenes and synthesizing photo-realistic novel views. However, they are generally based on the pinhole camera model and assume all-in-focus inputs. This limits their applicability as images captured from the real world often have finite depth-of-field (DoF). To mitigate this issue, we introduce DoF-NeRF, a novel neural rendering approach that can deal with shallow DoF inputs and can simulate DoF effect. In particular, it extends NeRF to simulate the aperture of lens following the principles of geometric optics. Such a physical guarantee allows DoF-NeRF to operate views with different focus configurations. Benefiting from explicit aperture modeling, DoF-NeRF also enables direct manipulation of DoF effect by adjusting virtual aperture and focus parameters. It is plug-and-play and can be inserted into NeRF-based frameworks. Experiments on synthetic and real-world datasets show that, DoF-NeRF not only performs comparably with NeRF in the all-in-focus setting, but also can synthesize all-in-focus novel views conditioned on shallow DoF inputs. An interesting application of DoF-NeRF to DoF rendering is also demonstrated. The source code will be made available at:

RKformer: Runge-Kutta Transformer with Random-Connection Attention for Infrared Small Target Detection

  • Mingjin Zhang
  • Haichen Bai
  • Jing Zhang
  • Rui Zhang
  • Chaoyue Wang
  • Jie Guo
  • Xinbo Gao

Infrared small target detection (IRSTD) refers to segmenting the small targets from infrared images, which is of great significance in practical applications. However, due to the small scale of targets as well as noise and clutter in the background, current deep neural network-based methods struggle in extracting features with discriminative semantics while preserving fine details. In this paper, we address this problem by proposing a novel RKformer model with an encoder-decoder structure, where four specifically designed Runge-Kutta transformer (RKT) blocks are stacked sequentially in the encoder. Technically, it has three key designs. First, we adopt a parallel encoder block (PEB) of the transformer and convolution to take their advantages in long-range dependency modeling and locality modeling for extracting semantics and preserving details. Second, we propose a novel random-connection attention (RCA) block, which has a reservoir structure to learn sparse attention via random connections during training. RCA encourages the target to attend to sparse relevant positions instead of all the large-area background pixels, resulting in more informative attention scores. It has fewer parameters and computations than the original self-attention in the transformer while performing better. Third, inspired by neural ordinary differential equations (ODE), we stack two PEBs with several residual connections as the basic encoder block to implement the Runge-Kutta method for solving ODE, which can effectively enhance the feature and suppress noise. Experiments on the public NUAA-SIRST dataset and IRSTD-1k dataset demonstrate the superiority of the RKformer over state-of-the-art methods.

Self-Supervised Human Pose based Multi-Camera Video Synchronization

  • Liqiang Yin
  • Ruize Han
  • Wei Feng
  • Song Wang

Multi-view video collaborative analysis is an important task and has many applications in multimedia community. However, it always requires the given multiple videos to be temporally synchronized. Existing methods commonly synchronize the videos by the wired communication, which may hinder the practical application in real world, especially for moving cameras. In this paper, we focus on the human-centric video analysis and propose a self-supervised framework for the automatic multi-camera video synchronization. Specifically, we develop SeSyn-Net with the 2D human pose as input for feature embedding and design a series of self-supervised losses to effectively extract the view-invariant but time-discriminative representation for video synchronization. We also build two new datasets for the performance evaluation. Extensive experimental results verify the effectiveness of our method, which achieves the superior performance compared to both the classical and state-of-the-art methods.

Energy-Based Domain Generalization for Face Anti-Spoofing

  • Zhekai Du
  • Jingjing Li
  • Lin Zuo
  • Lei Zhu
  • Ke Lu

With various unforeseeable face presentation attacks (PA) springing up, face anti-spoofing (FAS) urgently needs to generalize to unseen scenarios. Research on generalizable FAS has lately attracted growing attention. Existing methods cast FAS as a vanilla binary classification problem and address it by a standard discriminative classifier p(y|x) under a domain generalization framework. However, discriminative models are unreliable for samples far away from the training distribution. In this paper, we resort to an energy-based model (EBM) to tackle FAS in a generative perspective. Our motivation is to model the joint density p(x,y), which allows to compute not only p(y|x) but also p(x). Due to the intractability of direct modeling, we use EBMs as an alternative to probabilistic estimation. With energy-based training, real faces are encouraged to get low free energy associated with the marginal probability p(x) of real faces, and all samples with high free energy are regarded as fake faces, thus rejecting any kind of PA out of the distribution of real faces. To learn to generalize to unseen domains, we generate diverse and novel populations in feature space under the guidance of energy model. Our model is updated in a meta-learning schema, where the original source samples are utilized for meta-training and the generated ones for meta-testing. We validate our method on four widely used FAS datasets. Comprehensive experimental results demonstrate the effectiveness of our method compared with state-of-the-arts.

Revisiting Stochastic Learning for Generalizable Person Re-identification

  • Jiajian Zhao
  • Yifan Zhao
  • Xiaowu Chen
  • Jia Li

Generalizable person re-identification aims to achieve a well generalization capability on target domains without accessing target data. Existing methods focus on suppressing domain-specific information or simulating unseen environments by meta-learning strategies, which could damage the capture ability on fine-grained visual patterns or lead to overfitting issues by the repetitive training of episodes. In this paper, we revisit the stochastic behaviors from two different perspectives: 1) Stochastic splitting-sliding sampler. It splits domain sources into approximately equal sample-size subsets and selects several subsets from various sources by a sliding window, forcing the model to step out of local minimums under stochastic sources. 2) Variance-varying gradient dropout. Gradients in parts of network are also selected by a sliding window and multiplied by binary masks generated from Bernoulli distribution, making gradients in varying variance and preventing the model from local minimums. By applying these two proposed stochastic behaviors, the model achieves a better generalization performance on unseen target domains without any additional computation costs or auxiliary modules. Extensive experiments demonstrate that our proposed model is effective and outperforms state-of-the-art methods on public domain generalizable person Re-ID benchmarks.

D2Animator: Dual Distillation of StyleGAN For High-Resolution Face Animation

  • Zhuo Chen
  • Chaoyue Wang
  • Haimei Zhao
  • Bo Yuan
  • Xiu Li

The style-based generator architectures (e.g. StyleGAN v1, v2) largely promote the controllability and explainability of Generative Adversarial Networks (GANs). Many researchers have applied the pretrained style-based generators to image manipulation and video editing by exploring the correlation between linear interpolation in the latent space and semantic transformation in the synthesized image manifold. However, most previous studies focused on manipulating separate discrete attributes, which is insufficient to animate a still image to generate videos with complex and diverse poses and expressions. In this work, we devise a dual distillation strategy (D2Animator) for generating animated high-resolution face videos conditioned on identities and poses from different images. Specifically, we first introduce a Clustering-based Distiller (CluDistiller) to distill diverse interpolation directions in the latent space, and synthesize identity-consistent faces with various poses and expressions, such as blinking, frowning, looking up/down, etc. Then we propose an Augmentation-based Distiller (AugDistiller) that learns to encode arbitrary face deformation into a combination of interpolation directions via training on augmentation samples synthesized by CluDistiller. Through assembling the two distillation methods, D2Animator can generate high-resolution face animation videos without training on video sequences. Extensive experiments on self-driving, cross-identity and sequence-driving tasks demonstrate the superiority of the proposed D2Animator over existing StyleGAN manipulation and face animation methods in both generation quality and animation fidelity.

Adaptive Hierarchical Pooling for Weakly-supervised Sound Event Detection

  • Lijian Gao
  • Ling Zhou
  • Qirong Mao
  • Ming Dong

In Weakly-supervised Sound Event Detection (WSED), the ground truth of training data contains the presence or absence of each sound event only at the clip-level (i.e., no frame-level annotations). Recently, WSED has been formulated under the multi-instance learning framework, and a critical component within this formulation is the design of the temporal pooling function. In this paper, we propose an adaptive hierarchical pooling (HiPool) for WSED, which combines the advantages of max pooling in audio tagging and weighted average pooling in audio localization through a novel hierarchical structure and learns event-wise optimal pooling functions through continuous relaxation-based joint optimization. Extensive experiments on benchmark datasets show that HiPool outperforms the current pooling methods and greatly improves the performance of WSED. HiPool also has great generality - ready to be plugged into any WSED models.

Mutual Adaptive Reasoning for Monocular 3D Multi-Person Pose Estimation

  • Juze Zhang
  • Jingya Wang
  • Ye Shi
  • Fei Gao
  • Lan Xu
  • Jingyi Yu

Inter-person occlusion and depth ambiguity make estimating the 3D poses of monocular multiple persons as camera-centric coordinates a challenging problem. Typical top-down frameworks suffer from high computational redundancy with an additional detection stage. By contrast, the bottom-up methods enjoy low computational costs as they are less affected by the number of humans. However, most existing bottom-up methods treat camera-centric 3D human pose estimation as two unrelated subtasks: 2.5D pose estimation and camera-centric depth estimation. In this paper, we propose a unified model that leverages the mutual benefits of both these subtasks. Within the framework, a robust structured 2.5D pose estimation is designed to recognize inter-person occlusion based on depth relationships. Additionally, we develop an end-to-end geometry-aware depth reasoning method that exploits the mutual benefits of both 2.5D pose and camera-centric root depths. This method first uses 2.5D pose and geometry information to infer camera-centric root depths in a forward pass, and then exploits the root depths to further improve representation learning of 2.5D pose estimation in a backward pass. Further, we designed an adaptive fusion scheme that leverages both visual perception and body geometry to alleviate inherent depth ambiguity issues. Extensive experiments demonstrate the superiority of our proposed model over a wide range of bottom-up methods. Our accuracy is even competitive with top-down counterparts. Notably, our model runs much faster than existing bottom-up and top-down methods.

Learning Generalizable Latent Representations for Novel Degradations in Super-Resolution

  • Fengjun Li
  • Xin Feng
  • Fanglin Chen
  • Guangming Lu
  • Wenjie Pei

Typical methods for blind image super-resolution (SR) focus on dealing with unknown degradations by directly estimating them or learning the degradation representations in a latent space. A potential limitation of these methods is that they assume the unknown degradations can be simulated by the integration of various handcrafted degradations (e.g., bicubic downsampling), which is not necessarily true. The real-world degradations can be beyond the simulation scope by the handcrafted degradations, which are referred to as novel degradations. In this work, we propose to learn a latent representation space for degradations, which can be generalized from handcrafted (base) degradations to novel degradations. Furthermore, we perform variational inference to match the posterior of degradations in latent representation space with a prior distribution (e.g., Gaussian distribution). Consequently, we are able to sample more high-quality representations for a novel degradation to augment the training data for SR model. We conduct extensive experiments on both synthetic and real-world datasets to validate the effectiveness and advantages of our method for blind super-resolution with novel degradations.

Rethinking the Vulnerability of DNN Watermarking: Are Watermarks Robust against Naturalness-aware Perturbations?

  • Run Wang
  • Haoxuan Li
  • Lingzhou Mu
  • Jixing Ren
  • Shangwei Guo
  • Li Liu
  • Liming Fang
  • Jing Chen
  • Lina Wang

Training Deep Neural Networks (DNN) is a time-consuming process and requires a large amount of training data, which motivates studies working on protecting the intellectual property (IP) of DNN models by employing various watermarking techniques. Unfortunately, in recent years, adversaries have been exploiting the vulnerabilities of the employed watermarking techniques to remove the embedded watermarks. In this paper, we investigate and introduce a novel watermark removal attack, called AdvNP, against all the existing four different types of DNN watermarking schemes via input preprocessing by injecting <u>Adv</u>ersarial <u>N</u>aturalness-aware <u>P</u>erturbations. In contrast to the prior studies, our proposed method is the first work that generalizes all the existing four watermarking schemes well without involving any model modification, which preserves the fidelity of the target model. We conduct the experiments against four state-of-the-art (SOTA) watermarking schemes on two real tasks (e.g., image classification on ImageNet, face recognition on CelebA) across multiple DNN models. Overall, our proposed AdvNP significantly invalidates the watermarks against the four watermarking schemes on two real-world datasets, i.e., 60.9% on the average attack success rate and up to 97% in the worse case. Moreover, our AdvNP could well survive the image denoising techniques and outperforms the baseline in both the fidelity preserving and watermark removal. Furthermore, we introduce two defense methods to enhance the robustness of DNN watermarking against our AdvNP. Our experimental results pose real threats to the existing watermarking schemes and call for more practical and robust watermarking techniques to protect the copyright of pre-trained DNN models. The source code and models are available at ttps://

In-N-Out Generative Learning for Dense Unsupervised Video Segmentation

  • Xiao Pan
  • Peike Li
  • Zongxin Yang
  • Huiling Zhou
  • Chang Zhou
  • Hongxia Yang
  • Jingren Zhou
  • Yi Yang

In this paper, we focus on unsupervised learning for Video Object Segmentation (VOS) which learns visual correspondence (i.e., the similarity between pixel-level features) from unlabeled videos. Previous methods are mainly based on the contrastive learning paradigm, which optimize either in image level or pixel level. Image-level optimization (e.g., the spatially pooled feature of ResNet) learns robust high-level semantics but is sub-optimal since the pixel-level features are optimized implicitly. By contrast, pixel-level optimization is more explicit, however, it is sensitive to the visual quality of training data and is not robust to object deformation. To complementarily perform these two levels of optimization in a unified framework, we propose the In-aNd-Out (INO) generative learning from a purely generative perspective with the help of naturally designed class tokens and patch tokens in Vision Transformer (ViT). Specifically, for image-level optimization, we force the out-view imagination from local to global views on class tokens, which helps capture high-level semantics, and we name it as out-generative learning. As to pixel-level optimization, we perform in-view masked image modeling on patch tokens, which recovers the corrupted parts of an image via inferring its fine-grained structure, and we term it as in-generative learning. To discover the temporal information better, we additionally force the inter-frame consistency from both feature and affinity matrix levels. Extensive experiments on DAVIS-2017 val and YouTube-VOS 2018 val show that our INO outperforms previous state-of-the-art methods by significant margins.

Everything is There in Latent Space: Attribute Editing and Attribute Style Manipulation by StyleGAN Latent Space Exploration

  • Rishubh Parihar
  • Ankit Dhiman
  • Tejan Karmali
  • Venkatesh R

Unconstrained Image generation with high realism is now possible using recent Generative Adversarial Networks (GANs). However, it is quite challenging to generate images with a given set of attributes. Recent methods use style-based GAN models to perform image editing by leveraging the semantic hierarchy present in the layers of the generator. We present Few-shot Latent-based Attribute Manipulation and Editing (FLAME), a simple yet effective framework to perform highly controlled image editing by latent space manipulation. Specifically, we estimate linear directions in the latent space (of a pre-trained StyleGAN) that controls semantic attributes in the generated image. In contrast to previous methods that either rely on large-scale attribute labeled datasets or attribute classifiers, FLAME uses minimal supervision of a few curated image pairs to estimate disentangled edit directions. FLAME can perform both individual and sequential edits with high precision on a diverse set of images while preserving identity. Further, we propose a novel task of Attribute Style Manipulation to generate diverse styles for attributes such as eyeglass and hair. We first encode a set of synthetic images of the same identity but having different attribute styles in the latent space to estimate an attribute style manifold. Sampling a new latent from this manifold will result in a new attribute style in the generated image. We propose a novel sampling method to sample latent from the manifold, enabling us to generate a diverse set of attribute styles beyond the styles present in the training set. FLAME can generate diverse attribute styles in a disentangled manner. We illustrate the superior performance of FLAME against previous image editing methods by extensive qualitative and quantitative comparisons. FLAME generalizes well on out-of-distribution images from art domain as well as on other datasets such as cars and churches.

An Image-to-video Model for Real-Time Video Enhancement

  • Dongyu She
  • Kun Xu

Recent years have witnessed the increasing popularity of learning-based methods to enhance the color and tone of images. Although these methods achieve satisfying performance on static images, it is non-trivial to extend such image-to-image methods to handle videos. A straight extension would easily lead to computation inefficiency or distracting flickering effects. In this paper, we propose a novel image-to-video model enforcing the temporal stability for real-time video enhancement, which is trained using only static images. Specifically, we first propose a lightweight image enhancer via learnable flexible 2-dimensional lookup tables (F2D LUTs), which can consider scenario information adaptively. To impose temporal constancy, we further propose to infer the motion fields via a virtual camera motion engine, which can be utilized to stabilize the image-to-video model with temporal consistency loss. Experimental results show that our image-to-video model not only achieves the state-of-the-art performance on the image enhancement task, but also performs favorably against baselines on the video enhancement task. Our source code is available at

Learning an Inference-accelerated Network from a Pre-trained Model with Frequency-enhanced Feature Distillation

  • Xuesong Niu
  • Jili Gu
  • Guoxin Zhang
  • Pengfei Wan
  • Zhongyuan Wang

Convolution neural networks (CNNs) have achieved great success in various computer vision tasks, but they are still suffering from the heavy computation costs, which are mainly resulted from the substantial redundancy of the feature maps. In order to reduce these redundancy, we proposed a simple but effective frequency-enhanced feature distillation strategy to train an inference-accelerated network with a pre-trained model. Traditionally, one CNN can be regarded as a hierarchical structure, which can generate the low-level, middle-level and high-level feature maps from different convolution layers. In order to accelerate the inference time of CNNs, in this paper, we propose to resize the low-level and middle-level feature maps to smaller scales to reduce the spatial computation costs of CNNs. A frequency-enhanced feature distillation training strategy with a pre-trained model is then used to help the inference-accelerated network to maintain the core information after resizing the feature maps. To be specific, the original pre-trained network and the inference-accelerated network with resized feature maps are regarded as the teacher network and student network respectively. Considering that the low-frequency domain of the feature maps contribute the most parts to the final classification, we then transform the feature maps of different levels into a frequency-enhanced feature space, which highlights the low-frequency features for both the teacher and student networks. The frequency-enhanced features are used to transfer the knowledge from the teacher network to the student network. At the same time, knowledge for the final classification, i.e., the classification feature and predicted probabilities, are also used for distillation. Experiments on multiple databases based on various network structure types, e.g., ResNet, Res2Net, MobileNetV2, and ConvNeXt, have shown that with the proposed frequency-enhanced feature distillation training strategy, our method could get an inference-accelerated network with comparable performance and much less computation cost.

Exploring Feature Compensation and Cross-level Correlation for Infrared Small Target Detection

  • Mingjin Zhang
  • Ke Yue
  • Jing Zhang
  • Yunsong Li
  • Xinbo Gao

Single frame infrared small target (SIRST) detection is useful for many practical applications, such as maritime rescue. However, SIRST detection is challenging due to the low-contrast between small targets and noisy background in infrared images. To address this challenge, we propose a novel FC3-Net by exploring feature compensation and cross-level correlation for SIRST detection. Specifically, FC3-Net consists of a Fine-detail guided Multi-level Feature Compensation (F-MFC) module, and a Cross-level Feature Correlation (CFC) module. The F-MFC module aims to compensate the information loss of details caused by the downsampling layers in convolutional neural networks (CNN) via aggregating features from multiple adjacent levels, so that the detail features of small targets can be propagated to the deeper layers of the network. Besides, to suppress the side impact of background noise, the CFC module constructs an energy filtering kernel based on the higher-level features with less background noise to filter out the noise in the middle-level features, and fuse them with the low-level ones to learn a strong target representation. Putting them together into the encoder-decoder structure, our FC3-Net could produce an accurate target mask with fine shape and details. Experiment results on the public NUAA-SIRST and IRSTD-1k datasets demonstrate that the proposed FC3-Net outperforms state-of-the-art methods in terms of both pixel-level and object-level metrics. The code will be released at

Pixel Exclusion: Uncertainty-aware Boundary Discovery for Active Cross-Domain Semantic Segmentation

  • fuming you
  • Jingjing Li
  • Zhi Chen
  • Lei Zhu

Unsupervised Domain Adaptation (UDA) has been shown to alleviate the heavy annotations for semantic segmentation. Recently, numerous self-training approaches are proposed to address the challenging cross-domain semantic segmentation problem. However, there still exists two open issues: (1) The generated pseudo-labels are inevitably noisy without external supervision. (2) These is a performance gap between UDA models and the fully-supervised model. In this paper, we propose to investigate Active Learning (AL) that selects a small portion of unlabeled pixels (or images) to be annotated, which leads to an impressive performance gain. Specifically, we propose a novel Uncertainty-aware Boundary Discovery (UBD) strategy that selects the uncertain pixels in the boundary areas that contains rich contextual information. Technically, we firstly select the pixels with top entropy values, and then re-select the pixels that are exclusive to their neighbors. We leverage the Kullback-Leibler divergence between one pixel's softmax prediction and its neighbors' to measure its "exclusivity". Extensive experiments show that our approach outperforms previous methods with both pixel-level and image-level label acquisition protocols.

Deep Flexible Structure Preserving Image Smoothing

  • Mingjia Li
  • Yuanbin Fu
  • Xinhui Li
  • Xiaojie Guo

Structure preserving image smoothing is fundamental to numerous multimedia, computer vision, and graphics tasks. This paper develops a deep network in the light of flexibility in controlling, structure preservation in smoothing, and efficiency. Following the principle of divide-and-rule, we decouple the original problem into two specific functionalities, i.e., controllable guidance prediction and image smoothing conditioned on the predicted guidance. Concretely, for flexibly adjusting the strength of smoothness, we customize a two-branch module equipped with a sluice mechanism, which enables altering the strength during inference in a fixed range from 0 (fully smoothing) to 1 (non-smoothing). Moreover, we build a UNet-in-UNet structure with carefully designed loss terms to seek visually pleasant smoothing results without paired data involved for training. As a consequence, our method can produce promising smoothing results with structures well-preserved at arbitrary levels through a compact model with 0.6M parameters, making it attractive for practical use. Quantitative and qualitative experiments are provided to reveal the efficacy of our design, and demonstrate its superiority over other competitors. The code can be found at

Defending Physical Adversarial Attack on Object Detection via Adversarial Patch-Feature Energy

  • Taeheon Kim
  • Youngjoon Yu
  • Yong Man Ro

Object detection plays an important role in security-critical systems such as autonomous vehicles but has shown to be vulnerable to adversarial patch attacks. Existing defense methods are restricted to localized noise patches by removing noisy regions in the input image. However, adversarial patches have developed into natural-looking patterns which evade existing defenses. To address this issue, we propose a defense method based on a novel concept "Adversarial Patch- Feature Energy" (APE) which exploits common deep feature characteristics of an adversarial patch. Our proposed defense consists of APE-masking and APE-refinement which can be employed to defend against any adversarial patch on literature. Extensive experiments demonstrate that APE-based defense achieves impressive robustness against adversarial patches both in the digital space and the physical world.

Multiview Contrastive Learning for Completely Blind Video Quality Assessment of User Generated Content

  • Shankhanil Mitra
  • Rajiv Soundararajan

Completely blind video quality assessment (VQA) refers to a class of quality assessment methods that do not use any reference videos, human opinion scores or training videos from the target database to learn a quality model. The design of this class of methods is particularly important since it can allow for superior generalization in performance across various datasets. We consider the design of completely blind VQA for user generated content. While several deep feature extraction methods have been considered in supervised and weakly supervised settings, such approaches have not been studied in the context of completely blind VQA. We bridge this gap by presenting a self-supervised multiview contrastive learning framework to learn spatio-temporal quality representations. In particular, we capture the common information between frame differences and frames by treating them as a pair of views and similarly obtain the shared representations between frame differences and optical flow. The resulting features are then compared with a corpus of pristine natural video patches to predict the quality of the distorted video. Detailed experiments on multiple camera captured VQA datasets reveal the superior performance of our method over other features when evaluated without training on human scores. Code will be made available at

Compound Batch Normalization for Long-tailed Image Classification

  • Lechao Cheng
  • Chaowei Fang
  • Dingwen Zhang
  • Guanbin Li
  • Gang Huang

Significant progress has been made in learning image classification neural networks under long-tail data distribution using robust training algorithms such as data re-sampling, re-weighting, and margin adjustment. Those methods, however, ignore the impact of data imbalance on feature normalization. The dominance of majority classes (head classes) in estimating statistics and affine parameters causes internal covariate shifts within less-frequent categories to be overlooked. To alleviate this challenge, we propose a compound batch normalization method based on a Gaussian mixture. It can model the feature space more comprehensively and reduce the dominance of head classes. In addition, a moving average-based expectation maximization (EM) algorithm is employed to estimate the statistical parameters of multiple Gaussian distributions. However, the EM algorithm is sensitive to initialization and can easily become stuck in local minima where the multiple Gaussian components continue to focus on majority classes. To tackle this issue, we developed a dual-path learning framework that employs class-aware split feature normalization to diversify the estimated Gaussian distributions, allowing the Gaussian components to fit with training samples of less-frequent classes more comprehensively. Extensive experiments on commonly used datasets demonstrated that the proposed method outperforms existing methods on long-tailed image classification.

Alleviating Style Sensitivity then Adapting: Source-free Domain Adaptation for Medical Image Segmentation

  • Yalan Ye
  • Ziqi Liu
  • Yangwuyong Zhang
  • Jingjing Li
  • Hengtao Shen

Recently, source-free domain adaptation (SFDA) has attracted extensive attention in medical image segmentation due to the ability of knowledge transfer without accessing source data. However, existing SFDA methods suffer from severe performance degradation since the style of the target data shifts from the source. Although traditional unsupervised domain adaptation (UDA) methods are capable of addressing the style shifts issue using both domain data, they fail to extract the source style due to a lack of source data in source-free scenarios. In this paper, we propose a novel style-insensitive source-free domain adaptation framework (SI-SFDA) for medical image segmentation to reduce the impacts of style shifts. The proposed framework first pretrains a generalized source model and then adapts the source model in a source data-free manner. Towards the former, a cross-patch style generalization (CPSG) mechanism is introduced to reduce the style sensitivity of the source model via a self-training paradigm with Transformer structure. Towards the latter, an adaptive confidence regularization (ACR) loss with dynamic scaling strategy is developed to further reduce the classification confusion caused by style shifts. The proposed ACR loss is model-independent so that it can be used with other methods to improve the segmentation performance. Extensive experiments are conducted on five public medical image benchmarks, the promising performance on organ and fundus segmentation tasks demonstrates the effectiveness of our framework.

Multimedia Event Extraction From News With a Unified Contrastive Learning Framework

  • Jian Liu
  • Yufeng Chen
  • Jinan Xu

Extracting events from news have seen many benefits in downstream applications. Today's event extraction (EE) systems, however, usually focus on a single modality --- either for text or image, and such methods suffer from incomplete information because a news document is typically presented in a multimedia format. In this paper, we propose a new method for multimedia EE by bridging the textual and visual modalities with a unified contrastive learning framework. Our central idea is to create a shared space for texts and images in order to improve their similar representation. This is accomplished by training on text-image pairs in general, and we demonstrate that it is possible to use this framework to boost learning for one modality by investigating the complementary of the other modality. On the benchmark dataset, our approach establishes a new state-of-the-art performance and shows a 3 percent improvement in F1. Furthermore, we demonstrate that it can achieve cutting-edge performance for visual EE even in a zero-shot scenario with no annotated data in the visual modality.

DomainPlus: Cross Transform Domain Learning towards High Dynamic Range Imaging

  • Bolun Zheng
  • Xiaokai Pan
  • Hua Zhang
  • Xiaofei Zhou
  • Gregory Slabaugh
  • Chenggang Yan
  • Shanxin Yuan

High dynamic range (HDR) imaging by combining multiple low dynamic range (LDR) images of different exposures provides a promising way to produce high quality photographs. However, the misalignment between the input images leads to ghosting artifacts in the reconstructed HDR image. In this paper, we propose a cross-transform domain neural network for efficient HDR imaging. Our approach consists of two modules: a merging module and a restoration module. For the merging module, we propose a Multiscale Attention with Fronted Fusion (MAFF) mechanism to achieve coarse-to-fine spatial fusion. For the restoration module, we propose fronted Discrete Wavelet Transform (DWT) and Discrete Cosine Transform (DCT)-based learnable bandpass filters to formulate a cross-transform domain learning block, dubbed DomainPlus Block (DPB) for effective ghosting removal. Our ablation study and comprehensive experiments show that DomainPlus outperforms the existing state-of-the-art on several datasets.

Tracking Game: Self-adaptative Agent based Multi-object Tracking

  • Shuai Wang
  • Da Yang
  • Yubin Wu
  • Yang Liu
  • Hao Sheng

Multi-object tracking (MOT) has become a hot task in multi-media analysis. It not only locates the objects but also maintains their unique identities. However, previous methods encounter tracking failures in complex scenes, since they lose most of the unique attributes of each target. In this paper, we formulate the MOT problem as Tracking Game and propose a Self-adaptative Agent Tracker (SAT) framework to solve this problem. The roles in Tracking Game are divided into two classes including the agent player and the game organizer. The organizer controls the game and optimizes the agents' actions from a global perspective. The agent encodes the attributes of targets and selects action dynamically. For these purposes, we design the State Transition Net to update the agent state and the Action Decision Net to implement the flexible tracking strategy for each agent. Finally, we present the organizer-agent coordination tracking algorithm to leverage both global and individual information. The experiments show that the proposed SAT achieves the state-of-the-art performance on both MOT17 and MOT20 benchmarks.

Self-Supervised Text Erasing with Controllable Image Synthesis

  • Gangwei Jiang
  • Shiyao Wang
  • Tiezheng Ge
  • Yuning Jiang
  • Ying Wei
  • Defu Lian

Recent efforts on text erasing have shown promising results. However, existing methods require rich yet costly label annotations to obtain robust models, which limits their use for practical applications. To this end, we study an unsupervised scenario by proposing a novel Self-supervised Text Erasing (STE) framework that jointly learns to synthesize training images with erasure ground-truth and accurately erase texts in the real world. We first design a style-aware image synthesis function to generate synthetic images with diverse styled texts based on two synthetic mechanisms. To bridge the text style gap between the synthetic and real-world data, a policy network is constructed to control the synthetic mechanisms by picking style parameters with the guidance of two specifically designed rewards. The synthetic training images with ground-truth are then fed to train a coarse-to-fine erasing network. To produce better erasing outputs, a triplet erasure loss is designed to enforce the refinement stage to recover background textures. Moreover, we provide a new dataset (called PosterErase), which contains 60K high-resolution posters and is more challenging for the erasing task. The proposed method has been extensively evaluated with both PosterErase and the widely-used SCUT-Enstext dataset. Notably, on PosterErase, our method achieves 5.07 in terms of FID, with a relative improvement of 20.9% over existing supervised baselines.

Look Before You Leap: Improving Text-based Person Retrieval by Learning A Consistent Cross-modal Common Manifold

  • Zijie Wang
  • Aichun Zhu
  • Jingyi Xue
  • Xili Wan
  • Chao Liu
  • Tian Wang
  • Yifeng Li

The core problem of text-based person retrieval is how to bridge the heterogeneous gap between multi-modal data. Many previous approaches contrive to learning a latent common manifold mapping paradigm following a cross-modal distribution consensus prediction (CDCP) manner. When mapping features from distribution of one certain modality into the common manifold, feature distribution of the opposite modality is completely invisible. That is to say, how to achieve a cross-modal distribution consensus so as to embed and align the multi-modal features in a constructed cross-modal common manifold all depends on the experience of the model itself, instead of the actual situation. With such methods, it is inevitable that the multi-modal data can not be well aligned in the common manifold, which finally leads to a sub-optimal retrieval performance. To overcome this CDCP dilemma, we propose a novel algorithm termed LBUL to learn a Consistent Cross-modal Common Manifold (C3 M) for text-based person retrieval. The core idea of our method, just as a Chinese saying goes, is to 'san si er hou xing', namely, to Look Before yoU Leap (LBUL). The common manifold mapping mechanism of LBUL contains a looking step and a leaping step. Compared to CDCP-based methods, LBUL considers distribution characteristics of both the visual and textual modalities before embedding data from one certain modality into C3 M to achieve a more solid cross-modal distribution consensus, and hence achieve a superior retrieval accuracy. We evaluate our proposed method on two text-based person retrieval datasets CUHK-PEDES and RSTPReid. Experimental results demonstrate that the proposed LBUL outperforms previous methods and achieves the state-of-the-art performance.

The More, The Better? Active Silencing of Non-Positive Transfer for Efficient Multi-Domain Few-Shot Classification

  • Xingxing Zhang
  • Zhizhe Liu
  • Weikai Yang
  • Liyuan Wang
  • Jun Zhu

Few-shot classification refers to recognizing several novel classes given only a few labeled samples. Many recent methods try to gain an adaptation benefit by learning prior knowledge from more base training domains, aka. multi-domain few-shot classification. However, with extensive empirical evidence, we find more is not always better: current models do not necessarily benefit from pre-training on more base classes and domains, since the pre-trained knowledge might be non-positive for a downstream task. In this work, we hypothesize that such redundant pre-training can be avoided without compromising the downstream performance. Inspired by the selective activating/silencing mechanism in the biological memory system, which enables the brain to learn a new concept from a few experiences both quickly and accurately, we propose to actively silence those redundant base classes and domains for efficient multi-domain few-shot classification. Then, a novel data-driven approach named Active Silencing with hierarchical Subset Selection (AS3) is developed to address two problems: 1) finding a subset of base classes that adequately represent novel classes for efficient positive transfer; and 2) finding a subset of base learners (i.e., domains) with confident accurate prediction in a new domain. Both problems are formulated as distance-based sparse subset selection. We extensively evaluate AS3 on the recent META-DATASET benchmark as well as MNIST, CIFAR10, and CIFAR100, where AS3 achieves over 100% acceleration while maintaining or even improving accuracy. Our code and Appendix are available at

Hierarchical Few-Shot Object Detection: Problem, Benchmark and Method

  • Lu Zhang
  • Yang Wang
  • Jiaogen Zhou
  • Chenbo Zhang
  • Yinglu Zhang
  • Jihong Guan
  • Yatao Bian
  • Shuigeng Zhou

Few-shot object detection (FSOD) is to detect objects with a few examples. However, existing FSOD methods do not consider hierarchical fine-grained category structures of objects that exist widely in real life. For example, animals are taxonomically classified into orders, families, genera and species etc. In this paper, we propose and solve a new problem called hierarchical few-shot object detection (Hi-FSOD), which aims to detect objects with hierarchical categories in the FSOD paradigm. To this end, on the one hand, we build the first large-scale and high-quality Hi-FSOD benchmark dataset HiFSOD-Bird, which contains 176,350 wild-bird images falling to 1,432 categories. All the categories are organized into a 4-level taxonomy, consisting of 32 orders, 132 families, 572 genera and 1,432 species. On the other hand, we propose the first Hi-FSOD method HiCLPL, where a hierarchical contrastive learning approach is developed to constrain the feature space so that the feature distribution of objects is consistent with the hierarchical taxonomy and the model's generalization power is strengthened. Meanwhile, a probabilistic loss is designed to enable the child nodes to correct the classification errors of their parent nodes in the taxonomy. Extensive experiments on the benchmark dataset HiFSOD-Bird show that our method HiCLPL outperforms the existing FSOD methods.

Few-shot X-ray Prohibited Item Detection: A Benchmark and Weak-feature Enhancement Network

  • Renshuai Tao
  • Tianbo Wang
  • Ziyang Wu
  • Cong Liu
  • Aishan Liu
  • Xianglong Liu

X-ray prohibited items detection of security inspection plays an important role in protecting public safety. It is a typical few-shot object detection (FSOD) task because some categories of prohibited items are highly scarce due to low-frequency appearance, e.g. pistols, which has been ignored by recent X-ray detection works. In contrast to most FSOD studies that rely on rich feature correlations from natural scenarios, the more practical X-ray security inspection usually faces the dilemma of only weak features learnable due to heavy occlusion, color fading, etc, which causes a severe performance drop when traditional FSOD methods are adopted. However, professional X-ray FSOD evaluation benchmarks and effective models of this scenario have been rarely studied in recent years. Therefore, in this paper, we propose the first X-ray FSOD dataset on the typical industrial X-ray security inspection scenario consisting of 12,333 images and 41,704 instances from 20 categories, which could benchmark and promote FSOD studies in such more challenging scenarios. Further, we propose the Weak-feature Enhancement Network (WEN) containing two core modules, i.e. Prototype Perception (PR) and Feature Reconciliation (FR), where PR first generates a prototype library by aggregating and extracting the basis feature from critical regions around instances, to generate the basis information for each category; FR then adaptively adjusts the impact intensity of the corresponding prototype and forces the model to precisely enhance the weak features of specific objects through the basis information. This mechanism is also effective in traditional FSOD tasks. Extensive experiments on X-ray FSOD and Pascal VOC datasets demonstrate that WEN outperforms other baselines in both X-ray and common scenarios.

High-Fidelity Variable-Rate Image Compression via Invertible Activation Transformation

  • Shilv Cai
  • Zhijun Zhang
  • Liqun Chen
  • Luxin Yan
  • Sheng Zhong
  • Xu Zou

Learning-based methods have effectively promoted the community of image compression. Meanwhile, variational autoencoder(VAE) based variable-rate approaches have recently gained much attention to avoid the usage of a set of different networks for various compression rates. Despite the remarkable performance that has been achieved, these approaches would be readily corrupted once multiple compression/decompression operations are executed, resulting in the fact that image quality would be tremendously dropped and strong artifacts would appear. Thus, we try to tackle the issue of high-fidelity fine variable-rate image compression and propose the Invertible Activation Transformation(IAT) module. We implement the IAT in a mathematical invertible manner on a single rate Invertible Neural Network(INN) based model and the quality level(QLevel) would be fed into the IAT to generate scaling and bias tensors. IAT and QLevel together give the image compression model the ability of fine variable-rate control while better maintaining the image fidelity. Extensive experiments demonstrate that the single rate image compression model equipped with our IAT module has the ability to achieve variable-rate control without any compromise. And our IAT-embedded model obtains comparable rate-distortion performance with recent learning-based image compression methods. Furthermore, our method outperforms the state-of-the-art variable-rate image compression method by a large margin, especially after multiple re-encodings.

Cycle Encoding of a StyleGAN Encoder for Improved Reconstruction and Editability

  • Xudong Mao
  • Liujuan Cao
  • Aurele Tohokantche Gnanha
  • Zhenguo Yang
  • Qing Li
  • Rongrong Ji

GAN inversion aims to invert an input image into the latent space of a pre-trained GAN. Despite the recent advances in GAN inversion, there remain challenges to mitigate the tradeoff between distortion and editability, i.e. reconstructing the input image accurately and editing the inverted image with a small visual quality drop. The recently proposed pivotal tuning model makes significant progress towards reconstruction and editability, by using a two-step approach that first inverts the input image into a latent code, called pivot code, and then alters the generator so that the input image can be accurately mapped into the pivot code. Here, we show that both reconstruction and editability can be improved by a proper design of the pivot code. We present a simple yet effective method, named cycle encoding, for a high-quality pivot code. The key idea of our method is to progressively train an encoder in varying spaces according to a cycle scheme: W->W+->W. This training methodology preserves the properties of both W and W+ spaces, i.e. high editability of W and low distortion of W+. To further decrease the distortion, we also propose to refine the pivot code with an optimization-based method, where a regularization term is introduced to reduce the degradation in editability. Qualitative and quantitative comparisons to several state-of-the-art methods demonstrate the superiority of our approach.

Speech Fusion to Face: Bridging the Gap Between Human's Vocal Characteristics and Facial Imaging

  • Yeqi BAI
  • Tao Ma
  • Lipo Wang
  • Zhenjie Zhang

While deep learning technologies are now capable of generating realistic images confusing humans, the research efforts are turning to the synthesis of images for more concrete and application-specific purposes. Facial image generation based on vocal characteristics from speech is one of such important yet challenging tasks. It is the key enabler to influential use cases of image generation, especially for business in public security and entertainment. Existing solutions to the problem of speech2face renders limited image quality and fails to preserve facial similarity due to the lack of quality dataset for training and appropriate integration of vocal features. In this paper, we investigate these key technical challenges and propose Speech Fusion to Face, or SF2F in short, attempting to address the issue of facial image quality and the poor connection between vocal feature domain and modern image generation models. By adopting new strategies on data model and training, we demonstrate dramatic performance boost over state-of-the-art solution, by doubling the recall of individual identity, and lifting the quality score from 15 to 19 based on the mutual information score with VGGFace classifier.

Learning Action-guided Spatio-temporal Transformer for Group Activity Recognition

  • Wei Li
  • Tianzhao Yang
  • Xiao Wu
  • Xian-Jun Du
  • Jian-Jun Qiao

Learning spatial and temporal relations among people plays an important role in recognizing group activity. Recently, transformer-based methods have become popular solutions due to the proposal of self-attention mechanism. However, the person-level features are fed directly into the self-attention module without any refinement. Moreover, group activity in a clip often involves unbalanced spatio-temporal interactions, where only a few persons with special actions are critical to identifying different activities. It is difficult to learn the spatio-temporal interactions due to the lack of elaborately modeling the action dependencies among all people. In this paper, a novel Action-guided Spatio-Temporal transFormer (ASTFormer) is proposed to capture the interaction relations for group activity recognition by learning action-centric aggregation and modeling spatio-temporal action dependencies. Specifically, ASTFormer starts with assigning all persons in each frame to the latent actions, while an action-centric aggregation strategy is performed by weighting the sum of residuals for each latent action under the supervision of global action information. Then, a dual-branch transformer is proposed to refine the inter- and intra-frame action-level features, where two encoders with the self-attention mechanism are employed to select important tokens. Next, a semantic action graph is explicitly devised to model the dynamic action-wise dependencies. Finally, our model is capable of boosting group activity recognition by fusing these important cues, while only requiring video-level action labels. Extensive experiments on two popular benchmarks (Volleyball and Collective Activity) demonstrate the superior performance of our method in comparison with the state-of-the-art methods using only raw RGB frames as input.

A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQA

  • Yangyang Guo
  • Liqiang Nie
  • Yongkang Wong
  • Yibing Liu
  • Zhiyong Cheng
  • Mohan Kankanhalli

Knowledge-based Visual Question Answering (VQA) expects models to rely on external knowledge for robust answer prediction. Though significant it is, this paper discovers several leading factors impeding the advancement of current state-of-the-art methods. On the one hand, methods which exploit the explicit knowledge take the knowledge as a complement for the coarsely trained VQA model. Despite their effectiveness, these approaches often suffer from noise incorporation and error propagation. On the other hand, pertaining to the implicit knowledge, the multi-modal implicit knowledge for knowledge-based VQA still remains largely unexplored. This work presents a unified end-to-end retriever-reader framework towards knowledge-based VQA. In particular, we shed light on the multi-modal implicit knowledge from vision-language pre-training models to mine its potential in knowledge reasoning. As for the noise problem encountered by the retrieval operation on explicit knowledge, we design a novel scheme to create pseudo labels for effective knowledge supervision. This scheme is able to not only provide guidance for knowledge retrieval, but also drop these instances potentially error-prone towards question answering. To validate the effectiveness of the proposed method, we conduct extensive experiments on the benchmark dataset. The experimental results reveal that our method outperforms existing baselines by a noticeable margin. Beyond the reported numbers, this paper further spawns several insights on knowledge utilization for future research with some empirical findings.

PIA: Parallel Architecture with Illumination Allocator for Joint Enhancement and Detection in Low-Light

  • Tengyu Ma
  • Long Ma
  • Xin Fan
  • Zhongxuan Luo
  • Risheng Liu

Visual perception in low-light conditions (e.g., nighttime) plays an important role in various multimedia-related applications (e.g., autonomous driving). The enhancement (provides a visual-friendly appearance) and detection (detects the instances of objects) in low-light are two fundamental and crucial visual perception tasks. In this paper, we make efforts on how to simultaneously realize low-light enhancement and detection from two aspects. First, we define a parallel architecture to satisfy the task demand for both two tasks. In which, a decomposition-type warm-start acting on the entrance of parallel architecture is developed to narrow down the adverse effects brought by low-light scenes to some extent. Second, a novel illumination allocator is designed by encoding the key illumination component (the inherent difference between normal-light and low-light) to extract hierarchical features for assisting in enhancement and detection. Further, we make a substantive discussion for our proposed method. That is, we solve enhancement in a coarse-to-fine manner and handle detection in a decomposed-to-integrated fashion. Finally, multidimensional analytical and evaluated experiments are performed to indicate our effectiveness and superiority. The code is available at \url

Robust Actor Recognition in Entertainment Multimedia at Scale

  • Abhinav Aggarwal
  • Yash Pandya
  • Lokesh A. Ravindranathan
  • Laxmi S. Ahire
  • Manivel Sethu
  • Kaustav Nandy

Actor identification and localization in movies and TV series seasons can enable deeper engagement with the content. Manual actor identification and tagging at every time-instance in a video is error prone as it is a highly repetitive, decision intensive and time-consuming task. The goal of this paper is to accurately label as many faces as possible in the video with actor names. We solve this problem using a multi-step clustering process followed by a selection of face-instances that are: (a) representative of their member clusters and (b) aesthetically pleasing for visual identification. These face-instances can be matched with the actor names by automated or manual techniques to complete actor tagging. This solution is further optimized for seasons with repeating cast members which constitutes majority of the entertainment multimedia content. In such titles, the face labels from the previous episodes are efficiently used to pre-label faces in the subsequent episode. We guarantee the same level of accuracy even after scaling the solution to TV series seasons. This novel solution works in a completely realistic setup where the input to the solution is just the raw video. This is the first known work which has proved its robustness on more than 5000 TV episodes and movies across different genres, languages and runtimes with actors of diverse ethnicity, race, gender identity, age, etc. The proposed solution establishes a new state-of-the-art for cluster purity in both movies and TV series seasons by achieving near-perfect cluster homogeneity.

MF-Net: A Novel Few-shot Stylized Multilingual Font Generation Method

  • Yufan Zhang
  • Junkai Man
  • Peng Sun

Creating a complete stylized font library that helps the audience to perceive information from the text often requires years of study and proficiency in the use of many professional tools. Accordingly, automatic stylized font generation in a deep learning-based fashion is a desirable but challenging task that has attracted a lot of attention in recent years. This paper revisits the state-of-the-art methods for stylized font generation and presents a taxonomy of the deep learning-based stylized font generation. Despite the notable performance of the existing models, stylized multilingual font generation, the task of applying specific font style to diverse characters in multiple languages has never been reported to be addressed. An efficient and economical method for stylized multilingual font generation is essential in numerous application scenarios that require communication with international audiences. We propose a solution for few-shot multilingual stylized font generation by a fast feed-forward network, Multilingual Font Generation Network (MF-Net), which can transfer previously unseen font styles from a few samples to characters from previously unseen languages. Following the Generative Adversarial Network (GAN) framework, MF-Net adopts two separate encoders in the generator to decouple a font image's content and style information. We adopt an attention module in the style encoder to extract both shallow and deep style features. Moreover, we also design a novel language complexity-aware skip connection to adaptive adjust the structural information to be preserved. With an effective loss function to improve the visual quality of the generated font images, we show the effectiveness of the proposed MF-Net based on quantitative and subjective visual evaluation, and compare it with the existing models in the scenario of stylized multilingual font generation. The source code is available on

Feature and Semantic Views Consensus Hashing for Image Set Classification

  • Yuan Sun
  • Dezhong Peng
  • Haixiao Huang
  • Zhenwen Ren

Image set classification (ISC) has always been an active topic, primarily due to the fact that image set can provide more comprehensive information to describe a subject. However, the existing ISC methods face two problems: (1) The high computational cost prohibits these methods from being applied into median or large-scale applications; (2) the consensus information between feature and semantic representation of image set are largely ignored. To overcome these issues, in this paper, we propose a novel ISC method, termly feature and semantic views consensus hashing (FSVCH). Specifically, a kernelized bipartite graph is constructed to capture the nonlinear structure of data, and then two-views (\ie feature and semantic) consensus hashing learning (TCHL) is proposed to obtain a shared hidden consensus information. Meanwhile, for robust out-of-sample prediction purpose, we further propose TCHL guided optimal hash function inversion (TGHI) to learn a high-quality general hash function. Afterwards, hashing rotating (HR) is employed to obtain a more approximate real-valued hash solution. A large number of experiments show that FSVCH remarkably outperforms comparison methods on three benchmark datasets, in term of running time and classification performance. Experimental results also indicate that FSVCH can be scalable to median or large-scale ISC task.

Evidential Reasoning for Video Anomaly Detection

  • Che Sun
  • Yunde Jia
  • Yuwei Wu

Video anomaly detection aims to discriminate events that deviate from normal patterns in a video. Modeling the decision boundaries of anomalies is challenging, due to the uncertainty in the probability of deviating from normal patterns. In this paper, we propose a deep evidential reasoning method that explicitly learns the uncertainty to model the boundaries. Our method encodes various visual cues as evidences representing potential deviations, assigns beliefs to the predicted probability of deviating from normal patterns based on the evidences, and estimates the uncertainty from the remained beliefs to model the boundaries. To do this, we build a deep evidential reasoning network to encode evidence vectors and estimate uncertainty by learning evidence distributions and deriving beliefs from the distributions. We introduce an unsupervised strategy to train our network by minimizing an energy function of the deep Gaussian mixed model (GMM). Experimental results show that our uncertainty score is beneficial for modeling the boundaries of video anomalies on three benchmark datasets.

Gaze- and Spacing-flow Unveil Intentions: Hidden Follower Discovery

  • Danni Xu
  • Ruimin Hu
  • Zheng Wang
  • Linbo Luo
  • Dengshi Li
  • Wenjun Zeng

We raise a new and challenging multimedia application in video surveillance system, i.e., Hidden Follower Discovery (HFD). In contrast to the common abnormal behaviors that are occurring, hidden following is not an ongoing activity, but a preparatory action. Hidden following behavior does not have salient features, making it hard to be discovered. Fortunately, from a socio-cognitive perspective, we found and verified the phenomena that the gaze-flow pattern and the spacing-flow pattern between hidden and normal followers are different. To promote HFD research, we construct two pioneering datasets and devise an HFD baseline network based on the recognition of both gaze-flow and spacing-flow patterns from surveillance videos. Extensive experiments demonstrate their effectiveness.

Semi-supervised Learning for Multi-label Video Action Detection

  • Hongcheng Zhang
  • Xu Zhao
  • Dongqi Wang

Semi-supervised multi-label video action detection aims to locate all the persons and recognize their multiple action labels by leveraging both labeled and unlabeled videos. Compared to the single-label scenario, semi-supervised learning in multi-label video action detection is more challenging due to two significant issues: generation of multiple pseudo labels and class-imbalanced data distribution. In this paper, we propose an effective semi-supervised learning method to tackle these challenges. Firstly, to make full use of the informative unlabeled data for better training, we design an effective multiple pseudo labeling strategy by setting dynamic learnable threshold for each class. Secondly, to handle the long-tailed distribution for each class, we propose the unlabeled class balancing strategy. We select training samples according to the multiple pseudo labels generated during the training iteration, instead of the usual data re-sampling that requires label information before training. Then the balanced re-weighting is leveraged to mitigate the class imbalance caused by multi-label co-occurrence. Extensive experiments conducted on two challenging benchmarks, AVA and UCF101-24, demonstrate the effectiveness of our proposed designs. By using the unlabeled data effectively, our method achieves the state-of-the-art performance in video action detection on both AVA and UCF101-24 datasets. Besides, it can still achieve competitive performance compared with fully-supervised methods when using limited annotations on AVA dataset.

Learning Cross-Image Object Semantic Relation in Transformer for Few-Shot Fine-Grained Image Classification

  • Bo Zhang
  • Jiakang Yuan
  • Baopu Li
  • Tao Chen
  • Jiayuan Fan
  • Botian Shi

Few-shot fine-grained learning aims to classify a query image into one of a set of support categories with fine-grained differences. Although learning different objects' local differences via Deep Neural Networks has achieved success, how to exploit the query-support cross-image object semantic relations in Transformer-based architecture remains under-explored in the few-shot fine-grained scenario. In this work, we propose a Transformer-based double-helix model, namely HelixFormer, to achieve the cross-image object semantic relation mining in a bidirectional and symmetrical manner. The HelixFormer consists of two steps: 1) Relation Mining Process (RMP) across different branches, and 2) Representation Enhancement Process (REP) within each individual branch. By the designed RMP, each branch can extract fine-grained object-level Cross-image Semantic Relation Maps (CSRMs) using information from the other branch, ensuring better cross-image interaction in semantically related local object regions. Further, with the aid of CSRMs, the developed REP can strengthen the extracted features for those discovered semantically-related local regions in each branch, boosting the model's ability to distinguish subtle feature differences of fine-grained objects. Extensive experiments conducted on five public fine-grained benchmarks demonstrate that HelixFormer can effectively enhance the cross-image object semantic relation matching for recognizing fine-grained objects, achieving much better performance over most state-of-the-art methods under 1-shot and 5-shot scenarios.

Progressive Spatial-temporal Collaborative Network for Video Frame Interpolation

  • Mengshun Hu
  • Kui Jiang
  • Liang Liao
  • Zhixiang Nie
  • Jing Xiao
  • Zheng Wang

Most video frame interpolation (VFI) algorithms infer the intermediate frame with the help of adjacent frames through the cascaded motion estimation and content refinement.However, the intrinsic correlations between motion and content are barely investigated, commonly producing interpolated results with inconsistency and blurry contents.Specifically, we first discover a simple yet essential domain knowledge that contents and motions characteristics should be homogeneous to a certain degree from the same objects, and formulate the consistency into the loss function for model optimization. Based on this, we propose to learn the collaborative representation between motions and contents, and construct a novel progressive spatial-temporal Collaborative network (Prost-Net) for video frame interpolation.Specifically, we develop a content-guided motion module (CGMM) and a motion-guided content module (MGCM) for individual content and motion representation. In particular, the predicted motion in CGMM is used to guide the fusion and distillation of contents for intermediate frame interpolation, and vice versa. Furthermore, by considering collaborative strategy in a multi-scale framework, our Prost-Net progressively optimizes motions and contents in a coarse-to-fine manner, making it robust to various challenging scenarios (occlusion and large motions) in VFI. Extensive experiments on the benchmark datasets demonstrate that our method significantly outperforms state-of-the-art methods.

Best of Both Worlds: See and Understand Clearly in the Dark

  • Xinwei Xue
  • Jia He
  • Long Ma
  • Yi Wang
  • Xin Fan
  • Risheng Liu

Recently, with the development of intelligent technology, the perception of low-light scenes has been gaining widespread attention. However, existing techniques usually focus on only one task (e.g., enhancement) and lose sight of the others (e.g., detection), making it difficult to perform all of them well at the same time. To overcome this limitation, we propose a new method that can handle visual quality enhancement and semantic-related tasks (e.g., detection, segmentation) simultaneously in a unified framework. Specifically, we build a cascaded architecture to meet the task requirements. To better enhance the entanglement in both tasks and achieve mutual guidance, we develop a new contrastive-alternative learning strategy for learning the model parameters, to largely improve the representational capacity of the cascaded architecture. Notably, the contrastive learning mechanism establishes the communication between two objective tasks in essence, which actually extends the capability of contrastive learning to some extent. Finally, extensive experiments are performed to fully validate the advantages of our method over other state-of-the-art works in enhancement, detection, and segmentation. A series of analytical evaluations are also conducted to reveal our effectiveness. The code is available at

Meta Clustering Learning for Large-scale Unsupervised Person Re-identification

  • Xin Jin
  • Tianyu He
  • Xu Shen
  • Tongliang Liu
  • Xinchao Wang
  • Jianqiang Huang
  • Zhibo Chen
  • Xian-Sheng Hua

Unsupervised Person Re-identification (U-ReID) with pseudo labeling recently reaches a competitive performance compared to fully-supervised ReID methods based on modern clustering algorithms. However, such clustering-based scheme becomes computationally prohibitive for large-scale datasets, making it infeasible to be applied in real-world application. How to efficiently leverage endless unlabeled data with limited computing resources for better U-ReID is under-explored. In this paper, we make the first attempt to the large-scale U-ReID and propose a "small data for big task" paradigm dubbed Meta Clustering Learning (MCL). MCL only pseudo-labels a subset of the entire unlabeled data via clustering to save computing for the first-phase training. After that, the learned cluster centroids, termed as meta-prototypes in our MCL, are regarded as a proxy annotator to softly annotate the rest unlabeled data for further polishing the model. To alleviate the potential noisy labeling issue in the polishment phase, we enforce two well-designed loss constraints to promise intra-identity consistency and inter-identity strong correlation. For multiple widely-used U-ReID benchmarks, our method significantly saves computational cost while achieving a comparable or even better performance compared to prior works.

Adjustable Memory-efficient Image Super-resolution via Individual Kernel Sparsity

  • Xiaotong Luo
  • Mingliang Dai
  • Yulun Zhang
  • Yuan Xie
  • Ding Liu
  • Yanyun Qu
  • Yun Fu
  • Junping Zhang

Though single image super-resolution (SR) has witnessed incredible progress, the increasing model complexity impairs its applications in memory-limited devices. To solve this problem, prior arts have aimed to reduce the number of model parameters and sparsity has been exploited, which usually enforces the group sparsity constraint on the filter level and thus is not arbitrarily adjustable for satisfying the customized memory requirements. In this paper, we propose an individual kernel sparsity (IKS) method for memory-efficient and sparsity-adjustable image SR to aid deep network deployment in memory-limited devices. IKS performs model sparsity in the weight level that implicitly allocates the user-defined target sparsity to each individual kernel. To induce the kernel sparsity, a soft thresholding operation is used as a gating constraint for filtering the trivial weights. To achieve adjustable sparsity, a dynamic threshold learning algorithm is proposed, in which the threshold is updated by associated training with the network weight and is adaptively decayed with the guidance of the desired sparsity. This work essentially provides a dynamic parameter reassignment scheme with a given resource budget for an off-the-shelf SR model. Extensive experimental results demonstrate that IKS imparts considerable sparsity with negligible effect on SR quality. The code is available at:

GT-MUST: Gated Try-on by Learning the Mannequin-Specific Transformation

  • Ning Wang
  • Jing Zhang
  • Lefei Zhang
  • Dacheng Tao

Given the mannequin (i.e., reference person) and target garment, the virtual try-on (VTON) task aims at dressing the mannequin in the provided garment automatically, having attracted increasing attention in recent years. Previous works usually conduct the garment deformation under the guidance of ''shape''. However, ''shape-only transformation'' ignores the local structures and results in unnatural distortions. To address this issue, we propose a Gated Try-on method by learning the ManneqUin-Specific Transformation (GT-MUST). Technically, we implement GT-MUST as a three-stage deep neural model. First, GT-MUST learns the ''mannequin-specific transformation'' with a ''take-off'' mechanism, which recovers the warped clothes of the mannequin to its original in-shop state. Then, the learned ''mannequin-specific transformation'' is inverted and utilized to help generate the mannequin-specific warped state for a target garment. Finally, a special gate is employed to better combine the mannequin-specific warped garment with the mannequin. GT-MUST benefits from learning to solve a much easier ''take-off'' task to obtain the mannequin-specific information than the common ''try-on'' task, since flat in-shop garments usually have less variation in shape than those clothed on the body. Experiments on the fashion dataset demonstrate that GT-MUST outperforms the state-of-the-art virtual try-on methods. The code is available at

PC2-PU: Patch Correlation and Point Correlation for Effective Point Cloud Upsampling

  • Chen Long
  • WenXiao Zhang
  • Ruihui Li
  • Hao Wang
  • Zhen Dong
  • Bisheng Yang

Point cloud upsampling is to densify a sparse point set acquired from 3D sensors, providing a denser representation for the underlying surface. Existing methods divide the input points into small patches and upsample each patch separately, however, ignoring the global spatial consistency between patches. In this paper, we present a novel method PC$^2$-PU, which explores patch-to-patch and point-to-point correlations for more effective and robust point cloud upsampling. Specifically, our network has two appealing designs: (i) We take adjacent patches as supplementary inputs to compensate the loss structure information within a single patch and introduce a Patch Correlation Module to capture the difference and similarity between patches. (ii) After augmenting each patch's geometry, we further introduce a Point Correlation Module to reveal the relationship of points inside each patch to maintain the local spatial consistency. Extensive experiments on both synthetic and real scanned datasets demonstrate that our method surpasses previous upsampling methods, particularly with the noisy inputs. The code and data are at:

Self-Supervised Multi-view Stereo via Adjacent Geometry Guided Volume Completion

  • Luoyuan Xu
  • Tao Guan
  • Yuesong Wang
  • Yawei Luo
  • Zhuo Chen
  • Wenkai Liu
  • Wei Yang

Existing self-supervised multi-view stereo (MVS) approaches largely rely on photometric consistency for geometry inference, and hence suffer from low-texture or non-Lambertian appearances. In this paper, we observe that adjacent geometry shares certain commonality that can help to infer the correct geometry of the challenging or low-confident regions. Yet exploiting such property in a non-supervised MVS approach remains challenging for the lacking of training data and necessity of ensuring consistency between views. To address the issues, we propose a novel geometry inference training scheme by selectively masking regions with rich textures, where geometry can be well recovered and used for supervisory signal, and then lead a deliberately designed cost volume completion network to learn how to recover geometry of the masked regions. During inference, we then mask the low-confident regions instead and use the cost volume completion network for geometry correction. To deal with the different depth hypotheses of the cost volume pyramid, we design a three-branch volume inference structure for the completion network. Further, by considering plane as a special geometry, we first identify planar regions from pseudo labels and then correct the low-confident pixels by high-confident labels through plane normal consistency. Extensive experiments on DTU and Tanks & Temples demonstrate the effectiveness of the proposed framework and the state-of-the-art performance.

AtHom: Two Divergent Attentions Stimulated By Homomorphic Training in Text-to-Image Synthesis

  • Zhenbo Shi
  • Zhi Chen
  • Zhenbo Xu
  • Wei Yang
  • Liusheng Huang

Image generation from text is a challenging and ill-posed task. Images generated from previous methods usually have low semantic consistency with texts and the achieved resolution is limited. To generate semantically consistent high-resolution images, we propose a novel method named AtHom, in which two attention modules are developed to extract the relationships from both independent modality and unified modality. The first is a novel Independent Modality Attention Module (IAM), which is presented to find out semantically important areas in generated images and to extract the informative context in texts. The second is a new module named Unified Semantic Space Attention Module (UAM), which is utilized to find out the relationships between extracted text context and essential areas in generated images. In particular, to bring the semantic features of texts and images closer in a unified semantic space, AtHom incorporates a homomorphic training mode by exploiting an extra discriminator to distinguish between two different modalities. Extensive experiments show that our AtHom surpasses previous methods by large margins.

One-step Low-Rank Representation for Clustering

  • Zhiqiang Fu
  • Yao Zhao
  • Dongxia Chang
  • Yiming Wang
  • Jie Wen
  • Xingxing Zhang
  • Guodong Guo

Existing low-rank representation-based methods adopt a two-step framework, which must employ an extra clustering method to gain labels after representation learning. In this paper, a novel one-step representation-based method, i.e., One-step Low-Rank Representation (OLRR), is proposed to capture multi-subspace structures for clustering. OLRR integrates the low-rank representation model and clustering into a unified framework. Thus it can jointly learn the low-rank subspace structure embedded in the database and gain the clustering results. In particular, by approximating the representation matrix with two same clustering indicator matrices, OLRR can directly show the probability of samples belonging to each cluster. Further, a probability penalty is introduced to ensure that the samples with smaller distances are more inclined to be in the same cluster, thus enhancing the discrimination of the clustering indicator matrix and resulting in a more favorable clustering performance. Moreover, to enhance the robustness against noise, OLRR uses the probability to guide denoising and then performs representation learning and clustering in a recovered clean space. Extensive experiments well demonstrate the robustness and effectiveness of OLRR. Our code is publicly available at:

Customizing GAN Using Few-shot Sketches

  • Syed Muhammad Israr
  • Feng Zhao

Generative adversarial networks (GANs) have demonstrated remarkable success in image synthesis applications, but their performance deteriorates under limited data regimes. The fundamental challenge is that it is extremely difficult to synthesize photo-realistic and highly diversified images while capturing meaningful attributes of the targets under minimum supervision. Previous methods either fine-tune or rewrite the model weights to adapt to few-shot datasets. However, this either overfits or requires access to large-scale data on which they are trained. To tackle the problem, we propose a framework that repurposes the existing pre-trained generative models using only a few samples (e.g., <30) of sketches. Unlike previous works, we transfer the sample diversity and quality without accessing the source data using inter-domain distance consistency. By employing cross-domain adversarial learning, we encourage the model output to closely resemble the input sketches in both shape and pose. Extensive experiments show that our method significantly outperforms the existing approaches in terms of sample quality and diversity. The qualitative and quantitative results on various standard datasets also demonstrate its efficacy. On the most popularly used dataset, Gabled church, we achieve a Fréchet inception distance (FID) score of 15.63.

Video Coding using Learned Latent GAN Compression

  • Mustafa Shukor
  • Bharath Bhushan Damodaran
  • Xu Yao
  • Pierre Hellier

We propose in this paper a new paradigm for facial video compression. We leverage the generative capacity of GANs such as StyleGAN to represent and compress a video, including intra and inter compression. Each frame is inverted in the latent space of StyleGAN, from which the optimal compression is learned. To do so, a diffeomorphic latent representation is learned using a normalizing flows model, where an entropy model can be optimized for image coding. In addition, we propose a new perceptual loss that is more efficient than other counterparts. Finally, an entropy model for video inter coding with residual is also learned in the previously constructed latent representation. Our method (SGANC) is simple, faster to train, and achieves better results for image and video coding compared to state-of-the-art codecs such as VTM, AV1, and recent deep learning techniques. In particular, it drastically minimizes perceptual distortion at low bit rates.

Action-conditioned On-demand Motion Generation

  • Qiujing Lu
  • Yipeng Zhang
  • Mingjian Lu
  • Vwani Roychowdhury

We propose a novel framework, On-Demand MOtion Generation (ODMO), for generating realistic and diverse long-term 3D human motion sequences conditioned only on action types with an additional capability of customization. ODMO shows improvements over SOTA approaches on all traditional motion evaluation metrics when evaluated on three public datasets (HumanAct12, UESTC, and MoCap). Furthermore, we provide both qualitative evaluations and quantitative metrics demonstrating several first-known customization capabilities afforded by our framework, including mode discovery, interpolation, and trajectory customization. These capabilities significantly widen the spectrum of potential applications of such motion generation models. The novel on-demand generative capabilities are enabled by innovations in both the encoder and decoder architectures: (i) Encoder: Utilizing contrastive learning in low-dimensional latent space to create a hierarchical embedding of motion sequences, where not only the codes of different action types form different groups, but within an action type, codes of similar inherent patterns (motion styles) cluster together, making them readily discoverable; (ii) Decoder: Using a hierarchical decoding strategy where the motion trajectory is reconstructed first and then used to reconstruct the whole motion sequence. Such an architecture enables effective trajectory control. Our code is released on the Github page:

Universal Domain Adaptive Object Detector

  • Wenxu Shi
  • Lei Zhang
  • Weijie Chen
  • Shiliang Pu

Universal domain adaptive object detection (UniDAOD) is more challenging than domain adaptive object detection (DAOD) since the label space of the source domain may not be the same as that of the target and the scale of objects in the universal scenarios can vary dramatically (i.e, category shift and scale shift). To this end, we propose US-DAF, namely Universal Scale-Aware Domain Adaptive Faster RCNN with Multi-Label Learning, to reduce the negative transfer effect during training while maximizing transferability as well as discriminability in both domains under a variety of scales. Specifically, our method is implemented by two modules: 1) We facilitate the feature alignment of common classes and suppress the interference of private classes by designing a Filter Mechanism module to overcome the negative transfer caused by category shift. 2) We fill the blank of scale-aware adaptation in object detection by introducing a new Multi-Label Scale-Aware Adapter to perform individual alignment between corresponding scale for two domains. Experiments show that US-DAF achieves state-of-the-art results on three scenarios (\emphi.e, Open-Set, Partial-Set, and Closed-Set) and yields 7.1% and 5.9% relative improvement on benchmark datasets Clipart1k and Watercolor in particular.

PIMoG: An Effective Screen-shooting Noise-Layer Simulation for Deep-Learning-Based Watermarking Network

  • Han Fang
  • Zhaoyang Jia
  • Zehua Ma
  • Ee-Chien Chang
  • Weiming Zhang

With the omnipresence of camera phone and digital display, capturing digitally displayed image with camera phone are getting widely practiced. In the context of watermarking, this brings forth the issue of screen-shooting robustness. The key to acquiring screen-shooting robustness is designing a good noise layer that could represent screen-shooting distortions in a deep-learning-based watermarking framework. However, it is very difficult to quantitatively formulate the screen-shooting distortion since the screen-shooting process is too complex. In order to design an effective noise layer for screen-shooting robustness, we propose new insight in this paper, that is, it is not necessary to quantitatively simulate the overall procedure in the screen-shooting noise layer, only including the most influenced distortions is enough to generate an effective noise layer with strong robustness. To verify this insight, we propose a screen-shooting noise layer dubbed PIMoG. Specifically, we summarize the most influenced distortions of screen-shooting process into three parts (p erspective distortion, i llumination distortion and mo iré distortion) and further simulate them in a differentiable way. For the rest distortion, we utilize the G aussian noise to approximate the main part of them. As a result, the whole network can be trained end-to-end with such noise layer. Extensive experiments illustrate the superior performance of the proposed PIMoG noise layer. In addition to the noise layer design, we also propose a gradient mask-guided image loss and an edge mask-guided image loss to further improve the robustness and invisibility of the whole network respectively. Based on the proposed loss and PIMoG noise layer, the whole framework outperforms the SOTA watermarking method with at least 5% in extraction accuracy and achieves more than 97% accuracy in different screen-shooting conditions.

MONOPOLY: Financial Prediction from MONetary POLicY Conference Videos Using Multimodal Cues

  • Puneet Mathur
  • Atula Neerkaje
  • Malika Chhibber
  • Ramit Sawhney
  • Fuming Guo
  • Franck Dernoncourt
  • Sanghamitra Dutta
  • Dinesh Manocha

Risk prediction and price movement classification are essential tasks in financial markets. Monetary policy calls (MPC) provide important insights into the actions taken by a country's central bank on economic goals related to inflation, employment, prices, and interest rates. Analyzing visual, vocal, and textual cues from MPC calls can help analysts and policymakers evaluate the economic risks and make sound investment decisions. To aid the analysis of MPC calls, we curate the Monopoly dataset, a collection of public conference call videos along with their corresponding audio recordings and text transcripts released by six international banks between 2009 and 2022. Our dataset is the first attempt to explore the benefits of visual cues in addition to audio and textual signals for financial prediction tasks. We introduce MPCNet, a competitive baseline architecture that takes advantage of the cross-modal transformer blocks and modality-specific attention fusion to forecast the financial risk and price movement associated with the MPC calls. Empirical results prove that the task is challenging, with the proposed architecture performing 5-18% better than strong Transformer-based baselines. We release the MPC dataset and benchmark models to motivate future research in this new challenging domain.

Structure-Inferred Bi-level Model for Underwater Image Enhancement

  • Pan Mu
  • Haotian Qian
  • Cong Bai

Very recently, with the development of underwater robots, underwater image enhancement arising growing interests in the computer vision community. However, owing to light being scattered and absorbed while it traveling in water, underwater captured images often suffer from color cast and low visibility. Existing methods depend on specific prior knowledge and training data to enhance underwater images in the absence of structure information, which results in poor and unnatural performance. To this end, we propose a Structural-Inferred Bi-level Model (SIBM) that incorporates different modalities of knowledge (i.e., semantic domain, gradient-domain, and pixel domain) hierarchically enhancing underwater images. In particular, by introducing a semantic mask, we individually optimize the forehand branch that avoids unnecessary interference arising from the background region. We design a gradient-based high-frequency branch to exploit gradient-space guidance for preserving texture structures. Moreover, we construct a pixel-based branch by feeding semantic and gradient information to enhance underwater images. To exploit different modalities, we introduce a hyper-parameter optimization scheme to fuse the above domain information. Experimental results illustrate that the developed method not only outperforms the previous methods in quantitative scores but also generalizes well on real-world underwater datasets. Source code is available at \href

Composite Photograph Harmonization with Complete Background Cues

  • Yazhou Xing
  • Yu Li
  • Xintao Wang
  • Ye Zhu
  • Qifeng Chen

Compositing portrait photographs or videos to novel backgrounds is an important application in computational photography. Seamless blending along boundaries and globally harmonic colors are two desired properties of the photo-realistic composition of foregrounds and new backgrounds. Existing works are dedicated to either foreground alpha matte generation or after-blending harmonization, leading to sub-optimal background replacement when putting foregrounds and backgrounds together. In this work, we unify the two objectives in a single framework to obtain realistic portrait image composites. Specifically, we investigate the usage of a target background and find that a complete background plays a vital role in both seamlessly blending and harmonization. We develop a network to learn the composition process given an imperfect alpha matte with appearance features extracted from the complete background to adjust color distribution. Our dedicated usage of a complete background enables realistic portrait image composition and also temporally stable results on videos. Extensive quantitative and qualitative experiments on both synthetic and real-world data demonstrate that our method achieves state-of-the-art performance.

Self-supervised Multi-view Stereo via Inter and Intra Network Pseudo Depth

  • Ke Qiu
  • Yawen Lai
  • Shiyi Liu
  • Ronggang Wang

Recent self-supervised learning-based multi-view stereo (MVS) approaches have shown promising results. However, previous methods primarily utilize view synthesis as the replacement for costly ground-truth depth data to guide network learning, still maintaining a performance gap with recent supervised methods. In this paper, we propose a self-supervised dual network MVS framework with inter and intra network pseudo depth labels for more powerful supervision guidance. Specifically, the inter network pseudo depth labels are estimated by an unsupervised network, filtered by multi-view geometry consistency, updated iteratively by a pseudo depth supervised network, and finally refined by our efficient geometry priority sampling strategy. And we dynamically generate multi-scale intra network pseudo labels inside our cascade unsupervised network during training to provide additional reliable supervision. Experimental results on the DTU and Tanks & Temples datasets demonstrate that our proposed methods achieve state-of-the-art performance among unsupervised methods and even achieve comparable performance and generalization ability with supervised adversaries.

Delegate-based Utility Preserving Synthesis for Pedestrian Image Anonymization

  • Zhenzhong Kuang
  • Longbin Teng
  • Zhou Yu
  • Jun Yu
  • Jianping Fan
  • Mingliang Xu

The rapidly growing application of pedestrian images has aroused wide concern on visual privacy protection because personal information is under the risk of privacy disclosure. Anonymization is regarded as an effective solution by identity obfuscation. Most recent methods focus on face, but it is not enough when the presence of human body carries lots of identifiable information. This paper presents a new delegate-based utility preserving synthesis (DUPS) approach for pedestrian image anonymization. This is challenging because one may expect that the anonymized image can still be useful in various computer vision tasks. We model DUPS as an adaptive translation process from source to target. To provide a comprehensive identity protection, we first perform anonymous delegate sampling based on image-level differential privacy. To synthesize anonymous images, we then introduce an adaptive translation network and optimize it with a multi-task loss function. Our approach is theoretically sound and can generate diverse results by preserving data utility. The experiments on multiple datasets show that DUPS can not only achieve superior anonymization performance against deep pedestrian recognizers, but also can obtain a better tradeoff between privacy protection and utility preservation compared with state-of-the-art methods.

Video Instance Lane Detection via Deep Temporal and Geometry Consistency Constraints

  • Mingqian Wang
  • Yujun Zhang
  • Wei Feng
  • Lei Zhu
  • Song Wang

Video instance lane detection is one of the most important tasks in autonomous driving.Due to the very sparse region and weak context in lane annotations, accurately detecting instance-level lanes in real-world traffic scenarios is challenging, especially for scenes with occlusion, bad weather conditions, dim or dazzling lights.Current methods mainly address this problem by integrating features of adjacent video frames to simply encourage temporal constancy for image-level lane detectors. However, most of them ignore lane shape constraint of adjacent frames and geometry consistency of individual lanes, thereby harming the performance of video instance lane detection. In this paper, we propose TGC-Net via temporal and geometry consistency constraints for reliable video instance lane detection. Specifically, we devise a temporal recurrent feature-shift aggregation module (T-RESA) to learn spatio-temporal lane features along horizontal, vertical, and temporal directions of the feature tensor. We further impose temporal consistency constraint by encouraging spatial distribution consistency among the lane features of adjacent frames. Besides, we devise two effective geometry constraints to ensure the integrity and continuity of lane predictions by leveraging pairwise point affinity loss and vanishing point guided geometric context, respectively. Extensive experiments on public benchmark dataset show that our TGC-Net quantitatively and qualitatively outperforms state-of-the-art video instance lane detectors and video object segmentation competitors. Our code and our results have been released at

Learning Visible Surface Area Estimation for Irregular Objects

  • Xu Liu
  • Jianing Li
  • Xianqi Zhang
  • Jingyuan Sun
  • Xiaopeng Fan
  • Yonghong Tian

Visible surface area estimation for irregular objects, one of the most fundamental and challenging topics in mathematics, supports a wide range of applications. The existing techniques usually estimate the visible surface area via mathematical modeling from 3D point clouds. However, the 3D scanner is expensive, and the corresponding evaluation method is too complex. In this paper, we propose a novel problem setting, deep learning for visible surface area estimation, which is the first trial to estimate the visible surface area for irregular objects from monocular images. Technically, we first build a novel visible surface area estimation dataset including 9099 real annotations. Then, we design a learning-based architecture to predict the visible surface area, including two core modules (i.e., the classification module and the area-bins module). The classification module is presented to predict the visible surface area distribution interval and assist network training for more accurate visible surface area estimation. Meanwhile, the area-bins module using the transformer encoder is proposed to distinguish the difference in visible surface area between irregular objects of the same category. The experimental results demonstrate that our approach can effectively estimate the visible surface area for irregular objects with various categories and sizes. We hope that this work will attract further research into this newly identified, yet crucial research direction. Our source code and data are available at \textcolormagenta \url .

Blind Robust Video Watermarking Based on Adaptive Region Selection and Channel Reference

  • Qinwei Chang
  • Leichao Huang
  • Shaoteng Liu
  • Hualuo Liu
  • Tianshu Yang
  • Yexin Wang

Digital watermarking technology has a wide range of applications in video distribution and copyright protection due to its excellent invisibility and convenient traceability. This paper proposes a robust blind watermarking algorithm using adaptive region selection and channel reference. By designing a combinatorial selection algorithm using texture information and feature points, the method realizes automatically selecting stable blocks which can avoid being destroyed during video encoding and complex attacks. In addition, considering human's insensitivity to some specific color components, a channel-referenced watermark embedding method is designed for less impact on video quality. Moreover, compared with other methods' embedding watermark only at low frequencies, our method tends to modify low-frequency coefficients close to mid frequencies, further ensuring stable retention of the watermark information in the video encoding process. Experimental results show that the proposed method achieves excellent video quality and high robustness against geometric attacks, compression, transcoding and camcorder recordings attacks.

Disparity-based Stereo Image Compression with Aligned Cross-View Priors

  • Yongqi Zhai
  • Luyang Tang
  • Yi Ma
  • Rui Peng
  • Ronggang Wang

With the wide application of stereo images in various fields, the research on stereo image compression (SIC) attracts extensive attention from academia and industry. The core of SIC is to fully explore the mutual information between the left and right images and reduce redundancy between views as much as possible. In this paper, we propose DispSIC, an end-to-end trainable deep neural network, in which we jointly train a stereo matching model to assist in the image compression task. Based on the stereo matching results (i.e. disparity), the right image can be easily warped to the left view, and only the residuals between the left and right views are encoded for the left image. A three-branch auto-encoder architecture is adopted in DispSIC, which encodes the right image, the disparity map and the residuals respectively. During training, the whole network can learn how to adaptively allocate bitrates to these three parts, achieving better rate-distortion performance at the cost of a lower disparity map bitrates. Moreover, we propose a conditional entropy model with aligned cross-view priors for SIC, which takes the warped latents of the right image as priors to improve the accuracy of the probability estimation for the left image. Experimental results demonstrate that our proposed method achieves superior performance compared to other existing SIC methods on the KITTI and InStereo2K datasets both quantitatively and qualitatively.

Label-Efficient Domain Generalization via Collaborative Exploration and Generalization

  • Junkun Yuan
  • Xu Ma
  • Defang Chen
  • Kun Kuang
  • Fei Wu
  • Lanfen Lin

Considerable progress has been made in domain generalization (DG) which aims to learn a generalizable model from multiple well-annotated source domains to unknown target domains. However, it can be prohibitively expensive to obtain sufficient annotation for source datasets in many real scenarios. To escape from the dilemma between domain generalization and annotation costs, in this paper, we introduce a novel task named label-efficient domain generalization (LEDG) to enable model generalization with label-limited source domains. To address this challenging task, we propose a novel framework called Collaborative Exploration and Generalization (CEG) which jointly optimizes active exploration and semi-supervised generalization. Specifically, in active exploration, to explore class and domain discriminability while avoiding information divergence and redundancy, we query the labels of the samples with the highest overall ranking of class uncertainty, domain representativeness, and information diversity. In semi-supervised generalization, we design MixUp-based intra- and inter-domain knowledge augmentation to expand domain knowledge and generalize domain invariance. We unify active exploration and semi-supervised generalization in a collaborative way and promote mutual enhancement between them, boosting model generalization with limited annotation. Extensive experiments show that CEG yields superior generalization performance. In particular, CEG can even use only 5% data annotation budget to achieve competitive results compared to the previous DG methods with fully labeled data on PACS dataset.

Progressive Unsupervised Learning of Local Descriptors

  • Wufan Wang
  • Lei Zhang
  • Hua Huang

Training tuple construction is a crucial step in unsupervised local descriptor learning. Existing approaches perform this step relying on heuristics, which suffer from inaccurate supervision signals and struggle to achieve the desired performance. To address the problem, this work presents DescPro, an unsupervised approach that progressively explores both accurate and informative training tuples for model optimization without using heuristics. Specifically, DescPro consists of a Robust Cluster Assignment (RCA) method to infer pairwise relationships by clustering reliable samples with the increasingly powerful CNN model, and a Similarity-weighted Positive Sampling (SPS) strategy to select informative positive pairs for training tuple construction. Extensive experimental results show that, with the collaboration of the above two modules, DescPro can outperform state-of-the-art unsupervised local descriptors and even rival competitive supervised ones on standard benchmarks.

Graph Reasoning Transformer for Image Parsing

  • Dong Zhang
  • Jinhui Tang
  • Kwang-Ting Cheng

Capturing the long-range dependencies has empirically proven to be effective on a wide range of computer vision tasks. The progressive advances on this topic have been made through the employment of the transformer framework with the help of the multi-head attention mechanism. However, the attention-based image patch interaction potentially suffers from problems of redundant interactions of intra-class patches and unoriented interactions of inter-class patches. In this paper, we propose a novel Graph Reasoning Transformer (GReaT) for image parsing to enable image patches to interact following a relation reasoning pattern. Specifically, the linearly embedded image patches are first projected into the graph space, where each node represents the implicit visual center for a cluster of image patches and each edge reflects the relation weight between two adjacent nodes. After that, global relation reasoning is performed on this graph accordingly. Finally, all nodes including the relation information are mapped back into the original space for subsequent processes. Compared to the conventional transformer, GReaT has higher interaction efficiency and a more purposeful interaction pattern. Experiments are carried out on the challenging Cityscapes and ADE20K datasets. Results show that GReaT achieves consistent performance gains with slight computational overheads on the state-of-the-art transformer baselines.

Opportunistic Backdoor Attacks: Exploring Human-imperceptible Vulnerabilities on Speech Recognition Systems

  • Qiang Liu
  • Tongqing Zhou
  • Zhiping Cai
  • Yonghao Tang

Speech recognition systems, trained and updated based on large-scale audio data, are vulnerable to backdoor attacks that inject dedicated triggers in system training. The used triggers are generally human-inaudible audio, such as ultrasonic waves. However, we note that such a design is not feasible, as it can be easily filtered out via pre-processing. In this work, we propose the first audible backdoor attack paradigm for speech recognition, characterized by passively triggering and opportunistically invoking. Traditional device-synthetic triggers are replaced with ambient noise in daily scenarios. For adapting triggers to the application dynamics of speech interaction, we exploit the observed knowledge inherited from the context to a trained model and accommodate the injection and poisoning with certainty-based trigger selection, performance-oblivious sample binding, and trigger late-augmentation. Experiments on two datasets under various environments evaluate the proposal's effectiveness in maintaining a high benign rate and facilitating outstanding attack success rate (99.27%, ~4% higher than BadNets), robustness (bounded infectious triggers), feasibility in real-world scenarios. It requires less than 1% data to be poisoned and is demonstrated to be able to resist typical speech enhancement techniques and general countermeasures (e.g., dedicated fine-tuning). The code and data will be made available at

Certifying Better Robust Generalization for Unsupervised Domain Adaptation

  • Zhiqiang Gao
  • Shufei Zhang
  • Kaizhu Huang
  • Qiufeng Wang
  • Rui Zhang
  • Chaoliang Zhong

Recent studies explore how to obtain adversarial robustness for unsupervised domain adaptation (UDA). These efforts are however dedicated to achieving an optimal trade-off between accuracy and robustness on a given or seen target domain but ignore the robust generalization issue over unseen adversarial data. Consequently, degraded performance will be often observed when existing robust UDAs are applied to future adversarial data. In this work, we make a first attempt to address the robust generalization issue of UDA. We conjecture that the poor robust generalization of present robust UDAs may be caused by the large distribution gap among adversarial examples. We then provide an empirical and theoretical analysis showing that this large distribution gap is mainly owing to the discrepancy between feature-shift distributions. To reduce such discrepancy, a novel Anchored Feature-Shift Regularization (AFSR) method is designed with a certificated robust generalization bound. We conduct a series of experiments on benchmark UDA datasets. Experimental results validate the effectiveness of our proposed AFSR over many existing robust UDA methods.

Multimodal In-bed Pose and Shape Estimation under the Blankets

  • Yu Yin
  • Joseph P. Robinson
  • Yun Fu

Advancing technology to monitor our bodies and behavior while sleeping and resting are essential for healthcare. However, keen challenges arise from our tendency to rest under blankets. We present a multimodal approach to uncover the subjects and view bodies at rest without the blankets obscuring the view. For this, we introduce a channel-based fusion scheme to effectively fuse different modalities in a way that best leverages the knowledge captured by the multimodal sensors, including visual- and non-visual-based. The channel-based fusion scheme enhances the model's flexibility in the input at inference: one-to-many input modalities required at test time. Nonetheless, multimodal data or not, detecting humans at rest in bed is still a challenge due to the extreme occlusion when covered by a blanket. To mitigate the negative effects of blanket occlusion, we use an attention-based reconstruction module to explicitly reduce the uncertainty of occluded parts by generating uncovered modalities, which further update the current estimation via a cyclic fashion. Extensive experiments validate the proposed model's superiority over others.

Progressive Limb-Aware Virtual Try-On

  • Xiaoyu Han
  • Shengping Zhang
  • Qinglin Liu
  • Zonglin Li
  • Chenyang Wang

Existing image-based virtual try-on methods directly transfer specific clothing to a human image without utilizing clothing attributes to refine the transferred clothing geometry and textures, which causes incomplete and blurred clothing appearances. In addition, these methods usually mask the limb textures of the input for the clothing-agnostic person representation, which results in inaccurate predictions for human limb regions (i.e., the exposed arm skin), especially when transforming between long-sleeved and short-sleeved garments. To address these problems, we present a progressive virtual try-on framework, named PL-VTON, which performs pixel-level clothing warping based on multiple attributes of clothing and embeds explicit limb-aware features to generate photo-realistic try-on results. Specifically, we design a Multi-attribute Clothing Warping (MCW) module that adopts a two-stage alignment strategy based on multiple attributes to progressively estimate pixel-level clothing displacements. A Human Parsing Estimator (HPE) is then introduced to semantically divide the person into various regions, which provides structural constraints on the human body and therefore alleviates texture bleeding between clothing and limb regions. Finally, we propose a Limb-aware Texture Fusion (LTF) module to estimate high-quality details in limb regions by fusing textures of the clothing and the human body with the guidance of explicit limb-aware features. Extensive experiments demonstrate that our proposed method outperforms the state-of-the-art virtual try-on methods both qualitatively and quantitatively.

Text Style Transfer based on Multi-factor Disentanglement and Mixture

  • Anna Zhu
  • Zhanhui Yin
  • Brian Kenji Iwana
  • Xinyu Zhou
  • Shengwu Xiong

Text style transfer aims to transfer the reference style of one text image to another text image. Previous works have only been able to transfer the style to a binary text image. In this paper, we propose a framework to disentangle the text images into three factors: text content, font, and style features, and then remix the factors of different images to transfer a new style. Both the reference and input text images have no style restrictions. Adversarial training through multi-factor cross recognition is adopted in the network for better feature disentanglement and representation. To decompose the input text images into a disentangled representation with swappable factors, the network is trained using similarity mining within pairs of exemplars. To train our model, we synthesized a new dataset with various text styles in both English and Chinese. Several ablation studies and extensive experiments on our designed and public datasets demonstrate the effectiveness of our approach for text style transfer.

Cloud2Sketch: Augmenting Clouds with Imaginary Sketches

  • Zhaoyi Wan
  • Dejia Xu
  • Zhangyang Wang
  • Jian Wang
  • Jiebo Luo

Have you ever looked up at the sky and imagined what the clouds look like? In this work, we present an interesting task that augments clouds in the sky with imagined sketches. Different from generic image-to-sketch translation tasks, unique challenges are introduced: real-world clouds have different levels of similarity to something; sketch generation without sketch retrieval could lead to something unrecognizable; a retrieved sketch from some dataset cannot be directly used because of the mismatch of the shape; an optimal sketch imagination is subjective. We propose Cloud2Sketch, a novel self-supervised pipeline to tackle the aforementioned challenges. First, we pre-process cloud images with a cloud detector and a thresholding algorithm to obtain cloud contours. Then, cloud contours are passed through a retrieval module to retrieve sketches with similar geometrical shapes. Finally, we adopt a novel sketch translation model with built-in free-form deformation for aligning the sketches to cloud contours. To facilitate training, an icon-based sketch collection named Sketchy Zoo is proposed. Extensive experiments validate the effectiveness of our method both qualitatively and quantitatively.

CycleHand: Increasing 3D Pose Estimation Ability on In-the-wild Monocular Image through Cyclic Flow

  • Daiheng Gao
  • Xindi Zhang
  • Xingyu Chen
  • Andong Tan
  • Bang Zhang
  • Pan Pan
  • Ping Tan

Current methods for 3D hand pose estimation fail to generalize well to in-the-wild new scenarios due to varying camera viewpoints, self-occlusions, and complex environments. To address this problem, we propose CycleHand to improve the generalization ability of the model in a self-supervised manner. Our motivation is based on an observation: if one globally rotates the whole hand and reversely rotates it back, the estimated 3D poses of fingers should keep consistent before and after the rotation because the wrist-relative hand poses stay unchanged during global 3D rotation. Hence, we propose arbitrary-rotation self-supervised consistency learning to improve the model's robustness for varying viewpoints. Another innovation of CycleHand is that we propose a high-fidelity texture map to render the photorealistic rotated hand with different lighting conditions, backgrounds, and skin tones to further enhance the effectiveness of our self-supervised task. To reduce the potential negative effects brought by the domain shift of synthetic images, we use the idea of contrastive learning to learn a synthetic-real consistent feature extractor in extracting domain-irrelevant hand representations. Experiments show that CycleHand can largely improve the hand pose estimation performance in both canonical datasets and real-world applications.

Defeating DeepFakes via Adversarial Visual Reconstruction

  • Ziwen He
  • Wei Wang
  • Weinan Guan
  • Jing Dong
  • Tieniu Tan

Existing DeepFake detection methods focus on passive detection, i.e., they detect fake face images by exploiting the artifacts produced during DeepFake manipulation. These detection-based methods have their limitation that they only work for ex-post forensics but cannot erase the negative influences of DeepFakes. In this work, we propose a proactive framework for combating DeepFake before the data manipulations. The key idea is to find a well defined substitute latent representation to reconstruct target facial data, leading the reconstructed face to disable the DeepFake generation. To this end, we invert face images into latent codes with a well trained auto-encoder, and search the adversarial face embeddings in their neighbor with the gradient descent method. Extensive experiments on three typical DeepFake manipulation methods, facial attribute editing, face expression manipulation, and face swapping, have demonstrated the effectiveness of our method in different settings.

Content based User Preference Modeling in Music Generation

  • Xichu Ma
  • Yuchen Wang
  • Ye Wang

Automatic music generation (AMG) has been an emerging research topic in AI in recent years. However, generating user-preferred music remains an unsolved problem. To address this challenge, we propose a hierarchical convolutional recurrent neural network with self-attention (CRNN-SA) to extract user music preference (UMP) and map it into an embedding space where the common UMPs are in the center and uncommon UMPs are scattered towards the edge. We then propose an explainable music distance measure as a bridge between the UMP and AMG; this measure computes the distance between a seed song and the user's UMP. That distance is then employed to adjust the AMG's parameters which control the music generation process in an iterative manner, so that the generated song will be closer to the user's UMP in every iteration. Experiments demonstrate that the proposed UMP embedding model successfully captures individual UMPs and that our proposed system is capable of generating user-preferred songs.

CrossHuman: Learning Cross-guidance from Multi-frame Images for Human Reconstruction

  • Liliang Chen
  • Jiaqi Li
  • Han Huang
  • Yandong Guo

We propose CrossHuman, a novel method that learns cross-guidance from parametric human model and multi-frame RGB images to achieve high-quality 3D human reconstruction. To recover geometry details and texture even in invisible regions, we design a reconstruction pipeline combined with tracking-based methods and tracking-free methods. Given a monocular RGB sequence, we track the parametric human model in the whole sequence, the points (voxels) corresponding to the target frame are warped to reference frames by the parametric body motion. Guided by the geometry priors of the parametric body and spatially aligned features from RGB sequence, the robust implicit surface is fused. Moreover, a multi-frame transformer (MFT) and a self-supervised warp refinement module are integrated to the framework to relax the requirements of parametric body and help to deal with very loose cloth. Compared with previous works, our CrossHuman enables high-fidelity geometry details and texture in both visible and invisible regions and improves the accuracy of the human reconstruction even under estimated inaccurate parametric human models. The experiments demonstrate that our method achieves state-of-the-art (SOTA) performance.

High-Quality 3D Face Reconstruction with Affine Convolutional Networks

  • Zhiqian Lin
  • Jiangke Lin
  • Lincheng Li
  • Yi Yuan
  • Zhengxia Zou

Recent works based on convolutional encoder-decoder architecture and 3DMM parameterization have shown great potential for canonical view reconstruction from a single input image. Conventional CNN architectures benefit from exploiting the spatial correspondence between the input and output pixels. However, in 3D face reconstruction, the spatial misalignment between the input image (e.g. face) and the canonical/UV output makes the feature encoding-decoding process quite challenging. In this paper, to tackle this problem, we propose a new network architecture, namely the Affine Convolution Networks, which enables CNN based approaches to handle spatially non-corresponding input and output images and maintain high-fidelity quality output at the same time. In our method, an affine transformation matrix is learned from the affine convolution layer for each spatial location of the feature maps. In addition, we represent 3D human heads in UV space with multiple components, including diffuse maps for texture representation, position maps for geometry representation, and light maps for recovering more complex lighting conditions in the real world. All the components can be trained without any manual annotations. Our method is parametric-free and can generate high-quality UV maps at resolution of 512 x 512 pixels, while previous approaches normally generate 256 x 256 pixels or smaller. Our code will be released once the paper got accepted.

xCloth: Extracting Template-free Textured 3D Clothes from a Monocular Image

  • Astitva Srivastava
  • Chandradeep Pokhariya
  • Sai Sagar Jinka
  • Avinash Sharma

Existing approaches for 3D garment reconstruction either assume a predefined template for the garment geometry (restricting them to fixed clothing styles) or yield vertex-colored meshes (lacking high-frequency textural details). Our novel framework co-learns geometric and semantic information of garment surface from the input monocular image for template-free textured 3D garment digitization. More specifically, we propose to extend PeeledHuman representation to predict the pixel-aligned, layered depth and semantic maps to extract 3D garments. The layered representation is further exploited to UV parametrize the arbitrary surface of the extracted garment without any human intervention to form a UV atlas. The texture is then imparted on the UV atlas in a hybrid fashion by first projecting pixels from the input image to UV space for the visible region, followed by inpainting the occluded regions. Thus, we are able to digitize arbitrarily loose clothing styles while retaining high-frequency textural details from a monocular image. We achieve high-fidelity 3D garment reconstruction results on three publicly available datasets and generalization on internet images.

SD-GAN: Semantic Decomposition for Face Image Synthesis with Discrete Attribute

  • Kangneng Zhou
  • Xiaobin Zhu
  • Daiheng Gao
  • Kai Lee
  • Xinjie Li
  • Xu-cheng Yin

Manipulating latent code in generative adversarial networks (GANs) for facial image synthesis mainly focuses on continuous attribute synthesis (e.g., age, pose and emotion), while discrete attribute synthesis (like face mask and eyeglasses) receives less attention. Directly applying existing works to facial discrete attributes may cause inaccurate results. In this work, we propose an innovative framework to tackle challenging facial discrete attribute synthesis via semantic decomposing, dubbed SD-GAN. To be concrete, we explicitly decompose the discrete attribute representation into two components, i.e. the semantic prior basis and offset latent representation. The semantic prior basis shows an initializing direction for manipulating face representation in the latent space. The offset latent presentation obtained by 3D-aware semantic fusion network is proposed to adjust prior basis. In addition, the fusion network integrates 3D embedding for better identity preservation and discrete attribute synthesis. The combination of prior basis and offset latent representation enable our method to synthesize photo-realistic face images with discrete attributes. Notably, we construct a large and valuable dataset MEGN (Face Mask and Eyeglasses images crawled from Google and Naver) for completing the lack of discrete attributes in the existing dataset. Extensive qualitative and quantitative experiments demonstrate the state-of-the-art performance of our method. Our code is available at an anonymous website:

SingGAN: Generative Adversarial Network For High-Fidelity Singing Voice Generation

  • Rongjie Huang
  • Chenye Cui
  • FeiYang cHEN
  • Yi Ren
  • Jinglin Liu
  • Zhou Zhao
  • Baoxing Huai
  • Zhefeng Wang

Deep generative models have achieved significant progress in speech synthesis to date, while high-fidelity singing voice synthesis is still an open problem for its long continuous pronunciation, rich high-frequency parts, and strong expressiveness. Existing neural vocoders designed for text-to-speech cannot directly be applied to singing voice synthesis because they result in glitches and poor high-frequency reconstruction. In this work, we propose SingGAN, a generative adversarial network designed for high-fidelity singing voice synthesis. Specifically, 1) to alleviate the glitch problem in the generated samples, we propose source excitation with the adaptive feature learning filters to expand the receptive field patterns and stabilize long continuous signal generation; and 2) SingGAN introduces global and local discriminators at different scales to enrich low-frequency details and promote high-frequency reconstruction; and 3) To improve the training efficiency, SingGAN includes auxiliary spectrogram losses and sub-band feature matching penalty loss. To the best of our knowledge, SingGAN is the first work designed toward high-fidelity singing voice vocoding. Our evaluation of SingGAN demonstrates the state-of-the-art results with higher-quality (MOS 4.05) samples. Also, SingGAN enables a sample speed of 50x faster than real-time on a single NVIDIA 2080Ti GPU. We further show that SingGAN generalizes well to the mel-spectrogram inversion of unseen singers, and the end-to-end singing voice synthesis system SingGAN-SVS enjoys a two-stage pipeline to transform the music scores into expressive singing voices.

Design What You Desire: Icon Generation from Orthogonal Application and Theme Labels

  • Yinpeng Chen
  • Zhiyu Pan
  • Min Shi
  • Hao Lu
  • Zhiguo Cao
  • Weicai Zhong

Generative adversarial networks,(GANs) have been trained to be professional artists able to create stunning artworks such as face generation and image style transfer. In this paper, we focus on a realistic business scenario: automated generation of customizable icons given desired mobile applications and theme styles. We first introduce a theme-application icon dataset, termed AppIcon, where each icon has two orthogonal theme and app labels. By investigating a strong baseline StyleGAN2, we observe mode collapse caused by the entanglement of the orthogonal labels. To solve this challenge, we propose IconGAN composed of a conditional generator and dual discriminators with orthogonal augmentations, and a contrastive feature disentanglement strategy is further designed to regularize the feature space of the two discriminators. Compared with other approaches, IconGAN indicates a superior advantage on the AppIcon benchmark. Further analysis also justifies the effectiveness of disentangling app and theme representations. Our project will be released at:

Semantically-Consistent Dynamic Blurry Image Generation for Image Deblurring

  • Zhaohui Jing
  • Youjian Zhang
  • Chaoyue Wang
  • Daqing Liu
  • Yong Xia

The training of deep learning-based image deblurring models heavily relies on the paired sharp/blurry image dataset. Although many works verified that synthesized blurry-sharp pairs contribute to improving the deblurring performance, it is still an open problem about how to synthesize realistic and diverse dynamic blurry images. Instead of directly synthesizing blurry images, in this paper, we propose a novel method to generate semantic-aware dense dynamic motion, and employ the generated motion to synthesize blurry images. Specifically, for each sharp image, both the global motion (camera shake) and local motion (object moving) are considered given the depth information as the condition. Then, a blur creation module takes the spatial-variant motion information and the sharp image as input to synthesize a motion-blurred image. A relativistic GAN loss is employed to assure the synthesized blurry image is as realistic as possible. Experiments show that our method can generate diverse dynamic motion and visually realistic blurry images. Also, the generated image pairs can further improve the quantitative performance and generalization ability of the existing deblurring method on several test sets.

RepSR: Training Efficient VGG-style Super-Resolution Networks with Structural Re-Parameterization and Batch Normalization

  • Xintao Wang
  • Chao Dong
  • Ying Shan

This paper explores training efficient VGG-style super-resolution (SR) networks with the structural re-parameterization technique. The general pipeline of re-parameterization is to train networks with multi-branch topology first, and then merge them into standard 3x3 convolutions for efficient inference. In this work, we revisit those primary designs and investigate essential components for re-parameterizing SR networks. First of all, we find that batch normalization (BN) is important to bring training non-linearity and improve the final performance. However, BN is typically ignored in SR, as it usually degrades the performance and introduces unpleasant artifacts. We carefully analyze the cause of BN issue and then propose a straightforward yet effective solution. In particular, we first train SR networks with mini-batch statistics as usual, and then switch to using population statistics at the later training period. While we have successfully re-introduced BN into SR, we further design a new re-parameterizable block tailored for SR, namely RepSR. It consists of a clean residual path and two expand-and-squeeze convolution paths with the modified BN. Extensive experiments demonstrate that our simple RepSR is capable of achieving superior performance to previous SR re-parameterization methods among different model sizes. In addition, our RepSR can achieve a better trade-off between performance and actual running time (throughput) than previous SR methods. Codes are available at

Rotation Invariant Transformer for Recognizing Object in UAVs

  • Shuoyi Chen
  • Mang Ye
  • Bo Du

Recognizing a target of interest from the UAVs is much more challenging than the existing object re-identification tasks across multiple city cameras. The images taken by the UAVs usually suffer from significant size difference when generating the object bounding boxes and uncertain rotation variations. Existing methods are usually designed for city cameras, incapable of handing the rotation issue in UAV scenarios. A straightforward solution is to perform the image-level rotation augmentation, but it would cause loss of useful information when inputting the powerful vision transformer as patches. This motivates us to simulate the rotation operation at the patch feature level, proposing a novel rotation invariant vision transformer (RotTrans). This strategy builds on high-level features with the help of the specificity of the vision transformer structure, which enhances the robustness against large rotation differences. In addition, we design invariance constraint to establish the relationship between the original feature and the rotated features, achieving stronger rotation invariance. Our proposed transformer tested on the latest UAV datasets greatly outperforms the current state-of-the-arts, which is 5.9% and 4.8% higher than the highest mAP and Rank1. Notably, our model also performs competitively for the person re-identification task on traditional city cameras. In particular, our solution wins the first place in the UAV-based person re-recognition track in the Multi-Modal Video Reasoning and Analyzing Competition held in ICCV 2021. Code is available at

Active Learning for Point Cloud Semantic Segmentation via Spatial-Structural Diversity Reasoning

  • Feifei Shao
  • Yawei Luo
  • Ping Liu
  • Jie Chen
  • Yi Yang
  • Yulei Lu
  • Jun Xiao

The expensive annotation cost is notoriously known as the main constraint for the development of the point cloud semantic segmentation technique. Active learning methods endeavor to reduce such cost by selecting and labeling only a subset of the point clouds, yet previous attempts ignore the spatial-structural diversity of the selected samples, inducing the model to select clustered candidates with similar shapes in a local area while missing other representative ones in the global environment. In this paper, we propose a new 3D region-based active learning method to tackle this problem. Dubbed SSDR-AL, our method groups the original point clouds into superpoints and incrementally selects the most informative and representative ones for label acquisition. We achieve the selection mechanism via a graph reasoning network that considers both the spatial and structural diversities of superpoints. To deploy SSDR-AL in a more practical scenario, we design a noise-aware iterative labeling strategy to confront the "noisy annotation'' problem introduced by the previous "dominant labeling'' strategy in superpoints. Extensive experiments on two point cloud benchmarks demonstrate the effectiveness of SSDR-AL in the semantic segmentation task. Particularly, SSDR-AL significantly outperforms the baseline method and reduces the annotation cost by up to $63.0%$ and $24.0%$ when achieving $90%$ performance of fully supervised learning, respectively. Code is available at

Free-Lunch for Cross-Domain Few-Shot Learning: Style-Aware Episodic Training with Robust Contrastive Learning

  • Ji Zhang
  • Jingkuan Song
  • Lianli Gao
  • Hengtao Shen

Cross-Domain Few-Shot Learning (CDFSL) aims for training an adaptable model that can learn out-of-domain classes with a handful of samples. Compared to the well-studied few-shot learning problem, the difficulty for CDFSL lies in that the available training data from test tasks is not only extremely limited but also presents severe class differences from training tasks. To tackle this challenge, we propose Style-aware Episodic Training with Robust Contrastive Learning (SET-RCL), which is motivated by the key observation that a remarkable style-shift between tasks from source and target domains plays a negative role in cross-domain generalization. SET-RCL addresses the style-shift from two perspectives: 1) simulating the style distributions of unknown target domains (data perspective); and 2) learning a style-invariant representation (model perspective). Specifically, Style-aware Episodic Training (SET) focuses on manipulating the styl distribution of training tasks in the source domain, such that the learned model can achieve better adaption on test tasks with domain-specific styles. To further improve cross-domain generalization under style-shift, we develop Robust Contrastive Learning (RCL) to capture style-invariant and discriminative representations from the manipulated tasks. Notably,our SET-RCL is orthogonal to existing FSL approaches, thus can be adopted as a "free-lunch" for boosting their CDFSL performance. Extensive experiments on nine benchmark datasets and six baseline methods demonstrate the effectiveness of our method.

ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech

  • Rongjie Huang
  • Zhou Zhao
  • Huadai Liu
  • Jinglin Liu
  • Chenye Cui
  • Yi Ren

Denoising diffusion probabilistic models (DDPMs) have recently achieved leading performances in many generative tasks. However, the inherited iterative sampling process costs hinder their applications to text-to-speech deployment. Through the preliminary study on diffusion model parameterization, we find that previous gradient-based TTS models require hundreds or thousands of iterations to guarantee high sample quality, which poses a challenge for accelerating sampling. In this work, we propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech. Unlike previous work estimating the gradient for data density, ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling. To tackle the model convergence challenge with decreased diffusion iterations, ProDiff reduces the data variance in the target site via knowledge distillation. Specifically, the denoising model uses the generated mel-spectrogram from an N-step DDIM teacher as the training target and distills the behavior into a new model with N/2 steps. As such, it allows the TTS model to make sharp predictions and further reduces the sampling time by orders of magnitude. Our evaluation demonstrates that ProDiff needs only 2 iterations to synthesize high-fidelity mel-spectrograms, while it maintains sample quality and diversity competitive with state-of-the-art models using hundreds of steps. ProDiff enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU, making diffusion models practically applicable to text-to-speech synthesis deployment for the first time. Our extensive ablation studies demonstrate that each design in ProDiff is effective, and we further show that ProDiff can be easily extended to the multi-speaker setting.

Joint Learning Content and Degradation Aware Feature for Blind Super-Resolution

  • Yifeng Zhou
  • Chuming Lin
  • Donghao Luo
  • Yong Liu
  • Ying Tai
  • Chengjie Wang
  • Mingang Chen

To achieve promising results on blind image super-resolution (SR),some attempts leveraged the low resolution (LR) images to predict the kernel and improve the SR performance. However, these Supervised Kernel Prediction (SKP) methods are impractical due to the unavailable real-world blur kernels. Although some Unsupervised Degradation Prediction (UDP) methods are proposed to bypass this problem, the inconsistency between degradation embedding and SR feature is still challenging. By exploring the correlations between degradation embedding and SR feature, we observe that jointly learning the content and degradation aware feature is optimal. Based on this observation, a Content and Degradation aware SR Network dubbed CDSR is proposed. Specifically, CDSR contains three newly-established modules: (1) a Lightweight Patch-based Encoder (LPE) is applied to jointly extract content and degradation features; (2) a Domain Query Attention based module (DQA) is employed to adaptively reduce the inconsistency; (3) a Codebook-based Space Compress module (CSC) that can suppress the redundant information. Extensive experiments on several benchmarks demonstrate that the proposed CDSR outperforms the existing UDP models and achieves competitive performance on PSNR and SSIM even compared with the state-of-the-art SKP methods.

Self-Aligned Concave Curve: Illumination Enhancement for Unsupervised Adaptation

  • Wenjing Wang
  • Zhengbo Xu
  • Haofeng Huang
  • Jiaying Liu

Low light conditions not only degrade human visual experience, but also reduce the performance of downstream machine analytics. Although many works have been designed for low-light enhancement or domain adaptive machine analytics, the former considers less on high-level vision, while the latter neglects the potential of image-level signal adjustment. How to restore underexposed images/videos from the perspective of machine vision has long been overlooked. In this paper, we are the first to propose a learnable illumination enhancement model for high-level vision. Inspired by real camera response functions, we assume that the illumination enhancement function should be a concave curve, and propose to satisfy this concavity through discrete integral. With the intention of adapting illumination from the perspective of machine vision without task-specific annotated data, we design an asymmetric cross-domain self-supervised training strategy. Our model architecture and training designs mutually benefit each other, forming a powerful unsupervised normal-to-low light adaptation framework. Comprehensive experiments demonstrate that our method surpasses existing low-light enhancement and adaptation methods and shows superior generalization on various low-light vision tasks, including classification, detection, action recognition, and optical flow estimation. All of our data, code, and results will be available online upon publication of the paper.

Photorealistic Style Transfer via Adaptive Filtering and Channel Seperation

  • Hong Ding
  • Fei Luo
  • Caoqing Jiang
  • Gang Fu
  • Zipei Chen
  • Shenghong Hu
  • Chunxia Xiao

The problem of color and texture distortion remains unsolved in the photorealistic style transfer task. It is mainly caused by the interference between color and texture during transferring. To address this problem, we propose a end-to-end network via adaptive filtering and channel separation. Given a pair of content image and reference image, we firstly decompose them into two structure layers through adaptive weighted least squares filter (AWLSF), which could better perceive the color structure and illumination. Then, we carry out RGB transfer in a channel separation way on the two generated structure layers. To deal with texture in a relatively independent manner, we use a module and a subtraction operation to get more complete and clear content features. Finally, we merge the color structure and texture detail into the ultimate result. We conduct solid quantitative experiments on four metrics NIQE, AG, SSIM, and PSNR, and make a user study. The experimental results demonstrate that our method is able to produce better results than previous state-of-the-art methods, and validate the effectiveness and superiority of our method.

Recurrent Meta-Learning against Generalized Cold-start Problem in CTR Prediction

  • Junyu Chen
  • Qianqian Xu
  • Zhiyong Yang
  • Ke Ma
  • Xiaochun Cao
  • Qingming Huang

During the last decades, great success has been witnessed along the course of accurate Click-Through-Rate (CTR) prediction models for online advertising. However, the cold-start problem, which refers to the issue that the standard models can hardly draw accurate inferences for unseen users/ads, is still yet to be fully understood. Most recently, some related studies have been proposed to tackle this problem with only the new users/ads being considered. We argue that such new users/ads are not the only sources for cold-start. From another perspective, since users might shift their interests over time, one's recent behaviors might vary greatly from the records long ago. In this sense, we believe that the cold-start problem should also exist along the temporal dimension. Motivated by this, a generalized definition of the cold-start problem is provided where both new users/ads and recent behavioral data from known users are considered. To attack this problem, we propose a recursive meta-learning model with the user's behavior sequence prediction as a separate training task. Specifically, a time-series CTR model with the MAML (Model-Agnostic Meta-Learning)-like meta-learning method is proposed to make our model adapt to new tasks rapidly. Besides, we propose a parallel structure for extracting the feature interactions to efficiently fuse attention mechanisms and the RNN layer. Finally, experiments on three public datasets demonstrate the effectiveness of the proposed approaches.

Learning Projection Views for Sparse-View CT Reconstruction

  • Liutao Yang
  • Rongjun Ge
  • Shichang Feng
  • Daoqiang Zhang

Sparse-View CT (SVCT), which provides low-dose and high-speed CT imaging, plays an important role in the medical imaging area. As the decrease of projection views, the reconstructed image suffers from severe artifacts. To this end, recent works utilize deep learning methods to improve the imaging quality of SVCT and achieve promising performances. However, these methods mainly focus on the network design and modeling but overlook the importance of choosing projection views. To address this issue, this paper proposes a Projection-view LeArning Network (PLANet), which can estimate the importance of different view angles through reconstruction network training and select the projection views for high-quality image restoration. Specifically, we generate synthesized sparse-view sinograms by subsampling projections from full-view sinograms based on a learnable distribution, which can be learned through reconstruction network training. Thus, important image views can be selected to acquire sparse-view projection in imaging equipment. Furthermore, effective data augmentations are provided by the online generation of sparse-view sinogram to improve the stability and performance of reconstruction networks. In short, our method can select the important projection views and learn high-performance reconstruction networks in one unified deep-learning framework. Comprehensive experiments show that the proposed method achieves promising results compared to state-of-the-art methods, and the ablation studies also show the superiority of our proposed PLANet in terms of effectiveness and robustness.

Unsupervised Textured Terrain Generation via Differentiable Rendering

  • Peichi Zhou
  • Dingbo Lu
  • Chen Li
  • Jian Zhang
  • Long Liu
  • Changbo Wang

Constructing large-scale realistic terrains using modern modeling tools is an extremely challenging task even for professional users, undermining the effectiveness of video games, virtual reality, and other applications. In this paper, we present a step towards unsupervised and realistic modeling of textured terrains from DEM and satellite imagery, built upon two-stage illumination and texture optimization via differentiable rendering. First, a differentiable renderer for satellite imagery is established based on the Lambert diffuse model that allows inverse optimization of material and lighting parameters towards specific objective. Second, the original illumination direction of satellite imagery is recovered by reducing the difference between the shadow distribution generated by the renderer and that of the satellite image in YCrCb colour space, leveraging the abundant geometric information of DEM. Third, we propose to generate the original texture of the shadowed region by introducing visual consistency and smoothness constraints via differentiable rendering to arrive at an end-to-end unsupervised architecture. Comprehensive experiments demonstrate the effectiveness and efficiency of our proposed method as a potential tool to achieve virtual terrain modeling for widespread graphics applications.

MegaPortraits: One-shot Megapixel Neural Head Avatars

  • Nikita Drobyshev
  • Jenya Chelishev
  • Taras Khakhulin
  • Aleksei Ivakhnenko
  • Victor Lempitsky
  • Egor Zakharov

In this work, we advance the neural head avatar technology to the megapixel resolution while focusing on the particularly challenging task of cross-driving synthesis, i.e., when the appearance of the driving image is substantially different from the animated source image. We propose a set of new neural architectures and training methods that can leverage both medium-resolution video data and high-resolution image data to achieve the desired levels of rendered image quality and generalization to novel views and motion. We demonstrate that suggested architectures and methods produce convincing high-resolution neural avatars, outperforming the competitors in the cross-driving scenario. Lastly, we show how a trained high-resolution neural avatar model can be distilled into a lightweight student model which runs in real-time and locks the identities of neural avatars to several dozens of pre-defined source images. Real-time operation and identity lock are essential for many practical applications head avatar systems.

Event-guided Video Clip Generation from Blurry Images

  • Xin Ding
  • Tsuyoshi Takatani
  • Zhongyuan Wang
  • Ying Fu
  • Yinqiang Zheng

Dynamic and active pixel vision sensors (DAVIS) can simultaneously produce streams of asynchronous events captured by the dynamic vision sensor (DVS) and intensity frames from the active pixel sensor (APS). Event sequences show high temporal resolution and high dynamic range, while intensity images easily suffer from motion blur due to the low frame rate of APS. In this paper, we present an end-to-end convolutional neural network based method under the local and global constraints of events to restore clear, sharp intensity frames through collaborative learning from a blurry image and its associated event streams. Specifically, we first learn a function of the relationship between the sharp intensity frame and the corresponding blurry image with its event data. Then we propose a generation module to realize it with a supervision module to constrain the restoration in the motion process. We also capture the first realistic dataset with paired blurry frame/events and sharp frames by synchronizing a DAVIS camera and a high-speed camera. Experimental results show that our method can reconstruct high-quality sharp video clips, and outperform the state-of-the-art on both simulated and real-world data.

Consistency-Contrast Learning for Conceptual Coding

  • Jianhui Chang
  • Jian Zhang
  • Youmin Xu
  • Jiguo Li
  • Siwei Ma
  • Wen Gao

As an emerging compression scheme, conceptual coding usually encodes images into structural and textural representations and decodes them in a deep synthesis fashion. However, existing conceptual coding schemes ignore the structure of deep texture representation space, leading to a challenge of establishing efficient and faithful conceptual representations. In this paper, we firstly introduce contrastive learning into conceptual coding and propose Consistency-Contrast Learning (CCL) which optimizes the representation space by a consistency-contrast regularization. By modeling the original images and reconstructed images as "positive'' pairs and random images in a batch as "negative'' samples, CCL aims to align texture representation space with source images space relatively. Extensive experiments on diverse datasets demonstrate that: (1) the proposed CCL can achieve the best compression performance on the conceptual coding task; (2) CCL is superior to other popular regularization methods towards improving reconstruction quality; (3) CCL is general and can be applied to other tasks related to representation optimization and image reconstruction, such as GAN inversion.

Order-aware Human Interaction Manipulation

  • Mandi Luo
  • Jie Cao
  • Ran He

The majority of current techniques for pose transfer disregard the interactions between the transferred person and the surrounding instances, resulting in context inconsistency when applied to complicated situations. To tackle this issue, we propose InterOrderNet, a novel framework to perform order-aware interaction learning. The proposed InterOrderNet learns the relative order on the direction of the z-axis among instances to describe instance-level occlusions. Not only does learning this order guarantee the context consistency of human pose transfer, but it also enhances its generalization to natural scenes. Additionally, we present a novel unsupervised method, named Imitative Contrastive Learning, which sidesteps the requirements of order annotations. Existing pose transfer methods are easy to be integrated into the proposed InterOrderNet. Extensive experiments demonstrate that InterOrderNet enables these methods to perform interaction manipulation.

Semi-supervised Video Shadow Detection via Image-assisted Pseudo-label Generation

  • Zipei Chen
  • Xiao Lu
  • Ling Zhang
  • Chunxia Xiao

Although learning-based methods have shown their potential for image shadow detection, video shadow detection is still a challenging problem. It is due to the absence of large-scale, temporally consistent annotated video shadow detection dataset. To this end, we propose a semi-supervised video shadow detection method by seeking the assistance of the existing labeled image dataset to generate pseudo-labels as the additional supervision signals. Specifically, we first introduce a novel image-assisted video pseudo-label generator with a spatio-temporally aligned network (STANet). It generates high-quality and temporally consistent pseudo-labels. Then, with these pseudo-labels, we propose an uncertainty-guided semi-supervised learning strategy to reduce the impact of noise from them. Moreover, we also design a memory propagated long-term network (MPLNet), which produces video shadow detection results with long-term consistency in a light-weight way by using the memory mechanism. Extensive experiments on ViSha and our collected real-world video shadow detection dataset RVSD show that our approach not only achieves superior performance in the benchmark dataset but also generalizes well in more practical applications, which demonstrates the effectiveness of our method.

Towards Robust Video Object Segmentation with Adaptive Object Calibration

  • Xiaohao Xu
  • Jinglu Wang
  • Xiang Ming
  • Yan Lu

In the booming video era, video segmentation attracts increasing research attention in the multimedia community. Semi-supervised video object segmentation (VOS) aims at segmenting objects in all target frames of a video, given annotated object masks of reference frames. Most existing methods build pixel-wise reference-target correlations and then perform pixel-wise tracking to obtain target masks. Due to neglecting object-level cues, pixel-level approaches make the tracking vulnerable to perturbations, and even indiscriminate among similar objects. Towards robust VOS, the key insight is to calibrate the representation and mask of each specific object to be expressive and discriminative. Accordingly, we propose a new deep network, which can adaptively construct object representations and calibrate object masks to achieve stronger robustness. First, we construct the object representations by applying an adaptive object proxy (AOP) aggregation method, where the proxies represent arbitrary-shaped segments at multi-levels for reference. Then, prototype masks are initially generated from the reference-target correlations based on AOP. Afterwards, such proto-masks are further calibrated through network modulation, conditioning on the object proxy representations. We consolidate this conditional mask calibration process in a progressive manner, where the object representations and proto-masks evolve to be discriminative iteratively. Extensive experiments are conducted on the standard VOS benchmarks, YouTube-VOS-18/19 and DAVIS-17. Our model achieves the state-of-the-art performance among existing published works, and also exhibits superior robustness against perturbations.

Split-PU: Hardness-aware Training Strategy for Positive-Unlabeled Learning

  • Chengming Xu
  • Chen Liu
  • Siqian Yang
  • Yabiao Wang
  • Shijie Zhang
  • Lijie Jia
  • Yanwei Fu

Positive-Unlabeled (PU) learning aims to learn a model with rare positive samples and abundant unlabeled samples. Compared with classical binary classification, the task of PU learning is much more challenging due to the existence of many incompletely-annotated data instances. Since only part of the most confident positive samples are available and evidence is not enough to categorize the rest samples, many of these unlabeled data may also be the positive samples. Research on this topic is particularly useful and essential to many real-world tasks which demand very expensive labelling cost. For example, the recognition tasks in disease diagnosis, recommendation system and satellite image recognition may only have few positive samples that can be annotated by the experts. While this problem is receiving increasing attention, most of the efforts have been dedicated to the design of trustworthy risk estimators such as uPU and nnPU and direct knowledge distillation, e.g., Self-PU. These methods mainly omit the intrinsic hardness of some unlabeled data, which can result in sub-optimal performance as a consequence of fitting the easy noisy data and not sufficiently utilizing the hard data. In this paper, we focus on improving the commonly-used nnPU with a novel training pipeline. We highlight the intrinsic difference of hardness of samples in the dataset and the proper learning strategies for easy and hard data. By considering this fact, we propose first splitting the unlabeled dataset with an early-stop strategy. The samples that have inconsistent predictions between the temporary and base model are considered as hard samples. Then the model utilizes a noise-tolerant Jensen-Shannon divergence loss for easy data; and a dual-source consistency regularization for hard data which includes a cross-consistency between student and base model for low-level features and self-consistency for high-level features and predictions, respectively. Our method achieves much better results compared with existing methods on CIFAR10 and two medical datasets of liver cancer survival time prediction, and low blood pressure diagnosis of pregnant, individually. The experimental results validates the efficacy of our proposed method.

Multi-Camera Collaborative Depth Prediction via Consistent Structure Estimation

  • Jialei Xu
  • Xianming Liu
  • Yuanchao Bai
  • Junjun Jiang
  • Kaixuan Wang
  • Xiaozhi Chen
  • Xiangyang Ji

Depth map estimation from images is an important task in robotic systems. Existing methods can be categorized into two groups including multi-view stereo and monocular depth estimation. The former requires cameras to have large overlapping areas and sufficient baseline between cameras, while the latter that processes each image independently can hardly guarantee the structure consistency between cameras. In this paper, we propose a novel multi-camera collaborative depth prediction method that does not require large overlapping areas while maintaining structure consistency between cameras. Specifically, we formulate the depth estimation as a weighted combination of depth basis, in which the weights are updated iteratively by a refinement network driven by the proposed consistency loss. During the iterative update, the results of depth estimation are compared across cameras and the information of overlapping areas is propagated to the whole depth maps with the help of basis formulation. Experimental results on DDAD and NuScenes datasets demonstrate the superior performance of our method.

Fast Hierarchical Deep Unfolding Network for Image Compressed Sensing

  • Wenxue Cui
  • Shaohui Liu
  • Debin Zhao

By integrating certain optimization solvers with deep neural network, deep unfolding network (DUN) has attracted much attention in recent years for image compressed sensing (CS). However, there still exist several issues in existing DUNs: 1) For each iteration, a simple stacked convolutional network is usually adopted, which apparently limits the expressiveness of these models. 2) Once the training is completed, most hyperparameters of existing DUNs are fixed for any input content, which significantly weakens their adaptability. In this paper, by unfolding the Fast Iterative Shrinkage-Thresholding Algorithm (FISTA), a novel fast hierarchical DUN, dubbed FHDUN, is proposed for image compressed sensing, in which a well-designed hierarchical unfolding architecture is developed to cooperatively explore richer contextual prior information in multi-scale spaces. To further enhance the adaptability, series of hyperparametric generation networks are developed in our framework to dynamically produce the corresponding optimal hyperparameters according to the input content. Furthermore, due to the accelerated policy in FISTA, the newly embedded acceleration module makes the proposed FHDUN save more than 50% of the iterative loops against recent DUNs. Extensive CS experiments manifest that the proposed FHDUN outperforms existing state-of-the-art CS methods, while maintaining fewer iterations.

Restoration of User Videos Shared on Social Media

  • Hongming Luo
  • Fei Zhou
  • Kin-man Lam
  • Guoping Qiu

User videos shared on social media platforms usually suffer from degradations caused by unknown proprietary processing procedures, which means that their visual quality is poorer than that of the originals. This paper presents a new general video restoration framework for the restoration of user videos shared on social media platforms. In contrast to most deep learning-based video restoration methods that perform end-to-end mapping, where feature extraction is mostly treated as a black box, in the sense that what role a feature plays is often unknown, our new method, termed Video restOration through adapTive dEgradation Sensing (VOTES), introduces the concept of a degradation feature map (DFM) to explicitly guide the video restoration process. Specifically, for each video frame, we first adaptively estimate its DFM to extract features representing the difficulty of restoring its different regions. We then feed the DFM to a convolutional neural network (CNN) to compute hierarchical degradation features to modulate an end-to-end video restoration backbone network, such that more attention is paid explicitly to potentially more difficult to restore areas, which in turn leads to enhanced restoration performance. We will explain the design rationale of the VOTES framework and present extensive experimental results to show that the new VOTES method outperforms various state-of-the-art techniques both quantitatively and qualitatively. In addition, we contribute a large scale real-world database of user videos shared on different social media platforms. Codes and datasets are available at

Real-time Streaming Video Denoising with Bidirectional Buffers

  • Chenyang Qi
  • Junming Chen
  • Xin Yang
  • Qifeng Chen

Video streams are delivered continuously to save the cost of storage and device memory. Real-time denoising algorithms are typically adopted on the user device to remove the noise involved during the shooting and transmission of video streams. However, sliding-window-based methods feed multiple input frames for a single output and lack computation efficiency. Recent multi-output inference works propagate the bidirectional temporal feature with a parallel or recurrent framework, which either suffers from performance drops on the temporal edges of clips or can not achieve online inference. In this paper, we propose a Bidirectional Streaming Video Denoising (BSVD) framework, to achieve high-fidelity real-time denoising for streaming videos with both past and future temporal receptive fields. The bidirectional temporal fusion for online inference is considered not applicable in the MoViNet. However, we introduce a novel Bidirectional Buffer Block as the core module of our BSVD, which makes it possible during our pipeline-style inference. In addition, our method is concise and flexible to be utilized in both non-blind and blind video denoising. We compare our model with various state-of-the-art video denoising models qualitatively and quantitatively on synthetic and real noise. Our method outperforms previous methods in terms of restoration fidelity and runtime.

Learning Hierarchical Dynamics with Spatial Adjacency for Image Enhancement

  • Yudong Liang
  • Bin Wang
  • Wenqi Ren
  • Jiaying Liu
  • Wenjian Wang
  • Wangmeng Zuo

In various real-world image enhancement applications, the degradations are always non-uniform or non-homogeneous and diverse, which challenges most deep networks with fixed parameters during the inference phase. Inspired by the dynamic deep networks that adapt the model structures or parameters conditioned on the inputs, we propose a DCP-guided hierarchical dynamic mechanism for image enhancement to adapt the model parameters and features from local to global as well as to keep spatial adjacency within the region. Specifically, channel-spatial-level, structure-level, and region-level dynamic components are sequentially applied. Channel-spatial-level dynamics obtain channel- and spatial-wise representation variations, and structure-level dynamics enable modeling geometric transformations and augment sampling locations for the varying local features to better describe the structures. In addition, a novel region-level dynamic is proposed to generate spatially continuous masks for dynamic features which capitalizes on the Dark Channel Priors (DCP). The proposed region-level dynamics benefit from exploiting the statistical differences between distorted and undistorted images. Moreover, the DCP-guided region generations are inherently spatial coherent which facilitates capturing local coherence of the images. The proposed method achieves state-of-the-art performance and generates visually pleasing images for multiple enhancement tasks,i.e. , image dehazing, image deraining and low-light image enhancement. The codes are available at

Text's Armor: Optimized Local Adversarial Perturbation Against Scene Text Editing Attacks

  • Tao Xiang
  • Hangcheng Liu
  • Shangwei Guo
  • Hantao Liu
  • Tianwei Zhang

Deep neural networks (DNNs) have shown their powerful capability in scene text editing (STE). With carefully designed DNNs, one can alter texts in a source image with other ones while maintaining their realistic look. However, such editing tools provide a great convenience for criminals to falsify documents or modify texts without authorization. In this paper, we propose to actively defeat text editing attacks by designing invisible "armors" for texts in the scene. We turn the adversarial vulnerability of DNN-based STE into strength and design local perturbations (i.e., "armors") specifically for texts using an optimized normalization strategy. Such local perturbations can effectively mislead STE attacks without affecting the perceptibility of scene background. To strengthen our defense capabilities, we systemically analyze and model STE attacks and provide a precise defense method to defeat attacks on different editing stages. We conduct both subjective and objective experiments to show the superior of our optimized local adversarial perturbation against state-of-the-art STE attacks. We also evaluate the portrait and landscape transferability of our perturbations.

ChartStamp: Robust Chart Embedding for Real-World Applications

  • Jiayun Fu
  • Bin B. Zhu
  • Haidong Zhang
  • Yayi Zou
  • Song Ge
  • Weiwei Cui
  • Yun Wang
  • Dongmei Zhang
  • Xiaojing Ma
  • Hai Jin

Deep learning-based image embedding methods are typically designed for natural images and may not work for chart images due to their homogeneous regions, which lack variations to hide data both robustly and imperceptibly. In this paper, we propose ChartStamp, the first chart embedding method that is robust to real-world printing and displaying (printed on paper and displayed on screen, respectively, and then captured with a camera) while maintaining a good perceptual quality. ChartStamp hides 100, 1,000, or 10,000 raw bits into a chart image, depending on the designated robustness to printing, displaying, or JPEG. To ensure perceptual quality, it introduces a new perceptual model to guide embedding to insensitive regions of a chart image and a smoothness loss to ensure smoothness of the embedding residual in homogeneous regions. ChartStamp applies a distortion layer approximating designated real-world manipulations to train a model robust to these manipulations. Our experimental evaluation indicates that ChartStamp achieves the robustness and embedding capacity on chart images similar to their state-of-the-art counterparts on natural images. Our user studies indicate that ChartStamp achieves better perceptual quality than existing robust chart embedding methods and that our perceptual model outperforms the existing perceptual model.

Few-shot Image Generation Using Discrete Content Representation

  • Yan Hong
  • Li Niu
  • Jianfu Zhang
  • Liqing Zhang

Few-shot image generation and few-shot image translation are two related tasks, both of which aim to generate new images for an unseen category with only a few images. In this work, we make the first attempt to adapt few-shot image translation method to few-shot image generation task. Few-shot image translation disentangles an image into style vector and content map. An unseen style vector can be combined with different seen content maps to produce different images. However, it needs to store seen images to provide content maps and the unseen style vector may be incompatible with seen content maps. To adapt it to few-shot image generation task, we learn a compact dictionary of local content vectors via quantizing continuous content maps into discrete content maps instead of storing seen images. Furthermore, we model the autoregressive distribution of discrete content map conditioned on style vector, which can alleviate the incompatibility between content map and style vector. Qualitative and quantitative results on three real datasets demonstrate that our model can produce images of higher diversity and fidelity for unseen categories than previous methods.

Marior: Margin Removal and Iterative Content Rectification for Document Dewarping in the Wild

  • Jiaxin Zhang
  • Canjie Luo
  • Lianwen Jin
  • Fengjun Guo
  • Kai Ding

Camera-captured document images usually suffer from perspective and geometric deformations. It is of great value to rectify them when considering poor visual aesthetics and the deteriorated performance of OCR systems. Recent learning-based methods intensively focus on the accurately cropped document image. However, this might not be sufficient for overcoming practical challenges, including document images either with large marginal regions or without margins. Due to this impracticality, users struggle to crop documents precisely when they encounter large marginal regions. Simultaneously, dewarping images without margins is still an insurmountable problem. To the best of our knowledge, there is still no complete and effective pipeline for rectifying document images in the wild. To address this issue, we propose a novel approach called Marior (Margin Removal and Iterative Content Rectification). Marior follows a progressive strategy to iteratively improve the dewarping quality and readability in a coarse-to-fine manner. Specifically, we divide the pipeline into two modules: margin removal module (MRM) and iterative content rectification module (ICRM). First, we predict the segmentation mask of the input image to remove the margin, thereby obtaining a preliminary result. Then we refine the image further by producing dense displacement flows to achieve content-aware rectification. We determine the number of refinement iterations adaptively. Experiments demonstrate the state-of-the-art performance of our method on public benchmarks. The resources are available at for further comparison.

Image Inpainting Detection via Enriched Attentive Pattern with Near Original Image Augmentation

  • Wenhan Yang
  • Rizhao Cai
  • Alex Kot

As deep learning-based inpainting methods have achieved increasingly better results, its malicious use, e.g. removing objects to report fake news or to provide fake evidence, is becoming threatening. Previous works have provided rich discussions on network architectures, e.g. even performing Neural Architecture Search to obtain the optimal model architecture. However, there are rooms in other aspects. In our work, we provide comprehensive efforts from data and feature aspects. From the data aspect, as harder samples in the training data usually lead to stronger detection models, we propose near original image augmentation that pushes the inpainted images closer to the original ones (without distortion and inpainting) as the input images, which is proved to improve the detection accuracy. From the feature aspect, we propose to extract the attentive pattern. With the designed attentive pattern, the knowledge of different inpainting methods can be better exploited during the training phase. Finally, extensive experiments are conducted. In our evaluation, we consider the scenarios where the inpainting masks, which are used to generate the testing set, have a distribution gap from those masks used to produce the training set. Thus, the comparisons are conducted on a newly proposed dataset, where testing masks are inconsistent with the training ones. The experimental results show the superiority of the proposed method and the effectiveness of each component. All our codes and data will be online available.

Searching Lightweight Neural Network for Image Signal Processing

  • Haojia Lin
  • Lijiang Li
  • Xiawu Zheng
  • Fei Chao
  • Rongrong Ji

Recently, it has been shown that the traditional Image Signal Processing (ISP) can be replaced by deep neural networks due to their superior performance. However, most of these networks require heavy computation burden and thus are far from sufficient to be deployed on resource-limited platforms, including but not limited to mobile devices and FPGA. To tackle this challenge, we propose an automated search framework that derives ISP models with high image quality while satisfying the low-computation requirement. To reduce the search cost, we adopt the weight-sharing strategy by introducing a supernet and decouple the architecture search into two stages, supernet training and hard-aware evolutionary search. With the proposed framework, we can train the ISP model once and quickly find high-performance but low-computation models on multiple devices. Experiments demonstrate that the searched ISP models have an excellent trade-off between image quality and model complexity, i.e., achieve compelling reconstruction quality with more than 90% reduction in FLOPs as compared to the state-of-the-art networks.

Image Generation Network for Covert Transmission in Online Social Network

  • Zhengxin You
  • Qichao Ying
  • Sheng Li
  • Zhenxing Qian
  • Xinpeng Zhang

Online social networks have stimulated communications over the Internet more than ever, making it possible for secret message transmission over such noisy channels. In this paper, we propose a Coverless Image Steganography Network, called CIS-Net, that synthesizes a high-quality image directly conditioned on the secret message to transfer. CIS-Net is composed of four modules, namely, the Generation, Adversarial, Extraction, and Noise Module. The receiver can extract the hidden message without any loss even the images have been distorted by JPEG compression attacks. To disguise the behaviour of steganography, we collected images in the context of profile photos and stickers and train our network accordingly. As such, the generated images are more inclined to escape from malicious detection and attack. The distinctions from previous image steganography methods are majorly the robustness and losslessness against diverse attacks. Experiments over diverse public datasets have manifested the superior ability of anti-steganalysis.

Augmented Dual-Contrastive Aggregation Learning for Unsupervised Visible-Infrared Person Re-Identification

  • Bin Yang
  • Mang Ye
  • Jun Chen
  • Zesen Wu

Visible infrared person re-identification (VI-ReID) aims at searching out the corresponding infrared (visible) images from a gallery set captured by other spectrum cameras. Recent works mainly focus on supervised VI-ReID methods that require plenty of cross-modality (visible-infrared) identity labels which are more expensive than the annotations in single-modality person ReID. For the unsupervised learning visible infrared re-identification (USL-VI-ReID), the large cross-modality discrepancies lead to difficulties in generating reliable cross-modality labels and learning modality-invariant features without any annotations. To address this problem, we propose a novel Augmented Dual-Contrastive Aggregation (ADCA) learning framework. Specifically, a dual-path contrastive learning framework with two modality-specific memories is proposed to learn the intra-modality person representation. To associate positive cross-modality identities, we design a cross-modality memory aggregation module with count priority to select highly associated positive samples, and aggregate their corresponding memory features at the cluster level, ensuring that the optimization is explicitly concentrated on the modality-irrelevant perspective. Extensive experiments demonstrate that our proposed ADCA significantly outperforms existing unsupervised methods under various settings, and even surpasses some supervised counterparts, facilitating VI-ReID to real-world deployment. Code is available at

DrawMon: A Distributed System for Detection of Atypical Sketch Content in Concurrent Pictionary Games

  • Nikhil Bansal
  • Kartik Gupta
  • Kiruthika Kannan
  • Sivani Pentapati
  • Ravi Kiran Sarvadevabhatla

Pictionary, the popular sketch-based guessing game, provides an opportunity to analyze shared goal cooperative game play in restricted communication settings. However, some players occasionally draw atypical sketch content. While such content is occasionally relevant in the game context, it sometimes represents a rule violation and impairs the game experience. To address such situations in a timely and scalable manner, we introduce DrawMon, a novel distributed framework for automatic detection of atypical sketch content in concurrently occurring Pictionary game sessions. We build specialized online interfaces to collect game session data and annotate atypical sketch content, resulting in AtyPict, the first ever atypical sketch content dataset. We use AtyPict to train CanvasNet, a deep neural atypical content detection network. We utilize CanvasNet as a core component of DrawMon. Our analysis of post deployment game session data indicates DrawMon's effectiveness for scalable monitoring and atypical sketch content detection. Beyond Pictionary, our contributions also serve as a design guide for customized atypical content response systems involving shared and interactive whiteboards. Code and datasets are available at

Approximate Shifted Laplacian Reconstruction for Multiple Kernel Clustering

  • Jiali You
  • Zhenwen Ren
  • Quansen Sun
  • Yuan Sun
  • Xingfeng Li

Multiple kernel clustering (MKC) has demonstrated promising performance for handing non-linear data clustering. Positively, it can integrate complementary information of multiple base kernels and avoid kernel function selection. However, negatively, the main challenging is that the kernel matrix with the size n x n leads to O(n2) memory complexity and O(n3) computational complexity. To mitigate such a challenging, taking graph Laplacian as breakthrough, this paper proposes a novel and simple MKC method, dubbed as approximate shifted Laplacian reconstruction (ASLR). For each base kernel, we propose the r-rank shifted Laplacian reconstruction scheme by considering the energy losing of Laplacian reconstruction and the clustering information preserving of Laplacian decompose simultaneously. Then, by analyzing the eigenvectors of the reconstructed Laplacian, we impose some constrains to tame its solution within a Fantope. Accordingly, the byproduct (i.e. the most informative eigenvectors) contains the main clustering information, such that the clustering assignments can be obtained relying on simple k-means algorithm. Owe to the Laplacian reconstruction scheme, the memory and computational complexity can be reduced to O(n) and O<(n^2)$, respectively. As experimentally demonstrated on eight challenging MKC benchmark datasets, the results verify the effectiveness and efficiency of ASLR.

Towards Continual Adaptation in Industrial Anomaly Detection

  • Wujin Li
  • Jiawei Zhan
  • Jinbao Wang
  • Bizhong Xia
  • Bin-Bin Gao
  • Jun Liu
  • Chengjie Wang
  • Feng Zheng

Anomaly detection (AD) has gained widespread attention due to its ability to identify defects in industrial scenarios using only normal samples. Although traditional AD methods achieved acceptable performance, they mainly focus on the current set of examples solely, leading to catastrophic forgetting of previously learned tasks when trained on a new one. Due to the limitation of flexibility and the requirements of realistic industrial scenarios, it is urgent to enhance the ability of continual adaptation of AD models. Therefore, this paper proposes a unified framework by incorporating continual learning (CL) to achieve our newly designed task of continual anomaly detection (CAD). Note that, we observe that data augmentation strategy can make AD methods well adapted to supervised CL (SCL) via constructing anomaly samples. Based on this, we hence propose a novel method named Distribution of Normal Embeddings (DNE), which utilizes the feature distribution of normal training samples from past tasks. It not only effectively alleviates catastrophic forgetting in CAD but also can be integrated with SCL methods to further improve their performance. Extensive experiments and visualization results on the popular benchmark dataset MVTec AD, have demonstrated advanced performance and the excellent continual adaption ability of our proposed method compared to other AD methods. To the best of our knowledge, we are the first to introduce and tackle the task of CAD. We believe that the proposed task and benchmark will be beneficial to the field of AD. Our code is available in thesupplementary material.

Neural Network Model Protection with Piracy Identification and Tampering Localization Capability

  • Cheng Xiong
  • Guorui Feng
  • Xinran Li
  • Xinpeng Zhang
  • Chuan Qin

With the rapid development of neural network, a vast number of neural network models have been developed in recent years, which condense numerous manpower and hardware resource. However, the original models are at risk of being pirated by the adversary to obtain illegal profits. On the other hand, malicious tampering on models, such as implanting the vulnerability and backdoor, may cause catastrophic consequences. We propose a model hash generator method to protect neural network models. Detailedly, our model hash sequence is composed of two parts: one is the model piracy identification hash, which is based on the dynamic convolution and a dual-branch network; the other is the model tampering localization hash, which can help the model owner to accurately detect the tampered locations for further recovery. Experimental results demonstrate the effectiveness of the proposed method for neural network model protection.

SDRTV-to-HDRTV via Hierarchical Dynamic Context Feature Mapping

  • Gang He
  • Kepeng Xu
  • Li Xu
  • Chang Wu
  • Ming Sun
  • Xing Wen
  • Yu-Wing Tai

In this work, we address the task of SDR videos to HDR videos(SDRTV-to-HDRTV conversion). Previous approaches use global feature modulation for SDRTV-to-HDRTV conversion. Feature modulation scales and shifts the features in the original feature space, which has limited mapping capability. In addition, the global image mapping cannot restore detail in HDR frames due to the luminance differences in different regions of SDR frames. To resolve the appeal, we propose a two-stage solution. The first stage is a hierarchical Dynamic Context feature mapping (HDCFM) model. HDCFM learns the SDR frame to HDR frame mapping function via hierarchical feature modulation (HME and HM ) module and a dynamic context feature transformation (DYCT) module. The HME estimates the feature modulation vector, HM is capable of hierarchical feature modulation, consisting of global feature modulation in series with local feature modulation, and is capable of adaptive mapping of local image features. The DYCT module constructs a feature transformation module in conjunction with the context, which is capable of adaptively generating a feature transformation matrix for feature mapping. Compared with simple feature scaling and shifting, the DYCT module can map features into a new feature space and thus has a more excellent feature mapping capability. In the second stage, we introduce a patch discriminator-based context generation model PDCG to obtain subjective quality enhancement of over-exposed regions. The proposed method can achieve state-of-the-art objective and subjective quality results. Specifically, HDCFM achieves a PSNR gain of 0.81 dB at about 100K parameters. The number of parameters is 1/14th of the previous state-of-the-art methods. The test code will be released on

Arbitrary Bit-width Network: A Joint Layer-Wise Quantization and Adaptive Inference Approach

  • Chen Tang
  • Haoyu Zhai
  • Kai Ouyang
  • Zhi Wang
  • Yifei Zhu
  • Wenwu Zhu

Conventional model quantization methods use a fixed quantization scheme to different data samples, which ignores the inherent"recognition difficulty" differences between various samples. We propose to feed different data samples with varying quantization schemes to achieve a data-dependent dynamic inference, at a fine-grained layer level. However, enabling this adaptive inference with changeable layer-wise quantization schemes is challenging because the combination of bit-widths and layers is growing exponentially, making it extremely difficult to train a single model in such a vast searching space and use it in practice. To solve this problem, we present the Arbitrary Bit-width Network (ABN), where the bit-widths of a single deep network can change at runtime for different data samples, with a layer-wise granularity. Specifically, first we build a weight-shared layer-wise quantizable "super-network" in which each layer can be allocated with multiple bit-widths and thus quantized differently on demand. The super-network provides a considerably large number of combinations of bit-widths and layers, each of which can be used during inference without retraining or storing myriad models. Second, based on the well-trained super-network, each layer's runtime bit-width selection decision is modeled as a Markov Decision Process (MDP) and solved by an adaptive inference strategy accordingly. Experiments show that the super-network can be built without accuracy degradation, and the bit-widths allocation of each layer can be adjusted to deal with various inputs on the fly. On ImageNet classification, we achieve 1.1% top1 accuracy improvement while saving 36.2% BitOps.

Privacy-preserving Reflection Rendering for Augmented Reality

  • Yiqin Zhao
  • Sheng Wei
  • Tian Guo

When the virtual objects consist of reflective materials, the required lighting information to render such objects can consist of privacy-sensitive information outside the current camera view. In this paper, we show, for the first time, that accuracy-driven multi-view environment lighting can reveal out-of-camera scene information and compromise privacy. We present a simple yet effective privacy attack that extracts sensitive scene information such as human faces and text from rendered objects under several application scenarios.

To defend against such attacks, we develop a novel IPC2S defense and a conditional R2 defense. Our IPC2S defense, combined with a generic lighting reconstruction method, preserves the scene geometry while obfuscating the privacy-sensitive information. As a proof-of-concept, we leverage existing OCR and face detection models to identify text and human faces from past camera observations and blur the color pixels associated with detected regions. We evaluate the visual quality impact of our defense by comparing rendered virtual objects to ones rendered with a generic multi-lighting reconstruction technique, ARKit, and R2 defense. Our visual and quantitative results demonstrate that our defense leads to structurally similar reflections with up to 0.98 SSIM score across various rendering scenarios while preserving sensitive information by reducing the automatic extraction success rate to at most 8.8%.

SESSION: Oral Session VII: Multimedia Systems - Systems and Middleware

Confederated Learning: Going Beyond Centralization

  • Zitai Wang
  • Qianqian Xu
  • Ke Ma
  • Xiaochun Cao
  • Qingming Huang

Traditional machine learning implicitly assumes that a single entity (e.g., a person or an organization) could complete all the jobs of the whole learning process: data collection, algorithm design, parameter selection, and model evaluation. However, many practical scenarios require cooperation among entities, and existing paradigms fail to meet cost, privacy, or security requirements and so on. In this paper, we consider a generalized paradigm: different roles are granted multiple permissions to complete their corresponding jobs, called Confederated Learning. Systematic analysis shows that confederated learning generalizes traditional machine learning and the existing distributed paradigms like federation learning. Then, we study an application scenario of confederated learning which could inspire future research in the context of cooperation between different entities. Three methods are proposed as the first trial for the cooperated learning under restricted conditions. Empirical results on three datasets validate the effectiveness of the proposed methods.

R-FEC: RL-based FEC Adjustment for Better QoE in WebRTC

  • Insoo Lee
  • Seyeon Kim
  • Sandesh Sathyanarayana
  • Kyungmin Bin
  • Song Chong
  • Kyunghan Lee
  • Dirk Grunwald
  • Sangtae Ha

The demand for video conferencing applications has seen explosive growth while users still often face unsatisfactory quality of experience (QoE). Video conferencing applications adopt Forward Error Correction (FEC) as a recovery mechanism to meet tight latency requirements and overcome packet losses prevalent in the network. However, many studies mainly focused on video rate control by neglecting the complex interactions of this video recovery mechanism on the rate control and its impact on the user QoE. Deciding the right amount of FEC for the current video rate under a dynamically changing network environment is not straightforward. For instance, the higher FEC may enhance the tolerance to packet losses, but it may increase latency due to FEC processing overhead and hurt the video quality due to the additional bandwidth used for FEC. To address this issue, we propose R-FEC which is a reinforcement learning (RL) based framework for video and FEC bitrate decisions in video conferencing. R-FEC aims to improve overall QoE by automatically learning through the results of past decisions and adjusting video and FEC bitrates to maximize the user QoE while minimizing the congestion in the network. Our experiments show that R-FEC outperforms the state-of-the-art solutions in video conferencing, with up to 27% improvement in its video rate and 6dB PSNR improvement in video quality over the default WebRTC.

SESSION: Poster Session VII: Multimedia Systems -- Systems and Middleware

Physical Backdoor Attacks to Lane Detection Systems in Autonomous Driving

  • Xingshuo Han
  • Guowen Xu
  • Yuan Zhou
  • Xuehuan Yang
  • Jiwei Li
  • Tianwei Zhang

Modern autonomous vehicles adopt state-of-the-art DNN models to interpret the sensor data and perceive the environment. However, DNN models are vulnerable to different types of adversarial attacks, which pose significant risks to the security and safety of the vehicles and passengers. One prominent threat is the backdoor attack, where the adversary can compromise the DNN model by poisoning the training samples. Although lots of effort has been devoted to the investigation of the backdoor attack to conventional computer vision tasks, its practicality and applicability to the autonomous driving scenario is rarely explored, especially in the physical world.

In this paper, we target the lane detection system, which is an indispensable module for many autonomous driving tasks, e.g., navigation, lane switching. We design and realize thefirst physical backdoor attacks to such system. Our attacks are comprehensively effective against different types of lane detection algorithms. Specifically, we introduce two attack methodologies (poison-annotation and clean-annotation) to generate poisoned samples. With those samples, the trained lane detection model will be infected with the backdoor, and can be activated by common objects (e.g., traffic cones) to make wrong detections, leading the vehicle to drive off the road or onto the opposite lane. Extensive evaluations on public datasets and physical autonomous vehicles demonstrate that our backdoor attacks are effective, stealthy and robust against various defense solutions. Our codes and experimental videos can be found in \textcolorblue \url .

Dynamic Transformer for Few-shot Instance Segmentation

  • Haochen Wang
  • Jie Liu
  • Yongtuo Liu
  • Subhransu Maji
  • Jan-Jakob Sonke
  • Efstratios Gavves

Few-shot instance segmentation aims to train an instance segmentation model that can fast adapt to novel classes with only a few reference images. Existing methods are usually derived from standard detection models and tackle few-shot instance segmentation indirectly by conducting classification, box regression, and mask prediction on a large set of redundant proposals followed by indispensable post-processing, e.g., Non-Maximum Suppression. Such complicated hand-crafted procedures and hyperparameters lead to degraded optimization and insufficient generalization ability. In this work, we propose an end-to-end Dynamic Transformer Network, DTN for short, to directly segment all target object instances from arbitrary categories given by reference images, relieving the requirements of dense proposal generation and post-processing. Specifically, a small set of Dynamic Queries, conditioned on reference images, are exclusively assigned to target object instances and generate all the instance segmentation masks of reference categories simultaneously. Moreover, a Semantic-induced Transformer Decoder is introduced to constrain the cross-attention between dynamic queries and target images within the pixels of the reference category, which suppresses the noisy interaction with the background and irrelevant categories. Extensive experiments are conducted on the COCO-20 dataset. The experiment results demonstrate that our proposed Dynamic Transformer Network significantly outperforms the state-of-the-arts.

OISSR: Optical Image Stabilization Based Super Resolution on Smartphone Cameras

  • Hao Pan
  • Feitong Tan
  • Wenhao Li
  • Yi-Chao Chen
  • Guangtao Xue

Multi-frame super-resolution methods can generate high resolution images by combining multiple captures of the same scene; however, the performance of merged results are susceptible to degradation due to a lack of precision in image registration. In this study, we sought to develop a robust multi-frame super resolution method (called OISSR) for use on smartphone cameras with a optical image stabilizer (OIS). Acoustic injection is used to alter the readings from the built-in MEMS gyroscope to control the lens motion in the OIS module (note that the image sensor is fixed). We employ a priori knowledge of the induced lens motion to facilitate optical flow estimation with sub-pixel accuracy, and the output high-precision pixel alignment vectors are utilized to merge the multiple frames to reconstruct the final super resolution image. Extensive experiments on a OISSR prototype implemented on a Xiaomi 10Ultra demonstrate the high performance and effectiveness of the proposed system in obtaining the quadruple enhanced resolution imaging.

Improving Scalability, Sustainability and Availability via Workload Distribution in Edge-Cloud Gaming

  • Iryanto Jaya
  • Yusen Li
  • Wentong Cai

Recent uses of heterogeneous mobile and lightweight devices encourage computations to be abstracted remotely as black box systems. This same concept applies for cloud gaming in which computer games are located and run inside remote rendering servers (RSes). While cloud gaming enables lightweight devices with sufficient input capabilities and network connection to be able to play desktop games, latency and cost issue become significant hindrances in recent applications. In this paper, we came up with our edge-cloud gaming architecture which reduces the overall workload in RSes while increasing playerbase coverage by using edge RSes. Furthermore, we also proposed our allocation algorithm in order to assign incoming players to RSes. From our experiments, our proposed architecture has higher playerbase coverage while our allocation algorithm significantly reduces the cost in both single and batch player arrival pattern.

Display of 3D Illuminations using Flying Light Specks

  • Shahram Ghandeharizadeh

This paper presents techniques to display 3D illuminations using Flying Light Specks, FLSs. Each FLS is a miniature (hundreds of micrometers) sized drone with one or more light sources to generate different colors and textures with adjustable brightness. It is network enabled with a processor and local storage. Synchronized swarms of cooperating FLSs render illumination of virtual objects in a pre-specified 3D volume, an FLS display. We present techniques to display both static and motion illuminations. Our display techniques consider the limited flight time of an FLS on a fully charged battery and the duration of time to charge the FLS battery. Moreover, our techniques assume failure of FLSs is the norm rather than an exception. We present a hardware and a software architecture for an FLS-display along with a family of techniques to compute flight paths of FLSs for illuminations. With motion illuminations, one technique (ICF) minimizes the overall distance traveled by the FLSs significantly when compared with the other techniques.

SESSION: Oral Session VIII: Multimedia Systems -- Transport and Delivery

Improving Generalization for Neural Adaptive Video Streaming via Meta Reinforcement Learning

  • Nuowen Kan
  • Yuankun Jiang
  • Chenglin Li
  • Wenrui Dai
  • Junni Zou
  • Hongkai Xiong

In this paper, we present a meta reinforcement learning (Meta-RL)-based neural adaptive bitrate streaming (ABR) algorithm that is able to rapidly adapt its control policy to the changing network throughput dynamics. Specifically, to allow rapid adaptation, we discuss the necessity of detaching the inference of throughput dynamics with the universal control mechanism that is in essence shared by all potential throughput dynamics for neural ABR algorithms. To meta-learn the ABR policy, we then build up a model-free system framework, composed of a probabilistic latent encoder that infers the underlying dynamics from the recent throughput context, and a policy network that is conditioned on latent variable and learns to quickly adapt to new environments. Additionally, to address the difficulties caused by training the policy on mixed dynamics, on-policy RL (or imitation learning) algorithms are suggested for policy training, with a mutual information-based regularization to make the latent variable more informative about the policy. Finally, we implement our algorithm's meta-training and meta-adaptation procedures under a variety of throughput dynamics. Empirical evaluations on different QoE metrics and multiple datasets containing real-world network traces demonstrate that our algorithm outperforms state-of-the-art ABR algorithms, in terms of the performance on the average chunk QoE, consistency and fast adaptation across a wide range of throughput patterns.

DAO: Dynamic Adaptive Offloading for Video Analytics

  • taslim murad
  • Anh Nguyen
  • Zhisheng Yan

Offloading videos from end devices to edge or cloud servers is the key to enabling computation-intensive video analytics. To ensure the analytics accuracy at the server, the video quality for offloading must be configured based on the specific content and the available network bandwidth. While adaptive video streaming for user viewing has been widely studied, none of the existing works can guarantee the analytics accuracy at the server in bandwidth- and content-adaptive way. To fill in this gap, this paper presents DAO, a dynamic adaptive offloading framework for video analytics that jointly considers the dynamics of network bandwidth and video content. DAO is able to maximize the analytics accuracy at the server by adapting the video bitrate and resolution dynamically. In essence, we shift the context of adaptive video transport from traditional DASH systems to a new dynamic adaptive offloading framework tailored for video analytics. DAO is empowered by some new discoveries about the inherent relationship between analytics accuracy, video content, bitrate, and resolution, as well as by an optimization formulation to adapt the bitrate and resolution dynamically. Results from the real-world implementation of object detection tasks show that DAO's performance is close to the theoretical bound, achieving 20% bandwidth saving and 59% category-wise mAP improvement compared to conventional DASH schemes.

AggCast: Practical Cost-effective Scheduling for Large-scale Cloud-edge Crowdsourced Live Streaming

  • Rui-Xiao Zhang
  • Changpeng Yang
  • Xiaochan Wang
  • Tianchi Huang
  • Chenglei Wu
  • Jiangchuan Liu
  • Lifeng Sun

Conventional wisdom claims that in order to improve viewer engagement, the cloud-edge providers should serve the viewers with the nearest edge nodes, however, we show that doing this for crowdsourced live streaming (CLS) services can introduce significant costs inefficiency. We observe that the massive number of channels has greatly burdened the operating expenditure of the cloud-edge providers, and most importantly, unbalanced viewer distribution makes the edge nodes suffer significant costs inefficiency. To tackle the above concerns, we propose AggCast, a novel CLS scheduling framework to optimize the edge node utilization for the cloud-edge provider. The core idea of AggCast is to aggregate some viewers who are initially scattered on different regions, and assign them to fewer pre-selected nodes, thereby reducing bandwidth costs. In particular, by leveraging the insights obtained from our large-scale measurement, AggCast can not only ensure quality of experience (QoS), but also satisfy the systematic requirements of CLS services. AggCast has been A/B tested and fully deployed in a top cloud-edge provider in China for over eight months. The online and trace-driven experiments show that, compared to the common practice, AggCast can save over 15% back-to-source (BTS) bandwidth costs while having no negative impacts on QoS.

AdaMask: Enabling Machine-Centric Video Streaming with Adaptive Frame Masking for DNN Inference Offloading

  • Shengzhong Liu
  • Tianshi Wang
  • Jinyang Li
  • Dachun Sun
  • Mani Srivastava
  • Tarek Abdelzaher

This paper presents AdaMask, a machine-centric video streaming framework for remote deep neural network (DNN) inference. The objective is to optimize the accuracy of downstream DNNs, offloaded to a remote machine, by adaptively changing video compression control knobs at runtime. Our main contributions are twofold. First, we propose frame masking as an effective mechanism to reduce the bandwidth consumption of video stream, which only preserves regions that potentially contain objects of interest. Second, we design a new adaptation algorithm that achieves the Pareto-optimal tradeoff between accuracy and bandwidth by controlling the masked portions of frames together with conventional H.264 control knobs (eg. resolution). Through extensive evaluations on three sensing scenarios (dash camera, traffic surveillance, and drone), frame masking saves the bandwidth by up to 65% with < 1% accuracy degradation, and AdaMask improves the accuracy by up to 14% over the baselines against the network dynamics.

SESSION: Poster Session VIII: Multimedia Systems -- Transport and Delivery

Learning-Based Video Coding with Joint Deep Compression and Enhancement

  • Tiesong Zhao
  • Weize Feng
  • HongJi Zeng
  • Yiwen Xu
  • Yuzhen Niu
  • Jiaying Liu

End-to-end learning-based video coding has attracted substantial attentions by compressing video signals as stacked visual features. This paper proposes an end-to-end deep video codec with jointly optimized compression and enhancement modules (JCEVC). First, we propose a dual-path generative adversarial network (DPEG) to reconstruct video details after compression. An α-path and a β-path concurrently reconstruct the structure information and local textures. Second, we reuse the DPEG network in both motion compensation and quality enhancement modules, which are further combined with other necessary modules to formulate our JCEVC framework. Third, we employ a joint training of deep video compression and enhancement that further improves the rate-distortion (RD) performance of compression. Compared with x265 LDP very fast mode, our JCEVC reduces the average bit-per-pixel (bpp) by 39.39%/54.92% at the same PSNR/MS-SSIM, which outperforms the state-of-the-art deep video codecs by a considerable margin. Sourcecode is available at:

Structure-Preserving Motion Estimation for Learned Video Compression

  • Han Gao
  • Jinzhong Cui
  • Mao Ye
  • Shuai Li
  • Yu Zhao
  • Xiatian Zhu

Following the conventional hybrid video coding framework, existing learned video compression methods rely on the decoded previous frame as the reference for motion estimation considering that it is available to the decoder. Diving into its essential advantage of strong representation capability with CNNs, however, we find this strategy is suboptimal due to two reasons: (1) Motion estimation based on the decoded (often distorted) frame would damage both the spatial structure of motion information inferred and the corresponding residual for each frame, making it difficult to be spatially encoded on the whole image basis using CNNs; (2) Typically, it would break the consistent nature across frames since the estimated motion information is no longer consistent with the movement in the original video due to the distortion in the decoded video, lowering the overall temporal coding efficiency. To overcome these problems, a novel asymmetric Structure-Preserving Motion Estimation (SPME) method is proposed, with the aim to fully explore the ignored original previous frame at the encoder side while complying with the decoded previous frame at the decoder side. Concretely, SPME estimates superior spatially structure-preserving and temporally consistent motion field by aggregating the motion prediction of both the original and the decoded reference frames w.r.t the current frame. Critically, our method can be universally applied to the existing feature prediction based video compression methods. Extensive experiments on several standard test datasets show that our SPME can significantly enhance the state-of-the-art methods.

Learned Internet Congestion Control for Short Video Uploading

  • Tianchi Huang
  • Chao Zhou
  • Lianchen Jia
  • Rui-Xiao Zhang
  • Lifeng Sun

Short video uploading service has become increasingly important, as at least 30 million videos are uploaded per day. However, we find that existing congestion control (CC) algorithms, either heuristics or learning-based, are not applicable for video uploading -- i.e., lacking in the design of the fundamental mechanism and being short of leveraging network modeling. We present DuGu, a novel learning-based CC algorithm designed by considering the unique proprieties of video uploading via the probing phase and internet networking via the control phase. During the probing phase, DuGu leverages the transmission gap of uploading short videos to actively detect the network metrics to better understand network dynamics. DuGu uses a neural network~(NN) to avoid congestion during the control phase. Here, instead of using handcrafted reward functions, the NN is learned by imitating the expert policy given by the optimal solver, improving both performance and learning efficiency. To build this system, we construct an omniscient-like network emulator, implement an optimal solver and collect a large corpus of real-world network traces to learn expert strategies. Trace-driven and real-world A/B tests reveal that DuGu supports multi-objective and rivals or outperforms existing CC algorithms across all considered scenarios.

PicT: A Slim Weakly Supervised Vision Transformer for Pavement Distress Classification

  • Wenhao Tang
  • Sheng Huang
  • Xiaoxian Zhang
  • Luwen Huangfu

Automatic pavement distress classification facilitates improving the efficiency of pavement maintenance and reducing the cost of labor and resources. A recently influential branch of this task divides the pavement image into patches and infers the patch labels for addressing these issues from the perspective of multi-instance learning. However, these methods neglect the correlation between patches and suffer from a low efficiency in the model optimization and inference. As a representative approach of vision Transformer, Swin Transformer is able to address both of these issues. It first provides a succinct and efficient framework for encoding the divided patches as visual tokens, then employs self-attention to model their relations. Built upon Swin Transformer, we present a novel vision Transformer named Pavement Image Classification Transformer (PicT) for pavement distress classification. In order to better exploit the discriminative information of pavement images at the patch level, the Patch Labeling Teacher is proposed to leverage a teacher model to dynamically generate pseudo labels of patches from image labels during each iteration, and guides the model to learn the discriminative features of patches via patch label inference in a weakly supervised manner. The broad classification head of Swin Transformer may dilute the discriminative features of distressed patches in the feature aggregation step due to the small distressed area ratio of the pavement image. To overcome this drawback, we present a Patch Refiner to cluster patches into different groups and only select the highest distress-risk group to yield a slim head for the final image classification. We evaluate our method on a large-scale bituminous pavement distress dataset named CQU-BPDD. Extensive results demonstrate the superiority of our method over baselines and also show that PicT outperforms the second-best performed model by a large margin of +2.4% in P@R on detection task, +3.9% in F1 on recognition task, and 1.8x throughput, while enjoying 7x faster training speed using the same computing resources. Our codes and models have been released on

Rate-Distortion-Guided Learning Approach with Cross-Projection Information for V-PCC Fast CU Decision

  • Hang Yuan
  • Wei Gao
  • Ge Li
  • Zhu Li

In video-based point cloud compression (V-PCC), the 3D dynamic point cloud sequence is projected into 2D sequences for compression by utilizing the mature 2D video encoder. It is noted the encoding of attribute sequence is extremely time-consuming, and the applicable fast algorithms are still lacking because of the uniqueness of video content and coding structure in V-PCC. This paper proposes a novel rate-distortion-guided fast attribute coding unit (CU) partitioning approach with cross-projection information in V-PCC all-intra (AI) coding. By analyzing the guidance effectiveness of cross-projection information for attribute CU partition, we first propose to combine the occupancy, geometry and attribute features for CU division determination. Afterward, considering that different CUs have different rate-distortion costs and the influences on coding performances by inaccurate different CU predictions are also dissimilar, we devise a rate-distortion-guided learning approach to reduce the coding loss generated by the mispredictions of CU partition. Moreover, we carefully design an overall decision framework for CU partition in V-PCC AI coding structure. Experimental results prove the advantages of our approach, where the coding time is saved by 62.41%, and the End-to-End BD-TotalRate loss only is 0.27%. To the best of our knowledge, the proposed fast attribute CU decision approach achieves the state-of-the-art performance in V-PCC AI coding.

Evaluating the Impact of Tiled User-Adaptive Real-Time Point Cloud Streaming on VR Remote Communication

  • Shishir Subramanyam
  • Irene Viola
  • Jack Jansen
  • Evangelos Alexiou
  • Alan Hanjalic
  • Pablo Cesar

Remote communication has rapidly become a part of everyday life in both professional and personal contexts. However, popular video conferencing applications present limitations in terms of quality of communication, immersion and social meaning. VR remote communication applications offer a greater sense of co-presence and mutual sensing of emotions between remote users. Previous research on these applications has shown that realistic point cloud user reconstructions offer better immersion and communication as compared to synthetic user avatars. However, photorealistic point clouds require a large volume of data per frame and are challenging to transmit over bandwidth-limited networks. Recent research has demonstrated significant improvements to perceived quality by optimizing the usage of bandwidth based on the position and orientation of the user's viewport with user-adaptive streaming. In this work, we developed a real-time VR communication application with an adaptation engine that features tiled user-adaptive streaming based on user behaviour. The application also supports traditional network adaptive streaming. The contribution of this work is to evaluate the impact of tiled user-adaptive streaming on quality of communication, visual quality, system performance and task completion in a functional live VR remote communication system. We performed a subjective evaluation with 33 users to compare the different streaming conditions with a neck exercise training task. As a baseline, we use uncompressed streaming requiring approximately 300 megabits per second and our solution achieves similar visual quality with tiled adaptive streaming at 14 megabits per second. We also demonstrate statistically significant gains in the quality of interaction and improvements to system performance and CPU consumption with tiled adaptive streaming as compared to the more traditional network adaptive streaming.

Prism: Handling Packet Loss for Ultra-low Latency Video

  • Devdeep Ray
  • Vicente Bobadilla Riquelme
  • Srinivasan Seshan

Real-time interactive video streaming applications like cloud-based video games, AR, and VR require high quality video streams and extremely low end-to-end interaction delays. These requirements cause the QoE to be extremely sensitive to packet losses. Due to the inter-dependency between compressed frames, packet losses stall the video decode pipeline until the lost packets are retransmitted (resulting in stutters and higher delays), or the decoder state is reset using IDR-frames (lower video quality for given bandwidth). Prism is a hybrid predictive-reactive packet loss recovery scheme that uses a split-stream video coding technique to meet the needs of ultra-low latency video streaming applications. Prism's approach enables aggressive loss prediction, rapid loss recovery, and high video quality post-recovery, with zero overhead during normal operation - avoiding the pitfalls of existing approaches. Our evaluation on real video game footage shows that Prism reduces the penalty of using I-frames for recovery by 81%, while achieving 30% lower delay than pure retransmission-based recovery.

Exploring Spherical Autoencoder for Spherical Video Content Processing

  • Jin Zhou
  • Na Li
  • Yao Liu
  • Shuochao Yao
  • Songqing Chen

3D spherical content is increasingly presented in various applications (e.g., AR/MR/VR) for better users' immersiveness experience, yet today processing such spherical 3D content still mainly relies on the traditional 2D approaches after projection, leading to the distortion and/or loss of critical information. This study sets to explore methods to process spherical 3D content directly and more effectively. Using 360-degree videos as an example, we propose a novel approach called Spherical Autoencoder (SAE) for spherical video processing. Instead of projecting to a 2D space, SAE represents the 360-degree video content as a spherical object and employs encoding and decoding on the 360-degree video directly. Furthermore, to support the adoption of SAE on pervasive mobile devices that often have resource constraints, we further propose two optimizations on top of SAE.First, since the FoV (Field of View) prediction is widely studied and leveraged to transport only a portion of the content to the mobile device to save bandwidth and battery consumption, we design p-SAE, a SAE scheme with the partial view support that can utilize such FoV prediction. Second, since machine learning models are often compressed when running on mobile devices in order to reduce the processing load, which usually leads to degradation of output (e.g., video quality in SAE), we propose c-SAE by applying the compressive sensing theory into SAE to maintain the video quality when the model is compressed. Our extensive experiments show that directly incorporating and processing spherical signals is promising, and it outperforms the traditional approaches by a large margin. Both p-SAE and c-SAE show their effectiveness in delivering high quality videos (e.g., PSNR results) when used alone or combined together with model compression.

Sophon: Super-Resolution Enhanced 360° Video Streaming with Visual Saliency-aware Prefetch

  • Jianxin Shi
  • Lingjun Pu
  • Xinjing Yuan
  • Qianyun Gong
  • Jingdong Xu

360° video streaming requires ultra-high bandwidth to provide an excellent immersive experience. Traditional viewport-aware streaming methods are theoretically effective but unreliable in practice due to the adverse effects of time-varying available bandwidth on the small playback buffer. To this end, we ponder the complementarity between the large buffer-based approach and the viewport-aware strategy for 360°video streaming. In this work, we present Sophon, a buffer-based and neural-enhanced streaming framework, which exploits the double buffer design, super-resolution technique, and viewport-aware strategy to improve user experience. Furthermore, we propose two well-suited ideas: visual saliency-aware prefetch and super-resolution model selection scheme to address the challenges of insufficient computing resources and dynamic user preferences. Correspondingly, we respectively introduce the prefetch and model selection metric, and develop a lightweight buffer occupancy-based prefetch algorithm and a deep reinforcement learning method to trade off bandwidth consumption, computing resource utilization, and content quality enhancement. We implement a prototype of Sophon and extensive evaluations corroborate its superior performance over state-of-the-art works.

Error Concealment of Dynamic 3D Point Cloud Streaming

  • Tzu-Kuan Hung
  • I-Chun Huang
  • Samuel Rhys Cox
  • Wei Tsang Ooi
  • Cheng-Hsin Hsu

Recently standardized MPEG Video-based Point Cloud Compression (V-PCC) codec has shown promise in achieving a good rate-distortion ratio of dynamic 3D point cloud compression. Current error concealment methods of V-PCC, however, lead to significantly distorted 3D point cloud frames under imperfect network conditions. To address this problem, we propose a general framework for concealing distorted and lost 3D point cloud frames due to packet loss. We also design, implement, and evaluate a suite of tools for each stage of our framework, which can be combined into multiple variants of error concealment algorithms. We conduct extensive experiments using seven dynamic 3D point cloud sequences with diverse characteristics to understand the strengths and limitations of our proposed error concealment algorithms. Our experiment results show that our algorithms outperform: (i) the method employed by V-PCC by at least 3.58 dB in Geometry Peak Signal-to-Noise Ratio (GPSNR) and 10.68 in Video Multi-Method Assessment Fusion (VMAF) and (ii) point cloud frame copy method by at most 5.8 dB in (3D) GPSNR and 12.0 in (2D) VMAF. Further, the proposed error concealment framework and algorithms work in the 3D domain, and thus are agnostic to the codecs and are applicable to future point cloud compression standards

Personalized 360-Degree Video Streaming: A Meta-Learning Approach

  • Yiyun Lu
  • Yifei Zhu
  • Zhi Wang

Over the past decades, 360-degree videos have attracted wide interest for the immersive experience they bring to viewers. The rising of high-resolution 360-degree videos greatly challenges the traditional video streaming systems in limited network environments. Given the limited bandwidth, tile-based video streaming with adaptive bitrate selection has been widely studied to improve the Quality of Experience (QoE) of viewers by tiling the video frames and allocating different bitrates for tiles inside and outside viewers' viewports. Existing solutions for viewport prediction and bitrate selection train general models without catering to the intrinsic need for personalization. In this paper, we present the first meta-learning-based personalized 360-degree video streaming framework. The commonality among viewers of different viewing patterns and QoE preferences is captured by efficient meta-network designs. Specifically, we design a meta-based long-short term memory model for viewport prediction and a meta-based reinforcement learning model for bitrate selection. Extensive experiments on real-world datasets demonstrate that our framework not only outperforms the state-of-the-art data-driven approaches in prediction accuracy by 11% on average and improves QoE by 27% on average, but also quickly adapts to users with new preferences with on average 67%-88% less training epochs.

SESSION: Oral Session IX: Multimedia Systems -- Data Systems Management and Indexing

InDiD: Instant Disorder Detection via a Principled Neural Network

  • Evgenia Romanenkova
  • Alexander Stepikin
  • Matvey Morozov
  • Alexey Zaytsev

For sequential data, a change point is a moment of abrupt regime switch in data streams. Such changes appear in different scenarios, including simpler data from sensors and more challenging video surveillance data. We need to detect disorders as fast as possible. Classic approaches for change point detection (CPD) might underperform for semi-structured sequential data because they cannot process its structure without a proper representation. We propose a principled loss function that balances change detection delay and time to a false alarm. It approximates classic rigorous solutions but is differentiable and allows representation learning for deep models. We consider synthetic sequences, real-world data sensors and videos with change points. We carefully labelled available video data with change point moments and released it for the first time. Experiments suggest that complex data require meaningful representations tailored for the specificity of the CPD task --- and our approach provides them outperforming considered baselines. For example, for explosion detection in video, the F1 score for our method is 0.53 compared to baseline scores of 0.31 and 0.35.

Maze: A Cost-Efficient Video Deduplication System at Web-scale

  • An Qin
  • Mengbai Xiao
  • Ben Huang
  • Xiaodong Zhang

With the advancement and dominant service of Internet videos, the content-based video deduplication system becomes an essential and dependent infrastructure for Internet video service. However, the explosively growing video data on the Internet challenges the system design and implementation for its scalability in several ways. (1) Although the quantization-based indexing techniques are effective for searching visual features at a large scale, the costly re-training over the complete dataset must be done periodically. (2) The high-dimensional vectors for visual features demand increasingly large SSD space, degrading I/O performance. (3) Videos crawled from the Internet are diverse, and visually similar videos are not necessarily the duplicates, increasing deduplication complexity. (4) Most videos are edited ones. The duplicate contents are more likely discovered as clips inside the videos, demanding processing techniques with close attention to details.

To address above-mentioned issues, we propose Maze, a full-fledged video deduplication system. Maze has an ANNS layer that indexes and searches the high dimensional feature vectors. The architecture of the ANNS layer supports efficient reads and writes and eliminates the data migration caused by re-training. Maze adopts the CNN-based feature and the ORB feature as the visual features, which are optimized for the specific video deduplication task. The features are compact and fully reside in the memory. Acoustic features are also incorporated in Maze so that the visually similar videos but having different audio tracks are recognizable. A clip-based matching algorithm is developed to discover duplicate contents at a fine granularity. Maze has been deployed as a production system for two years. It has indexed 1.3 billion videos and is indexing ~800 thousand videos per day. For the ANNS layer, the average read latency is 4 seconds and the average write latency is at most 4.84 seconds. The re-training over the complete dataset is no longer required no matter how many new data sets are added, eliminating the costly data migration between nodes. Maze recognizes the duplicate live streaming videos with both the similar appearance and the similar audio at a recall of 98%. Most importantly, Maze is also cost-effective. For example, the compact feature design helps save 5800 SSDs and the computation resources devoted to running the whole system decrease to 250K standard cores per billion videos.

SESSION: Poster Session IX: Multimedia Systems -- Data Systems Management and Indexing

HyP2 Loss: Beyond Hypersphere Metric Space for Multi-label Image Retrieval

  • Chengyin Xu
  • Zenghao Chai
  • Zhengzhuo Xu
  • Chun Yuan
  • Yanbo Fan
  • Jue Wang

Image retrieval has become an increasingly appealing technique with broad multimedia application prospects, where deep hashing serves as the dominant branch towards low storage and efficient retrieval. In this paper, we carried out in-depth investigations on metric learning in deep hashing for establishing a powerful metric space in multi-label scenarios, where the pair loss suffers high computational overhead and converge difficulty, while the proxy loss is theoretically incapable of expressing the profound label dependencies and exhibits conflicts in the constructed hypersphere space. To address the problems, we propose a novel metric learning framework with Hybrid Proxy-Pair Loss (HyP$^2$ Loss) that constructs an expressive metric space with efficient training complexity w.r.t. the whole dataset. The proposed HyP$^2$ Loss focuses on optimizing the hypersphere space by learnable proxies and excavating data-to-data correlations of irrelevant pairs, which integrates sufficient data correspondence of pair-based methods and high-efficiency of proxy-based methods. Extensive experiments on four standard multi-label benchmarks justify the proposed method outperforms the state-of-the-art, is robust among different hash bits and achieves significant performance gains with a faster, more stable convergence speed. Our code is available at

Online Deep Learning from Doubly-Streaming Data

  • Heng Lian
  • John Scovi Atwood
  • Bo-Jian Hou
  • Jian Wu
  • Yi He

This paper investigates a new online learning problem with doubly-streaming data, where the data streams are described by feature spaces that constantly evolve, with new features emerging and old features fading away. A plausible idea to deal with such data streams is to establish a relationship between the old and new feature spaces, so that an online learner can leverage the knowledge learned from the old features to better the learning performance on the new features. Unfortunately, this idea does not scale up to high-dimensional multimedia data with complex feature interplay, which suffers a tradeoff between onlineness, which biases shallow learners, and expressiveness, which requires deep models. Motivated by this, we propose a novel OLD3S paradigm, where a shared latent subspace is discovered to summarize information from the old and new feature spaces, building an intermediate feature mapping relationship. A key trait of OLD3S is to treat the model capacity as a learnable semantics, aiming to yield optimal model depth and parameters jointly in accordance with the complexity and non-linearity of the input data streams in an online fashion. Both theoretical analysis and empirical studies substantiate the viability and effectiveness of our proposed approach. The code is available online at

Re-ordered Micro Image based High Efficient Residual Coding in Light Field Compression

  • Hyunmin Jung
  • Hyuk-Jae Lee
  • Chae Eun Rhee

Light field (LF), a new approach in three-dimensional image processing, has been actively used in various applications in recent years. LF is based on a large amount of data and this always leads to LF compression (LFC) issues. Pseudo-sequence (PS)-based LFC converts a LF into a video sequence and compresses it through a video codec, whereas synthesis-based LFC (SYN-LFC) synthesizes the rest from some of the LF to reduce the number of bits. SYN-LFC is superior to PS-based LFC at low bitrates. However, its competitiveness decreases at high bitrates due to the inefficient compression of residuals. This paper maximizes the advantages of SYN-LFC by increasing the compression efficiency of residuals. To exploit the characteristic of the residual in favor of compression, this paper compresses the residual in the form of a micro image (MI). The conversion of residuals to MI has the effect of gathering similar residuals of each viewpoint, which increases the spatial coherence. However, the conventional MI conversion does not reflect the geometric characteristics of LF at all. To tackle this problem, this paper proposes the re-ordered micro image (RoMI), which is a novel MI conversion that takes advantage of the geometric characteristics of LF, thereby maximizing the spatial coherence and compression efficiency. To compress MI-type residuals, JPEG2000, an image-level codec, is used. It is highly suitable for RoMI with spatial coherence beyond the block level. In the experimental results, the proposed RoMI shows average improvements of 30.29% and 14.05% in the compression efficiency compared to the existing PS-based LFC and SYN-LFC methods, respectively.

Accelerating General-purpose Lossless Compression via Simple and Scalable Parameterization

  • Yu Mao
  • Yufei Cui
  • Tei-Wei Kuo
  • Chun Jason Xue

The storage of multi-media data can benefit from the advancements in general-purpose lossless compression. The explosive growth of multi-media data volume in data centers demands a higher compression ratio and better compressors' run-time speed. However, recent deep-learning-based compressors with a high compression ratio usually build complicated dependencies on history symbols, leading to a long compression time. This paper investigates the behavior of historical symbols and finds an approximate order of importance. Namely, recent symbols have a substantially larger influence on the probability estimation of the next unknown symbol. This observation guides the designing of an interpretable structure for data compression, rather than learning implicitly from data like Recurrent Neural Network (RNN) and attention. Based on this observation, we disentangle the compression model into order learning and feature learning, which were fused in a large module in previous works. A parameterized ordered mask unit is established to learn the ordered importance of history symbols. A fast Multi-Layer Perceptron (MLP) network is designed for efficient feature learning. The proposed compressor can improve both compression performance and computational efficiency compared with transformer-based or RNN-based compressors. To further enhance computational efficiency, we propose a branch-MLP block to replace the original MLP layer. This block reduces the parameters and the FLOPs of the original MLP to a half, without sacrificing compression performance. Experiments on multi-media data demonstrate that our model improves the compression ratio by 10% on average across data domains while accelerating compression speed by 100% compared with the state-of-the-art. The source code and appendix are released at

SESSION: Oral Session X: Understanding Multimedia Content -- Multimodal Fusion and Embeddings

Semantic Data Augmentation based Distance Metric Learning for Domain Generalization

  • Mengzhu Wang
  • Jianlong Yuan
  • Qi Qian
  • Zhibin Wang
  • Hao Li

Domain generalization (DG) aims to learn a model on one or more different but related source domains that could be generalized into an unseen target domain. Existing DG methods try to prompt the diversity of source domains for the model's generalization ability, while they may have to introduce auxiliary networks or striking computational costs. On the contrary, this work applies the implicit semantic augmentation in feature space to capture the diversity of source domains. Concretely, an additional loss function of distance metric learning (DML) is included to optimize the local geometry of data distribution. Besides, the logits from cross entropy loss with infinite augmentations is adopted as input features for the DML loss in lieu of the deep features. We also provide a theoretical analysis to show that the logits can approximate the distances defined on original features well. Further, we provide an in-depth analysis of the mechanism and rational behind our approach, which gives us a better understanding of why leverage logits in lieu of features can help domain generalization. The proposed DML loss with the implicit augmentation is incorporated into a recent DG method, that is, Fourier Augmented Co-Teacher framework (FACT). Meanwhile, our method also can be easily plugged into various DG methods. Extensive experiments on three benchmarks (Digits-DG, PACS and Office-Home) have demonstrated that the proposed method is able to achieve the state-of-the-art performance.

Mix-DANN and Dynamic-Modal-Distillation for Video Domain Adaptation

  • Yuehao Yin
  • Bin Zhu
  • Jingjing Chen
  • Lechao Cheng
  • Yu-Gang Jiang

Video domain adaptation is non-trivial due to video is inherently involved with multi-dimensional and multi-modal information. Existing works mainly adopt adversarial learning and self-supervised tasks to align features. Nevertheless, the explicit interaction between source and target in the temporal dimension, as well as the adaptation between modalities, are unexploited. In this paper, we propose Mix-Domain-Adversarial Neural Network and Dynamic-Modal-Distillation (MD-DMD), a novel multi-modal adversarial learning framework for unsupervised video domain adaptation. Our approach incorporates the temporal information between source and target domains, as well as the diversity of adaptability between modalities. On the one hand, for every single modality, we mix the frames from source and target domains to form mix-samples, then let the adversarial-discriminator predict the mix ratio of a mix-sample to further enhance the ability of the model to capture domain-invariant feature representations. On the other hand, we dynamically estimate the adaptability for different modalities during training, then pick the most adaptable modality as a teacher to guide other modalities by knowledge distillation. As a result, modalities are capable of learning transferable knowledge from each other, which leads to more effective adaptation. Experiments on two video domain adaptation benchmarks demonstrate the superiority of our proposed MD-DMD over state-of-the-art methods.

Search-oriented Micro-video Captioning

  • Liqiang Nie
  • Leigang Qu
  • Dai Meng
  • Min Zhang
  • Qi Tian
  • Alberto Del Bimbo

Pioneer efforts have been dedicated to the content-oriented video captioning that generates relevant sentences to describe the visual contents of a given video from the producer perspective. By contrast, this work targets at the search-oriented one that summarizes the given video via generating query-like sentences from the consumer angle. Beyond relevance, diversity is vital in characterizing consumers' seeking intention from different aspects. Towards this end, we devise a large-scale multimodal pre-training network regularized by five tasks to strengthen the downstream video representation, which is well-trained over our collected 11M micro-videos. Thereafter, we present a flow-based diverse captioning model to generate different captions from consumers' search demand. This model is optimized via a reconstruction loss and a KL divergence between the prior and the posterior. We justify our model over our constructed golden dataset comprising 690k <query, micro-video> pairs and experimental results demonstrate its superiority.

Dual Part Discovery Network for Zero-Shot Learning

  • Jiannan Ge
  • Hongtao Xie
  • Shaobo Min
  • Pandeng Li
  • Yongdong Zhang

Zero-Shot Learning (ZSL) aims to recognize unseen classes by transferring knowledge from seen classes. Recent methods focus on learning a common semantic space to align visual and attribute information. However, they always over-relied on provided attributes and ignored the category discriminative information that contributes to accurate unseen class recognition, resulting in weak transferability. To this end, we propose a novel Dual Part Discovery Network (DPDN) that considers both attribute and category discriminative information by discovering attribute-guided parts and category-guided parts simultaneously to improve knowledge transfer. Specifically, for attribute-guided parts discovery, DPDN can localize the regions with specific attribute information and significantly bridge the gap between visual and semantic information guided by the given attributes. For category-guided parts discovery, the local parts are explored to discover other important regions that bring latent crucial details ignored by attributes, with the guidance of adaptive category prototypes. To better mine the transferable knowledge, we impose class correlations constraints to regularize the category prototypes. Finally, attribute- and category-guided parts complement each other and provide adequate discriminative subtle information for more accurate unseen class recognition. Extensive experimental results demonstrate that DPDN can discover discriminative parts and outperform state-of-the-art methods on three standard benchmarks.

Non-Autoregressive Cross-Modal Coherence Modelling

  • Yi Bin
  • Wenhao Shi
  • Jipeng Zhang
  • Yujuan Ding
  • Yang Yang
  • Heng Tao Shen

Modelling the coherence of information is important for human to perceive and prehend the physical world. Existing works on coherence modelling mainly focus on single modality, which overlook the effect of information integration and semantic consistency across modalities. To fill the research gap, this paper targets at the cross-modal coherence modelling, specifically, the cross-modal ordering task. The task requires to not only explore the coherence information in single modality, but also leverage cross-modal information to model the semantic consistency between modalities. To this end, we propose a Non-Autoregressive Cross-modal Ordering Net (NACON) adopting a basic encoder-decoder architecture. Specifically, NACON is equipped with an order-invariant context encoder to model the unordered input set and a non-autoregressive decoder to generate ordered sequences in parallel. We devise a cross-modal positional attention module in NACON to take advantage of the cross-modal order guidance. To alleviate the repetition problem of non-autoregressive models, we introduce an elegant exclusive loss to constrain the ordering exclusiveness between positions and elements. We conduct extensive experiments on two assembled datasets to support our task, SIND and TACoS-Ordering. Experimental results show that the proposed NACON can effectively leverage cross-modal guidance and recover the correct order of the elements.The code is available at

CoHOZ: Contrastive Multimodal Prompt Tuning for Hierarchical Open-set Zero-shot Recognition

  • Ning Liao
  • Yifeng Liu
  • Li Xiaobo
  • Chenyi Lei
  • Guoxin Wang
  • Xian-Sheng Hua
  • Junchi Yan

Practical image recognition often encounters samples whose labels either are totally unknown or belong to new classes outside the training set. The first problem refers to the open-set recognition (OSR), in which unknown classes are recognized as one with no more semantic information. While the latter is called zero-shot learning (ZSL), in which new classes are usually predefined. The existing literature mostly addresses these two problems separately. In this paper, we take the ambition for solving the combination of these two problems to fulfill semantically recognizing the unknown classes detected in OSR by zero-shot prediction. We propose the Contrastive multimodal prompt tuning for Hierarchical Open-set Zero-shot recognition (CoHOZ). Specifically, we firstly build a global and compatible hierarchical label tree with all downstream datasets aligned, which lays foundations for other modules. To detect unknown classes, we propose the contrastive continuous prompt tuning, which introduces additional negative classes from the fine level of the built hierarchy for prompt learning. To generate candidate classes for zero-shot prediction on the unknown data using prompt, we combine the built hierarchy to collect candidate classes from coarse to fine. In our experiments, when following the standard OSR protocol regarding all the unknown classes as a single class, CoHOZ achieves a new state-of-the-art performance both in unknown detection and open-set recognition. Few-shot tuning by the CoHOZ also shows competitive performance on them. In addition, the detailed semantic information of unknown classes are well explored, which has also been verified in experiments.

GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement

  • Zhi-Qi Cheng
  • Qi Dai
  • Siyao Li
  • Teruko Mitamura
  • Alexander Hauptmann

Grounded Situation Recognition (GSR) aims to generate structured semantic summaries of images for "human-like'' event understanding. Specifically, GSR task not only detects the salient activity verb (e.g. buying), but also predicts all corresponding semantic roles (e.g. agent and goods). Inspired by object detection and image captioning tasks, existing methods typically employ a two-stage framework: 1) detect the activity verb, and then 2) predict semantic roles based on the detected verb. Obviously, this illogical framework constitutes a huge obstacle to semantic understanding. First, pre-detecting verbs solely without semantic roles inevitably fails to distinguish many similar daily activities (e.g., offering and giving, buying and selling). Second, predicting semantic roles in a closed auto-regressive manner can hardly exploit the semantic relations among the verb and roles. To this end, in this paper we propose a novel two-stage framework that focuses on utilizing such bidirectional relations within verbs and roles. In the first stage, instead of pre-detecting the verb, we postpone the detection step and assume a pseudo label, where an intermediate representation for each corresponding semantic role is learned from images. In the second stage, we exploit transformer layers to unearth the potential semantic relations within both verbs and semantic roles. With the help of a set of support images, an alternate learning scheme is designed to simultaneously optimize the results: update the verb using nouns corresponding to the image, and update nouns using verbs from support images. Extensive experimental results on challenging SWiG benchmarks show that our renovated framework outperforms other state-of-the-art methods under various metrics.

CALM: Commen-Sense Knowledge Augmentation for Document Image Understanding

  • Qinyi Du
  • Qingqing Wang
  • Keqian Li
  • Jidong Tian
  • Liqiang Xiao
  • Yaohui Jin

Performance of document image understanding has been significantly fueled by encoding multi-modal information in recent years. However, existing works heavily rely on the superficial appearance of the observed data, resulting in counter-intuitive model behavior in many critical cases. To overcome this issue, this paper proposes a common-sense knowledge augmented model CALM for document image understanding tasks. It firstly produces purified representations of document contents to extract key information and learn common-sense augmented representation for inputs. Then, relevant common-sense knowledge is extracted from the external ConceptNet knowledge base, and a derived knowledge graph is built to enhance the common-sense reasoning capability of CALM jointly. In order to further highlight the importance of common-sense knowledge in document image understanding, we propose the first question-answering dataset, CS-DVQA, focused on common-sense reasoning for document images, in which questions are answered by taking both document contents and common-sense knowledge into consideration. Through extensive evaluation, the proposed CALM approach outperforms the state-of-the-art models in three document image understanding tasks, including key information extraction(from 85.37 to 86.52), document image classification(from 96.08 to 96.17), document visual question answering(from 86.72 to 88.03).

Cross-Modal Retrieval with Heterogeneous Graph Embedding

  • Dapeng Chen
  • Min Wang
  • Haobin Chen
  • Lin Wu
  • Jing Qin
  • Wei Peng

Conventional methods address the cross-modal retrieval problem by projecting the multi-modal data into a shared representation space. Such a strategy will inevitably lose the modality-specific information, leading to decreased retrieval accuracy. In this paper, we propose heterogeneous graph embeddings to preserve more abundant cross-modal information. The embedding from one modality will be compensated with the aggregated embeddings from the other modality. In particular, a self-denoising tree search is designed to reduce the "label noise" problem, making the heterogeneous neighborhood more semantically relevant. The dual-path aggregation tackles the "modality imbalance" problem, giving each sample comprehensive dual-modality information. The final heterogeneous graph embedding is obtained by feeding the aggregated dual-modality features to the cross-modal self-attention module. Experiments conducted on cross-modality person re-identification and image-text retrieval task validate the superiority and generality of the proposed method.

Simple Self-supervised Multiplex Graph Representation Learning

  • Yujie Mo
  • Yuhuan Chen
  • Liang Peng
  • Xiaoshuang Shi
  • Xiaofeng Zhu

Self-supervised multiplex graph representation learning (SMGRL) aims to capture the information from the multiplex graph, and generates discriminative embedding without labels. However, previous SMGRL methods still suffer from the issues of efficiency and effectiveness due to the processes, e.g., data augmentation, negative sample encoding, complex pretext tasks, etc. In this paper, we propose a simple method to achieve efficient and effective SMGRL. Specifically, the proposed method removes the processes (i.e., data augmentation and negative sample encoding) for the SMGRL and designs a simple pretext task, for achieving the efficiency. Moreover, the proposed method also designs an intra-graph decorrelation loss and an inter-graph decorrelation loss, respectively, to capture the common information within individual graphs and the common information across graphs, for achieving the effectiveness. Extensive experimental results verify the efficiency and effectiveness of our method, compared to 11 comparison methods on 4 public benchmark datasets, on the node classification task.

Ordered Attention for Coherent Visual Storytelling

  • Tom Braude
  • Idan Schwartz
  • Alex Schwing
  • Ariel Shamir

We address the problem of visual storytelling, i.e., generating a story for a given sequence of images. While each story sentence should describe a corresponding image, a coherent story also needs to be consistent and relate to both future and past images. Current approaches encode images independently, disregarding relations between images. Our approach learns to encode images with different interactions based on the story position (i.e., past image or future image). To this end, we develop a novel message-passing-like algorithm for ordered image attention (OIA) that collects interactions across all the images in the sequence. Finally, to generate the story's sentences, a second attention mechanism picks the important image attention vectors with an Image-Sentence Attention (ISA). The obtained results improve the METEOR score on the VIST dataset by 1%. Furthermore, a thorough human study confirms improvements and demonstrates that order-based interactions significantly improve coherency (64.20% \vs 28.70%). Source code available at \url

LVI-ExC: A Target-free LiDAR-Visual-Inertial Extrinsic Calibration Framework

  • Zhong Wang
  • Lin Zhang
  • Ying Shen
  • Yicong Zhou

Recently, the multi-modal fusion with 3D LiDAR, camera, and IMU has shown great potential in applications of automation-related fields. Yet a prerequisite for a successful fusion is that the geometric relationships among the sensors are accurately determined, which is called an extrinsic calibration problem. To date, the existing target-based approaches to deal with this problem rely on sophisticated calibration objects (sites) and well-trained operators, which is time-consuming and inflexible in practical applications. Contrarily, a few target-free methods can overcome these shortcomings, while they only focus on the calibrations of two types of the sensors. Although it is possible to obtain LiDAR-visual-inertial extrinsics by chained calibrations, problems such as cumbersome operations, large cumulative errors, and weak geometric consistency still exist. To this end, we propose LVI-ExC, an integrated LiDAR-Visual-Inertial Extrinsic Calibration framework, which takes natural multi-modal data as input and yields sensor-to-sensor extrinsics end-to-end without any auxiliary object (site) or manual assistance. To fuse multi-modal data, we formulate the LiDAR-visual-inertial extrinsic calibration as a continuous-time simultaneous localization and mapping problem, in which the extrinsics, trajectories, time differences, and map points are jointly estimated by establishing sensor-to-sensor and sensor-to-trajectory constraints. Extensive experiments show that LVI-ExC can produce precise results. With LVI-ExC's outputs, the LiDAR-visual reprojection results and the reconstructed environment map are all highly consistent with the actual natural scenes, demonstrating LVI-ExC's outstanding performance. To ensure that our results are fully reproducible, all the relevant data and codes have been released publicly at

MM-ALT: A Multimodal Automatic Lyric Transcription System

  • Xiangming Gu
  • Longshen Ou
  • Danielle Ong
  • Ye Wang

Automatic lyric transcription (ALT) is a nascent field of study attracting increasing interest from both the speech and music information retrieval communities, given its significant application potential. However, ALT with audio data alone is a notoriously difficult task due to instrumental accompaniment and musical constraints resulting in degradation of both the phonetic cues and the intelligibility of sung lyrics. To tackle this challenge, we propose the MultiModal Automatic Lyric Transcription system (MM-ALT), together with a new dataset, N20EM, which consists of audio recordings, videos of lip movements, and inertial measurement unit (IMU) data of an earbud worn by the performing singer. We first adapt the wav2vec 2.0 framework from automatic speech recognition (ASR) to the ALT task. We then propose a video-based ALT method and an IMU-based voice activity detection (VAD) method. In addition, we put forward the Residual Cross Attention (RCA) mechanism to fuse data from the three modalities (i.e., audio, video, and IMU). Experiments show the effectiveness of our proposed MM-ALT system, especially in terms of noise robustness. Project page is at

Self-supervised Exclusive Learning for 3D Segmentation with Cross-Modal Unsupervised Domain Adaptation

  • Yachao Zhang
  • Miaoyu Li
  • Yuan Xie
  • Cuihua Li
  • Cong Wang
  • Zhizhong Zhang
  • Yanyun Qu

2D-3D unsupervised domain adaptation (UDA) tackles the lack of annotations in a new domain by capitalizing the relationship between 2D and 3D data. Existing methods achieve considerable improvements by performing cross-modality alignment in a modality-agnostic way, failing to exploit modality-specific characteristic for modeling complementarity. In this paper, we present self-supervised exclusive learning for cross-modal semantic segmentation under the UDA scenario, which avoids the prohibitive annotation. Specifically, two self-supervised tasks are designed, named "plane-to-spatial'' and "discrete-to-textured''. The former helps the 2D network branch improve the perception of spatial metrics, and the latter supplements structured texture information for the 3D network branch. In this way, modality-specific exclusive information can be effectively learned, and the complementarity of multi-modality is strengthened, resulting in a robust network to different domains. With the help of the self-supervised tasks supervision, we introduce a mixed domain to enhance the perception of the target domain by mixing the patches of the source and target domain samples. Besides, we propose a domain-category adversarial learning with category-wise discriminators by constructing the category prototypes for learning domain-invariant features. We evaluate our method on various multi-modality domain adaptation settings, where our results significantly outperform both uni-modality and multi-modality state-of-the-art competitors.

Cross-Compatible Embedding and Semantic Consistent Feature Construction for Sketch Re-identification

  • Yafei Zhang
  • Yongzeng Wang
  • Huafeng Li
  • Shuang Li

Sketch re-identification (Re-ID) refers to using sketches of pedestrians to retrieve their corresponding photos from surveillance videos. It can track pedestrians according to the sketches drawn based on eyewitnesses without querying pedestrian photos. Although the Sketch Re-ID concept has been proposed, the gap between the sketch and the photo still greatly hinders pedestrian identity matching. Based on the idea of transplantation without rejection, we propose a Cross-Compatible Embedding (CCE) approach to narrow the gap. A Semantic Consistent Feature Construction (SCFC) scheme is simultaneously presented to enhance feature discrimination. Under the guidance of identity consistency, the CCE performs cross modal interchange at the local token level in the Transformer framework, enabling the model to extract modal-compatible features. The SCFC improves the representation ability of features by handling the inconsistency of information in the same location of the sketch and the corresponding pedestrian photo. The SCFC scheme divides the local tokens of pedestrian images with different modes into different groups and assigns specific semantic information to each group for constructing a semantic consistent global feature representation. Experiments on the public Sketch Re-ID dataset confirm the effectiveness of the proposed method and its superiority over existing methods. Experiments on Sketch-based image retrieval datasets QMUL-Shoe-v2 and QMUL-Chair-v2 are conducted to assess the method's generalization. The results show that the proposed method outperforms the state-of-the-art works compared. The source code of our method is available at:

Difference Residual Graph Neural Networks

  • Liang Yang
  • Weihang Peng
  • Wenmiao Zhou
  • Bingxin Niu
  • Junhua Gu
  • Chuan Wang
  • Yuanfang Guo
  • Dongxiao He
  • Xiaochun Cao

Graph Neural Networks have been widely employed for multimodal fusion and embedding. To overcome over-smoothing issue, residual connections, which are designed for alleviating vanishing gradient problem in NNs, are adopted in Graph Neural Networks (GNNs) to incorporate local node information. However, these simple residual connections are ineffective on networks with heterophily, since the roles of both convolutional operations and residual connections in GNNs are significantly different from those in classic NNs. By considering the specific smoothing characteristic of graph convolutional operation, deep layers in GNNs are expected to focus on the data which can't be properly handled in shallow layers. To this end, a novel and universal Difference Residual Connections (DRC), which feed the difference of the output and input of previous layer as the input of the next layer, is proposed. Essentially, Difference Residual Connections is equivalent to inserting layers with opposite effect (e.g., sharpening) into the network to prevent the excessive effect (e.g., over-smoothing issue) induced by too many layers with the similar role (e.g., smoothing) in GNNs. From the perspective of optimization, DRC is the gradient descent method to minimize an objective function with both smoothing and sharpening terms. The analytic solution to this objective function is determined by both graph topology and node attributes, which theoretically proves that DRC can prevent over-smoothing issue. Extensive experiments demonstrate the superiority of DRC on real networks with both homophily and heterophily, and show that DRC can automatically determine the model depth and be adaptive to both shallow and deep models with two complementary components.

SESSION: Poster Session X: Understanding Multimedia Content -- Multimodal Fusion and Embeddings

Normalization-based Feature Selection and Restitution for Pan-sharpening

  • Man Zhou
  • Jie Huang
  • Keyu Yan
  • Gang Yang
  • Aiping Liu
  • Chongyi Li
  • Feng Zhao

Pan-sharpening is essentially a panchromatic (PAN) image-guided low-spatial resolution MS image super-resolution problem. The commonly challenging issue of pan-sharpening is how to correctly select consistent features and propagate them, and properly handle inconsistent ones between PAN and MS modalities. To solve this issue, we propose a Normalization-based Feature Selection and Restitution mechanism, which is capable of filtering out the inconsistent features and promoting to learn the consistent ones. Specifically, we first modulate the PAN feature as the MS style in feature space by AdaIN operation \citeAdaIN. However, such operation inevitably removes the favorable features. We thus propose to distill the effective information from the removed part and restitute it back to the modulated part. To better distillation, we enforce a contrastive learning constraint to close the distance between the restituted feature and the ground truth, and push the removed part away from the ground truth. In this way, the consistent features of PAN images are correctly selected and the inconsistent ones are filtered out, thus relieving the over-transferred artifacts in the process of PAN-guided MS super-resolution. Extensive experiments validate the effectiveness of the proposed network and demonstrate its favorable performance against other state-of-the-art methods. The source code will be released at

Adaptively Learning Low-high Frequency Information Integration for Pan-sharpening

  • Man Zhou
  • Jie Huang
  • Chongyi Li
  • Hu Yu
  • Keyu Yan
  • Naishan Zheng
  • Feng Zhao

Pan-sharpening aims to generate high-spatial resolution multi-spectral (MS) image by fusing high-spatial resolution panchromatic (PAN) image and its corresponding low-spatial resolution MS image. Despite the remarkable progress, most existing pan-sharpening methods only work in the spatial domain and rarely explore the potential solutions in the frequency domain. In this paper, we propose a novel pan-sharpening framework by adaptively learning low-high frequency information integration in the spatial and frequency dual domains. It consists of three key designs: mask prediction sub-network, low-frequency learning sub-network and high-frequency learning sub-network. Specifically, the first is responsible for measuring the modality-aware frequency information difference of PAN and MS images and further predicting the low-high frequency boundary in the form of a two-dimensional mask. In view of the mask, the second adaptively picks out the corresponding low-frequency components of different modalities and then restores the expected low-frequency one by spatial and frequency dual domains information integration while the third combines the above refined low-frequency and the original high-frequency for the latent high-frequency reconstruction. In this way, the low-high frequency information is adaptively learned, thus leading to the pleasing results. Extensive experiments validate the effectiveness of the proposed network and demonstrate the favorable performance against other state-of-the-art methods. The source code will be released at

Complementary Graph Representation Learning for Functional Neuroimaging Identification

  • Rongyao Hu
  • Liang Peng
  • Jiangzhang Gan
  • Xiaoshuang Shi
  • Xiaofeng Zhu

The functional connectomics study on resting state functional magnetic resonance imaging (rs-fMRI) data has become a popular way for early disease diagnosis. However, previous methods did not jointly consider the global patterns, the local patterns, and the temporal information of the blood-oxygen-level-dependent (BOLD) signals, thereby restricting the model effectiveness for early disease diagnosis. In this paper, we propose a new graph convolutional network (GCN) method to capture local and global patterns for conducting dynamically functional connectivity analysis. Specifically, we first employ the sliding window method to partition the original BOLD signals into multiple segments, aiming at achieving the dynamically functional connectivity analysis, and then design a multi-view node classification and a temporal graph classification to output two kinds of representations, which capture the temporally global patterns and the temporally local patterns, respectively. We further fuse these two kinds of representation by the weighted concatenation method whose effectiveness is experimentally proved as well. Experimental results on real datasets demonstrate the effectiveness of our method, compared to comparison methods on different classification tasks.

Dynamically Adjust Word Representations Using Unaligned Multimodal Information

  • Jiwei Guo
  • Jiajia Tang
  • Weichen Dai
  • Yu Ding
  • Wanzeng Kong

Multimodal Sentiment Analysis is a promising research area for modeling multiple heterogeneous modalities. Two major challenges that exist in this area are a) multimodal data is unaligned in nature due to the different sampling rates of each modality, and b) long-range dependencies between elements across modalities. These challenges increase the difficulty of conducting efficient multimodal fusion. In this work, we propose a novel end-to-end network named Cross Hyper-modality Fusion Network (CHFN). The CHFN is an interpretable Transformer-based neural model that provides an efficient framework for fusing unaligned multimodal sequences. The heart of our model is to dynamically adjust word representations in different non-verbal contexts using unaligned multimodal sequences. It is concerned with the influence of non-verbal behavioral information at the scale of the entire utterances and then integrates this influence into verbal expression. We conducted experiments on both publicly available multimodal sentiment analysis datasets CMU-MOSI and CMU-MOSEI. The experiment results demonstrate that our model surpasses state-of-the-art models. In addition, we visualize the learned interactions between language modality and non-verbal behavior information and explore the underlying dynamics of multimodal language data.

Bipartite Graph-based Discriminative Feature Learning for Multi-View Clustering

  • Weiqing Yan
  • Jindong Xu
  • Jinglei Liu
  • Guanghui Yue
  • Chang Tang

Multi-view clustering is an important technique in machine learning research. Existing methods have improved in clustering performance, most of them learn graph structure depending on all samples, which are high complexity. Bipartite graph-based multi-view clustering can obtain clustering result by establishing the relationship between the sample points and small anchor points, which improve the efficiency of clustering. Most bipartite graph-based clustering methods only focus on topological graph structure learning depending on sample nodes, ignore the influence of node features. In this paper, we propose bipartite graph-based discriminative feature learning for multi-view clustering, which combines bipartite graph learning and discriminative feature learning to a unified framework. Specifically, the bipartite graph learning is proposed via multi-view subspace representation with manifold regularization terms. Meanwhile, our feature learning utilizes data pseudo-labels obtained by fused bipartite graph to seek projection direction, which make the same label be closer and make data points with different labels be far away from each other. At last, the proposed manifold regularization terms establish the relationship between constructed bipartite graph and new data representation. By leveraging the interactions between structure learning and discriminative feature learning, we are able to select more informative features and capture more accurate structure of data for clustering. Extensive experimental results on different scale datasets demonstrate our method achieves better or comparable clustering performance than the results of state-of-the-art methods.

Dynamic Incomplete Multi-view Imputing and Clustering

  • Xingfeng Li
  • Quansen Sun
  • Zhenwen Ren
  • Yinghui Sun

Incomplete multi-view clustering (IMVC) is deemed a significant research topic in multimedia to handle data loss situations. Current late fusion incomplete multi-view clustering methods have attracted intensive attention owing to their superiority in using consensus partition for effective and efficient imputation and clustering. However, 1) their imputation quality and clustering performance depend heavily on the static prior partition, such as predefined zeros filling, destroying the diversity of different views; 2) the size of base partitions is too small, which would lose advantageous details of base kernels to decrease clustering performance. To address these issues, we propose a novel IMVC method, named Dynamic Incomplete Multi-view Imputing and Clustering (DIMIC). Concretely, the observed views dynamically generate a consensus proxy with the guidance of a shared cluster matrix for more effective imputation and clustering, rather than a fixed predefined partition matrix. Furthermore, the proper size of base partitions is employed to protect sufficient kernel details for further enhancing the quality of the consensus proxy. By designing a solver with a linear computational and memory complexity on extensive experiments, our effectiveness, superiority, and efficiency are validated on multiple public datasets with recent advances.

Learning Smooth Representation for Multi-view Subspace Clustering

  • Shudong Huang
  • Yixi Liu
  • Yazhou Ren
  • Ivor W. Tsang
  • Zenglin Xu
  • Jiancheng Lv

Multi-view subspace clustering aims to exploit data correlation consensus among multiple views, which essentially can be treated as graph-based approach. However, existing methods usually suffer from suboptimal solution as the raw data might not be separable into subspaces. In this paper, we propose to achieve a smooth representation for each view and thus facilitate the downstream clustering task. It is based on a assumption that a graph signal is smooth if nearby nodes on the graph have similar features representations. Specifically, our mode is able to retain the graph geometric features by applying a low-pass filter to extract the smooth representations of multiple views. Besides, our method achieves the smooth representation learning as well as multi-view clustering interactively in a unified framework, hence it is an end-to-end single-stage learning problem. Substantial experiments on benchmark multi-view datasets are performed to validate the effectiveness of the proposed method, compared to the state-of-the-arts over the clustering performance.

LFBCNet: Light Field Boundary-aware and Cascaded Interaction Network for Salient Object Detection

  • Mianzhao Wang
  • Fan Shi
  • Xu Cheng
  • Meng Zhao
  • Yao Zhang
  • Chen Jia
  • Weiwei Tian
  • Shengyong Chen

In light field imaging techniques, the abundance of stereo spatial information aids in improving the performance of salient object detection. In some complex scenes, however, applying the 4D light field boundary structure to discriminate salient objects from background regions is still under-explored. In this paper, we propose a light field boundary-aware and cascaded interaction network based on light field macro-EPI, named LFBCNet. Firstly, we propose a well-designed light field multi-epipolar-aware learning (LFML) module to learn rich salient boundary cues by perceiving the continuous angle changes from light field macro-EPI. Secondly, to fully excavate the correlation between salient objects and boundaries at different scales, we design multiple light field boundary interactive (LFBI) modules and cascade them to form a light field multi-scale cascade interaction decoder network. Each LFBI is assigned to predict exquisite salient objects and boundaries by interactively transmitting the salient object and boundary features. Meanwhile, the salient boundary features are forced to gradually refine the salient object features during the multi-scale cascade encoding. Furthermore, a light field multi-scale-fusion prediction (LFMP) module is developed to automatically select and integrate multi-scale salient object features for final saliency prediction. The proposed LFBCNet can accurately distinguish tiny differences between salient objects and background regions. Comprehensive experiments on large benchmark datasets prove that the proposed method achieves competitive performance over 2-D, 3-D, and 4-D salient object detection methods.

Multiple Kernel Clustering with Dual Noise Minimization

  • Junpu Zhang
  • Liang Li
  • Siwei Wang
  • Jiyuan Liu
  • Yue Liu
  • Xinwang Liu
  • En Zhu

Clustering is a representative unsupervised method widely applied in multi-modal and multi-view scenarios. Multiple kernel clustering (MKC) aims to group data by integrating complementary information from base kernels. As a representative, late fusion MKC first decomposes the kernels into orthogonal partition matrices, then learns a consensus one from them, achieving promising performance recently. However, these methods fail to consider the noise inside the partition matrix, preventing further improvement of clustering performance. We discover that the noise can be disassembled into separable dual parts, i.e. N-noise and C-noise (Null space noise and Column space noise). In this paper, we rigorously define dual noise and propose a novel parameter-free MKC algorithm by minimizing them. To solve the resultant optimization problem, we design an efficient two-step iterative strategy. To our best knowledge, it is the first time to investigate dual noise within the partition in the kernel space. We observe that dual noise will pollute the block diagonal structures and incur the degeneration of clustering performance, and C-noise exhibits stronger destruction than N-noise. Owing to our efficient mechanism to minimize dual noise, the proposed algorithm surpasses the recent methods by large margins.

Webly Supervised Image Hashing with Lightweight Semantic Transfer Network

  • Hui Cui
  • Lei Zhu
  • Jingjing Li
  • Zheng Zhang
  • Weili Guan

Recent studies have verified the success of deep hashing for efficient image retrieval. However, most existing methods require abundant human labeling data to optimize the large number of involved network parameters, which consequently restricts the scalability of deep image hashing. Alternatively, learning from freely available web images that inherently include rich semantics is a promising strategy. Nevertheless, the domain distribution gap will prevent transferring the semantics involved in the source web images to the target images. Besides, most existing deep image hashing methods suffer from excessive training time to achieve satisfactory performance without explicit supervision. How to efficiently train the deep image hashing network is another important problem that needs to be seriously considered. In this paper, we propose a Webly Supervised Image Hashing (WSIH) with a well-designed lightweight network. Our model enhances the semantics of unsupervised image hashing with the weak supervision from freely available web images, and simultaneously avoids involving over-abundant parameters in the deep network architecture. Particularly, we train a concept prototype learning network on the web images, learning well-trained network parameters and the prototype codes that hold the discriminative semantics of the potential visual concepts in target images. Further, we meticulously design a lightweight siamese network architecture and a dual-level transfer mechanism to efficiently translate the semantics learned from source web images to the target images. Experiments on two widely-tested image datasets show the superiority of the proposed method in both retrieval accuracy and training efficiency compared to state-of-the-art image hashing methods.The source codes of our method are available at:

Rethinking Super-Resolution as Text-Guided Details Generation

  • Chenxi Ma
  • Bo Yan
  • Qing Lin
  • Weimin Tan
  • Siming Chen

Deep neural networks have greatly promoted the performance of single image super-resolution (SISR). Conventional methods still resort to restoring the single high-resolution (HR) solution only based on the input of image modality. However, the image-level information is insufficient to predict adequate details and photo-realistic visual quality facing large upscaling factors (×8, ×16). In this paper, we propose a new perspective that regards the SISR as a semantic image detail enhancement problem to generate semantically reasonable HR image that are faithful to the ground truth. To enhance the semantic accuracy and the visual quality of the reconstructed image, we explore the multi-modal fusion learning in SISR by proposing a Text-Guided Super-Resolution (TGSR) framework, which can effectively utilize the information from the text and image modalities. Different from existing methods, the proposed TGSR could generate HR image details that match the text descriptions through a coarse-to-fine process. Extensive experiments and ablation studies demonstrate the effect of the TGSR, which exploits the text reference to recover realistic images.

DEAL: An Unsupervised Domain Adaptive Framework for Graph-level Classification

  • Nan Yin
  • Li Shen
  • Baopu Li
  • Mengzhu Wang
  • Xiao Luo
  • Chong Chen
  • Zhigang Luo
  • Xian-Sheng Hua

Graph neural networks (GNNs) have achieved state-of-the-art results on graph classification tasks. They have been primarily studied in cases of supervised end-to-end training, which requires abundant task-specific labels. Unfortunately, annotating labels of graph data could be prohibitively expensive or even impossible in many applications. An effective solution is to incorporate labeled graphs from a different, but related source domain, to develop a graph classification model for the target domain. However, the problem of unsupervised domain adaptation for graph classification is challenging due to potential domain discrepancy in graph space as well as the label scarcity in the target domain. In this paper, we present a novel GNN framework named DEAL by incorporating both source graphs and target graphs, which is featured by two modules, i.e., adversarial perturbation and pseudo-label distilling. Specifically, to overcome domain discrepancy, we equip source graphs with target semantics by applying to them adaptive perturbations which are adversarially trained against a domain discriminator. Additionally, DEAL explores distinct feature spaces at different layers of the GNN encoder, which emphasize global and local semantics respectively. Then, we distill the consistent predictions from two spaces to generate reliable pseudo-labels for sufficiently utilizing unlabeled data, which further improves the performance of graph classification. Extensive experiments on a wide range of graph classification datasets reveal the effectiveness of our proposed DEAL.

AVQA: A Dataset for Audio-Visual Question Answering on Videos

  • Pinci Yang
  • Xin Wang
  • Xuguang Duan
  • Hong Chen
  • Runze Hou
  • Cong Jin
  • Wenwu Zhu

Audio-visual question answering aims to answer questions regarding both audio and visual modalities in a given video, and has drawn increasing research interest in recent years. However, there have been no appropriate datasets for this challenging task on videos in real-life scenarios so far. They are either designed with questions containing only visual clues without taking any audio information into account, or considering audio with restrictions to specific scenarios, such as panoramic videos and videos about music performances. In this paper, to overcome the limitations of existing datasets, we introduce AVQA, a new audio-visual question answering dataset on videos in real-life scenarios. We collect 57,015 videos from daily audio-visual activities and 57,335 specially-designed question-answer pairs relying on clues from both modalities, where information contained in a single modality is insufficient or ambiguous. Furthermore, we propose a Hierarchical Audio-Visual Fusing module to model multiple semantic correlations among audio, visual, and text modalities and conduct ablation studies to analyze the role of different modalities on our datasets. Experimental results show that our proposed method significantly improves the audio-visual question answering performance over various question types. Therefore, AVQA can provide an adequate testbed for the generation of models with a deeper understanding of multimodal information on audio-visual question answering in real-life scenarios. (The dataset is available at

Prompting for Multi-Modal Tracking

  • Jinyu Yang
  • Zhe Li
  • Feng Zheng
  • Ales Leonardis
  • Jingkuan Song

Multi-modal tracking gains attention due to its ability to be more accurate and robust in complex scenarios compared to traditional RGB-based tracking. Its key lies in how to fuse multi-modal data and reduce the gap between modalities. However, multi-modal tracking still severely suffers from data deficiency, thus resulting in the insufficient learning of fusion modules. Instead of building such a fusion module, in this paper, we provide a new perspective on multi-modal tracking by attaching importance to the multi-modal visual prompts. We design a novel multi-modal prompt tracker (ProTrack), which can transfer the multi-modal inputs to a single modality by the prompt paradigm. By best employing the tracking ability of pre-trained RGB trackers learning at scale, our ProTrack can achieve high-performance multi-modal tracking by only altering the inputs, even without any extra training on multi-modal data. Extensive experiments on 5 benchmark datasets demonstrate the effectiveness of the proposed ProTrack.

mmBody Benchmark: 3D Body Reconstruction Dataset and Analysis for Millimeter Wave Radar

  • Anjun Chen
  • Xiangyu Wang
  • Shaohao Zhu
  • Yanxu Li
  • Jiming Chen
  • Qi Ye

Millimeter Ware (mmWave) Radar is gaining popularity as it can work in adverse environments like smoke, rain, snow, poor lighting, etc. Prior work has explored the possibility of reconstructing 3D skeletons or meshes from the noisy and sparse mmWare Radar signals. However, it is unclear how accurately we can reconstruct the 3D body from the mmWave signals across scenes and how it performs compared with cameras, which are important aspects needed to be considered when either using mmWave radars alone or combining them with cameras. To answer these questions, an automatic 3D body annotation system is first designed and built up with multiple sensors to collect a large-scale dataset. The dataset consists of synchronized and calibrated mmWave radar point clouds and RGB(D) images in different scenes and skeleton/mesh annotations for humans in the scenes. With this dataset, we train state-of-the-art methods with inputs from different sensors and test them in various scenarios. The results demonstrate that 1) despite the noise and sparsity of the generated point clouds, the mmWave radar can achieve better reconstruction accuracy than the RGB camera but worse than the depth camera; 2) the reconstruction from the mmWave radar is affected by adverse weather conditions moderately while the RGB(D) camera is severely affected. Further, analysis of the dataset and the results shadow insights on improving the reconstruction from the mmWave radar and the combination of signals from different sensors.

Eliminating Spatial Ambiguity for Weakly Supervised 3D Object Detection without Spatial Labels

  • Haizhuang Liu
  • Huimin Ma
  • Yilin Wang
  • Bochao Zou
  • Tianyu Hu
  • Rongquan Wang
  • Jiansheng Chen

Previous weakly-supervised methods of 3D object detection in driving scenes mainly rely on spatial labels, which provide the location, dimension, or orientation information. The annotation of 3D spatial labels is time-consuming. There also exist methods that do not require spatial labels, but their detections may fall on object parts rather than entire objects or backgrounds. In this paper, a novel cross-modal weakly-supervised 3D progressive refinement framework (WS3DPR) for 3D object detection that only needs image-level class annotations is introduced. The proposed framework consists of two stages: 1) classification refinement for potential objects localization and 2) regression refinement for spatial pseudo labels reasoning. In the first stage, a region proposal network is trained by cross-modal class knowledge transferred from 2D image to 3D point cloud and class information propagation. In the second stage, the locations, dimensions, and orientations of 3D bounding boxes are further refined with geometric reasoning based on 2D frustum and 3D region. When only image-level class labels are available, proposals with different 3D locations become overlapped in 2D, leading to the misclassification of foreground objects. Therefore, a 2D-3D semantic consistency block is proposed to disentangle different 3D proposals after projection. The overall framework progressively learns features in a coarse to fine manner. Comprehensive experiments on the KITTI3D dataset demonstrate that our method achieves competitive performance compared with previous methods with a lightweight labeling process.

Dynamic Graph Reasoning for Multi-person 3D Pose Estimation

  • Zhongwei Qiu
  • Qiansheng Yang
  • Jian Wang
  • Dongmei Fu

Multi-person 3D pose estimation is a challenging task because of occlusion and depth ambiguity, especially in the cases of crowd scenes. To solve these problems, most existing methods explore modeling body context cues by enhancing feature representation with graph neural networks or adding structural constraints. However, these methods are not robust for their single-root formulation that decoding 3D poses from a root node with a pre-defined graph. In this paper, we propose GR-M3D, which models the Multi-person 3D pose estimation with dynamic Graph Reasoning. The decoding graph in GR-M3D is predicted instead of pre-defined. In particular, It firstly generates several data maps and enhances them with a scale and depth aware refinement module (SDAR). Then multiple root keypoints and dense decoding paths for each person are estimated from these data maps. Based on them, dynamic decoding graphs are built by assigning path weights to the decoding paths, while the path weights are inferred from those enhanced data maps. And this process is named dynamic graph reasoning (DGR). Finally, the 3D poses are decoded according to dynamic decoding graphs for each detected person. GR-M3D can adjust the structure of the decoding graph implicitly by adopting soft path weights according to input data, which makes the decoding graphs be adaptive to different input persons to the best extent and more capable of handling occlusion and depth ambiguity than previous methods. We empirically show that the proposed bottom-up approach even outperforms top-down methods and achieves state-of-the-art results on three 3D pose datasets.

DiT: Self-supervised Pre-training for Document Image Transformer

  • Junlong Li
  • Yiheng Xu
  • Tengchao Lv
  • Lei Cui
  • Cha Zhang
  • Furu Wei

Image Transformer has recently achieved significant progress for natural image understanding, either using supervised (ViT, DeiT, etc.) or self-supervised (BEiT, MAE, etc.) pre-training techniques. In this paper, we propose DiT, a self-supervised pre-trained Document Image Transformer model using large-scale unlabeled text images for Document AI tasks, which is essential since no supervised counterparts ever exist due to the lack of human-labeled document images. We leverage DiT as the backbone network in a variety of vision-based Document AI tasks, including document image classification, document layout analysis, table detection as well as text detection for OCR. Experiment results have illustrated that the self-supervised pre-trained DiT model achieves new state-of-the-art results on these downstream tasks, e.g. document image classification (91.11 - 92.69), document layout analysis (91.0 - 94.9), table detection (94.23 - 96.55) and text detection for OCR (93.07 - 94.29). The code and pre-trained models are publicly available at

Learning to Estimate External Forces of Human Motion in Video

  • Nathan Louis
  • Jason J. Corso
  • Tylan N. Templin
  • Travis D. Eliason
  • Daniel P. Nicolella

Analyzing sports performance or preventing injuries requires capturing ground reaction forces (GRFs) exerted by the human body during certain movements. Standard practice uses physical markers paired with force plates in a controlled environment, but this is marred by high costs, lengthy implementation time, and variance in repeat experiments; hence, we propose GRF inference from video. While recent work has used LSTMs to estimate GRFs from 2D viewpoints, these can be limited in their modeling and representation capacity. First, we propose using a transformer architecture to tackle the GRF from video task, being the first to do so. Then we introduce a new loss to minimize high impact peaks in regressed curves. We also show that pre-training and multi-task learning on 2D-to-3D human pose estimation improves generalization to unseen motions. And pre-training on this different task provides good initial weights when finetuning on smaller (rarer) GRF datasets. We evaluate on LAAS Parkour and a newly collected ForcePose dataset; we show up to 19% decrease in error compared to prior approaches.

Query Prior Matters: A MRC Framework for Multimodal Named Entity Recognition

  • Meihuizi Jia
  • Xin Shen
  • Lei Shen
  • Jinhui Pang
  • Lejian Liao
  • Yang Song
  • Meng Chen
  • Xiaodong He

Multimodal named entity recognition (MNER) is a vision-language task where the system is required to detect entity spans and corresponding entity types given a sentence-image pair. Existing methods capture text-image relations with various attention mechanisms that only obtain implicit alignments between entity types and image regions. To locate regions more accurately and better model cross-/within-modal relations, we propose a machine reading comprehension based framework for MNER, namely MRC-MNER. By utilizing queries in MRC, our framework can provide prior information about entity types and image regions. Specifically, we design two stages, Query-Guided Visual Grounding and Multi-Level Modal Interaction, to align fine-grained type-region information and simulate text-image/inner-text interactions respectively. For the former, we train a visual grounding model via transfer learning to extract region candidates that can be further integrated into the second stage to enhance token representations. For the latter, we design text-image and inner-text interaction modules along with three sub-tasks for MRC-MNER. To verify the effectiveness of our model, we conduct extensive experiments on two public MNER datasets, Twitter2015 and Twitter2017. Experimental results show that MRC-MNER outperforms the current state-of-the-art models on Twitter2017, and yields competitive results on Twitter2015.

Robust Multimodal Depth Estimation using Transformer based Generative Adversarial Networks

  • Md Fahim Faysal Khan
  • Anusha Devulapally
  • Siddharth Advani
  • Vijaykrishnan Narayanan

Accurately measuring the absolute depth of every pixel captured by an imaging sensor is of critical importance in real-time applications such as autonomous navigation, augmented reality and robotics. In order to predict dense depth, a general approach is to fuse sensor inputs from different modalities such as LiDAR, camera and other time-of-flight sensors. LiDAR and other time-of-flight sensors provide accurate depth data but are quite sparse, both spatially and temporally. To augment missing depth information, generally RGB guidance is leveraged due to its high resolution information. Due to the reliance on multiple sensor modalities, design for robustness and adaptation is essential. In this work, we propose a transformer-like self-attention based generative adversarial network to estimate dense depth using RGB and sparse depth data. We introduce a novel training recipe for making the model robust so that it works even when one of the input modalities is not available. The multi-head self-attention mechanism can dynamically attend to most salient parts of the RGB image or corresponding sparse depth data producing the most competitive results. Our proposed network also requires less memory for training and inference compared to other existing heavily residual connection based convolutional neural networks, making it more suitable for resource-constrained edge applications. The source code is available at:

Caption-Aware Medical VQA via Semantic Focusing and Progressive Cross-Modality Comprehension

  • Fuze Cong
  • Shibiao Xu
  • Li Guo
  • Yinbing Tian

Medical Visual Question Answering as a specific-domain task requires substantive prior knowledge of medicine. However, deep learning techniques encounter severe problems of limited supervision due to the scarcity of well-annotated large-scale medical VQA datasets. As an alternative to facing the data limitation problem, image captioning can be introduced to learn summary information about the picture, which is beneficial to question answering. To this end, we propose a caption-aware VQA method that can read the summary information of image content and clinic diagnoses from plenty of medical images and answer the medical question with richer multimodality features. The proposed method consists of two novel components emphasizing semantic locations and semantic content respectively. Firstly, to extract and leverage the semantic locations implied in image captioning, similarity analysis is designed to summarize the attention maps generated from image captioning by their relevance and guide the visual model to focus on the semantic-rich regions. Besides, to combine the semantic content in the generated captions, we propose a Progressive Compact Bilinear Interactions structure to achieve cross-modality comprehension over the image, question and caption features by performing bilinear attention in a gradual manner. Qualitative and quantitative experiments on various medical datasets exhibit the superiority of the proposed approach compared to the state-of-the-art methods.

Complementarity-Enhanced and Redundancy-Minimized Collaboration Network for Multi-agent Perception

  • Guiyang Luo
  • Hui Zhang
  • Quan Yuan
  • Jinglin Li

Multi-agent collaborative perception depends on sharing sensory information to improve perception accuracy and robustness, as well as to extend coverage. The cooperative shared information between agents should achieve an equilibrium between redundancy and complementarity, thus creating a concise and composite representation. To this end, this paper presents a complementarity-enhanced and redundancy-minimized collaboration network (CRCNet), for efficiently guiding and supervising the fusion among shared features. Our key novelties lie in two aspects. First, each fused feature is forced to bring about a marginal gain by exploiting a contrastive loss, which can supervise our model to select complementary features. Second, mutual information is applied to measure the dependence between fused feature pairs and the upper bound of mutual information is minimized to encourage independence, thus guiding our model to select irredundant features. Furthermore, the above modules are incorporated into a feature fusion network CRCNet. Our quantitative and qualitative experiments in collaborative object detection show that CRCNet performs better than the state-of-the-art methods.

Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language Explanations

  • Qian Yang
  • Yunxin Li
  • Baotian Hu
  • Lin Ma
  • Yuxin Ding
  • Min Zhang

Visual Entailment with natural language explanations aims to infer the relationship between a text-image pair and generate a sentence to explain the decision-making process. Previous methods rely mainly on a pre-trained vision-language model to perform the relation inference and a language model to generate the corresponding explanation. However, the pre-trained vision-language models mainly build token-level alignment between text and image yet ignore the high-level semantic alignment between the phrases (chunks) and visual contents, which is critical for vision-language reasoning. Moreover, the explanation generator based only on the encoded joint representation does not explicitly consider the critical decision-making points of relation inference. Thus the generated explanations are less faithful to visual-language reasoning. To mitigate these problems, we propose a unified Chunk-aware Alignment and Lexical Constraint based method, dubbed as CALeC. It contains a Chunk-aware Semantic Interactor (arr. CSI), a relation inferrer, and a Lexical Constraint-aware Generator (arr. LeCG). Specifically, CSI exploits the sentence structure inherent in language and various image regions to build chunk-aware semantic alignment. Relation inferrer uses an attention-based reasoning network to incorporate the token-level and chunk-level vision-language representations. LeCG utilizes lexical constraints to expressly incorporate the words or chunks focused by the relation inferrer into explanation generation, improving the faithfulness and informativeness of the explanations. We conduct extensive experiments on three datasets, and experimental results indicate that CALeC significantly outperforms other competitor models on inference accuracy and quality of generated explanations.

Two-Stream Transformer for Multi-Label Image Classification

  • Xuelin Zhu
  • Jiuxin Cao
  • Jiawei Ge
  • Weijia Liu
  • Bo Liu

Multi-label image classification is a fundamental yet challenging task in computer vision that aims to identify multiple objects from a given image. Recent studies on this task mainly focus on learning cross-modal interactions between label semantics and high-level visual representations via an attention operation. However, these one-shot attention based approaches generally perform poorly in establishing accurate and robust alignments between vision and text due to the acknowledged semantic gap. In this paper, we propose a two-stream transformer (TSFormer) learning framework, in which the spatial stream focuses on extracting patch features with a global perception, while the semantic stream aims to learn vision-aware label semantics as well as their correlations via a multi-shot attention mechanism. Specifically, in each layer of TSFormer, a cross-modal attention module is developed to aggregate visual features from spatial stream into semantic stream and update label semantics via a residual connection. In this way, the semantic gap between two streams gradually narrows as the procedure progresses layer by layer, allowing the semantic stream to produce sophisticated visual representations for each label towards accurate label recognition. Extensive experiments on three visual benchmarks, including Pascal VOC 2007, Microsoft COCO and NUS-WIDE, consistently demonstrate that our proposed TSFormer achieves state-of-the-art performance on the multi-label image classification task.

SoftSkip: Empowering Multi-Modal Dynamic Pruning for Single-Stage Referring Comprehension

  • Dulanga Weerakoon
  • Vigneshwaran Subbaraju
  • Tuan Tran
  • Archan Misra

Supporting real-time referring expression comprehension (REC) on pervasive devices is an important capability for human-AI collaborative tasks. Model pruning techniques, applied to DNN models, can enable real-time execution even on resource-constrained devices. However, existing pruning strategies are designed principally for uni-modal applications, and suffer a significant loss of accuracy when applied to REC tasks that require fusion of textual and visual inputs. We thus present a multi-modal pruning model, LGMDP, which uses language as a pivot to dynamically and judiciously select the relevant computational blocks that need to be executed. LGMDP also introduces a new SoftSkip mechanism, whereby 'skipped' visual scales are not completely eliminated but approximated with minimal additional computation. Experimental evaluation, using 3 benchmark REC datasets and an embedded device implementation, shows that LGMDP can achieve 33% latency savings, with an accuracy loss 0.5% - 2%.

Unbiased Directed Object Attention Graph for Object Navigation

  • Ronghao Dang
  • Zhuofan Shi
  • Liuyi Wang
  • Zongtao He
  • Chengju Liu
  • Qijun Chen

Object navigation tasks require agents to locate specific objects in unknown environments based on visual information. Previously, graph convolutions were used to implicitly explore the relationships between objects. However, due to differences in visibility among objects, it is easy to generate biases in object attention. Thus, in this paper, we propose a directed object attention (DOA) graph to guide the agent in explicitly learning the attention relationships between objects, thereby reducing the object attention bias. In particular, we use the DOA graph to perform unbiased adaptive object attention (UAOA) on the object features and unbiased adaptive image attention (UAIA) on the raw images, respectively. To distinguish features in different branches, a concise adaptive branch energy distribution (ABED) method is proposed. We assess our methods on the AI2-Thor dataset. Compared with the state-of-the-art (SOTA) method, our method reports 7.4%, 8.1% and 17.6% increase in success rate (SR), success weighted by path length (SPL) and success weighted by action efficiency (SAE), respectively.

FastPR: One-stage Semantic Person Retrieval via Self-supervised Learning

  • Meng Sun
  • Ju Ren
  • Xin Wang
  • Wenwu Zhu
  • Yaoxue Zhang

Semantic person retrieval aims to locate a specific person in an image with the query of semantic descriptions, which has shown great significance in surveillance and security applications. Prior arts commonly adopt a two-stage method that first extracts the persons with a pretrained detector and then finds the target matching the descriptions optimally.However, existing works suffer from high computational complexity and low recall rate caused by error accumulation in the two-stage inference. To solve the problems, we propose FastPR, a one-stage semantic person retrieval method via self-supervised learning, to optimize the person localization and semantic retrieval simultaneously. Specifically, we propose a dynamic visual-semantic alignment mechanism which utilizes grid-based attention to fuse the cross-modal features, and employs a label prediction proxy task to constrain the attention process. To tackle the challenges that real-world surveillance images may suffer from low-resolution and occlusion, and the target persons may be within a crowd,we further propose a dual-granularity person localization module through designing an upsampling reconstruction proxy task to enhance the local feature of the target person in the fused features, followed by a tailored offset prediction proxy task to make the localization network capable of accurately identifying and distinguishing the target person in a crowd. Experimental results demonstrate that FastPR achieves the best retrieval accuracy compared to the state-of-the-art baseline methods, with over 15 times inference time reduction.

Towards Counterfactual Image Manipulation via CLIP

  • Yingchen Yu
  • Fangneng Zhan
  • Rongliang Wu
  • Jiahui Zhang
  • Shijian Lu
  • Miaomiao Cui
  • Xuansong Xie
  • Xian-Sheng Hua
  • Chunyan Miao

Leveraging StyleGAN's expressivity and its disentangled latent codes, existing methods can achieve realistic editing of different visual attributes such as age and gender of facial images. An intriguing yet challenging problem arises: Can generative models achieve counterfactual editing against their learnt priors? Due to the lack of counterfactual samples in natural datasets, we investigate this problem in a text-driven manner with Contrastive-Language-Image-Pretraining (CLIP), which can offer rich semantic knowledge even for various counterfactual concepts. Different from in-domain manipulation, counterfactual manipulation requires more comprehensive exploitation of semantic knowledge encapsulated in CLIP as well as more delicate handling of editing directions for avoiding being stuck in local minimum or undesired editing. To this end, we design a novel contrastive loss that exploits predefined CLIP-space directions to guide the editing toward desired directions from different perspectives. In addition, we design a simple yet effective scheme that explicitly maps CLIP embeddings (of target text) to the latent space and fuses them with latent codes for effective latent code optimization and accurate editing. Extensive experiments show that our design achieves accurate and realistic editing while driving by target texts with various counterfactual concepts.

Bidirectionally Learning Dense Spatio-temporal Feature Propagation Network for Unsupervised Video Object Segmentation

  • Jiaqing Fan
  • Tiankang Su
  • Kaihua Zhang
  • Qingshan Liu

Spatio-temporal feature representation is essential for accurate unsupervised video object segmentation, which needs an effective feature propagation paradigm for both appearance and motion features that can fully interchange information across frames. However, existing solutions mainly focus on the forward feature propagation from the preceding frame to the current one, either using the former segmentation mask or motion propagation in a frame-by-frame manner. This ignores the bi-directional temporal feature interactions (including the backward propagation from the future to the current frame) across all frames that can help to enhance the spatiotemporal feature representation for segmentation prediction. To this end, this paper presents a novel Dense Bidirectional Spatio-temporal feature propagation Network (DBSNet) to fully integrate the forward and the backward propagations across all frames. Specifically, a dense bi-ConvLSTM module is first developed to propagate the features across all frames in a forward and backward manner. This can fully capture the multi-level spatio-temporal contextual information across all frames, producing an effective feature representation that has a strong discriminative capability to tell from noisy backgrounds. Following it, a spatio-temporal Transformer refinement module is designed to further enhance the propagated features, which can effectively capture the spatio-temporal long-range dependencies among all frames. Afterwards, a Co-operative Direction-aware Graph Attention (Co-DGA) module is designed to integrate the propagated appearancemotion cues, yielding a strong spatio-temporal feature representation for segmentation mask prediction. The Co-DGA assigns proper attentional weights to neighboring points along the coordinate axis, making the segmentation model to selectively focus on the most relevant neighbors. Extensive evaluations on four mainstream challenging benchmarks including DAVIS16, FBMS, DAVSOD, and MCL demonstrate that the proposed DBSNet achieves favorable performance against state-of-the-art methods in terms of all evaluation metrics.

Weakly Supervised Video Salient Object Detection via Point Supervision

  • Shuyong Gao
  • Haozhe Xing
  • Wei Zhang
  • Yan Wang
  • Qianyu Guo
  • Wenqiang Zhang

Fully supervised video salient object detection models have achieved excellent performance, yet obtaining pixel-by-pixel annotated datasets is laborious. Several works attempt to use scribble annotations to mitigate this problem, but point supervision as a more labor-saving annotation method (even the most labor-saving method among manual annotation methods for dense prediction), has not been explored. In this paper, we propose a strong baseline model based on point supervision. To infer saliency maps with temporal information, we mine inter-frame complementary information from short-term and long-term perspectives, respectively. Specifically, we propose a hybrid token attention module, which mixes optical flow and image information from orthogonal directions, adaptively highlighting critical optical flow information (channel dimension) and critical token information (spatial dimension). To exploit long-term cues, we develop the Long-term Cross-Frame Attention module (LCFA), which assists the current frame in inferring salient objects based on multi-frame tokens. Furthermore, we label two point-supervised datasets, P-DAVIS and P-DAVSOD, by relabeling the DAVIS and the DAVSOD dataset. Experiments on the six benchmark datasets illustrate our method outperforms the previous state-of-the-art weakly supervised methods and even is comparable with some fully supervised approaches. Our source code and datasets are available at:

Look Less Think More: Rethinking Compositional Action Recognition

  • Rui Yan
  • Peng Huang
  • Xiangbo Shu
  • Junhao Zhang
  • Yonghua Pan
  • Jinhui Tang

Compositional action recognition which aims to identify the unseen combinations of actions and objects has recently attracted wide attention. Conventional methods bring in additional cues (e.g., dynamic motions of objects) to alleviate the inductive bias between the visual appearance of objects and the human action-level labels. Besides, compared with non-compositional settings, previous methods only pursue higher performance in compositional settings, which can not prove their generalization ability. To this end, we firstly rethink the problem and design a more generalized metric (namely Drop Ratio) and a more practical setting to evaluate the compositional generalization of existing action recognition algorithms. Beyond that, we propose a simple yet effective framework, Look Less Think More (LLTM), to reduce the strong association between visual objects and action-level labels (Look Less), and then discover the commonsense relationships between object categories and human actions (Think More). We test the rationality of the proposed Drop Ratio and Practical setting by comparing several popular action recognition methods on SSV2. Besides, the proposed LLTM achieves state-of-the-art performance on SSV2 with different settings.

Continual Multi-view Clustering

  • Xinhang Wan
  • Jiyuan Liu
  • Weixuan Liang
  • Xinwang Liu
  • Yi Wen
  • En Zhu

With the increase of multimedia applications, data are often collected from multiple sensors or modalities, encouraging the rapid development of multi-view (also called multi modal) clustering technique. As a representative, late fusion multi-view clustering algorithm has attracted extensive attention due to its low computation complexity yet promising performance. However, most of them deal with the clustering problem in which all data views are available in advance, and overlook the scenarios where data observations of new views are accumulated over time. To solve this issue, we propose a continual approach on the basis of late fusion multi-view clustering framework. In specific, it only needs to maintain a consensus partition matrix and update knowledge with the incoming one of a new data view rather than keep all of them. This benefits a lot by preventing the previously learned knowledge from recomputing over and over again, saving a large amount of computation resource/time and labor force. Nevertheless, we design an alternate and convergent strategy to solve the resultant optimization problem. Also, the proposed algorithm shows excellent clustering performance and time/space efficiency in the experiment.

Efficient Anchor Learning-based Multi-view Clustering -- A Late Fusion Method

  • Tiejian Zhang
  • Xinwang Liu
  • En Zhu
  • Sihang Zhou
  • Zhibin Dong

Anchor enhanced multi-view late fusion clustering has attracted numerous researchers' attention for its high clustering accuracy and promising efficiency. However, in the existing methods, the anchor points are usually generated through sampling or linearly combining the samples within the datasets, which could result in enormous time consumption and limited representation capability. To solve the problem, in our method, we learn the view-specific anchor points by learning them directly. Specifically, in our method, we first reconstruct the partition matrix of each view through multiplying a view-specific anchor matrix by a consensus reconstruction matrix. Then, by maximizing the weighted alignment between the base partition matrix and its estimated version in each view, we learn the optimal anchor points for each view. In particular, unlike previous late fusion algorithms, which define anchor points as linear combinations of existing samples, we define anchor points as a series of orthogonal vectors that are directly learned through optimization, which expands the learning space of the anchor points. Moreover, based on the above design, the resultant algorithm has only linear complexity and no hyper-parameter. Experiments on $12$ benchmark kernel datasets and 5 large-scale datasets illustrate that the proposed Efficient Anchor Learning-based Multi-view Clustering (AL-MVC) algorithm achieves the state-of-the-art performance in both clustering performance and efficiency.

Cross-modal Knowledge Graph Contrastive Learning for Machine Learning Method Recommendation

  • Xianshuai Cao
  • Yuliang Shi
  • Jihu Wang
  • Han Yu
  • Xinjun Wang
  • Zhongmin Yan

The explosive growth of machine learning (ML) methods is overloading users with choices for learning tasks. Method recommendation aims to alleviate this problem by selecting the most appropriate ML methods for given learning tasks. Recent research shows that the descriptive and structural information of the knowledge graphs (KGs) can significantly enhance the performance of ML method recommendation. However, existing studies have not fully explored the descriptive information in KGs, nor have they effectively exploited the descriptive and structural information to provide the necessary supervision. To address these limitations, we distinguish descriptive attributes from the traditional relationships in KGs with the rest as structural connections to expand the scope of KG descriptive information. Based on this insight, we propose the Cross-modal Knowledge Graph Contrastive learning (CKGC) approach, which regards information from descriptive attributes and structural connections as two modalities, learning informative node representations by maximizing the agreement between the descriptive view and the structural view. Through extensive experiments, we demonstrate that CKGC significantly outperforms the state-of-the-art baselines, achieving around 2% higher accurate click-through-rate (CTR) prediction, over 30% more accurate top-10 recommendation, and over 50% more accurate top-20 recommendation compared to the best performing existing approach.

Multigranular Visual-Semantic Embedding for Cloth-Changing Person Re-identification

  • Zan Gao
  • Hongwei Wei
  • Weili Guan
  • Weizhi Nie
  • Meng Liu
  • Meng Wang

To date, only a few works have focused on the cloth-changing person Re-identification (ReID) task, but since it is very difficult to extract generalized and robust features for representing people with different clothes, thus, their performances need to be improved. Moreover, visual-semantic information is also often ignored. To solve these issues, in this work, a novel multigranular visual-semantic embedding algorithm (MVSE) is proposed for cloth-changing person ReID, where visual semantic information and human attributes are embedded into the network, and the generalized features of human appearance can be well learned to effectively solve the problem of cloth-changing. Specifically, to fully represent a person with clothing changes, a multigranular feature representation scheme (MGR) is employed to adaptively extract multilevel and multigranular feature information, and then a cloth desensitization network (CDN) is designed to improve the feature robustness for the person with different clothes, where different high-level human attributes are fully utilized. Moreover, to further solve the issue of pose changes and occlusion under different camera perspectives, a partially semantically aligned network (PSA) is proposed to obtain the visual-semantic information that is used to align the human attributes. Most importantly, these three modules are jointly explored in a unified framework. Extensive experimental results on four cloth-changing person ReID datasets demonstrate that the MVSE algorithm can extract highly robust feature representations of cloth-changing persons, and it can outperform state-of-the-art cloth-changing person ReID approaches.

Adaptive Structural Similarity Preserving for Unsupervised Cross Modal Hashing

  • Liang Li
  • Baihua Zheng
  • Weiwei Sun

Cross-modal hashing is an important approach for multimodal data management and application. Existing unsupervised cross-modal hashing algorithms mainly rely on data features in pre-trained models to mine their similarity relationships. However, their optimization objectives are based on the static metric between the original uni-modal features, without further exploring data correlations during the training. In addition, most of them mainly focus on association mining and alignment among pairwise instances in continuous space but ignore the latent structural correlations contained in the semantic hashing space. In this paper, we propose an unsupervised hash learning framework ASSPH to solve the above problems. Firstly, we propose an adaptive learning scheme, with limited data and training batches, to enrich semantic correlations of unlabeled instances during the training process and meanwhile to ensure a smooth convergence of the training process. Secondly, we present an asymmetric structural semantic representation learning scheme. We introduce structural semantic metrics based on graph adjacency relations and meanwhile align the inter- and intra-modal semantics in the hash space with an asymmetric binary optimization process. Finally, we conduct extensive experiments to validate the enhancements of our work in comparison with existing works.

CubeMLP: An MLP-based Model for Multimodal Sentiment Analysis and Depression Estimation

  • Hao Sun
  • Hongyi Wang
  • Jiaqing Liu
  • Yen-Wei Chen
  • Lanfen Lin

Multimodal sentiment analysis and depression estimation are two important research topics that aim to predict human mental states using multimodal data. Previous research has focused on developing effective fusion strategies for exchanging and integrating mind-related information from different modalities. Some MLP-based techniques have recently achieved considerable success in a variety of computer vision tasks. Inspired by this, we explore multimodal approaches with a feature-mixing perspective in this study. To this end, we introduce CubeMLP, a multimodal feature processing framework based entirely on MLP. CubeMLP consists of three independent MLP units, each of which has two affine transformations. CubeMLP accepts all relevant modality features as input and mixes them across three axes. After extracting the characteristics using CubeMLP, the mixed multimodal features are flattened for task predictions. Our experiments are conducted on sentiment analysis datasets: CMU-MOSI and CMU-MOSEI, and depression estimation dataset: AVEC2019. The results show that CubeMLP can achieve state-of-the-art performance with a much lower computing cost.

Generalized Global Ranking-Aware Neural Architecture Ranker for Efficient Image Classifier Search

  • Bicheng Guo
  • Tao Chen
  • Shibo He
  • Haoyu Liu
  • Lilin Xu
  • Peng Ye
  • Jiming Chen

Neural Architecture Search (NAS) is a powerful tool for automating effective image processing DNN designing. The ranking has been advocated to design an efficient performance predictor for NAS. The previous contrastive method solves the ranking problem by comparing pairs of architectures and predicting their relative performance. However, it only focuses on the rankings between two involved architectures and neglects the overall quality distributions of the search space, which may suffer generalization issues. A predictor, namely Neural Architecture Ranker (NAR) which concentrates on the global quality tier of specific architecture, is proposed to tackle such problems caused by the local perspective. The NAR explores the quality tiers of the search space globally and classifies each individual to the tier they belong to according to its global ranking. Thus, the predictor gains the knowledge of the performance distributions of the search space which helps to generalize its ranking ability to the datasets more easily. Meanwhile, the global quality distribution facilitates the search phase by directly sampling candidates according to the statistics of quality tiers, which is free of training a search algorithm, e.g., Reinforcement Learning (RL) or Evolutionary Algorithm (EA), thus it simplifies the NAS pipeline and saves the computational overheads. The proposed NAR achieves better performance than the state-of-the-art methods on two widely used datasets for NAS research. On the vast search space of NAS-Bench-101, the NAR easily finds the architecture with top 0.01 performance only by sampling. It also generalizes well to different image datasets of NAS-Bench-201, i.e., CIFAR-10, CIFAR-100, and ImageNet-16-120 by identifying the optimal architectures for each of them.

Exploiting Transformation Invariance and Equivariance for Self-supervised Sound Localisation

  • Jinxiang Liu
  • Chen Ju
  • Weidi Xie
  • Ya Zhang

We present a simple yet effective self-supervised framework for audio-visual representation learning, to localize the sound source in videos. To understand what enables to learn useful representations, we systematically investigate the effects of data augmentations, and reveal that (1) composition of data augmentations plays a critical role, i.e. explicitly encouraging the audio-visual representations to be invariant to various transformations (transformation invariance); (2) enforcing geometric consistency substantially improves the quality of learned representations, i.e. the detected sound source should follow the same transformation applied on input video frames (transformation equivariance). Extensive experiments demonstrate that our model significantly outperforms previous methods on two sound localization benchmarks, namely, Flickr-SoundNet and VGG-Sound. Additionally, we also evaluate audio retrieval and cross-modal retrieval tasks. In both cases, our self-supervised models demonstrate superior retrieval performances, even competitive with the supervised approach in audio retrieval. This reveals the proposed framework learns strong multi-modal representations that are beneficial to sound localisation and generalization to further applications. The project page is

Unsupervised Video Hashing with Multi-granularity Contextualization and Multi-structure Preservation

  • Yanbin Hao
  • Jingru Duan
  • Hao Zhang
  • Bin Zhu
  • Pengyuan Zhou
  • Xiangnan He

Unsupervised video hashing typically aims to learn a compact binary vector to represent complex video content without using manual annotations. Existing unsupervised hashing methods generally suffer from incomplete exploration of various perspective dependencies (e.g., long-range and short-range) and data structures that exist in visual contents, resulting in less discriminative hash codes. In this paper, we propose aMulti-granularity Contextualized and Multi-Structure preserved Hashing (MCMSH) method, exploring multiple axial contexts for discriminative video representation generation and various structural information for unsupervised learning simultaneously. Specifically, we delicately design three self-gating modules to separately model three granularities of dependencies (i.e., long/middle/short-range dependencies) and densely integrate them into MLP-Mixer for feature contextualization, leading to a novel model MC-MLP. To facilitate unsupervised learning, we investigate three kinds of data structures, including clusters, local neighborhood similarity structure, and inter/intra-class variations, and design a multi-objective task to train MC-MLP. These data structures show high complementarities in hash code learning. We conduct extensive experiments using three video retrieval benchmark datasets, demonstrating that our MCMSH not only boosts the performance of the backbone MLP-Mixer significantly but also outperforms the competing methods notably. Code is available at:

DisCo: Disentangled Implicit Content and Rhythm Learning for Diverse Co-Speech Gestures Synthesis

  • Haiyang Liu
  • Naoya Iwamoto
  • Zihao Zhu
  • Zhengqing Li
  • You Zhou
  • Elif Bozkurt
  • Bo Zheng

Current co-speech gestures synthesis methods struggle with generating diverse motions and typically collapse to single or few frequent motion sequences, which are trained on original data distribution with customized models and strategies. We tackle this problem by temporally clustering motion sequences into content and rhythm segments and then training on content-balanced data distribution. In particular, by clustering motion sequences, we have observed for each rhythm pattern, some motions appear frequently, while others appear less. This imbalance results in the difficulty of generating low frequent occurrence motions and it cannot be easily solved by resampling, due to the inherent many-to-many mapping between content and rhythm. Therefore, we present DisCo, which disentangles motion into implicit content and rhythm features by contrastive loss for adopting different data balance strategies. Besides, to model the inherent mapping between content and rhythm features, we design a diversity-and-inclusion network (DIN), which firstly generates content features candidates and then selects one candidate by learned voting. Experiments on two public datasets, Trinity and S2G-Ellen, justify that DisCo generates more realistic and diverse motions than state-of-the-art methods. Code and data are available at

Adaptively-weighted Integral Space for Fast Multiview Clustering

  • Man-Sheng Chen
  • Tuo Liu
  • Chang-Dong Wang
  • Dong Huang
  • Jian-Huang Lai

Multiview clustering has been extensively studied to take advantage of multi-source information to improve the clustering performance. In general, most of the existing works typically compute an $n\times n$ affinity graph by some similarity/distance metrics (e.g. the Euclidean distance) or learned representations, and explore the pairwise correlations across views. But unfortunately, a quadratic or even cubic complexity is often needed, bringing about difficulty in clustering large-scale datasets. Some efforts have been made recently to capture data distribution in multiple views by selecting view-wise anchor representations with k-means, or by direct matrix factorization on the original observations. Despite the significant success, few of them have considered the view-insufficiency issue, implicitly holding the assumption that each individual view is sufficient to recover the cluster structure. Moreover, the latent integral space as well as the shared cluster structure from multiple insufficient views is not able to be simultaneously discovered. In view of this, we propose an \underlineA daptively-weighted \underlineI ntegral Space for Fast \underlineM ultiview \underlineC lustering (AIMC) with nearly linear complexity. Specifically, view generation models are designed to reconstruct the view observations from the latent integral space with diverse adaptive contributions. Meanwhile, a centroid representation with orthogonality constraint and cluster partition are seamlessly constructed to approximate the latent integral space. An alternate minimizing algorithm is developed to solve the optimization problem, which is proved to have linear time complexityw.r.t. the sample size. Extensive experiments conducted on several real-world datasets confirm the superiority of the proposed AIMC method compared with the state-of-the-art methods.

Towards All Weather and Unobstructed Multi-Spectral Image Stitching: Algorithm and Benchmark

  • Zhiying Jiang
  • Zengxi Zhang
  • Xin Fan
  • Risheng Liu

Image stitching is a fundamental task that requires multiple images from different viewpoints to generate a wide field-of-viewing~(FOV) scene. Previous methods are developed on RGB images. However, the severe weather and harsh conditions, such as rain, fog, low light, strong light, etc., on visible images may introduce evident interference, leading to the distortion and misalignment of the stitched results. To remedy the deficient imaging of optical sensors, we investigate the complementarity across infrared and visible images to improve the perception of scenes in terms of visual information and viewing ranges. Instead of the cascaded fusion-stitching process, where the inaccuracy accumulation caused by image fusion hinders the stitch performance, especially content loss and ghosting effect, we develop a learnable feature adaptive network to investigate a stitch-oriented feature representation and perform the information complementary at the feature-level. By introducing a pyramidal structure along with the global fast correlation regression, the quadrature attention based correspondence is more responsible for feature alignment, and the estimation of sparse offsets can be realized in a coarse-to-fine manner. Furthermore, we propose the first infrared and visible image based multi-spectral image stitching dataset, covering a more comprehensive range of scenarios and diverse viewing baselines. Extensive experiments on real-world data demonstrate that our method reconstructs the wide FOV images with more credible structure and complementary information against state-of-the-arts.

A Parameter-free Multi-view Information Bottleneck Clustering Method by Cross-view Weighting

  • Shizhe Hu
  • Ruilin Geng
  • Zhaoxu Cheng
  • Chaoyang Zhang
  • Guoliang Zou
  • Zhengzheng Lou
  • Yangdong Ye

With the fast-growing multi-modal/media data in the Big Data era, multi-view clustering (MVC) has attracted lots of attentions lately. Most MVCs focus on integrating and utilizing the complementary information among views by linear sum of the learned view weights and have shown great success in some fields. However, they fail to quantify how complementary the information across views actually utilized for benefiting final clustering. Additionally, most of them contain at least one parameter for regularization without prior knowledge, which puts pressure on the parameter-tuning and thus makes them impractical. In this paper, we propose a novel parameter-free multi-view information bottleneck (PMIB) clustering method to automatically identify and exploit useful complementary information among views, thus reducing the negative impact from the harmful views. Specifically, we first discover the informative view by measuring the relevant information preserved by the original data and the compact clusters with mutual information. Then, a new cross-view weight learning scheme is designed to learn how complementary between the informative view and remaining views. Finally, the quantitative correlations among views are fully exploited to improve the clustering performance without needing any additional parameters or prior knowledge. Experimental results on different kinds of multi-view datasets show the effectiveness of the proposed method.

HERO: HiErarchical spatio-tempoRal reasOning with Contrastive Action Correspondence for End-to-End Video Object Grounding

  • Mengze Li
  • Tianbao Wang
  • Haoyu Zhang
  • Shengyu Zhang
  • Zhou Zhao
  • Wenqiao Zhang
  • Jiaxu Miao
  • Shiliang Pu
  • Fei Wu

<u>V</u>ideo <u>O</u>bject <u>G</u>rounding (VOG) is the problem of associating spatial object regions in the video to a descriptive natural language query. This is a challenging vision-language task that necessitates constructing the correct cross-modal correspondence and modeling the appropriate spatio-temporal context of the query video and caption, thereby localizing the specific objects accurately. In this paper, we tackle this task by a novel framework called <u>H</u>i<u>E</u>rarchical spatio-tempo<u>R</u>al reas<u>O</u>ning (HERO) with contrastive action correspondence. We study the VOG task at two aspects that prior works overlooked: (1) Contrastive Action Correspondence-aware Retrieval. Notice that the fine-grained video semantics (e.g., multiple actions) is not totally aligned with the annotated language query (e.g., single action), we first introduce the weakly-supervised contrastive learning that classifies the video as action-consistent and action-independent frames relying on the video-caption action semantic correspondence. Such a design can build the fine-grained cross-modal correspondence for more accurate subsequent VOG. (2) Hierarchical Spatio-temporal Modeling Improvement. While transformer-based VOG models present their potential in sequential modality (i.e., video and caption) modeling, existing evidence also indicates that the transformer suffers from the issue of the insensitive spatio-temporal locality. Motivated by that, we carefully design the hierarchical reasoning layers to decouple fully connected multi-head attention and remove the redundant interfering correlations. Furthermore, our proposed pyramid and shifted alignment mechanisms are effective to improve the cross-modal information utilization of neighborhood spatial regions and temporal frames. We conducted extensive experiments to show our HERO outperforms existing techniques by achieving significant improvement on two benchmark datasets.

MAVT-FG: Multimodal Audio-Visual Transformer for Weakly-supervised Fine-Grained Recognition

  • Xiaoyu Zhou
  • Xiaotong Song
  • Hao Wu
  • Jingran Zhang
  • Xing Xu

Weakly-supervised fine-grained recognition aims to detect potential differences between subcategories at a more detailed scale without using any manual annotations. While most recent works focus on classical image-based fine-grained recognition that recognizes subcategories at image-level, video-based fine-grained recognition is much more challenging and specifically needed. In this paper, we propose a Multimodal Audio-Visual Transformer for Weakly-supervised Fine-Grained Recognition (MAVT-FG) model which incorporates audio-visual modalities. Specifically, MAVT-FG consists of Audio-Visual Dual-Encoder for feature extraction, Cross-Decoder for Audio-Visual Fusion (DAVF) to exploit inherent cues and correspondences between two modalities, and Search-and-Select Fine-grained Branch (SSFG) to capture the most discriminative regions. Furthermore, we construct a new benchmark: Fine-grained Birds of Audio-Visual (FGB-AV) for audio-visual weakly-supervised fine-grained recognition at video-level. Experimental results show that our method achieves superior performance and outperforms other state-of-the-art methods.

Dynamic Graph Modeling for Weakly-Supervised Temporal Action Localization

  • Haichao Shi
  • Xiao-Yu Zhang
  • Changsheng Li
  • Lixing Gong
  • Yong Li
  • Yongjun Bao

Weakly supervised action localization is a challenging task that aims to localize action instances in untrimmed videos given only video-level supervision. Existing methods mostly distinguish action from background via attentive feature fusion with RGB and optical flow modalities. Unfortunately, this strategy fails to retain the distinct characteristics of each modality, leading to inaccurate localization under hard-to-discriminate cases such as action-context interference and in-action stationary period. As an action is typically comprised of multiple stages, an intuitive solution is to model the relation between the finer-grained action segments to obtain a more detailed analysis. In this paper, we propose a dynamic graph-based method, namely DGCNN, to explore the two-stream relation between action segments. To be specific, segments within a video which are likely to be actions are dynamically selected to construct an action graph. For each graph, a triplet adjacency matrix is devised to explore the temporal and contextual correlations between the pseudo action segments, which consists of three components, i.e., mutual importance, feature similarity, and high-level contextual similarity. The two-stream dynamic pseudo graphs, along with the pseudo background segments, are used to derive more detailed video representation. For action localization, a non-local based temporal refinement module is proposed to fully leverage the temporal consistency between consecutive segments. Experimental results on three datasets, i.e., THUMOS14, ActivityNet v1.2 and v1.3, demonstrate that our method is superior to the state-of-the-arts.

Cross-Domain and Cross-Modal Knowledge Distillation in Domain Adaptation for 3D Semantic Segmentation

  • Miaoyu Li
  • Yachao Zhang
  • Yuan Xie
  • Zuodong Gao
  • Cuihua Li
  • Zhizhong Zhang
  • Yanyun Qu

With the emergence of multi-modal datasets where LiDAR and camera are synchronized and calibrated, cross-modal Unsupervised Domain Adaptation (UDA) has attracted increasing attention because it reduces the laborious annotation of target domain samples. To alleviate the distribution gap between source and target domains, existing methods conduct feature alignment by using adversarial learning. However, it is well-known to be highly sensitive to hyperparameters and difficult to train. In this paper, we propose a novel model (Dual-Cross) that integrates Cross-Domain Knowledge Distillation (CDKD) and Cross-Modal Knowledge Distillation (CMKD) to mitigate domain shift. Specifically, we design the multi-modal style transfer to convert source image and point cloud to target style. With these synthetic samples as input, we introduce a target-aware teacher network to learn knowledge of the target domain. Then we present dual-cross knowledge distillation when the student is learning on source domain. CDKD constrains teacher and student predictions under same modality to be consistent. It can transfer target-aware knowledge from the teacher to the student, making the student more adaptive to the target domain. CMKD generates hybrid-modal prediction from the teacher predictions and constrains it to be consistent with both 2D and 3D student predictions. It promotes the information interaction between two modalities to make them complement each other. From the evaluation results on various domain adaptation settings, Dual-Cross significantly outperforms both uni-modal and cross-modal state-of-the-art methods.

AVA-AVD: Audio-visual Speaker Diarization in the Wild

  • Eric Zhongcong Xu
  • Zeyang Song
  • Satoshi Tsutsui
  • Chao Feng
  • Mang Ye
  • Mike Zheng Shou

Audio-visual speaker diarization aims at detecting "who spoke when'' using both auditory and visual signals. Existing audio-visual diarization datasets are mainly focused on indoor environments like meeting rooms or news studios, which are quite different from in-the-wild videos in many scenarios such as movies, documentaries, and audience sitcoms. To develop diarization methods for these challenging videos, we create the AVA Audio-Visual Diarization (AVA-AVD) dataset. Our experiments demonstrate that adding AVA-AVD into training set can produce significantly better diarization models for in-the-wild videos despite that the data is relatively small. Moreover, this benchmark is challenging due to the diverse scenes, complicated acoustic conditions, and completely off-screen speakers. As a first step towards addressing the challenges, we design the Audio-Visual Relation Network (AVR-Net) which introduces a simple yet effective modality mask to capture discriminative information based on face visibility. Experiments show that our method not only can outperform state-of-the-art methods but is more robust as varying the ratio of off-screen speakers. Our data and code has been made publicly available at \textcolormagenta \url .

Image-Signal Correlation Network for Textile Fiber Identification

  • Bo Peng
  • Liren He
  • Yining Qiu
  • Wu Dong
  • Mingmin Chi

Identifying fiber compositions is an important aspect of the textile industry. In recent decades, near-infrared spectroscopy has shown its potential in the automatic detection of fiber components. However, for plant fibers such as cotton and linen, the chemical compositions are the same and thus the absorption spectra are very similar, leading to the problem of "different materials with the same spectrum, whereas the same material with different spectrums" and it is difficult using a single mode of NIR signals to capture the effective features to distinguish these fibers. To solve this problem, textile experts under a microscope measure the cross-sectional or longitudinal characteristics of fibers to determine fiber contents with a destructive way. In this paper, we construct the first NIR signal-microscope image textile fiber composition dataset (NIRITFC). Based on the NIRITFC dataset, we propose an image-signal correlation network (ISiC-Net) and design image-signal correlation perception and image-signal correlation attention modules, respectively, to effectively integrate the visual features (esp. local texture details of fibers) with the finer absorption spectrum information of the NIR signal to capture the deep abstract features of bimodal data for nondestructive textile fiber identification. To better learn the spectral characteristics of the fiber components, the endmember vectors of the corresponding fibers are generated by embedding encoding, and the reconstruction loss is designed to guide the model to reconstruct the NIR signals of the corresponding fiber components by a nonlinear mapping. The quantitative and qualitative results are significantly improved compared to both single and bimodal approaches, indicating the great potential of combining microscopic images and NIR signals for textile fiber composition identification.

Relation-enhanced Negative Sampling for Multimodal Knowledge Graph Completion

  • Derong Xu
  • Tong Xu
  • Shiwei Wu
  • Jingbo Zhou
  • Enhong Chen

Knowledge Graph Completion (KGC), aiming to infer the missing part of Knowledge Graphs (KGs), has long been treated as a crucial task to support downstream applications of KGs, especially for the multimodal KGs (MKGs) which suffer the incomplete relations due to the insufficient accumulation of multimodal corpus. Though a few research attentions have been paid to the completion task of MKGs, there is still a lack of specially designed negative sampling strategies tailored to MKGs. Meanwhile, though effective negative sampling strategies have been widely regarded as a crucial solution for KGC to alleviate the vanishing gradient problem, we realize that, there is a unique challenge for negative sampling in MKGs about how to model the effect of KG relations during learning the complementary semantics among multiple modalities as an extra context. In this case, traditional negative sampling techniques which only consider the structural knowledge may fail to deal with the multimodal KGC task. To that end, in this paper, we propose a MultiModal Relation-enhanced Negative Sampling (MMRNS) framework for multimodal KGC task. Especially, we design a novel knowledge-guided cross-modal attention (KCA) mechanism, which provides bi-directional attention for visual & textual features via integrating relation embedding. Then, an effective contrastive semantic sampler is devised after consolidating the KCA mechanism with contrastive learning. In this way, a more similar representation of semantic features between positive samples, as well as a more diverse representation between negative samples under different relations could be learned. Afterwards, a masked gumbel-softmax optimization mechanism is utilized for solving the non-differentiability of sampling process, which provides effective parameter optimization compared with traditional sample strategies. Extensive experiments on three multimodal KGs demonstrate that our MMRNS framework could significantly outperform the state-of-the-art baseline methods, which validates the effectiveness of relation guides in multimodal KGC task.

Symmetric Uncertainty-Aware Feature Transmission for Depth Super-Resolution

  • Wuxuan Shi
  • Mang Ye
  • Bo Du

Color-guided depth super-resolution (DSR) is an encouraging paradigm that enhances a low-resolution (LR) depth map guided by an extra high-resolution (HR) RGB image from the same scene. Existing methods usually use interpolation to upscale the depth maps before feeding them into the network and transfer the high-frequency information extracted from HR RGB images to guide the reconstruction of depth maps. However, the extracted high-frequency information usually contains textures that are not present in depth maps in the existence of the cross-modality gap, and the noises would be fur- ther aggravated by interpolation due to the resolution gap between the RGB and depth images. To tackle these challenges, we propose a novel Symmetric Uncertainty-aware Feature Transmission (SUFT) for color-guided DSR. (1) For the resolution gap, SUFT builds an iterative up-and-down sampling pipeline, which makes depth features and RGB features spatially consistent while suppressing noise amplification and blurring by replacing common interpolated pre-upsampling. (2) For the cross-modality gap, we propose a novel Symmetric Uncertainty scheme to remove parts of RGB information harmful to the recovery of HR depth maps. Extensive experiments on benchmark datasets and challenging real-world settings suggest that our method achieves superior performance compared to state-of-the-art methods. Our code and models are available at

DTR: An Information Bottleneck Based Regularization Framework for Video Action Recognition

  • Jiawei Fan
  • Yu Zhao
  • Xie Yu
  • Lihua Ma
  • Junqi Liu
  • Fangqiu Yi
  • Boxun Li

An optimal representation should contain the maximum task-relevant information and minimum task-irrelevant information, as revealed from Information Bottleneck Principle. In video action recognition, CNN based approaches have obtained better spatio-temporal representation by modeling temporal context. However, these approaches still suffer low generalization. In this paper, we propose a moderate optimization based approach called Dual-view Temporal Regularization (DTR) based on Information Bottleneck Principle for an effective and generalized video representation without sacrificing any efficiency of the model. On the one hand, we design Dual-view Regularization (DR) to constrain task-irrelevant information, which can effectively compress background and irrelevant motion information. On the other hand, we design Temporal Regularization (TR) to maintain task-relevant information by finding an optimal difference between frames, which benefits extracting sufficient motion information. The experimental results demonstrate: (1) DTR is orthogonal to temporal modeling as well as data augmentation, and it achieves general improvement on both model-based and data-based approaches; (2) DTR is effective among 7 different datasets, especially on motion-centric datasets i.e. SSv1/ SSv2, in which DTR gets 6%/3.8% absolute gains in top-1 accuracy.

Self-Supervised Graph Neural Network for Multi-Source Domain Adaptation

  • Jin Yuan
  • Feng Hou
  • Yangzhou Du
  • Zhongchao Shi
  • Xin Geng
  • Jianping Fan
  • Yong Rui

Domain adaptation (DA) tries to tackle the scenarios when the test data does not fully follow the same distribution of the training data, and multi-source domain adaptation (MSDA) is very attractive for real world applications. By learning from large-scale unlabeled samples, self-supervised learning has now become a new trend in deep learning. It is worth noting that both self-supervised learning and multi-source domain adaptation share a similar goal: they both aim to leverage unlabeled data to learn more expressive representations. Unfortunately, traditional multi-task self-supervised learning faces two challenges: (1) the pretext task may not strongly relate to the downstream task, thus it could be difficult to learn useful knowledge being shared from the pretext task to the target task; (2) when the same feature extractor is shared between the pretext task and the downstream one and only different prediction heads are used, it is ineffective to enable inter-task information exchange and knowledge sharing. To address these issues, we propose a novel Self-Supervised Graph Neural Network (SSG), where a graph neural network is used as the bridge to enable more effective inter-task information exchange and knowledge sharing. More expressive representation is learned by adopting a mask token strategy to mask some domain information. Our extensive experiments have demonstrated that our proposed SSG method has achieved state-of-the-art results over four multi-source domain adaptation datasets, which have shown the effectiveness of our proposed SSG method from different aspects.

ChoreoGraph: Music-conditioned Automatic Dance Choreography over a Style and Tempo Consistent Dynamic Graph

  • Ho Yin Au
  • Jie Chen
  • Junkun Jiang
  • Yike Guo

To generate dance that temporally and aesthetically matches the music is a challenging problem, as the following factors need to be considered. First, the aesthetic styles and messages conveyed by the motion and music should be consistent. Second, the beats of the generated motion should be locally aligned to the musical features. And finally, basic choreomusical rules should be observed, and the motion generated should be diverse. To address these challenges, we propose ChoreoGraph, which choreographs high-quality dance motion for a given piece of music over a Dynamic Graph. A data-driven learning strategy is proposed to evaluate the aesthetic style and rhythmic connections between music and motion in a progressively learned cross-modality embedding space. The motion sequences will be beats-aligned based on the music segments and then incorporated as nodes of a Dynamic Motion Graph. Compatibility factors such as the style and tempo consistency, motion context connection, action completeness, and transition smoothness are comprehensively evaluated to determine the node transition in the graph. We demonstrate that our repertoire-based framework can generate motions with aesthetic consistency and robustly extensible in diversity. Both quantitative and qualitative experiment results show that our proposed model outperforms other baseline models.

Pixelwise Adaptive Discretization with Uncertainty Sampling for Depth Completion

  • Rui Peng
  • Tao Zhang
  • Bing Li
  • Yitong Wang

Image guided depth completion is an extensively studied multi-modal task that takes sparse measurements and RGB images as input to recover dense depth maps. While the common practice is to regress the depth value from the unbounded range, some recent methods achieve breakthrough performance by discretizing the regression range into a number of discrete depth values, namely, Depth Hypotheses, and casting the scalar regression to the distribution estimation. However, existing methods employ the handcraft or image-level adaptive discretization strategies, where their generated depth hypotheses are pixel-shared, which can not adapt to all pixels and is inefficient. In this paper, we are the first to consider the difference between pixels and propose Pixelwise Adaptive Discretization to generate the tailored depth hypotheses for each pixel. Meanwhile, we introduce Uncertainty Sampling to generate the compact depth hypotheses for easy pixels and loose for hard pixels. This divide-and-conquer for each pixel allows the discrete depth hypotheses to be concentrated around the ground-truth of each pixel as much as possible, which is the core of discretization methods. Extensive experiments on the outdoor KITTI and indoor NYU Depth V2 datasets show that our model, called PADNet, surpasses the previous state-of-the-art methods even with limited parameters and computational cost.

Robust Diversified Graph Contrastive Network for Incomplete Multi-view Clustering

  • Zhe Xue
  • Junping Du
  • Hai Zhu
  • Zhongchao Guan
  • Yunfei Long
  • Yu Zang
  • Meiyu Liang

Incomplete multi-view clustering is a challenging task which aims to partition the unlabeled incomplete multi-view data into several clusters. The existing incomplete multi-view clustering methods neglect to utilize the diversified correlations inherent in data and handle the noise contained in different views. To address these issues, we propose a Robust Diversified Graph Contrastive Network (RDGC) for incomplete multi-view clustering, which integrates multi-view representation learning and diversified graph contrastive regularization into a unified framework. Multi-view unified and specific encoding network is developed to fuse different views into a unified representation, which can flexibly estimate the importance of views for incomplete multi-view data. Robust diversified graph contrastive regularization is proposed which captures the diversified data correlations to improve the discriminating power of the learned representation and reduce the information loss caused by the view missing problem. Moreover, our method can effectively resist the influence of noise and unreliable views by leveraging the robust contrastive learning loss. Extensive experiments conducted on four multi-view clustering datasets demonstrate the superiority of our method over the state-of-the-art methods.

Calibrating Class Weights with Multi-Modal Information for Partial Video Domain Adaptation

  • Xiyu Wang
  • Yuecong Xu
  • Jianfei Yang
  • Kezhi Mao

Assuming the source label space subsumes the target one, Partial Video Domain Adaptation (PVDA) is a more general and practical scenario for cross-domain video classification problems. The key challenge of PVDA is to mitigate the negative transfer caused by the source-only outlier classes. To tackle this challenge, a crucial step is to aggregate target predictions to assign class weights by up-weighing target classes and down-weighing outlier classes. However, the incorrect predictions of class weights can mislead the network and lead to negative transfer. Previous works improve the class weight accuracy by utilizing temporal features and attention mechanisms, but these methods may fall short when trying to generate accurate class weight when domain shifts are significant, as in most real-world scenarios. To deal with these challenges, we first propose the Multi-modality partial Adversarial Network (MAN), which utilizes multi-scale and multi-modal information to enhance PVDA performance. Based on MAN, we then propose Multi-modality Cluster-calibrated partial Adversarial Network (MCAN). It utilizes a novel class weight calibration method to alleviate the negative transfer caused by incorrect class weights. Specifically, the calibration method tries to identify and weigh correct and incorrect predictions using distributional information implied by unsupervised clustering. Extensive experiments are conducted on prevailing PVDA benchmarks, and the proposed MCAN achieves significant improvements when compared to state-of-the-art PVDA methods.

Cyclical Fusion: Accurate 3D Reconstruction via Cyclical Monotonicity

  • Duo Chen
  • Zixin Tang
  • Yiguang Liu

The dense correspondence estimation is crucial to RGB-D reconstruction systems. However, the projective correspondences are highly unreliable due to sensor depth and pose uncertainties. To tackle this challenge, we introduce a geometry-driven fusion framework, Cyclical Fusion. It pushes the correspondence finding forward to the 3D space instead of searching for candidates on the 2.5D projective map. Moreover, it establishes precise correspondence in two phases, coarse to fine. 1) First, the local surface (represented by a voxel) is characterized by Gaussian distribution. The Karcher-Frechet barycenter is adapted to conduct the robust approximation of covariance. Then, the metric between distributions is calculated via the L2-Wasserstein distance, and the correspondence voxel can be discovered through the nearest distribution-to-distribution model. 2) Our method utilizes an effective correspondence verification scheme derived from cyclical monotonicity related to Rockafellar's theorem. The concept of cyclical monotonicity reveals the geometrical nature of correspondences. A substantial constraint prevents the correspondences from twisting during the fusion process. Accordingly, precise point-to-point correspondence can be discovered. 3) The advection between correspondences is used to form a smooth manifold under regularization terms. Finally, Cyclical Fusion is integrated into a prototype reconstruction system (utilize multiple streams: depth, pose, RGB, and infrared). Experimental results on different benchmarks and real-world scanning verify the superior performance of the proposed method. Cyclical Fusion accomplishes the most authentic reconstruction for which the original projective correspondence-based scheme failed (See Fig.1). Our new techniques make the reconstruction applicable for multimedia content creation and many others.

Keypoint-Guided Modality-Invariant Discriminative Learning for Visible-Infrared Person Re-identification

  • Tengfei Liang
  • Yi Jin
  • Wu Liu
  • Songhe Feng
  • Tao Wang
  • Yidong Li

The visible-infrared person re-identification (VI-ReID) task aims to retrieve images of pedestrians across cameras with different modalities. In this task, the major challenges arise from two aspects: intra-class variations among images of the same identity, and cross-modality discrepancies between visible and infrared images. Existing methods mainly focus on the latter, attempting to alleviate the impact of modality discrepancy, which ignore the former issue of identity variations and achieve limited discrimination. To address both aspects, we propose a Keypoint-guided Modality-invariant Discriminative Learning (KMDL) method, which can simultaneously adapt to intra-ID variations and bridge the cross-modality gap. By introducing human keypoints, our method makes further exploration in the image space, feature space and loss constraints to solve the above issues. Specifically, considering the modality discrepancy in original images, we first design a Hue Jitter Augmentation (HJA) strategy, introducing the hue disturbance to alleviate color dependence in the input stage. To obtain discriminative fine-grained representation for retrieval, we design the Global-Keypoint Graph Module (GKGM) in feature space, which can directly extract keypoint-aligned features and mine relationships within global and keypoint embeddings. Based on these semantic local embeddings, we further propose the Keypoint-Aware Center (KAC) loss that can effectively adjust the feature distribution under the supervision of ID and keypoint to learn discriminative representation for the matching. Extensive experiments on SYSU-MM01 and RegDB datasets demonstrate the effectiveness of our KMDL method.

Model-Guided Multi-Contrast Deep Unfolding Network for MRI Super-resolution Reconstruction

  • Gang Yang
  • Li Zhang
  • Man Zhou
  • Aiping Liu
  • Xun Chen
  • Zhiwei Xiong
  • Feng Wu

Magnetic resonance imaging (MRI) with high resolution (HR) provides more detailed information for accurate diagnosis and quantitative image analysis. Despite the significant advances, most existing super-resolution (SR) reconstruction network for medical images has two flaws: 1) All of them are designed in a black-box principle, thus lacking sufficient interpretability and further limiting their practical applications. Interpretable neural network models are of significant interest since they enhance the trustworthiness required in clinical practice when dealing with medical images. 2) most existing SR reconstruction approaches only use a single contrast or use a simple multi-contrast fusion mechanism, neglecting the complex relationships between different contrasts that are critical for SR improvement. To deal with these issues, in this paper, a novel Model-Guided interpretable Deep Unfolding Network (MGDUN) for medical image SR reconstruction is proposed. The Model-Guided image SR reconstruction approach solves manually designed objective functions to reconstruct HR MRI. We show how to unfold an iterative MGDUN algorithm into a novel model-guided deep unfolding network by taking the MRI observation matrix and explicit multi-contrast relationship matrix into account during the end-to-end optimization. Extensive experiments on the multi-contrast IXI dataset and BraTs 2019 dataset demonstrate the superiority of our proposed model.

Learning from Different text-image Pairs: A Relation-enhanced Graph Convolutional Network for Multimodal NER

  • Fei Zhao
  • Chunhui Li
  • Zhen Wu
  • Shangyu Xing
  • Xinyu Dai

Multimodal Named Entity Recognition (MNER) aims to locate and classify named entities mentioned in a (text, image) pair. However, dominant work independently models the internal matching relations in a pair of image and text, ignoring the external matching relations between different (text, image) pairs inside the dataset, though such relations are crucial for alleviating image noise in MNER task. In this paper, we primarily explore two kinds of external matching relations between different (text, image) pairs, i.e., inter-modal relations and intra-modal relations. On the basis, we propose a Relation-enhanced Graph Convolutional Network (R-GCN) for the MNER task. Specifically, we first construct an inter-modal relation graph and an intra-modal relation graph to gather the image information most relevant to the current text and image from the dataset, respectively. And then, multimodal interaction and fusion are leveraged to predict the NER label sequences. Extensive experimental results show that our model consistently outperforms state-of-the-art works on two public datasets. Our code and datasets are available at

Multi-directional Knowledge Transfer for Few-Shot Learning

  • Shuo Wang
  • Xinyu Zhang
  • Yanbin Hao
  • Chengbing Wang
  • Xiangnan He

Knowledge transfer-based few-shot learning (FSL) aims at improving the recognition ability of a novel object under limited training samples by transferring relevant potential knowledge from other data. Most related methods calculate such knowledge to refine the representation of a novel sample or enrich the supervision to a classifier during a transfer procedure. However, it is easy to introduce new noise during the transfer calculations since: (1) the unbalanced quantity of samples between the known (base) and the novel categories biases the contents capturing of the novel objects, and (2) the semantic gaps existing in different modalities weakens the knowledge interaction during the training.

To reduce the influences of these issues in knowledge transfer-based FSL, this paper proposes a multi-directional knowledge transfer (MDKT). Specifically, (1) we use two independent unidirectional knowledge self-transfer strategies to calibrate the distributions of the novel categories from base categories in the visual and the textual space. It aims to yield transferable knowledge of the base categories to describe a novel category. (2) To reduce the inferences of semantic gaps, we first use a bidirectional knowledge connection to exchange the knowledge between the visual and the textual space. Then we adopt an online fusion strategy to enhance the expressions of the textual knowledge and improve the prediction accuracy of the novel categories by combining the knowledge from different modalities. Empirical studies on three FSL benchmark datasets demonstrate the effectiveness of MDKT, which improves the recognition accuracy on novel categories under limited samples, especially on $1$-shot and $2$-shot training tasks.

DetFusion: A Detection-driven Infrared and Visible Image Fusion Network

  • Yiming Sun
  • Bing Cao
  • Pengfei Zhu
  • Qinghua Hu

Infrared and visible image fusion aims to utilize the complementary information between the two modalities to synthesize a new image containing richer information. Most existing works have focused on how to better fuse the pixel-level details from both modalities in terms of contrast and texture, yet ignoring the fact that the significance of image fusion is to better serve downstream tasks. For object detection tasks, object-related information in images is often more valuable than focusing on the pixel-level details of images alone. To fill this gap, we propose a detection-driven infrared and visible image fusion network, termed DetFusion, which utilizes object-related information learned in the object detection networks to guide multimodal image fusion. We cascade the image fusion network with the detection networks of both modalities and use the detection loss of the fused images to provide guidance on task-related information for the optimization of the image fusion network. Considering that the object locations provide a priori information for image fusion, we propose an object-aware content loss that motivates the fusion model to better learn the pixel-level information in infrared and visible images. Moreover, we design a shared attention module to motivate the fusion network to learn object-specific information from the object detection networks. Extensive experiments show that our DetFusion outperforms state-of-the-art methods in maintaining pixel intensity distribution and preserving texture details. More notably, the performance comparison with state-of-the-art image fusion methods in task-driven evaluation also demonstrates the superiority of the proposed method. Our code will be available:

Sketch Transformer: Asymmetrical Disentanglement Learning from Dynamic Synthesis

  • Cuiqun Chen
  • Mang Ye
  • Meibin Qi
  • Bo Du

Sketch-photo recognition is a cross-modal matching problem whose query sets are sketch images drawn by artists or amateurs. Due to the significant modality difference between the two modalities, it is challenging to extract discriminative modality-shared feature representations. Existing works focus on exploring modality-invariant features to discover shared embedding space. However, they discard modality-specific cues, resulting in information loss and diminished discriminatory power of features. This paper proposes a novel asymmetrical disentanglement and dynamic synthesis learning method in the transformer framework (SketchTrans) to handle modality discrepancy by combining modality-shared information with modality-specific information. Specifically, an asymmetrical disentanglement scheme is introduced to decompose the photo features into sketch-relevant and sketch-irrelevant cues while preserving the original sketch structure. Using the sketch-irrelevant cues, we further translate the sketch modality component to photo representation through knowledge transfer, obtaining cross-modality representations with information symmetry. Moreover, we propose a dynamic updatable auxiliary sketch (A-sketch) modality generated from the photo modality to guide the asymmetrical disentanglement in a single framework. Under a multi-modality joint learning framework, this auxiliary modality increases the diversity of training samples and narrows the cross-modality gap. We conduct extensive experiments on three fine-grained sketch-based retrieval datasets, i.e., PKU-Sketch, QMUL-ChairV2, and QMUL-ShoeV2, outperforming the state-of-the-arts under various metrics.

Rethinking the Metric in Few-shot Learning: From an Adaptive Multi-Distance Perspective

  • Jinxiang Lai
  • Siqian Yang
  • Guannan Jiang
  • Xi Wang
  • Yuxi Li
  • Zihui Jia
  • Xiaochen Chen
  • Jun Liu
  • Bin-Bin Gao
  • Wei Zhang
  • Yuan Xie
  • Chengjie Wang

Few-shot learning problem focuses on recognizing unseen classes given a few labeled images. In recent effort, more attention is paid to fine-grained feature embedding, ignoring the relationship among different distance metrics. In this paper, for the first time, we investigate the contributions of different distance metrics, and propose an adaptive fusion scheme, bringing significant improvements in few-shot classification. We start from a naive baseline of confidence summation and demonstrate the necessity of exploiting the complementary property of different distance metrics. By finding the competition problem among them, built upon the baseline, we propose an Adaptive Metrics Module (AMM) to decouple metrics fusion into metric-prediction fusion and metric-losses fusion. The former encourages mutual complementary, while the latter alleviates metric competition via multi-task collaborative learning. Based on AMM, we design a few-shot classification framework AMTNet, including the AMM and the Global Adaptive Loss (GAL), to jointly optimize the few-shot task and auxiliary self-supervised task, making the embedding features more robust. In the experiment, the proposed AMM achieves 2% higher performance than the naive metrics fusion module, and our AMTNet outperforms the state-of-the-arts on multiple benchmark datasets.

Cross-Modality Domain Adaptation for Freespace Detection: A Simple yet Effective Baseline

  • Yuanbin Wang
  • Leyan Zhu
  • Shaofei Huang
  • Tianrui Hui
  • Xiaojie Li
  • Fei Wang
  • Si Liu

As one of the fundamental functions of autonomous driving system, freespace detection aims at classifying each pixel of the image captured by the camera as drivable or non-drivable. Current works of freespace detection heavily rely on large amount of densely labeled training data for accuracy and robustness, which is time-consuming and laborious to collect and annotate. To the best of our knowledge, we are the first work to explore unsupervised domain adaptation for freespace detection to alleviate the data limitation problem with synthetic data. We develop a cross-modality domain adaptation framework which exploits both RGB images and surface normal maps generated from depth images. A Collaborative Cross Guidance (CCG) module is proposed to leverage the context information of one modality to guide the other modality in a cross manner, thus realizing inter-modality intra-domain complement. To better bridge the domain gap between source domain (synthetic data) and target domain (real-world data), we also propose a Selective Feature Alignment (SFA) module which only aligns the features of consistent foreground area between the two domains, thus realizing inter-domain intra-modality adaptation. Extensive experiments are conducted by adapting three different synthetic datasets to one real-world dataset for freespace detection respectively. Our method performs closely to fully supervised freespace detection methods (93.08% v.s. 97.50% F1 score) and outperforms other general unsupervised domain adaptation methods for semantic segmentation with large margins, which shows the promising potential of domain adaptation for freespace detection.

Learning a Dynamic Cross-Modal Network for Multispectral Pedestrian Detection

  • Jin Xie
  • Rao Muhammad Anwer
  • Hisham Cholakkal
  • Jing Nie
  • Jiale Cao
  • Jorma Laaksonen
  • Fahad Shahbaz Khan

Multispectral pedestrian detection that enables continuous (day and night) localization of pedestrians has numerous applications. Existing approaches typically aggregate multispectral features by a simple element-wise operation. However, such a local feature aggregation scheme ignores the rich non-local contextual information. Further, we argue that a local tight correspondence across modalities is desired for multi-modal feature aggregation. To address these issues, we introduce a multispectral pedestrian detection framework that comprises a novel dynamic cross-modal network (DCMNet), which strives to adaptively utilize the local and non-local complementary information between multi-modal features. The proposed DCMNet consists of a local and a non-local feature aggregation module. The local module employs dynamically learned convolutions to capture local relevant information across modalities. On the other hand, the non-local module captures non-local cross-modal information by first projecting features from both modalities into the latent space and then obtaining dynamic latent feature nodes for feature aggregation. Comprehensive experiments are performed on two challenging benchmarks: KAIST and LLVIP. Experiments reveal the benefits of the proposed DCMNet, leading to consistently improved detection performance on diverse detection paradigms and backbones. When using the same backbone, our proposed detector achieves absolute gains of 1.74% and 1.90% over the baseline Cascade RCNN on the KAIST and LLVIP datasets.

Two-Stage Multi-Scale Resolution-Adaptive Network for Low-Resolution Face Recognition

  • Haihan Wang
  • Shangfei Wang
  • Lin Fang

Low-resolution face recognition is challenging due to uncertain input resolutions and the lack of distinguishing details in low-resolution (LR) facial images. Resolution-invariant representations must be learned for optimal performance. Existing methods for this task mainly minimize the distance between the representations of the low-resolution (LR) and corresponding high-resolution (HR) image pairs in a common subspace. However, these works only focus on introducing various distance metrics at the final layer and between HR-LR image pairs. They do not fully utilize the intermediate layers or multi-resolution supervision, yielding only modest performance. In this paper, we propose a novel two-stage multi-scale resolution-adaptive network to learn more robust resolution-invariant representations. In the first stage, the structural patterns and the semantic patterns are distilled from HR images to provide sufficient supervision for LR images. A curriculum learning strategy facilitates the training of HR and LR image matching, smoothly decreasing the resolution of LR images. In the second stage, a multi-resolution contrastive loss is introduced on LR images to enforce intra-class clustering and inter-class separation of the LR representations. By introducing multi-scale supervision and multi-resolution LR representation clustering, our network can produce robust representations despite uncertain input sizes. Experimental results on eight benchmark datasets demonstrate the effectiveness of the proposed method. Code will be released at

When True Becomes False: Few-Shot Link Prediction beyond Binary Relations through Mining False Positive Entities

  • Xuan Zhang
  • Xun Liang
  • Xiangping Zheng
  • Bo Wu
  • Yuhui Guo

Recently, the link prediction task on Hyper-relational Knowledge Graphs (HKGs) has been a hot spot, which aims to predict new facts beyond binary relations. Although previous models have accomplished considerable achievements, there remain three challenges: i) the previous models neglect the existence of False Positive Entities (FPEs), which are true entities in the binary triples, yet becomes false when encountering the query statements of HKGs; ii) Due to the sparse interactions, the models are not capable of coping with long-tail hyper-relations, which are ubiquitous in the real-world; iii) The models are generally transductive learning processes, and have difficulty in adapting new hyper-relations. To tackle the above issues, we firstly propose the task of few-shot link prediction on HKGs and devise hyper-relation-aware attention networks with a contrastive loss, which are empowered to encode all entities including FPEs effectively and increase the distance between the true entities and FPEs through contrastive learning. With few-shot references available, the proposed model then learns the representations of their long-tail hyper-relations and predicts new links by calculating the likelihood between queries and references. Furthermore, our model is inductive and can be scalable to any new hyper-relation effortlessly. Since it is the first trial on few-shot link prediction for HKGs, we also modify the existing few-shot learning approaches on binary relational data to work with HKGs as baselines. Experimental results on three real-world datasets show the superiority of our model over various state-of-the-art baselines.

Understanding Political Polarization via Jointly Modeling Users, Connections and Multimodal Contents on Heterogeneous Graphs

  • Hanjia Lyu
  • Jiebo Luo

Understanding political polarization on social platforms is important as public opinions may become increasingly extreme when they are circulated in homogeneous communities, thus potentially causing damage in the real world. Automatically detecting the political ideology of social media users can help better understand political polarization. However, it is challenging due to the scarcity of ideology labels, complexity of multimodal contents, and cost of time-consuming data collection process. Most previous frameworks either focus on unimodal content or do not scale up well. In this study, we adopt a heterogeneous graph neural network to jointly model user characteristics, multimodal post contents as well as user-item relations in a bipartite graph to learn a comprehensive and effective user embedding without requiring ideology labels. We apply our framework to online discussions about economy and public health topics. The learned embeddings are then used to detect political ideology and understand political polarization. Our framework outperforms the unimodal, early/late fusion baselines, and homogeneous GNN frameworks by a margin of at least 9% absolute gain in the area under the receiver operating characteristic on two social media datasets. More importantly, our work does not require a time-consuming data collection process, which allows faster detection and in turn allows the policy makers to conduct analysis and design policies in time to respond to crises. We also show that our framework learns meaningful user embeddings and can help better understand political polarization. Notable differences in user descriptions, topics, images, and levels of retweet/quote activities are observed. Our framework for decoding user-content interaction shows wide applicability in understanding political polarization. Furthermore, it can be extended to user-item bipartite information networks for other applications such as content and product recommendation.

SESSION: Oral Session XI: Understanding Multimedia Content -- Vision and Language

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

  • Yupan Huang
  • Tengchao Lv
  • Lei Cui
  • Yutong Lu
  • Furu Wei

Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis. The code and models are publicly available at

Reducing the Vision and Language Bias for Temporal Sentence Grounding

  • Daizong Liu
  • Xiaoye Qu
  • Wei Hu

Temporal sentence grounding (TSG) is an important yet challenging task in multimedia information retrieval. Although previous TSG methods have achieved decent performance, they tend to capture the selection biases of frequently appeared video-query pairs in the dataset rather than present robust multimodal reasoning abilities, especially for the rarely appeared pairs. In this paper, we study the above issue of selection biases and accordingly propose a Debiasing-TSG (D-TSG) model to filter and remove the negative biases in both vision and language modalities for enhancing the model generalization ability. Specifically, we propose to alleviate the issue from two perspectives: 1) Feature distillation. We built a multi-modal debiasing branch to firstly capture the vision and language biases, and then apply a bias identification module to explicitly recognize the true negative biases and remove them from the benign multi-modal representations. 2) Contrastive sample generation. We construct two types of negative samples to enforce the model to accurately learn the aligned multi-modal semantics and make complete semantic reasoning. We apply the proposed model to both commonly and rarely appeared TSG cases, and demonstrate its effectiveness by achieving the state-of-the-art performance on three benchmark datasets (ActivityNet Caption, TACoS, and Charades-STA).

Face Forgery Detection via Symmetric Transformer

  • Luchuan Song
  • Xiaodan Li
  • Zheng Fang
  • Zhenchao Jin
  • YueFeng Chen
  • Chenliang Xu

The deep learning-based face forgery detection is a novel yet challenging task. Despite impressive results have been achieved, there are still some limitations in the existing methods. For example, the previous methods are hard to maintain consistent predictions for consecutive frames, even if all of those frames are actually forged. We propose a symmetric transformer for channel and spatial feature extraction, which is because the channel and spatial features of a robust forgery detector should be consistent in the temporal domain. The symmetric transformer adopt the newly-designed attention-based strategies for channel variance and spatial gradients as the vital features, which greatly improves the robustness of deepfake video detection. Moreover, this symmetric structure acts on temporal and spatial features respectively, which ensures the robustness of detection from two different aspects. Our symmetric transformer is an end-to-end optimized network. Experiments are conducted on various settings, the proposed methods achieve significantly improvement on prediction robustness and perform better than state-of-the-art methods on different datasets.

End-to-End Compound Table Understanding with Multi-Modal Modeling

  • Zaisheng Li
  • Yi Li
  • Qiao Liang
  • Pengfei Li
  • Zhanzhan Cheng
  • Yi Niu
  • Shiliang Pu
  • Xi Li

Table is a widely used data form in webpages, spreadsheets, or PDFs to organize and present structural data. Although studies on table structure recognition have been successfully used to convert image-based tables into digital structural formats, solving many real problems still relies on further understanding of the table, such as cell relationship extraction. The current datasets related to table understanding are all based on the digit format. To boost research development, we release a new benchmark named ComFinTab with rich annotations that support both table recognition and understanding tasks. Unlike previous datasets containing the basic tables, ComFinTab contains a large ratio of compound tables, which is much more challenging and requires methods using multiple information sources. Based on the dataset, we also propose a uniform, concise task form with the evaluation metric to better evaluate the model's performance on the table understanding task in compound tables. Finally, a framework named CTUNet is proposed to integrate the compromised visual, semantic, and position features with a graph attention network, which can solve the table recognition task and the challenging table understanding task as a whole. Experimental results compared with some previous advanced table understanding methods demonstrate the effectiveness of our proposed model. Code and dataset are available at \url

Modality Eigen-Encodings Are Keys to Open Modality Informative Containers

  • Yiyuan Zhang
  • Yuqi Ji

Vision-Language fusion relies heavily on precise cross-modal information synergy. Nevertheless, modality divergence makes mutual description with the other modality extremely difficult. Despite various attempts to tap into semantic unity in vision and language, most existing approaches utilize modality-specific features via the high-dimensionality tensors as the smallest unit of information, limiting the interactivity of multi-modal fine-grained fusion. Furthermore, in previous works, cross-modal interaction is commonly depicted by the similarity between semantically insufficient global features. Differently, we propose a novel scheme for multi-modal fusion named Vision Language Interaction (VLI). To represent more fine-grained and flexible information of the modality, we consider high-dimensional features as containers of modality-specific information, while homogeneous semantic information between heterogeneous modalities is the key stored in the containers. We first construct information containers via multi-scale alignment and then utilize modality eigen-encodings to take out the homogeneous semantics on the vector level. Finally, we iteratively embed the eigen-encodings of one modality into the eigen-encodings of the other modality to perform cross-modal semantic interaction. After embeddings interaction, vision and language information can break the existing representation bottleneck through the representation level of granularity never achieved in previous work. Extensive experimental results on vision-language tasks validate the effectiveness of VLI. On the three benchmarks of Referring Expression Comprehension (REC), Referring Expression Segmentation (RES), and Visual Question Answering (VQA), VLI significantly outperforms the existing state-of-the-art methods.

Visual Knowledge Graph for Human Action Reasoning in Videos

  • Yue Ma
  • Yali Wang
  • Yue Wu
  • Ziyu Lyu
  • Siran Chen
  • Xiu Li
  • Yu Qiao

Action recognition has been traditionally treated as a high-level video classification problem. However, such a manner lacks the detailed and semantic understanding of body movement, which is the critical knowledge to explain and infer complex human actions. To fill this gap, we propose to summarize a novel visual knowledge graph from over 15M detailed human annotations, for describing action as the distinct composition of body parts, part movements and interactive objects in videos. Based on it, we design a generic multi-modal Action Knowledge Understanding (AKU) framework, which can progressively infer human actions from body part movements in the videos, with assistance of visual-driven semantic knowledge mining. Finally, we validate AKU on the recent Kinetics-TPS benchmark, which contains body part parsing annotations for detailed understanding of human action in videos. The results show that, our AKU significantly boosts various video backbones with explainable action knowledge in both supervised and few shot settings, and outperforms the recent knowledge-based action recognition framework, e.g., our AKU achieves 83.9% accuracy on Kinetics-TPS while PaStaNet achieves 63.8% accuracy under the same backbone. The codes and models will be released at

Unsupervised and Pseudo-Supervised Vision-Language Alignment in Visual Dialog

  • Feilong Chen
  • Duzhen Zhang
  • Xiuyi Chen
  • Jing Shi
  • Shuang Xu
  • Bo XU

Visual dialog requires models to give reasonable answers according to a series of coherent questions and related visual concepts in images. However, most current work either focuses on attention-based fusion or pre-training on large-scale image-text pairs, ignoring the critical role of explicit vision-language alignment in visual dialog. To remedy this defect, we propose a novel unsupervised and pseudo-supervised vision-language alignment approach for visual dialog (AlignVD). Firstly, AlginVD utilizes the visual and dialog encoder to represent images and dialogs. Then, it explicitly aligns visual concepts with textual semantics via unsupervised and pseudo-supervised vision-language alignment (UVLA and PVLA). Specifically, UVLA utilizes a graph autoencoder, while PVLA uses dialog-guided visual grounding to conduct alignment. Finally, based on the aligned visual and textual representations, AlignVD gives a reasonable answer to the question via the cross-modal decoder. Extensive experiments on two large-scale visual dialog datasets have demonstrated the effectiveness of vision-language alignment, and our proposed AlignVD achieves new state-of-the-art results. In addition, our single model has won first place on the visual dialog challenge leaderboard with a NDCG metric of 78.70, surpassing the previous best ensemble model by about 1 point.

You Can even Annotate Text with Voice: Transcription-only-Supervised Text Spotting

  • Jingqun Tang
  • Su Qiao
  • Benlei Cui
  • Yuhang Ma
  • Sheng Zhang
  • Dimitrios Kanoulas

End-to-end scene text spotting has recently gained great attention in the research community. The majority of existing methods rely heavily on the location annotations of text instances (e.g., word-level boxes, word-level masks, and char-level boxes). We demonstrate that scene text spotting can be accomplished solely via text transcription, significantly reducing the need for costly location annotations. We propose a query-based paradigm to learn implicit location features via the interaction of text queries and image embeddings. These features are then made explicit during the text recognition stage via an attention activation map. Due to the difficulty of training the weakly-supervised model from scratch, we address the issue of model convergence via a circular curriculum learning strategy. Additionally, we propose a coarse-to-fine cross-attention localization mechanism for more precisely locating text instances. Notably, we provide a solution for text spotting via audio annotation, which further reduces the time required for annotation. Moreover, it establishes a link between audio, text, and image modalities in scene text spotting. Using only transcription annotations as supervision on both real and synthetic data, we achieve competitive results on several popular scene text benchmarks. The proposed method offers a reasonable trade-off between model accuracy and annotation time, allowing simplification of large-scale text spotting applications.

Inferential Visual Question Generation

  • Chao Bi
  • Shuhui Wang
  • Zhe Xue
  • Shengbo Chen
  • Qingming Huang

The task of Visual Question Generation (VQG) aims to generate natural language questions for images. Many methods regard it as a reverse Visual Question Answering (VQA) task. They trained a data-driven generator on VQA datasets, which is hard to obtain questions that can challenge robots and humans. Other methods rely heavily on elaborate but expensive artificial preprocessing to generate. To overcome these limitations, we propose a method to generate inferential questions from the image with noisy captions. Our method first introduces a core scene graph generation module, which can align text features and salient visual features to the initial scene graph. It constructs a special core scene graph with expanded linkage outwards from the high-confidence nodes hop by hop. Next, a question generation module uses the core scene graph as a basis to instantiate the function templates, resulting in questions with varying inferential paths. Experiments show that the visual questions generated by our method are controllable in both content and difficulty, and demonstrate clear inferential properties. In addition, since the salient region, captions, and function templates can be replaced by human-customized ones, our method has strong scalability and potential for more interactive applications. Finally, we use our method to automatically build a new dataset, InVQA, containing about 120k images and 480k question-answer pairs, to facilitate the development of more versatile VQA models.

A Baseline for Detecting Out-of-Distribution Examples in Image Captioning

  • Gal Shalev
  • Gabi Shalev
  • Joseph Keshet

Image captioning research achieved breakthroughs in recent years by developing neural models that can generate diverse and high-quality descriptions for images drawn from the same distribution as training images. However, when facing out-of-distribution (OOD) images, such as corrupted images, or images containing unknown objects, the models fail in generating relevant captions.

In this paper, we consider the problem of OOD detection in image captioning. We formulate the problem and suggest an evaluation setup for assessing the model's performance on the task. Then, we analyze and show the effectiveness of the caption's likelihood score at detecting and rejecting OOD images, which implies that the relatedness between the input image and the generated caption is encapsulated within the score.

Proxy Probing Decoder for Weakly Supervised Object Localization: A Baseline Investigation

  • Jingyuan Xu
  • Hongtao Xie
  • Chuanbin Liu
  • Yongdong Zhang

Weakly supervised object localization (WSOL) aims to localize the object with only image category labels. Existing methods generally fine-tune the models with manually selected training epochs and subjective loss functions to mitigate the partial activation problem of the classification-based model. However, such fine-tuning scheme would cause the model to degrade, e.g. affect the classification performance and generalization capabilities of the pre-trained model. In this paper, we propose a novel method named Proxy Probing Decoder (PPD) to meet these challenges, which utilizes the segmentation property of self-attention map in the self-supervised vision transformer and breaks through model fine-tuning with a novel proxy probing decoder. Specifically, we utilize the self-supervised vision transformer to capture long-range dependencies and avoid partial activation. Then we simply adopt a proxy consisting of a series of decoding layers to transform the feature representations into the heatmap of the objects' foreground and conduct localization. The backbone parameters are frozen during training while the proxy is used to decode the feature and localize the object. In this way, the vision transformer model can maintain the feature representation capabilities and only the proxy is required for adapting to the task. Without bells and whistles, our framework achieves 55.0% Top-1 Loc on the ILSVRC2012 dataset and 78.8% Top-1 Loc on the CUB-200-2011 dataset, which surpasses state-of-the-art by a large margin and provides a simple baseline. Codes and models will be available on Github.

Target-Driven Structured Transformer Planner for Vision-Language Navigation

  • Yusheng Zhao
  • Jinyu Chen
  • Chen Gao
  • Wenguan Wang
  • Lirong Yang
  • Haibing Ren
  • Huaxia Xia
  • Si Liu

Vision-language navigation is the task of directing an embodied agent to navigate in 3D scenes with natural language instructions. For the agent, inferring the long-term navigation target from visual-linguistic clues is crucial for reliable path planning, which, however, has rarely been studied before in literature. In this article, we propose a Target-Driven Structured Transformer Planner (TD-STP) for long-horizon goal-guided and room layout-aware navigation. Specifically, we devise an Imaginary Scene Tokenization mechanism for explicit estimation of the long-term target (even located in unexplored environments). In addition, we design a Structured Transformer Planner which elegantly incorporates the explored room layout into a neural attention architecture for structured and global planning. Experimental results demonstrate that our TD-STP substantially improves previous best methods' success rate by 2% and 5% on the test set of R2R and REVERIE benchmarks, respectively. Our code is available at

Integrating Object-aware and Interaction-aware Knowledge for Weakly Supervised Scene Graph Generation

  • Xingchen Li
  • Long Chen
  • Wenbo Ma
  • Yi Yang
  • Jun Xiao

Recently, increasing efforts have been focused on Weakly Supervised Scene Graph Generation (WSSGG). The mainstream solution for WSSGG typically follows the same pipeline: they first align text entities in the weak image-level supervisions (e.g., unlocalized relation triplets or captions) with image regions, and then train SGG models in a fully-supervised manner with aligned instance-level "pseudo" labels. However, we argue that most existing WSSGG works only focus on object-consistency, which means the grounded regions should have the same object category label as text entities. While they neglect another basic requirement for an ideal alignment: interaction-consistency, which means the grounded region pairs should have the same interactions (i.e., visual relations) as text entity pairs. Hence, in this paper, we propose to enhance a simple grounding module with both object-aware and interaction-aware knowledge to acquire more reliable pseudo labels. To better leverage these two types of knowledge, we regard them as two teachers and fuse their generated targets to guide the training process of our grounding module. Specifically, we design two different strategies to adaptively assign weights to different teachers by assessing their reliability on each training sample. Extensive experiments have demonstrated that our method consistently improves WSSGG performance on various kinds of weak supervision.

Reading and Writing: Discriminative and Generative Modeling for Self-Supervised Text Recognition

  • Mingkun Yang
  • Minghui Liao
  • Pu Lu
  • Jing Wang
  • Shenggao Zhu
  • Hualin Luo
  • Qi Tian
  • Xiang Bai

Existing text recognition methods usually need large-scale training data. Most of them rely on synthetic training data due to the lack of annotated real images. However, there is a domain gap between the synthetic data and real data, which limits the performance of the text recognition models. Recent self-supervised text recognition methods attempted to utilize unlabeled real images by introducing contrastive learning, which mainly learns the discrimination of the text images. Inspired by the observation that humans learn to recognize the texts through both reading and writing, we propose to learn discrimination and generation by integrating contrastive learning and masked image modeling in our self-supervised method. The contrastive learning branch is adopted to learn the discrimination of text images, which imitates the reading behavior of humans. Meanwhile, masked image modeling is firstly introduced for text recognition to learn the context generation of the text images, which is similar to the writing behavior. The experimental results show that our method outperforms previous self-supervised text recognition methods by 10.2%-20.2% on irregular scene text recognition datasets. Moreover, our proposed text recognizer exceeds previous state-of-the-art text recognition methods by averagely 5.3% on 11benchmarks, with similar model size. We also demonstrate that our pre-trained model can be easily applied to other text-related tasks with obvious performance gain.

Hierarchical Walking Transformer for Object Re-Identification

  • Xudong Tian
  • Jun Liu
  • Zhizhong Zhang
  • Chengjie Wang
  • Yanyun Qu
  • Yuan Xie
  • Lizhuang Ma

Recently, transformer purely based on attention mechanism has been applied to a wide range of tasks and achieved impressive performance. Though extensive efforts have been made, there are still drawbacks to the transformer architecture which hinder its further applications: (i) the quadratic complexity brought by attention mechanism; (ii) barely incorporated inductive bias.

In this paper, we present a new hierarchical walking attention, which provides a scalable, flexible, and interpretable sparsification strategy to reduce the complexity from quadratic to linear, and meanwhile evidently boost the performance. Specifically, we learn a hierarchical structure by splitting an image with different receptive fields. We associate each high-level region with a supernode, and inject supervision with prior knowledge in this node. Supernode then acts as an indicator to decide whether this area should be skipped and thereby massive unnecessary dot-product terms in attention can be avoided. Two sparsification phases are finally introduced, allowing the transformer to achieve strictly linear complexity. Extensive experiments are conducted to demonstrate the superior performance and efficiency against state-of-the-art methods. Significantly, our method sharply reduces the inference time and the total of tokens by 28% and $94%$ respectively, and brings 2.6%@Rank-1 promotion on MSMT17.

Cross-modal Semantic Alignment Pre-training for Vision-and-Language Navigation

  • Siying Wu
  • Xueyang Fu
  • Feng Wu
  • Zheng-Jun Zha

Vision-and-Language Navigation needs an agent to navigate to a target location by progressively grounding and following the relevant instruction conditioning on its memory and current observation. Existing works utilize the cross-modal transformer to pass the message between visual modality and textual modality. However, they are still limited to mining the fine-grained matching between the underlying components of trajectories and instructions. Inspired by the significant progress achieved by large-scale pre-training methods, in this paper, we propose CSAP, a new method of Cross-modal Semantic Alignment Pre-training for Vision-and-Language Navigation. It is designed to learn the alignment from trajectory-instruction pairs through two novel tasks, including trajectory-conditioned masked fragment modeling and contrastive semantic-alignment modeling. Specifically, the trajectory-conditioned masked fragment modeling encourages the agent to extract useful visual information to reconstruct the masked fragment. The contrastive semantic-alignment modeling is designed to align the visual representation with corresponding phrase embeddings. By showing experimental results on the benchmark dataset, we demonstrate that transformer architecture-based navigation agent pre-trained with our proposed CSAP outperforms existing methods on both SR and SPL scores.

RONF: Reliable Outlier Synthesis under Noisy Feature Space for Out-of-Distribution Detection

  • Rundong He
  • Zhongyi Han
  • Xiankai Lu
  • Yilong Yin

Out-of-distribution~(OOD) detection is fundamental to guaranteeing the reliability of multimedia applications during deployment in the open world. However, due to the lack of supervision signals from OOD data, the current model easily outputs overconfident predictions to OOD data during the inference phase. Several previous methods rely on large-scale auxiliary OOD datasets for model regularization. However, obtaining suitable and clean large-scale auxiliary OOD datasets is usually challenging. In this paper, we present <u>R</u>eliable <u>O</u>utlier synthesis under <u>N</u>oisy <u>F</u>eature space (RONF), which synthesizes reliable virtual outliers in noisy feature space to provide supervision signals for model regularization. Specifically, RONF first introduces a novel virtual outlier synthesis strategy <u>B</u>oundary <u>F</u>eature <u>M</u>ixup (BFM), which mixes up samples from the low-likelihood region of the class-conditional distribution in the feature space. However, the feature space is noisy due to the spurious features, which cause unreliable outlier synthesizing. To mitigate this problem, RONF then introduces <u>O</u>ptimal <u>P</u>arameter <u>L</u>earning (OPL) to obtain desirable features and remove spurious features. Alongside, RONF proposes a provable and effective scoring function called <u>E</u>nergy with <u>E</u>nergy <u>D</u>iscrepancy (EED) for the uncertainty measurement of OOD data. Extensive studies on several representative datasets of multimedia applications show that RONF outperforms the state-of-the-arts remarkably

ConceptBeam: Concept Driven Target Speech Extraction

  • Yasunori Ohishi
  • Marc Delcroix
  • Tsubasa Ochiai
  • Shoko Araki
  • Daiki Takeuchi
  • Daisuke Niizumi
  • Akisato Kimura
  • Noboru Harada
  • Kunio Kashino

We propose a novel framework for target speech extraction based on semantic information, called ConceptBeam. Target speech extraction means extracting the speech of a target speaker in a mixture. Typical approaches have been exploiting properties of audio signals, such as harmonic structure and direction of arrival. In contrast, ConceptBeam tackles the problem with semantic clues. Specifically, we extract the speech of speakers speaking about a concept, i.e., a topic of interest, using a concept specifier such as an image or speech. Solving this novel problem would open the door to innovative applications such as listening systems that focus on a particular topic discussed in a conversation. Unlike keywords, concepts are abstract notions, making it challenging to directly represent a target concept. In our scheme, a concept is encoded as a semantic embedding by mapping the concept specifier to a shared embedding space. This modality-independent space can be built by means of deep metric learning using paired data consisting of images and their spoken captions. We use it to bridge modality-dependent information, i.e., the speech segments in the mixture, and the specified, modality-independent concept. As a proof of our scheme, we performed experiments using a set of images associated with spoken captions. That is, we generated speech mixtures from these spoken captions and used the images or speech signals as the concept specifiers. We then extracted the target speech using the acoustic characteristics of the identified segments. We compare ConceptBeam with two methods: one based on keywords obtained from recognition systems and another based on sound source separation. We show that ConceptBeam clearly outperforms the baseline methods and effectively extracts speech based on the semantic representation.

Query-driven Generative Network for Document Information Extraction in the Wild

  • Haoyu Cao
  • Xin Li
  • Jiefeng Ma
  • Deqiang Jiang
  • Antai Guo
  • Yiqing Hu
  • Hao Liu
  • Yinsong Liu
  • Bo Ren

This paper focuses on solving Document Information Extraction (DIE) in the wild problem, which is rarely explored before. In contrast to existing studies mainly tailored for document cases in known templates with predefined layouts and keys under the ideal input without OCR errors involved, we aim to build up a more practical DIE paradigm for real-world scenarios where input document images may contain unknown layouts and keys in the scenes of the problematic OCR results. To achieve this goal, we propose a novel architecture, termed Query-driven Generative Network (QGN), which is equipped with two consecutive modules, i.e., Layout Context-aware Module (LCM) and Structured Generation Module (SGM). Given a document image with unseen layouts and fields, the former LCM yields the value prefix candidates serving as the query prompts for the SGM to generate the final key-value pairs even with OCR noise. To further investigate the potential of our method, we create a new large-scale dataset, named LArge-scale STructured Documents (LastDoc4000), containing 4,000 documents with 1,511 layouts and 3,500 different keys. In experiments, we demonstrate that our QGN consistently achieves the best F1-score on the new LastDoc4000 dataset by at most 30.32% absolute improvement. A more comprehensive experimental analysis and experiments on other public benchmarks also verify the effectiveness and robustness of our proposed method for the wild DIE task.

SPTS: Single-Point Text Spotting

  • Dezhi Peng
  • Xinyu Wang
  • Yuliang Liu
  • Jiaxin Zhang
  • Mingxin Huang
  • Songxuan Lai
  • Jing Li
  • Shenggao Zhu
  • Dahua Lin
  • Chunhua Shen
  • Xiang Bai
  • Lianwen Jin

Existing scene text spotting (i.e., end-to-end text detection and recognition) methods rely on costly bounding box annotations (e.g., text-line, word-level, or character-level bounding boxes). For the first time, we demonstrate that training scene text spotting models can be achieved with an extremely low-cost annotation of a single-point for each instance. We propose an end-to-end scene text spotting method that tackles scene text spotting as a sequence prediction task. Given an image as input, we formulate the desired detection and recognition results as a sequence of discrete tokens and use an auto-regressive Transformer to predict the sequence. The proposed method is simple yet effective, which can achieve state-of-the-art results on widely used benchmarks. Most significantly, we show that the performance is not very sensitive to the positions of the point annotation, meaning that it can be much easier to be annotated or even be automatically generated than the bounding box that requires precise positions. We believe that such a pioneer attempt indicates a significant opportunity for scene text spotting applications of a much larger scale than previously possible. The code is available at

AI Illustrator: Translating Raw Descriptions into Images by Prompt-based Cross-Modal Generation

  • Yiyang Ma
  • Huan Yang
  • Bei Liu
  • Jianlong Fu
  • Jiaying Liu

AI illustrator aims to automatically design visually appealing images for books to provoke rich thoughts and emotions. To achieve this goal, we propose a framework for translating raw descriptions with complex semantics into semantically corresponding images. The main challenge lies in the complexity of the semantics of raw descriptions, which may be hard to be visualized e.g., "gloomy" or "Asian"). It usually poses challenges for existing methods to handle such descriptions. To address this issue, we propose a Prompt-based Cross-Modal Generation Framework (PCM-Frame) to leverage two powerful pre-trained models, including CLIP and StyleGAN. Our framework consists of two components: a projection module from Text Embeddings to Image Embeddings based on prompts, and an adapted image generation module built on StyleGAN which takes Image Embeddings as inputs and is trained by combined semantic consistency losses. To bridge the gap between realistic images and illustration designs, we further adopt a stylization model as post-processing in our framework for better visual effects. Benefiting from the pre-trained models, our method can handle complex descriptions and does not require external paired data for training. Furthermore, we have built a benchmark that consists of 200 descriptions from literature books or online resources. We conduct a user study to demonstrate our superiority over the competing methods of text-to-image translation with complicated semantics.

Purifier: Plug-and-play Backdoor Mitigation for Pre-trained Models Via Anomaly Activation Suppression

  • Xiaoyu Zhang
  • Yulin Jin
  • Tao Wang
  • Jian Lou
  • Xiaofeng Chen

Pre-trained models have been widely adopted in deep learning development, benefiting the fine-tuning of downstream user-specific tasks with enormous computation saving. However, backdoor attacks pose severe security threat to the subsequent models built upon compromised pre-trained models, which call for effective countermeasures to mitigate the backdoor threat before deploying the victim models to safety-critical applications. This paper proposesPurifier : a novel backdoor mitigation framework for pre-trained models via suppressing anomaly activation.Purifier is motivated by the observation that, for backdoor triggers, anomaly activation patterns exist across different perspectives (e.g., channel-wise, cube-wise, and feature-wise), featuring different degrees of granularity. More importantly, choosing to suppress at the right granularity is vital to robustness and accuracy. To this end,Purifier is capable of defending against diverse types of backdoor triggers without any prior knowledge of the backdoor attacks, meanwhile featuring a convenient and flexible characteristic during deployment, i.e., plug-and-play-able. The extensive experimental results show, against a series of state-of-the-art mainstream attacks, thatPurifier performs better in terms of both defense effectiveness and model inference accuracy on clean examples than the state-of-the-art methods. Our code and Appendix can be found in \

C3CMR: Cross-Modality Cross-Instance Contrastive Learning for Cross-Media Retrieval

  • Junsheng Wang
  • Tiantian Gong
  • Zhixiong Zeng
  • Changchang Sun
  • Yan Yan

Cross-modal retrieval is an essential area of representation learning, which aims to retrieve instances with the same semantics from different modalities. In real implementation, a key challenge for cross-modal retrieval is to narrow the heterogeneity gap between different modalities and obtain modality-invariant and discriminative features. Typically, existing approaches for this task mainly learn inter-modal invariance and focus on how to combine pair-level loss and class-level loss, which cannot effectively and adequately learn discriminative features. To address these issues, in this paper, we propose a novel Cross-Modality Cross-Instance Contrastive Learning for Cross-Media Retrieval (C3CMR) method. Specifically, to fully employ the intra-modal similarities, we introduce the intra-modal contrastive learning to enhance the discriminative power of the unimodal features. Besides, we design a supervised inter-modal contrastive learning scheme to take full advantage of the label semantic associations. In this way, cross-semantic associations and inter-modal invariance can be further learned. Moreover, pertaining to the local suboptimal semantic similarity by only mining pairwise and triplewise sample relationships, we propose the cross-instance contrastive learning to mine the similarities among multiple instances. Comprehensive experimental results on four widely-used benchmark datasets demonstrate the superiority of our proposed method over several state-of-the-art cross-modal retrieval methods.

Progressive Attribute Embedding for Accurate Cross-modality Person Re-ID

  • Aihua Zheng
  • Peng Pan
  • Hongchao Li
  • Chenglong Li
  • Bin Luo
  • Chang Tan
  • Ruoran Jia

Attributes are important information to bridge the appearance gap across modalities, but have not been well explored in cross-modality person ReID. This paper proposes a progressive attribute embedding module (PAE) to effectively fuse the fine-grained semantic attribute information and the global structural visual information. Through a novel cascade way, we use attribute information to learn the relationship between the person images in different modalities, which significantly relieves the modality heterogeneity. Meanwhile, by embedding attribute information to guide more discriminative image feature generation, it simultaneously reduces the inter-class similarity and the intra-class discrepancy. In addition, we propose an attribute-based auxiliary learning strategy (AAL) to supervise the network to learn modality-invariant and identity-specific local features by joint attribute and identity classification losses. The PAE and AAL are jointly optimized in an end-to-end framework, namely, progressive attribute embedding network (PAENet). One can plug PAE and AAL into current mainstream models, as we implement them in five cross-modality person ReID frameworks to further boost the performance. Extensive experiments on public datasets demonstrate the effectiveness of the proposed method against the state-of-the-art cross-modality person ReID methods.

Class Discriminative Adversarial Learning for Unsupervised Domain Adaptation

  • Lihua Zhou
  • Mao Ye
  • Xiatian Zhu
  • Shuaifeng Li
  • Yiguang Liu

As a state-of-the-art family of Unsupervised Domain Adaptation (UDA), bi-classifier adversarial learning methods are formulated in an adversarial (minimax) learning framework with a single feature extractor and two classifiers. Model training alternates between two steps: (I) constraining the learning of the two classifiers to maximize the prediction discrepancy of unlabeled target domain data, and (II) constraining the learning of the feature extractor to minimize this discrepancy. Despite being an elegant formulation, this approach has a fundamental limitation: Maximizing and minimizing the classifier discrepancy is not class discriminative for the target domain, finally leading to a suboptimal adapted model. To solve this problem, we propose a novel Class Discriminative Adversarial Learning (CDAL) method characterized by discovering class discrimination knowledge and leveraging this knowledge to discriminatively regulate the classifier discrepancy constraints on-the-fly. This is realized by introducing an evaluation criterion for judging each classifier's capability and each target domain sample's feature reorientation via objective loss reformulation. Extensive experiments on three standard benchmarks show that our CDAL method yields new state-of-the-art performance. Our code is made available at

Background Layout Generation and Object Knowledge Transfer for Text-to-Image Generation

  • Zhuowei Chen
  • Zhendong Mao
  • Shancheng Fang
  • Bo Hu

Text-to-Image generation (T2I) aims to generate realistic and semantically consistent images according to the natural language descriptions. Built upon the recent advances in generative adversarial networks (GANs), existing T2I models have made great process. However, a close inspection of their generated images shows two major limitations: 1) the background (e.g., fence, lake) of the generated image with the complicated, real-world scene tends to be unrealistic; 2) the object (e.g., elephant, zebra) in the generated image often presents highly distorted shape or key parts missing. To address these limitations, we propose a two-stage T2I approach, where the first stage redesigns the text-to-layout process to incorporate the background layout with the existing object layout, the second stage transfers the object knowledge from an existing class-to-image model to the layout-to-image process to improve the object fidelity. Specifically, a transformer-based architecture is introduced as the layout generator to learn the mapping from text to layout of object and background, and a Text-attended Layout-aware feature Normalization (TL-Norm) is proposed to adaptively transfer the object knowledge to the image generation. Benefitting from the background layout and transferred object knowledge, the proposed approach significantly surpasses previous state-of-the-art methods in the image quality metric and achieves superior image-text alignment performance.

Towards Further Comprehension on Referring Expression with Rationale

  • Rengang Li
  • Baoyu Fan
  • Xiaochuan Li
  • Runze Zhang
  • Zhenhua Guo
  • Kun Zhao
  • Yaqian Zhao
  • Weifeng Gong
  • Endong Wang

Referring Expression Comprehension (REC) is one important research branch in visual grounding, where the goal of REC is to localize a relevant object in the image, given an expression in the form of text to exactly describe a specific object. However, existing REC tasks aim at text content filtering and image object locating, which are evaluated based on the precision of the detection boxes. This may lead models to skip the learning process of multimodal comprehension directly and achieve good performance. In this paper, we work on how to enable an artificial agent to understand RE further and propose a more comprehensive task, called Further Comprehension on Referring Expression (FREC). In this task, we mainly focus on three sub-tasks: 1) correcting the erroneous text expression based on visual information; 2) generating the rationale of this input expression; 3) localizing the proper object based on the corrected expression. Accordingly, we make a new dataset named Further-RefCOCOs based on the RefCOCO, RefCOCO+, RefCOCOg benchmark datasets for this new task and make it publicly available. After that, we design a novel end-to-end pipeline to achieve these sub-tasks simultaneously. The experimental results demonstrate the validity of the proposed pipeline. We believe this work will motivate more researchers to explore along with this direction, and promote the development of visual grounding.

DSE-GAN: Dynamic Semantic Evolution Generative Adversarial Network for Text-to-Image Generation

  • Mengqi Huang
  • Zhendong Mao
  • Penghui Wang
  • Quan Wang
  • Yongdong Zhang

Text-to-image generation aims at generating realistic images which are semantically consistent with the given text. Previous works mainly adopt the multi-stage architecture by stacking generator-discriminator pairs to engage multiple adversarial training, where the text semantics used to provide generation guidance remain static across all stages. This work argues that text features at each stage should be adaptively re-composed conditioned on the status of the historical stage (\emphi.e., historical stage's text and image features) to provide diversified and accurate semantic guidance during the coarse-to-fine generation process. We thereby propose a novel Dynamical Semantic Evolution GAN (DSE-GAN) to re-compose each stage's text features under a novel single adversarial multi-stage architecture. Specifically, we design (1) Dynamic Semantic Evolution (DSE) module, which first aggregates historical image features to summarize the generative feedback, and then dynamically selects words required to be re-composed at each stage as well as re-composed them by dynamically enhancing or suppressing different granularity subspace's semantics. (2) Single Adversarial Multi-stage Architecture (SAMA), which extends the previous structure by eliminating complicated multiple adversarial training requirements and therefore allows more stages of text-image interactions, and finally facilitates the DSE module. We conduct comprehensive experiments and show that DSE-GAN achieves 7.48% and 37.8% relative FID improvement on two widely used benchmarks, i.e., CUB-200 and MSCOCO, respectively.

Synthesizing Counterfactual Samples for Effective Image-Text Matching

  • Hao Wei
  • Shuhui Wang
  • Xinzhe Han
  • Zhe Xue
  • Bin Ma
  • Xiaoming Wei
  • Xiaolin Wei

Image-text matching is a fundamental research topic bridging vision and language. Recent works use hard negative mining to capture the multiple correspondences between visual and textual domains. Unfortunately, the truly informative negative samples are quite sparse in the training data, which are hard to obtain only in a randomly sampled mini-batch. Motivated by causal inference, we aim to overcome this shortcoming by carefully analyzing the analogy between hard negative mining and causal effects optimizing. Further, we propose Counterfactual Matching (CFM) framework for more effective image-text correspondence mining. CFM contains three major components, \ie, Gradient-Guided Feature Selection for automatic casual factor identification, Self-Exploration for causal factor completeness, and Self-Adjustment for counterfactual sample synthesis. Compared with traditional hard negative mining, our method largely alleviates the over-fitting phenomenon and effectively captures the fine-grained correlations between image and text modality. We evaluate our CFM in combination with three state-of-the-art image-text matching architectures. Quantitative and qualitative experiments conducted on two publicly available datasets demonstrate its strong generality and effectiveness. Code is available at:

Fine-tuning with Multi-modal Entity Prompts for News Image Captioning

  • Jingjing Zhang
  • Shancheng Fang
  • Zhendong Mao
  • Zhiwei Zhang
  • Yongdong Zhang

News Image Captioning aims to generate descriptions for images embedded in news articles, including plentiful real-world concepts, especially about named entities. However, existing methods are limited in the entity-level template. Not only is it labor-intensive to craft the template, but it is error-prone due to local entity-aware, which solely constrains the prediction output at each language model decoding step with corrupted entity relationship. To overcome the problem, we investigate a concise and flexible paradigm to achieve global entity-aware by introducing a prompting mechanism with fine-tuning pre-trained models, named Fine-tuning with Multi-modal Entity Prompts for News Image Captioning (NewsMEP). Firstly, we incorporate two pre-trained models: (i) CLIP, translating the image with open-domain knowledge; (ii) BART, extended to encode article and image simultaneously. Moreover, leveraging the BART architecture, we can easily take the end-to-end fashion. Secondly, we prepend the target caption with two prompts to utilize entity-level lexical cohesion and inherent coherence in the pre-trained language model. Concretely, the visual prompts are obtained by mapping CLIP embeddings, and contextual vectors automatically construct the entity-oriented prompts. Thirdly, we provide an entity chain to control caption generation that focuses on entities of interest. Experiments results on two large-scale publicly available datasets, including detailed ablation studies, show that our NewsMEP not only outperforms state-of-the-art methods in general caption metrics but also achieves significant performance in precision and recall of various named entities.

Rethinking the Reference-based Distinctive Image Captioning

  • Yangjun Mao
  • Long Chen
  • Zhihong Jiang
  • Dong Zhang
  • Zhimeng Zhang
  • Jian Shao
  • Jun Xiao

Distinctive Image Captioning (DIC) --- generating distinctive captions that describe the unique details of a target image --- has received considerable attention over the last few years. A recent DIC work proposes to generate distinctive captions by comparing the target image with a set of semantic-similar reference images, i.e., reference-based DIC (Ref-DIC). It aims to make the generated captions can tell apart the target and reference images. Unfortunately, reference images used by existing Ref-DIC works are easy to distinguish: these reference images only resemble the target image at scene-level and have few common objects, such that a Ref-DIC model can trivially generate distinctive captions even without considering the reference images. For example, if the target image contains objects "towel'' and "toilet'' while all reference images are without them, then a simple caption "A bathroom with a towel and a toilet'' is distinctive enough to tell apart target and reference images. To ensure Ref-DIC models really perceive the unique objects (or attributes) in target images, we first propose two new Ref-DIC benchmarks. Specifically, we design a two-stage matching mechanism, which strictly controls the similarity between the target and reference images at object-/attribute- level (vs. scene-level). Secondly, to generate distinctive captions, we develop a strong Transformer-based Ref-DIC baseline, dubbed as TransDIC. It not only extracts visual features from the target image, but also encodes the differences between objects in the target and reference images. Finally, for more trustworthy benchmarking, we propose a new evaluation metric named DisCIDEr for Ref-DIC, which evaluates both the accuracy and distinctiveness of the generated captions. Experimental results demonstrate that our TransDIC can generate distinctive captions. Besides, it outperforms several state-of-the-art models on the two new benchmarks over different metrics.

A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval

  • Alex Falcon
  • Giuseppe Serra
  • Oswald Lanz

Every hour, huge amounts of visual contents are posted on social media and user-generated content platforms. To find relevant videos by means of a natural language query, text-video retrieval methods have received increased attention over the past few years. Data augmentation techniques were introduced to increase the performance on unseen test examples by creating new training samples with the application of semantics-preserving techniques, such as color space or geometric transformations on images. Yet, these techniques are usually applied on raw data, leading to more resource-demanding solutions and also requiring the shareability of the raw data, which may not always be true, e.g. copyright issues with clips from movies or TV series. To address this shortcoming, we propose a multimodal data augmentation technique which works in the feature space and creates new videos and captions by mixing semantically similar samples. We experiment our solution on a large scale public dataset, EPIC-Kitchens-100, and achieve considerable improvements over a baseline method, improved state-of-the-art performance, while at the same time performing multiple ablation studies. We release code and pretrained models on Github at\_VideoRetrieval.

SESSION: Poster Session XI: Understanding Multimedia Content -- Vision and Language

MVPTR: Multi-Level Semantic Alignment for Vision-Language Pre-Training via Multi-Stage Learning

  • Zejun Li
  • Zhihao Fan
  • Huaixiao Tou
  • Jingjing Chen
  • Zhongyu Wei
  • Xuanjing Huang

Previous vision-language pre-training models mainly construct multi-modal inputs with tokens and objects (pixels) followed by performing cross-modality interaction between them. We argue that the input of only tokens and object features limits high-level semantic alignment like phrase-to-region grounding. Meanwhile, multi-level alignments are inherently consistent and able to facilitate the representation learning synergistically. Therefore, in this paper, we propose to learn Multi-level semantic alignment for Vision-language Pre-TRaining (MVPTR). In MVPTR, we follow the nested structure of both modalities to introduce concepts as high-level semantics. To ease the learning from multi-modal multi-level inputs, our framework is split into two stages, the first stage focuses on intra-modality multi-level representation learning, the second enforces interactions across modalities via both coarse-grained and fine-grained semantic alignment tasks. In addition to the commonly used image-text matching and masked language model tasks, we introduce a masked concept recovering task in the first stage to enhance the concept representation learning, and two more tasks in the second stage to explicitly encourage multi-level alignments across modalities. Our model achieves state-of-the-art results on several vision and language tasks.

Combining Vision and Language Representations for Patch-based Identification of Lexico-Semantic Relations

  • Prince Jha
  • Gaël Dias
  • Alexis Lechervy
  • Jose G. Moreno
  • Anubhav Jangra
  • Sebastião Pais
  • Sriparna Saha

Although a wide range of applications have been proposed in the field of multimodal natural language processing, very few works have been tackling multimodal relational lexical semantics. In this paper, we propose the first attempt to identify lexico-semantic relations with visual clues, which embody linguistic phenomena such as synonymy, co-hyponymy or hypernymy. While traditional methods take advantage of the paradigmatic approach or/and the distributional hypothesis, we hypothesize that visual information can supplement the textual information, relying on the apperceptum subcomponent of the semiotic textology linguistic theory. For that purpose, we automatically extend two gold-standard datasets with visual information, and develop different fusion techniques to combine textual and visual modalities following the patch-based strategy. Experimental results over the multimodal datasets show that the visual information can supplement the missing semantics of textual encodings with reliable performance improvements.

Multi-Attention Network for Compressed Video Referring Object Segmentation

  • Weidong Chen
  • Dexiang Hong
  • Yuankai Qi
  • Zhenjun Han
  • Shuhui Wang
  • Laiyun Qing
  • Qingming Huang
  • Guorong Li

Referring video object segmentation aims to segment the object referred by a given language expression. Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented, which increases computation and storage requirements and ultimately slows the inference down. This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones. To alleviate this problem, in this paper, we explore the referring object segmenta- tion task on compressed videos, namely on the original video data flow. Besides the inherent difficulty of the video referring object segmentation task itself, obtaining discriminative representation from compressed video is also rather challenging. To address this problem, we propose a multi-attention network which consists of dual-path dual-attention module and a query-based cross-modal Transformer module. Specifically, the dual-path dual-attention module is designed to extract effective representation from compressed data in three modalities, i.e., I-frame, Motion Vector and Residual. The query-based cross-modal Transformer firstly models the corre- lation between linguistic and visual modalities, and then the fused multi-modality features are used to guide object queries to generate a content-aware dynamic kernel and to predict final segmentation masks. Different from previous works, we propose to learn just one kernel, which thus removes the complicated post mask-matching procedure of existing methods. Extensive promising experimental results on three challenging datasets show the effectiveness of our method compared against several state-of-the-art methods which are proposed for processing RGB data. Source code is available at:

Cross-modal Co-occurrence Attributes Alignments for Person Search by Language

  • Kai Niu
  • Linjiang Huang
  • Yan Huang
  • Peng Wang
  • Liang Wang
  • Yanning Zhang

Person search by language refers to retrieving the interested pedestrian images based on a free-form natural language description, which has important applications in smart video surveillance. Although great efforts have been made to align images with sentences, the challenge of reporting bias, i.e., attributes are only partially matched across modalities, still incurs large noise and influences the accurate retrieval seriously. To address this challenge, we propose a novel cross-modal matching method named Cross-modal Co-occurrence Attributes Alignments (C2A2), which can better deal with noise and obtain significant improvements in retrieval performance for person search by language. First, we construct visual and textual attribute dictionaries relying on matrix decomposition, and carry out cross-modal alignments using denoising reconstruction features to address the noise from pedestrian-unrelated elements. Second, we re-gather pixels of image and words of sentence under the guidance of learned attribute dictionaries, to adaptively constitute more discriminative co-occurrence attributes in both modalities. And the re-gathered co-occurrence attributes are carefully captured by imposing explicit cross-modal one-to-one alignments which consider relations across modalities, better alleviating the noise from non-correspondence attributes. The whole C_2A_2 method can be trained end-to-end without any pre-processing, i.e., requiring negligible additional computation overheads. It significantly outperforms the existing solutions, and finally achieves the new state-of-the-art retrieval performance on two large-scale benchmarks, CUHK-PEDES and RSTPReid datasets.

RefCrowd: Grounding the Target in Crowd with Referring Expressions

  • Heqian Qiu
  • Hongliang Li
  • Taijin Zhao
  • Lanxiao Wang
  • Qingbo Wu
  • Fanman Meng

Crowd understanding has aroused the widespread interest in vision domain due to its important practical significance. Unfortunately, there is no effort to explore crowd understanding in multi-modal domain that bridges natural language and computer vision. Referring expression comprehension (REF) is such a representative multi-modal task. Current REF studies focus more on grounding the target object from multiple distinctive categories in general scenarios. It is difficult to applied to complex real-world crowd understanding. To fill this gap, we propose a new challenging dataset, called RefCrowd, which towards looking for the target person in crowd with referring expressions. It not only requires to sufficiently mine natural language information, but also requires to carefully focus on subtle differences between the target and a crowd of persons with similar appearance, so as to realize fine-grained mapping from language to vision. Furthermore, we propose a Fine-grained Multi-modal Attribute Contrastive Network (FMAC) to deal with REF in crowd understanding. It first decomposes the intricate visual and language features into attribute-aware multi-modal features, and then captures discriminative but robustness fine-grained attribute features to effectively distinguish these subtle differences between similar persons. The proposed method outperforms existing state-of-the-art (SoTA) methods on our RefCrowd dataset and existing REF datasets. In addition, we implement an end-to-end REF toolbox for the deeper research in multi-modal domain. Our dataset and code can be available at:

Unified Normalization for Accelerating and Stabilizing Transformers

  • Qiming Yang
  • Kai Zhang
  • Chaoxiang Lan
  • Zhi Yang
  • Zheyang Li
  • Wenming Tan
  • Jun Xiao
  • Shiliang Pu

Solid results from Transformers have made them prevailing architectures in various natural language and vision tasks. As a default component in Transformers, Layer Normalization (LN) normalizes activations within each token to boost the robustness. However, LN requires on-the-fly statistics calculation in inference as well as division and square root operations, leading to inefficiency on hardware. What is more, replacing LN with other hardware-efficient normalization schemes (e.g., Batch Normalization) results in inferior performance, even collapse in training. We find that this dilemma is caused by abnormal behaviors of activation statistics, including large fluctuations over iterations and extreme outliers across layers. To tackle these issues, we propose Unified Normalization (UN), which can speed up the inference by being fused with other linear operations and achieve comparable performance on par with LN. UN strives to boost performance by calibrating the activation and gradient statistics with a tailored fluctuation smoothing strategy. Meanwhile, an adaptive outlier filtration strategy is applied to avoid collapse in training whose effectiveness is theoretically proved and experimentally verified in this paper. We demonstrate that UN can be an efficient drop-in alternative to LN by conducting extensive experiments on language and vision tasks. Besides, we evaluate the efficiency of our method on GPU. Transformers equipped with UN enjoy about 31% inference speedup and nearly 18% memory reduction. Code will be released at

Enhancing Semi-Supervised Learning with Cross-Modal Knowledge

  • Hui Zhu
  • Yongchun Lu
  • Hongbin Wang
  • Xunyi Zhou
  • Qin Ma
  • Yanhong Liu
  • Ning Jiang
  • Xin Wei
  • Linchengxi Zeng
  • Xiaofang Zhao

Semi-supervised learning (SSL), which leverages a small number of labeled data that rely on expert knowledge and a large number of easily accessible unlabeled data, has made rapid progress recently. However, the information comes from a single modality and the corresponding labels are in form of one-hot in pre-existing SSL approaches, which can easily lead to deficiency supervision, omission of information and unsatisfactory results, especially when more categories and less labeled samples are covered. In this paper, we propose a novel method to further enhance SSL by introducing semantic modal knowledge, which contains the word embeddings of class labels and the semantic hierarchy structure among classes. The former helps retain more potential information and almost quantitatively reflects the similarities and differences between categories. The later encourages the model to construct the classification edge from simple to complex, and thus improves the generalization ability of the model. Comprehensive experiments and ablation studies are conducted on commonly-used datasets to demonstrate the effectiveness of our method.

Dynamic Spatio-Temporal Modular Network for Video Question Answering

  • Zi Qian
  • Xin Wang
  • Xuguang Duan
  • Hong Chen
  • Wenwu Zhu

Video Question Answering (VideoQA) aims to understand given videos and questions comprehensively by generating correct answers. However, existing methods usually rely on end-to-end black-box deep neural networks to infer the answers, which significantly differs from human logic reasoning, thus lacking the ability to explain. Besides, the performances of existing methods tend to drop when answering compositional questions involving realistic scenarios. To tackle these challenges, we propose a Dynamic Spatio-Temporal Modular Network (DSTN) model, which utilizes a spatio-temporal modular network to simulate the compositional reasoning procedure of human beings. Concretely, we divide the task of answering a given question into a set of sub-tasks focusing on certain key concepts in questions and videos such as objects, actions, temporal orders, etc. Each sub-task can be solved with a separately designed module, e.g., spatial attention module, temporal attention module, logic module, and answer module. Then we dynamically assemble different modules assigned with different sub-tasks to generate a tree-structured spatio-temporal modular neural network for human-like reasoning before producing the final answer for the question. We carry out extensive experiments on the AGQA dataset to demonstrate our proposed DSTN model can significantly outperform several baseline methods in various settings. Moreover, we evaluate intermediate results and visualize each reasoning step to verify the rationality of different modules and the explainability of the proposed DSTN model.

Micro-video Tagging via Jointly Modeling Social Influence and Tag Relation

  • Xiao Wang
  • Tian Gan
  • Yinwei Wei
  • Jianlong Wu
  • Dai Meng
  • Liqiang Nie

The last decade has witnessed the proliferation of micro-videos on various user-generated content platforms. According to our statistics, around 85.7% of micro-videos lack annotation. In this paper, we focus on annotating micro-videos with tags. Existing methods mostly focus on analyzing video content, neglecting users' social influence and tag relation. Meanwhile, existing tag relation construction methods suffer from either deficient performance or low tag coverage. To jointly model social influence and tag relation, we formulate micro-video tagging as a link prediction problem in a constructed heterogeneous network. Specifically, the tag relation (represented by tag ontology) is constructed in a semi-supervised manner. Then, we combine tag relation, video-tag annotation, and user follow relation to build the network. Afterward, a better video and tag representation are derived through Behavior Spread modeling and visual and linguistic knowledge aggregation. Finally, the semantic similarity between each micro-video and all candidate tags is calculated in this video-tag network. Extensive experiments on industrial datasets of three verticals verify the superiority of our model compared with several state-of-the-art baselines.

MimCo: Masked Image Modeling Pre-training with Contrastive Teacher

  • Qiang Zhou
  • Chaohui Yu
  • Hao Luo
  • Zhibin Wang
  • Hao Li

Recent masked image modeling (MIM) has received much attention in self-supervised learning (SSL), which requires the target model to recover the masked part of the input image. Although MIM-based pre-training methods achieve new state-of-the-art performance when transferred to many downstream tasks, the visualizations show that the learned representations are less separable, especially compared to those based on contrastive learning pre-training. This inspires us to think whether the linear separability of MIM pre-trained representation can be further improved, thereby improving the pre-training performance. Since MIM and contrastive learning tend to utilize different data augmentations and training strategies, combining these two pretext tasks is not trivial. In this work, we propose a novel and flexible pre-training framework, named MimCo, which combines MIM and contrastive learning through two-stage pre-training. Specifically, MimCo takes a pre-trained contrastive learning model as the teacher model and is pre-trained with two types of learning targets: patch-level and image-level reconstruction losses.

Extensive transfer experiments on downstream tasks demonstrate the superior performance of our MimCo pre-training framework. Taking ViT-S as an example, when using the pre-trained MoCov3-ViT-S as the teacher model, MimCo only needs 100 epochs of pre-training to achieve 82.53% top-1 finetuning accuracy on Imagenet-1K, which outperforms the state-of-the-art self-supervised learning counterparts.

LS-GAN: Iterative Language-based Image Manipulation via Long and Short Term Consistency Reasoning

  • Gaoxiang Cong
  • Liang Li
  • Zhenhuan Liu
  • Yunbin Tu
  • Weijun Qin
  • Shenyuan Zhang
  • Chengang Yan
  • Wenyu Wang
  • Bin Jiang

Iterative language-based image manipulation aims to edit images step by step according to user's linguistic instructions. The existing methods mostly focus on aligning the attributes and appearance of new-added visual elements with current instruction. However, they fail to maintain consistency between instructions and images as iterative rounds increase. To address this issue, we propose a novel Long and Short term consistency reasoning Generative Adversarial Network (LS-GAN), which enhances the awareness of previous objects with current instruction and better maintains the consistency with the user's intent under the continuous iterations. Specifically, we first design a Context-aware Phrase Encoder (CPE) to learn the user's intention by extracting different phrase-level information about the instruction. Further, we introduce a Long and Short term Consistency Reasoning (LSCR) mechanism. The long-term reasoning improves the model on semantic understanding and positional reasoning, while short-term reasoning ensures the ability to construct visual scenes based on linguistic instructions. Extensive results show that LS-GAN improves the generation quality in terms of both object identity and position, and achieves the state-of-the-art performance on two public datasets.

Multimodal Hate Speech Detection via Cross-Domain Knowledge Transfer

  • Chuanpeng Yang
  • Fuqing Zhu
  • Guihua Liu
  • Jizhong Han
  • Songiln Hu

Nowadays, the hate speech diffusion of texts and images in social network has become the mainstream compared with the diffusion of texts-only, raising the pressing needs of multimodal hate speech detection task. Current research on this task mainly focuses on the construction of multimodal models without considering the influence of the unbalanced and widely distributed samples for various attacks in hate speech. In this situation, introducing enhanced knowledge is necessary for understanding the attack category of hate speech comprehensively. Due to the high correlation between hate speech detection and sarcasm detection tasks, this paper makes an initial attempt of common knowledge transfer based on the above two tasks, where hate speech detection and sarcasm detection are defined as primary and auxiliary tasks, respectively. A scalable cross-domain knowledge transfer (CDKT) framework is proposed, where the mainstream vision-language transformer could be employed as backbone flexibly. Three modules are included, bridging the semantic, definition and domain gaps simultaneously between primary and auxiliary tasks. Specifically, semantic adaptation module formulates the irrelevant parts between image and text in primary and auxiliary tasks, and disentangles with the text representation to align the visual and word tokens. Definition adaptation module assigns different weights to the training samples of auxiliary task by measuring the correlation between samples of the auxiliary and primary task. Domain adaptation module minimizes the feature distribution gap of samples in two tasks. Extensive experiments show that the proposed CDKT provides a stable improvement compared with baselines and produces a competitive performance compared with some existing multimodal hate speech detection methods.

CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training

  • Zhiyuan Ma
  • Jianjun Li
  • Guohui Li
  • Kaiyan Huang

With the flourishing of social media platforms, vision-language pre-training (VLP) recently has received great attention and many remarkable progresses have been achieved. The success of VLP largely benefits from the information complementation and enhancement between different modalities. However, most of recent studies focus on cross-modal contrastive learning (CMCL) to promote image-text alignment by pulling embeddings of positive sample pairs together while pushing those of negative pairs apart, which ignores the natural asymmetry property between different modalities and requires large-scale image-text corpus to achieve arduous progress. To mitigate this predicament, we propose CMAL, a <u>C</u>ross-<u>M</u>odal <u>A</u>ssociative <u>L</u>earning framework with anchor points detection and cross-modal associative learning for VLP. Specifically, we first respectively embed visual objects and textual tokens into separate hypersphere spaces to learn intra-modal hidden features, and then design a cross-modal associative prompt layer to perform anchor point masking and swap feature filling for constructing a hybrid cross-modal associative prompt. Afterwards, we exploit a unified semantic encoder to learn their cross-modal interactive features for context adaptation. Finally, we design an associative mapping classification layer to learn potential associative mappings between modalities at anchor points, within which we develop a fresh self-supervised associative mapping classification task to boost CMAL's performance. Experimental results verify the effectiveness of CMAL, showing that it achieves competitive performance against previous CMCL-based methods on four common downstream vision-and-language tasks, with significantly fewer corpus. Noteably, CMAL obtains new state-of-the-art results on SNLI-VE and REC (testA).

ARMANI: Part-level Garment-Text Alignment for Unified Cross-Modal Fashion Design

  • Xujie Zhang
  • Yu Sha
  • Michael C. Kampffmeyer
  • Zhenyu Xie
  • Zequn Jie
  • Chengwen Huang
  • Jianqing Peng
  • Xiaodan Liang

Cross-modal fashion image synthesis has emerged as one of the most promising directions in the generation domain due to the vast untapped potential of incorporating multiple modalities and the wide range of fashion image applications. To facilitate accurate generation, cross-modal synthesis methods typically rely on Contrastive Language-Image Pre-training (CLIP) to align textual and garment information. In this work, we argue that simply aligning texture and garment information is not sufficient to capture the semantics of the visual information and therefore propose MaskCLIP. MaskCLIP decomposes the garments into semantic parts, ensuring fine-grained and semantically accurate alignment between the visual and text information. Building on MaskCLIP, we propose ARMANI, a unified cross-modal fashion designer with part-level garment-text alignment. ARMANI discretizes an image into uniform tokens based on a learned cross-modal codebook in its first stage and uses a Transformer to model the distribution of image tokens for a real image given the tokens of the control signals in its second stage. Contrary to prior approaches that also rely on two-stage paradigms, ARMANI introduces textual tokens into the codebook, making it possible for the model to utilize fine-grain semantic information to generate more realistic images. Further, by introducing a cross-modal Transformer, ARMANI is versatile and can accomplish image synthesis from various control signals, such as pure text, sketch images, and partial images. Extensive experiments conducted on our newly collected cross-modal fashion dataset demonstrate that ARMANI generates photo-realistic images in diverse synthesis tasks and outperforms existing state-of-the-art cross-modal image synthesis approaches. Our code is available at

Skimming, Locating, then Perusing: A Human-Like Framework for Natural Language Video Localization

  • Daizong Liu
  • Wei Hu

This paper addresses the problem of natural language video localization (NLVL). Almost all existing works follow the "only look once" framework that exploits a single model to directly capture the complex cross- and self-modal relations among video-query pairs and retrieve the relevant segment. However, we argue that these methods have overlooked two indispensable characteristics of an ideal localization method: 1) Frame-differentiable: considering the imbalance of positive/negative video frames, it is effective to highlight positive frames and weaken negative ones during the localization. 2) Boundary-precise: to predict the exact segment boundary, the model should capture more fine-grained differences between consecutive frames since their variations are often smooth. To this end, inspired by how humans perceive and localize a segment, we propose a two-step human-like framework called Skimming-Locating-Perusing (SLP). SLP consists of a Skimming-and-Locating (SL) module and a Bi-directional Perusing (BP) module. The SL module first refers to the query semantic and selects the best matched frame from the video while filtering out irrelevant frames. Then, the BP module constructs an initial segment based on this frame, and dynamically updates it by exploring its adjacent frames until no frame shares the same activity semantic. Experimental results on three challenging benchmarks show that our SLP is superior to the state-of-the-art methods and localizes more precise segment boundaries.

Distance Matters in Human-Object Interaction Detection

  • Guangzhi Wang
  • Yangyang Guo
  • Yongkang Wong
  • Mohan Kankanhalli

Human-Object Interaction (HOI) detection has received considerable attention in the context of scene understanding. Despite the growing progress, we realize existing methods often perform unsatisfactorily on distant interactions, where the leading causes are two-fold: 1) Distant interactions are by nature more difficult to recognize than close ones. A natural scene often involves multiple humans and objects with intricate spatial relations, making the interaction recognition for distant human-object largely affected by complex visual context. 2) Insufficient number of distant interactions in datasets results in under-fitting on these instances. To address these problems, we propose a novel two-stage method for better handling distant interactions in HOI detection. One essential component in our method is a novel Far Near Distance Attention module. It enables information propagation between humans and objects, whereby the spatial distance is skillfully taken into consideration. Besides, we devise a novel Distance-Aware loss function which leads the model to focus more on distant yet rare interactions. We conduct extensive experiments on HICO-DET and V-COCO datasets. The results show that the proposed method surpass existing methods significantly, leading to new state-of-the-art results.

Token Embeddings Alignment for Cross-Modal Retrieval

  • Chen-Wei Xie
  • Jianmin Wu
  • Yun Zheng
  • Pan Pan
  • Xian-Sheng Hua

Cross-modal retrieval has achieved significant progress in recent years with the help of token embeddings interaction methods. Most existing methods first extract embedding for each token of input image and text, then feed the token-level embeddings into a multi-modal transformer to learn a joint representation, this joint representation can be used to predict matching score between input image and text. However, these methods don't explicitly supervise the alignment between visual and textual tokens. In this paper, we propose a novel Token Embeddings AlignMent (TEAM) block, it first explicitly aligns visual tokens and textual tokens, then produces token-level matching scores to measure fine-grained similarity between input image and text. TEAM achieves new state-of-the-art performance on commonly used cross-modal retrieval benchmarks. Moreover, TEAM is interpretable and we provide visualization experiments to show how it works. At last, we construct a new billion-scale vision-language pre-training dataset in Chinese, which is the largest Chinese vision-language pre-training dataset so far. After pre-training on this dataset, our framework also achieves state-of-the-art performance on Chinese cross-modal retrieval benchmarks.

From Token to Word: OCR Token Evolution via Contrastive Learning and Semantic Matching for Text-VQA

  • Zan-Xia Jin
  • Mike Zheng Shou
  • Fang Zhou
  • Satoshi Tsutsui
  • Jingyan Qin
  • Xu-Cheng Yin

Text-based Visual Question Answering (Text-VQA) is a question-answering task to understand scene text, where the text is usually recognized by Optical Character Recognition (OCR) systems. However, the text from OCR systems often includes spelling errors, such as "pepsi" being recognized as "peosi". These OCR errors are one of the major challenges for Text-VQA systems. To address this, we propose a novel Text-VQA method to alleviate OCR errors via OCR token evolution. First, we artificially create the misspelled OCR tokens in the training time, and make the system more robust to the OCR errors. To be specific, we propose an OCR Token-Word Contrastive (TWC) learning task, which pre-trains word representation by augmenting OCR tokens via the Levenshtein distance between the OCR tokens and words in a dictionary. Second, by assuming that the majority of characters in misspelled OCR tokens are still correct, a multimodal transformer is proposed and fine-tuned to predict the answer using character-based word embedding. Specifically, we introduce a vocabulary predictor with character-level semantic matching, which enables the model to recover the correct word from the vocabulary even with misspelled OCR tokens. A variety of experimental evaluations show that our method outperforms the state-of-the-art methods on both TextVQA and ST-VQA datasets. The code will be released at

IDEA: Increasing Text Diversity via Online Multi-Label Recognition for Vision-Language Pre-training

  • Xinyu Huang
  • Youcai Zhang
  • Ying Cheng
  • Weiwei Tian
  • Ruiwei Zhao
  • Rui Feng
  • Yuejie Zhang
  • Yaqian Li
  • Yandong Guo
  • Xiaobo Zhang

Vision-Language Pre-training (VLP) with large-scale image-text pairs has demonstrated superior performance in various fields. However, the image-text pairs co-occurrent on the Internet typically lack explicit alignment information, which is suboptimal for VLP. Existing methods proposed to adopt an off-the-shelf object detector to utilize additional image tag information. However, the object detector is time-consuming and can only identify the pre-defined object categories, limiting the model capacity. Inspired by the observation that the texts incorporate incomplete fine-grained image information, we introduce IDEA, which stands for increasing text diversity via online multi-label recognition for VLP. IDEA shows that multi-label learning with image tags extracted from the texts can be jointly optimized during VLP. Moreover, IDEA can identify valuable image tags online to provide more explicit textual supervision. Comprehensive experiments demonstrate that IDEA can significantly boost the performance on multiple downstream datasets with a small extra computational cost.

CLOP: Video-and-Language Pre-Training with Knowledge Regularizations

  • Guohao Li
  • Hu Yang
  • Feng He
  • Zhifan Feng
  • Yajuan Lyu
  • Hua Wu
  • Haifeng Wang

Video-and-language pre-training has shown promising results for learning generalizable representations. Most existing approaches usually model video and text in an implicit manner, without considering explicit structural representations of the multi-modal content. We denote such form of representations as structural knowledge, which express rich semantics of multiple granularities. There are related works that propose object-aware approaches to inject similar knowledge as inputs. However, the existing methods usually fail to effectively utilize such knowledge as regularizations to shape a superior cross-modal representation space. To this end, we propose a Cross-modaL knOwledge-enhanced Pre-training (CLOP) method with Knowledge Regularizations. There are two key designs of ours: 1) a simple yet effective Structural Knowledge Prediction (SKP) task to pull together the latent representations of similar videos; and 2) a novel Knowledge-guided sampling approach for Contrastive Learning (KCL) to push apart cross-modal hard negative samples. We evaluate our method on four text-video retrieval tasks and one multi-choice QA task. The experiments show clear improvements, outperforming prior works by a substantial margin. Besides, we provide ablations and insights of how our methods affect the latent representation space, demonstrating the value of incorporating knowledge regularizations into video-and-language pre-training.

Talk2Face: A Unified Sequence-based Framework for Diverse Face Generation and Analysis Tasks

  • Yudong Li
  • Xianxu Hou
  • Zhe Zhao
  • Linlin Shen
  • Xuefeng Yang
  • Kimmo Yan

Facial analysis is an important domain in computer vision and has received extensive research attention. For numerous downstream tasks with different input/output formats and modalities, existing methods usually design task-specific architectures and train them using face datasets collected in the particular task domain. In this work, we proposed a single model, Talk2Face, to simultaneously tackle a large number of face generation and analysis tasks, e.g. text guided face synthesis, face captioning and age estimation. Specifically, we cast different tasks into a sequence-to-sequence format with the same architecture, parameters and objectives. While text and facial images are tokenized to sequences, the annotation labels of faces for different tasks are also converted to natural languages for unified representation. We collect a set of 2.3M face-text pairs from available datasets across different tasks, to train the proposed model. Uniform templates are then designed to enable the model to perform different downstream tasks, according to the task context and target. Experiments on different tasks show that our model achieves better face generation and caption performances than SOTA approaches. On age estimation and multi-attribute classification, our model reaches competitive performance with those models specially designed and trained for these particular tasks. In practice, our model is much easier to be deployed to different facial analysis related tasks. Code and dataset will be available at

TxVAD: Improved Video Action Detection by Transformers

  • Zhenyu Wu
  • Zhou Ren
  • Yi Wu
  • Zhangyang Wang
  • Gang Hua

Video action detection aims to localize persons in both space and time from video sequences and recognize their actions. Most existing methods are composed of many specialized components, e.g., pretrained person/object detectors, region proposal networks (RPN), memory banks, and so on. This paper proposes a conceptually simple paradigm for video action detection using Transformers, which effectively removes the need for specialized components and achieves superior performance. Our proposed Transformer-based Video Action Detector (TxVAD) utilizes two Transformers to capture scene context information and long-range spatio-temporal context information, for person localization and action classification, respectively. Through extensive experiments on four public datasets, AVA, AVA-Kinetics, JHMDB-21, and UCF101-24, we show that our conceptually simple paradigm has achieved state-of-the-art performance for video action detection task, without using pre-trained person/object detectors, RPN, or memory bank.

Relational Representation Learning in Visually-Rich Documents

  • Xin Li
  • Yan Zheng
  • Yiqing Hu
  • Haoyu Cao
  • Yunfei Wu
  • Deqiang Jiang
  • Yinsong Liu
  • Bo Ren

Relational understanding is critical for a number of visually-rich documents (VRDs) understanding tasks. Through multi-modal pre-training, recent studies provide comprehensive contextual representations and exploit them as prior knowledge for downstream tasks. In spite of their impressive results, we observe that the widespread relational hints (e.g., relation of key/value fields on receipts) built upon contextual knowledge are not excavated yet. To mitigate this gap, we propose DocReL, a Document Relational Representation Learning framework. The major challenge of DocReL roots in the variety of relations. From the simplest pairwise relation to the complex global structure, it is infeasible to conduct supervised training due to the definition of relation varies and even conflicts in different tasks. To deal with the unpredictable definition of relations, we propose a novel contrastive learning task named Relational Consistency Modeling (RCM), which harnesses the fact that existing relations should be consistent in differently augmented positive views. RCM provides relational representations which are more compatible to the urgent need of downstream tasks, even without any knowledge about the exact definition of relation. DocReL achieves better performance on a wide variety of VRD relational understanding tasks, including table structure recognition, key information extraction and reading order detection.

Unified Multimodal Model with Unlikelihood Training for Visual Dialog

  • Zihao Wang
  • Junli Wang
  • Changjun Jiang

The task of visual dialog requires a multimodal chatbot to answer sequential questions from humans about image content. Prior work performs the standard likelihood training for answer generation on the positive instances (involving correct answers). However, the likelihood objective often leads to frequent and dull outputs and fails to exploit the useful knowledge from negative instances (involving incorrect answers). In this paper, we propose a Unified Multimodal Model with UnLikelihood Training, named UniMM-UL, to tackle this problem. First, to improve visual dialog understanding and generation by multi-task learning, our model extends ViLBERT from only supporting answer discrimination to holding both answer discrimination and answer generation seamlessly by different attention masks. Specifically, in order to make the original discriminative model compatible with answer generation, we design novel generative attention masks to implement the autoregressive Masked Language Modeling (autoregressive MLM) task. And to attenuate the adverse effects of the likelihood objective, we exploit unlikelihood training on negative instances to make the model less likely to generate incorrect answers. Then, to utilize dense annotations, we adopt different fine-tuning methods for both generating and discriminating answers, rather than just for discriminating answers as in the prior work. Finally, on the VisDial dataset, our model achieves the best generative results (69.23 NDCG score). And our model also yields comparable discriminative results with the state-of-the-art in both single-model and ensemble settings (75.92 and 76.17 NDCG scores).

Tackling Instance-Dependent Label Noise with Dynamic Distribution Calibration

  • Manyi Zhang
  • Yuxin Ren
  • Zihao Wang
  • Chun Yuan

Instance-dependent label noise is realistic but rather challenging, where the label-corruption process depends on instances directly. It causes a severe distribution shift between the distributions of training and test data, which impairs the generalization of trained models. Prior works put great effort into tackling the issue. Unfortunately, these works always highly rely on strong assumptions or remain heuristic without theoretical guarantees. In this paper, to address the distribution shift in learning with instance-dependent label noise, a dynamic distribution-calibration strategy is adopted. Specifically, we hypothesize that, before training data are corrupted by label noise, each class conforms to a multivariate Gaussian distribution at the feature level. Label noise produces outliers to shift the Gaussian distribution. During training, to calibrate the shifted distribution, we propose two methods based on the mean and covariance of multivariate Gaussian distribution respectively. The mean-based method works in a recursive dimension-reduction manner for robust mean estimation, which is theoretically guaranteed to train a high-quality model against label noise. The covariance-based method works in a distribution disturbance manner, which is experimentally verified to improve the model robustness. We demonstrate the utility and effectiveness of our methods on datasets with synthetic label noise and real-world unknown noise.

On Leveraging Variational Graph Embeddings for Open World Compositional Zero-Shot Learning

  • Muhammad Umer Anwaar
  • Zhihui Pan
  • Martin Kleinsteuber

Humans are able to identify and categorize novel compositions of known concepts. The task in Compositional Zero-Shot learning (CZSL) is to learn composition of primitive concepts, i.e. objects and states, in such a way that even their novel compositions can be zero-shot classied. In this work, we do not assume any prior knowledge on the feasibility of novel compositions, i.e. open-world setting, where infeasible compositions dominate the search space. We propose a Compositional Variational Graph Autoencoder (CVGAE) approach for learning the variational embeddings of the primitive concepts (nodes) as well as feasibility of their compositions (via edges). Such modelling makes CVGAE scalable to real-world application scenarios. This is in contrast to SOTA method, CGE, which is computationally very expensive. e.g. for benchmark C-GQA dataset, CGE requires 3.94×10^5 nodes, whereas CVGAE requires only 1323 nodes. We learn a mapping of the graph and image embeddings onto a common embedding space. CVGAE adopts a deep metric learning approach and learns a similarity metric in this space via bi-directional contrastive loss between projected graph and image embeddings. We validate the eectiveness of our approach on three benchmark datasets. We also demonstrate via an image retrieval task that the representations learnt by CVGAE are better suited for compositional generalization.

Comprehensive Relationship Reasoning for Composed Query Based Image Retrieval

  • Feifei Zhang
  • Ming Yan
  • Ji Zhang
  • Changsheng Xu

Composed Query Based Image Retrieval (CQBIR) aims at searching images relevant to a composed query, i.e., a reference image together with a modifier text. Compared with conventional image retrieval, which takes a single image or text to retrieve desired images, CQBIR encounters more challenges as it requires not only effective semantic correspondence between the heterogeneous query and target, but also synergistic understanding of the composed query. To establish robust CQBIR model, four critical types of relational information can be included, i.e., cross-modal, intra-sample, inter-sample, and cross-sample relationships. Pioneer studies mainly exploit parts of the information, which are hard to make them enhance and complement each other. In this paper, we propose a comprehensive relationship reasoning network by fully exploring the four types of information for CQBIR, which mainly includes two key designs. First, we introduce a memory-augmented cross-modal attention module, in which the representation of the composed query is augmented by considering the cross-modal relationship between the reference image and the modification text. Second, we design a multi-scale matching strategy to optimize our network, aiming at harnessing information from the intra-sample, inter-sample, and cross-sample relationships. To the best of our knowledge, this is the first work to fully explore the four pieces of relationships in a unified deep model for CQBIR. Comprehensive experimental results on five standard benchmarks demonstrate that the proposed method performs favorably against state-of-the-art models.

Image Understanding by Captioning with Differentiable Architecture Search

  • Ramtin Hosseini
  • Pengtao Xie

In deep learning applications, image understanding is a crucial task, where several techniques such as image captioning and visual question answering have been widely studied to improve and evaluate the performances of deep neural networks (DNN) in this area. In image captioning, models have encoder-decoder architectures, where the encoders take the input images, produce embeddings, and feed them into the decoders to generate textual descriptions. Designing a proper image captioning encoder-decoder architecture manually is a difficult challenge due to the complexity of recognizing the critical objects of the input images and their relationships to generate caption descriptions. To address this issue, we propose a three-level optimization method that employs differentiable architecture search strategies to seek the most suitable architecture for image captioning automatically. Our optimization framework involves three stages, which are performed end-to-end. In the first stage, an image captioning model learns and updates the weights of its encoder and decoder to create image captions. At the next stage, the trained encoder-decoder generates a pseudo image captioning dataset from unlabeled images, and the predictive model trains on the generated dataset to update its weights. Finally, the trained model validates its performance on the validation set and updates the encoder-decoder architecture by minimizing the validation loss. Experiments and studies on the COCO image captions datasets demonstrate that our method performs significantly better than the baselines and can achieve state-of-the-art results in image understanding tasks.

Atrous Pyramid Transformer with Spectral Convolution for Image Inpainting

  • Muqi Huang
  • Lefei Zhang

Owing to the ability of extracting features of images on long-range dependencies naturally, transformer is possible to reconstruct the damaged areas of images with the information from the uncorrupted regions globally. In this paper, we propose a two-stage framework based on a novel atrous pyramid transformer (APT) for image inpainting that recovers the structure and texture of an image progressively. Specifically, the patches of APT blocks are embedded in an atrous pyramid manner to explicitly enhance the correlation for both inter-and intra-windows to restore the high-level semantic structures of images more precisely, which could be served as a guide map for the second phase. Subsequently, a dual spectral transform convolution (DSTC) module is further designed to work together with APT to infer the low-level features of the generated areas. The DSTC module decouples the image signal into high frequency and low frequency for capturing texture information with a global view. Experiments on the CelebA-HQ, Paris StreetView, and Places2 demonstrate the superiority of the proposed approach.

QuadTreeCapsule: QuadTree Capsules for Deep Regression Tracking

  • Ding Ma
  • Xiangqian Wu

Benefit from the capability of capturing part-to-whole relationships, Capsule Network has been successful in many vision tasks. However, their high computational complexity poses a significant obstacle to applying them to visual tracking, requiring fast inference. In this paper, we introduce the idea of QuadTree Capsules, which explores the property of part-to-whole relationships endowed by the Capsule Network by significantly reducing the computational complexity. We build capsule pyramids and select meaningful relationships in a coarse-to-fine manner, dubbed as QuadTreeCapsule. Specifically, the top K capsules with the highest activation values are selected, and routing is only calculated within the relevant regions corresponding to these top K capsules with a novel symmetric guided routing algorithm. Additionally, considering the importance of temporal relationships, a multi-spectral pose matrix attention mechanism is developed for more accurate spatio-temporal capsule assignments between two sets of capsules. Moreover, during online inference, we shift part of the spatio-temporal capsules long the temporal dimension, facilitating information exchanged among neighboring frames. Extensive experimentation has proved the effectiveness of our methodology, which achieves state-of-the-art results compared with other tracking methods on eight widely-used benchmarks. Our tracker runs at approximately 43 fps on GPU.

End-to-End 3D Face Reconstruction with Expressions and Specular Albedos from Single In-the-wild Images

  • Qixin Deng
  • Binh H. Le
  • Aobo Jin
  • Zhigang Deng

Recovering 3D face models from in-the-wild face images has numerous potential applications. However, properly modeling complex lighting effects in reality, including specular lighting, shadows, and occlusions, from a single in-the-wild face image is still considered as a widely open research challenge. In this paper, we propose a convolutional neural network based framework to regress the face model from a single image in the wild. The outputted face model includes dense 3D shape, head pose, expression, diffuse albedo, specular albedo, and the corresponding lighting conditions. Our approach uses novel hybrid loss functions to disentangle face shape identities, expressions, poses, albedos, and lighting. Besides a carefully-designed ablation study, we also conduct direct comparison experiments to show that our method can outperform state-of-art methods both quantitatively and qualitatively.

Heterogeneous Learning for Scene Graph Generation

  • Yunqing He
  • Tongwei Ren
  • Jinhui Tang
  • Gangshan Wu

Scene Graph Generation (SGG) task aims to construct a graph structure to express objects and their relationships in a scene at a holistic level. Due to the neglect of heterogeneity of feature spaces between objects and relations, coupling of feature representations becomes obvious in current SGG methods, which results in large intra-class variation and inter-class ambiguity. In order to explicitly emphasize the heterogeneity in SGG, we propose a plug-and-play Heterogeneous Learning Branch (HLB), which enhances the independent representation capability of relation features. The HLB actively obscures the interconnection between objects and relation feature spaces via gradient reversal, with the assistance of a link prediction module as information barrier and an Auto Encoder for information preservation. To validate the effectiveness of HLB, we apply HLB to typical SGG methods in which the feature spaces are either homogeneous or semi-heterogeneous, and conduct evaluation on VG-150 dataset. The experimental results demonstrate that HLB significantly improves the performance of all these methods in the common evaluation criteria for SGG task.

Equivariant and Invariant Grounding for Video Question Answering

  • Yicong Li
  • Xiang Wang
  • Junbin Xiao
  • Tat-Seng Chua

Video Question Answering (VideoQA) is the task of answering the natural language questions about a video. Producing an answer requires understanding the interplay across visual scenes in video and linguistic semantics in question. However, most leading VideoQA models work as black boxes, which make the visual-linguistic alignment behind the answering process obscure. Such black-box nature calls for visual explainability that reveals "What part of the video should the model look at to answer the question?". Only a few works present the visual explanations in a post-hoc fashion, which emulates the target model's answering process via an additional method.

Instead of post-hoc explainability, we focus on intrinsic interpretability to make the answering process transparent. At its core is grounding the question-critical cues as the causal scene to yield answers, while rolling out the question-irrelevant information as the environment scene. Taking a causal look at VideoQA, we devise a self-interpretable framework, Equivariant and Invariant Grounding for Interpretable VideoQA (EIGV). Specifically, the equivariant grounding encourages the answering to be sensitive to the semantic changes in the causal scene and question; in contrast, the invariant grounding enforces the answering to be insensitive to the changes in the environment scene. By imposing them on the answering process, EIGV is able to distinguish the causal scene from the environment information, and explicitly present the visual-linguistic alignment. Extensive experiments on three benchmark datasets justify the superiority of EIGV in terms of accuracy and visual interpretability over the leading baselines.

Align and Adapt: A Two-stage Adaptation Framework for Unsupervised Domain Adaptation

  • Yan Yu
  • Yuchen Zhai
  • Yin Zhang

Unsupervised domain adaptation aims to transfer knowledge from a labeled but heterogeneous source domain to an unlabeled target domain, alleviating the labeling efforts. Early advances in domain adaptation focus on invariant representations learning (IRL) methods to align domain distributions. Recent studies further utilize semi-supervised learning (SSL) methods to regularize domain-invariant representations based on the cluster assumption, making the category boundary more clear. However, the misalignment in the IRL methods might be intensified by SSL methods if the target instances are more proximate to the wrong source centroid, resulting in incompatibility between these techniques. In this paper, we hypothesize this phenomenon derives from the distraction of the source domain, and further give a novel two-stage adaptation framework to adapt the model toward the target domain. In addition, we propose DCAN to reduce the misalignment in IRL methods in the first stage, and we propose PCST to encode the semantic structure of unlabeled target data in the second stage. Extensive experiments demonstrate that our method outperforms current state-of-the-art methods on four benchmarks (Office-31, ImageCLEF-DA, Office-Home, and VisDA-2017).

Detach and Attach: Stylized Image Captioning without Paired Stylized Dataset

  • Yutong Tan
  • Zheng Lin
  • Peng Fu
  • Mingyu Zheng
  • Lanrui Wang
  • Yanan Cao
  • Weipinng Wang

Stylized Image Captioning aims to generate captions with accurate image content and stylized elements simultaneously. However, large-scaled image and stylized caption pairs cost lots of resources and are usually unavailable. Therefore, it's a challenge to generate stylized captions without paired stylized caption dataset. Previous work on controlling the style of generated captions in an unsupervised way can be divided into two ways: implicitly and explicitly. The former mainly relies on a well-trained language model to capture style knowledge, which is limited to a single style and hard to handle multi-style task. Thus, the latter uses extra style constraints such as outlined style labels or stylized words extracted from stylized sentences to control the style rather than the trained style-specific language model. However, certain styles, such as humorous and romance, are implied in the whole sentence, instead of in some words of a sentence. To address the problems above, we propose a two-step method based on Transformer: firstly detach style representations from large-scaled stylized text-only corpus to provide more holistic style supervision, and secondly attach the style representations to image content to generate stylized captions. We learn a shared image-text space to narrow the gap between the image and the text modality for better attachment. Due to the trade-off between semantics and style, we explore three injection methods of style representations to balance two requirements of image content preservation and stylization. Experiments show that our method outperforms the state-of-the-art systems in overall performance, especially on implied styles.

PixelSeg: Pixel-by-Pixel Stochastic Semantic Segmentation for Ambiguous Medical Images

  • Wei Zhang
  • Xiaohong Zhang
  • Sheng Huang
  • Yuting Lu
  • Kun Wang

Semantic segmentation tasks often have multiple output hypotheses for a single input image. Particularly in medical images, these ambiguities arise from unclear object boundaries or differences in physicians' annotation. Learning the distribution of annotations and automatically giving multiple plausible predictions is useful to assist physicians in their decision-making. In this paper, we propose a semantic segmentation framework, PixelSeg, for modelling aleatoric uncertainty in segmentation maps and generating multiple plausible hypotheses. Unlike existing works, PixelSeg accomplishes the semantic segmentation task by sampling the segmentation maps pixel by pixel, which is achieved by the PixelCNN layers used to capture the conditional distribution between pixels. We propose (1) a hierarchical architecture to model high-resolution segmentation maps more flexibly, (2) a fast autoregressive sampling algorithm to improve sampling efficiency by 96.2, and (3) a resampling module to further improve predictions' quality and diversity. In addition, we demonstrate the great advantages of PixelSeg in the novel area of interactive uncertainty segmentation, which is beyond the capabilities of existing models. Extensive experiments and state-of-the-art results on the LIDC-IDRI and BraTS 2017 datasets demonstrate the effectiveness of our proposed model.

A Probabilistic Model for Controlling Diversity and Accuracy of Ambiguous Medical Image Segmentation

  • Wei Zhang
  • Xiaohong Zhang
  • Sheng Huang
  • Yuting Lu
  • Kun Wang

Medical image segmentation tasks often have more than one plausible annotation for a given input image due to its inherent ambiguity. Generating multiple plausible predictions for a single image is of interest for medical critical applications. Many methods estimate the distribution of the annotation space by developing probabilistic models to generate multiple hypotheses. However, these methods aim to improve the diversity of predictions at the expense of the more important accuracy. In this paper, we propose a novel probabilistic segmentation model, called Joint Probabilistic U-net, which successfully achieves flexible control over the two abstract conceptions of diversity and accuracy. Specifically, we (i) model the joint distribution of images and annotations to learn a latent space, which is used to decouple diversity and accuracy, and (ii) transform the Gaussian distribution in the latent space to a complex distribution to improve model's expressiveness. In addition, we explore two strategies for preventing the latent space collapse, which are effective in improving the model's performance on datasets with limited annotation. We demonstrate the effectiveness of the proposed model on two medical image datasets, i.e. LIDC-IDRI and ISBI 2016, and achieved state-of-the-art results on several metrics.

Crossmodal Few-shot 3D Point Cloud Semantic Segmentation

  • Ziyu Zhao
  • Zhenyao Wu
  • Xinyi Wu
  • Canyu Zhang
  • Song Wang

Recently, few-shot 3D point cloud semantic segmentation methods have been introduced to mitigate the limitations of existing fully supervised approaches, i.e., heavy dependence on labeled 3D data and poor capacity to generalize to new categories. However, those few-shot learning methods need one or few labeled data as support for testing. In practice, such data labeling usually requires manual annotation of large-scale points in 3D space, which can be very difficult and laborious. To address this problem, in this paper we introduce a novel crossmodal few-shot learning approach for 3D point cloud semantic segmentation. In this approach, the point cloud to be segmented is taken as query while one or few labeled 2D RGB images are taken as support to guide the segmentation of query. This way, we only need to annotate on a few 2D support images for the categories of interest. Specifically, we first convert the 2D support images into 3D point cloud format based on both appearance and the estimated depth information. We then introduce a co-embedding network for extracting the features of support and query, both from 3D point cloud format, to fill their domain gap. Finally, we compute the prototypes of support and employ cosine similarity between the prototypes and the query features for final segmentation. Experimental results on two widely-used benchmarks show that, with one or few labeled 2D images as support, our proposed method achieves competitive results against existing few-shot 3D point cloud semantic segmentation methods.

VQ-DcTr: Vector-Quantized Autoencoder With Dual-channel Transformer Points Splitting for 3D Point Cloud Completion

  • Ben Fei
  • Weidong Yang
  • Wen-Ming Chen
  • Lipeng Ma

Existing point cloud completion methods mainly utilize the global shape representation to recover the missing regions of the 3D shape from the partial point cloud. However, these methods learn the global shape representations with continuous features against the inherently discrete nature of point cloud, hardly resulting in a high-quality structure for points. To address this challenge, we concentrate on discrete representations, which are potentially a more natural fit for the modalities of the point cloud. Therefore, we propose to employ Vector Quantization (VQ) Auto-Encoder and Dual-channel Transformer for point cloud completion (VQ-DcTr). The VQ-DcTr is apt to use discrete global features and exploit them in a well-structured generation process. Specifically, the vector quantization auto-encoder is integrated to learn a discrete latent representation along with inductive biases inherent in the transformer-based auto-encoder. By using the decoded seeds from the auto-encoder, the dual-channel transformer leverages point-wise and channel-wise attention to learn the splitting patterns in the previous Dual-channel Transformer Points Splitting (DCTPS) layer to perform the points splitting in the current DCTPS layer. In this way, we can obtain the locally compact and structured point cloud by capturing the structure characteristic of 3D shape in local patches. Extensive experiments on all standard benchmarks demonstrate that VQ-DcTr outperforms the state-of-the-art point cloud completion methods through qualitative and quantitative analysis.

Fine-grained Action Recognition with Robust Motion Representation Decoupling and Concentration

  • Baoli Sun
  • Xinchen Ye
  • Tiantian Yan
  • Zhihui Wang
  • Haojie Li
  • Zhiyong Wang

Fine-grained action recognition is a challenging task that requires identifying discriminative and subtle motion variations among fine-grained action classes. Existing methods typically focus on spatio-temporal feature extraction and long-temporal modeling to characterize complex spatio-temporal patterns of fine-grained actions. However, the learned spatio-temporal features without explicit motion modeling may emphasize more on visual appearance than on motion, which could compromise the learning of effective motion features required for fine-grained temporal reasoning. Therefore, how to decouple robust motion representations from the spatio-temporal features and further effectively leverage them to enhance the learning of discriminative features still remains less explored, which is crucial for fine-grained action recognition. In this paper, we propose a motion representation decoupling and concentration network (MDCNet) to address these two key issues. First, we devise a motion representation decoupling (MRD) module to disentangle the spatio-temporal representation into appearance and motion features through contrastive learning from video and segment views. Next, in the proposed motion representation concentration (MRC) module, the decoupled motion representations are further leveraged to learn a universal motion prototype shared across all the instances of each action class. Finally, we project the decoupled motion features onto all the motion prototypes through semantic relations to obtain the concentrated action-relevant features for each action class, which can effectively characterize the temporal distinctions of fine-grained actions for improved recognition performance. Comprehensive experimental results on four widely used action recognition benchmarks, i.e., FineGym, Diving48, Kinetics400 and Something-Something, clearly demonstrate the superiority of our proposed method in comparison with other state-of-the-art ones.

Concept Propagation via Attentional Knowledge Graph Reasoning for Video-Text Retrieval

  • Sheng Fang
  • Shuhui Wang
  • Junbao Zhuo
  • Qingming Huang
  • Bin Ma
  • Xiaoming Wei
  • Xiaolin Wei

Due to the rapid growth of online video data, video-text retrieval techniques are in urgent need, which aim to search for the most relevant video given a natural language caption and vice versa. The major challenge of this task is how to identify the true fine-grained semantic correspondence between videos and texts, using only the document-level correspondence. To deal with this issue, we propose a simple yet effective two-stream framework which takes the concept information into account and introduces a new branch of semantic-level matching. We further propose a concept propagation mechanism for mining the latent semantics in videos and achieving enriched representations. The concept propagation is achieved by building a commonsense graph distilled from ConceptNet with concepts extracted from videos and captions. The original concepts of videos are detected by pretrained detectors as the initial concept representations. By conducting attentional graph reasoning on the commonsense graph with the guidance of external knowledge, we can extend some new concepts in a detector-free manner for further enriching the video representations. In addition, a propagated BCE loss is designed for supervising the concept propagation procedure. Common space learning is then constructed for cross-modal matching. We conduct extensive experiments on various baseline models and several benchmark datasets. Promising experimental results demonstrate the effectiveness and generalization ability of our method.

Domain Generalization via Frequency-domain-based Feature Disentanglement and Interaction

  • Jingye Wang
  • Ruoyi Du
  • Dongliang Chang
  • Kongming Liang
  • Zhanyu Ma

Adaptation to out-of-distribution data is a meta-challenge for all statistical learning algorithms that strongly rely on the i.i.d. assumption. It leads to unavoidable labor costs and confidence crises in realistic applications. For that, domain generalization aims at mining domain-irrelevant knowledge from multiple source domains that can generalize to unseen target domains. In this paper, by leveraging the frequency domain of an image, we uniquely work with two key observations: (i) the high-frequency information of an image depicts object edge structure, which preserves high-level semantic information of the object is naturally consistent across different domains, and (ii) the low-frequency component retains object smooth structure, while this information is susceptible to domain shifts. Motivated by the above observations, we introduce (i) an encoder-decoder structure to disentangle high- and low-frequency features of an image, (ii) an information interaction mechanism to ensure the helpful knowledge from both two parts can cooperate effectively, and (iii) a novel data augmentation technique that works on the frequency domain to encourage the robustness of frequency-wise feature disentangling. The proposed method obtains state-of-the-art performance on three widely used domain generalization benchmarks (Digit-DG, Office-Home, and PACS).

Immunofluorescence Capillary Imaging Segmentation: Cases Study

  • Runpeng Hou
  • Ziyuan Ye
  • Chengyu Yang
  • Linhao Fu
  • Chao Liu
  • Quanying Liu

Nonunion is one of the challenges faced by orthopedics clinics for the technical difficulties and high costs in photographing interosseous capillaries. Segmenting vessels and filling capillaries are critical in understanding the obstacles encountered in capillary growth. However, existing datasets for blood vessel segmentation mainly focus on the large blood vessels of the body, and the lack of labeled capillary image datasets greatly limits the methodological development and applications of vessel segmentation and capillary filling. Here, we present a benchmark dataset, named IFCIS-155, consisting of 155 2D capillary images with segmentation boundaries and vessel fillings annotated by biomedical experts, and 19 large-scale, high-resolution 3D capillary images. To obtain better images of interosseous capillaries, we leverage state-of-the-art immunofluorescence imaging techniques to highlight the rich vascular morphology of interosseous capillaries. We conduct comprehensive experiments to verify the effectiveness of the dataset and the benchmarking deep learning models (e.g. UNet/UNet++ and the modified UNet/UNet++). Our work offers a benchmark dataset for training deep learning models for capillary image segmentation and provides a potential tool for future capillary research. The IFCIS-155 dataset and code are all publicly available at

Imitated Detectors: Stealing Knowledge of Black-box Object Detectors

  • Siyuan Liang
  • Aishan Liu
  • Jiawei Liang
  • Longkang Li
  • Yang Bai
  • Xiaochun Cao

Deep neural networks have shown great potential in many practical applications, yet their knowledge is at the risk of being stolen via exposed services (\eg APIs). In contrast to the commonly-studied classification model extraction, there exist no studies on the more challenging object detection task due to the sufficiency and efficiency of problem domain data collection. In this paper, we for the first time reveal that black-box victim object detectors can be easily replicated without knowing the model structure and training data. In particular, we treat it as black-box knowledge distillation and propose a teacher-student framework named Imitated Detector to transfer the knowledge of the victim model to the imitated model. To accelerate the problem domain data construction, we extend the problem domain dataset by generating synthetic images, where we apply the text-image generation process and provide short text inputs consisting of object categories and natural scenes; to promote the feedback information, we aim to fully mine the latent knowledge of the victim model by introducing an iterative adversarial attack strategy, where we feed victim models with transferable adversarial examples making victim provide diversified predictions with more information. Extensive experiments on multiple datasets in different settings demonstrate that our approach achieves the highest model extraction accuracy and outperforms other model stealing methods by large margins in the problem domain dataset. Our codes can be found at \url

Boosting Single-Frame 3D Object Detection by Simulating Multi-Frame Point Clouds

  • Wu Zheng
  • Li Jiang
  • Fanbin Lu
  • Yangyang Ye
  • Chi-Wing Fu

To boost a detector for single-frame 3D object detection, we present a new approach to train it to simulate features and responses following a detector trained on multi-frame point clouds. Our approach needs multi-frame point clouds only when training the single-frame detector, and once trained, it can detect objects with only single-frame point clouds as inputs during the inference. For this purpose, we design a novel Simulated Multi-Frame Single-Stage object Detector (SMF-SSD) framework: multi-view dense object fusion to densify ground-truth objects to generate a multi-frame point cloud; self-attention voxel distillation to facilitate one-to-many knowledge transfer from multi- to single-frame voxels; multi-scale BEV feature distillation to transfer knowledge in low-level spatial and high-level semantic BEV features; and adaptive response distillation to activate single-frame responses of high confidence and accurate localization. Experimental results on the Waymo test set show that our SMF-SSD consistently outperforms all state-of-the-art single-frame 3D object detectors for all object classes of difficulty levels 1 and 2 in terms of both mAP and mAPH.

Towards Complex Document Understanding By Discrete Reasoning

  • Fengbin Zhu
  • Wenqiang Lei
  • Fuli Feng
  • Chao Wang
  • Haozhou Zhang
  • Tat-Seng Chua

Document Visual Question Answering (VQA) aims to answer questions over visually-rich documents. In this work, we introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages comprising semi-structured table(s) and unstructured text as well as 16,558 question-answer pairs. The documents are sampled from financial reports and contain lots of numbers, which means discrete reasoning capability is demanded to answer the questions. Based on TAT-DQA, we further develop a novel model named MHST that takes into account the information in multi-modalities to intelligently address different types of questions with corresponding strategies, i.e., extraction or reasoning. The experiments show that MHST model significantly outperforms the baseline methods, demonstrating its effectiveness. However, the performance still lags far behind that of expert humans. We expect that our TAT-DQA dataset would facilitate the research on understanding of visually-rich documents, especially for scenarios that require discrete reasoning. Also, we hope the proposed model would inspire researchers to design more advanced Document VQA models in future.

RPPformer-Flow: Relative Position Guided Point Transformer for Scene Flow Estimation

  • Hanlin Li
  • Guanting Dong
  • Yueyi Zhang
  • Xiaoyan Sun
  • Zhiwei Xiong

Estimating scene flow for point clouds is one of the key problems in 3D scene understanding and autonomous driving. Recently the point transformer architecture has become a popular and successful solution for 3D computer vision tasks, e.g., point cloud object detection and completion, but its application to scene flow estimation is rarely explored. In this work, we provide a full transformer based solution for scene flow estimation. We first introduce a novel relative position guided point attention mechanism. Then to relax the memory consumption in practice, we provide an efficient implementation of our proposed point attention layer via matrix factorization and nearest neighbor sampling. Finally, we build a pyramid transformer, named RPPformer-Flow, to estimate the scene flow between two consecutive point clouds in a coarse-to-fine manner. We evaluate our RPPformer-Flow on the FlyingThings3D and KITTI Scene Flow 2015 benchmarks. Experimental results show that our method outperforms previous state-of-the-art methods with large margins.

mmLayout: Multi-grained MultiModal Transformer for Document Understanding

  • Wenjin Wang
  • Zhengjie Huang
  • Bin Luo
  • Qianglong Chen
  • Qiming Peng
  • Yinxu Pan
  • Weichong Yin
  • Shikun Feng
  • Yu Sun
  • Dianhai Yu
  • Yin Zhang

Recent efforts of multimodal Transformers have improved Visually Rich Document Understanding (VrDU) tasks via incorporating visual and textual information. However, existing approaches mainly focus on fine-grained elements such as words and document image patches, making it hard for them to learn from coarse-grained elements, including natural lexical units like phrases and salient visual regions like prominent image regions. In this paper, we attach more importance to coarse-grained elements containing high-density information and consistent semantics, which are valuable for document understanding. At first, a document graph is proposed to model complex relationships among multi-grained multimodal elements, in which salient visual regions are detected by a cluster-based method. Then, a multi-grained multimodal Transformer called mmLayout is proposed to incorporate coarse-grained information into existing pre-trained fine-grained multimodal Transformers based on the graph. In mmLayout, coarse-grained information is aggregated from fine-grained, and then, after further processing, is fused back into fine-grained for final prediction. Furthermore, common sense enhancement is introduced to exploit the semantic information of natural lexical units. Experimental results on four tasks, including information extraction and document question answering, show that our method can improve the performance of multimodal Transformers based on fine-grained elements and achieve better performance with fewer parameters. Qualitative analyses show that our method can capture consistent semantics in coarse-grained elements.

Boosting Video-Text Retrieval with Explicit High-Level Semantics

  • Haoran Wang
  • Di Xu
  • Dongliang He
  • Fu Li
  • Zhong Ji
  • Jungong Han
  • Errui Ding

Video-text retrieval (VTR) is an attractive yet challenging task for multi-modal understanding, which aims to search for relevant video (text) given a query (video). Existing methods typically employ completely heterogeneous visual-textual information to align video and text, whilst lacking the awareness of homogeneous high-level semantic information residing in both modalities. To fill this gap, in this work, we propose a novel visual-linguistic aligning model named HiSE for VTR, which improves the cross-modal representation by incorporating explicit high-level semantics. First, we explore the hierarchical property of explicit high-level semantics, and further decompose it into two levels, i.e. discrete semantics and holistic semantics. Specifically, for visual branch, we exploit an off-the-shelf semantic entity predictor to generate discrete high-level semantics. In parallel, a trained video captioning model is employed to output holistic high-level semantics. As for the textual modality, we parse the text into three parts including occurrence, action and entity. In particular, the occurrence corresponds to the holistic high-level semantics, meanwhile both action and entity represent the discrete ones. Then, different graph reasoning techniques are utilized to promote the interaction between holistic and discrete high-level semantics. Extensive experiments demonstrate that, with the aid of explicit high-level semantics, our method achieves the superior performance over state-of-the-art methods on three benchmark datasets, including MSR-VTT, MSVD and DiDeMo.

Rethinking the Mechanism of the Pattern Pruning and the Circle Importance Hypothesis

  • Hengyi Zhou
  • Longjun Liu
  • Haonan Zhang
  • Nanning Zheng

Network pruning is an effective and widely-used model compression technique. Pattern pruning is a new sparsity dimension pruning approach whose compression ability has been proven in some prior works. However, a detailed study on "pattern" and pattern pruning is still lacking. In this paper, we analyze the mechanism behind pattern pruning. Our analysis reveals that the effectiveness of pattern pruning should be attributed to finding the less important weights even before training. Then, motivated by the fact that the retinal ganglion cells in the biological visual system have approximately concentric receptive fields, we further investigate and propose the Circle Importance Hypothesis to guide the design of efficient patterns. We also design two series of special efficient patterns - circle patterns and semicircle patterns. Moreover, inspired by the neural architecture search technique, we propose a novel one-shot gradient-based pattern pruning algorithm. Besides, we also expand depthwise convolutions with our circle patterns, which improves the accuracy of networks with little extra memory cost. Extensive experiments are performed to validate our hypotheses and the effectiveness of the proposed methods. For example, we reduce the 44.0% FLOPS of ResNet-56 while improving its accuracy to 94.38% on CIFAR-10. And we reduce the 41.0% FLOPS of ResNet-18 with only a 1.11% accuracy drop on ImageNet.

A Region-based Document VQA

  • Xinya Wu
  • Duo Zheng
  • Ruonan Wang
  • Jiashen Sun
  • Minzhen Hu
  • Fangxiang Feng
  • Xiaojie Wang
  • Huixing Jiang
  • Fan Yang

Practical Document Visual Question Answering (DocVQA) needs not only to recognize and extract the document contents, but also reason on them for answering questions. However, previous DocVQA data mainly focuses on in-line questions, where the answers could be directly extracted after locating keywords in the documents, which needs less reasoning. This paper therefore builds a large-scale dataset named Region-based Document VQA (RDVQA), which includes more practical questions for DocVQA. We then propose a novel Reason-over-In-region-Question-answering (ReIQ) model for addressing the problems. It is a pre-training-based model, where a Spatial-Token Pre-trained Model (STPM) is employed as the backbone. Two novel pre-training tasks, Masked Text Box Regression and Shuffled Triplet Reconstruction, are proposed to learn the entailment relationship between text blocks and tokens as well as contextual information, respectively. Moreover, a DocVQA State Tracking Module (DocST) is also proposed to track the DocVQA state in the fine-tuning stage. Experimental results show that our model improves the performance onRDVQA significantly, although more work should be done for practical DocVQA as shown inRDVQA.

CyclicShift: A Data Augmentation Method For Enriching Data Patterns

  • Hui Lu
  • Xuan Cheng
  • Wentao Xia
  • Pan Deng
  • MingHui Liu
  • Tianshu Xie
  • XiaoMin Wang
  • Ming Liu

In this paper, we propose a simple yet effective data augmentation strategy, dubbed CyclicShift, to enrich data patterns. The idea is to shift the image in a certain direction and then circularly refill the resultant out-of-frame part to the other side. Compared with previous related methods, Translation, and Shuffle, our proposed method is able to avoid losing pixels of the original image and preserve its semantic information as much as possible. Visually and emprically, we show that our method indeed brings new data patterns and thereby improves the generalization ability as well as the performance of models. Extensive experiments demonstrate our method's effectiveness in image classification and fine-grained recognition over multiple datasets and various network architectures. Furthermore, our method can also be superimposed on other data augmentation methods in a very simple way. CyclicMix, the simultaneous use of CyclicShift and CutMix, hits a new high in most cases. Our code is open-source and available at

Counterexample Contrastive Learning for Spurious Correlation Elimination

  • Jinqiang Wang
  • Rui Hu
  • Chaoquan Jiang
  • Rui Hu
  • Jitao Sang

Biased dataset will lead models to learn bias features highly correlated to labels, which will deteriorate the performance especially when the test data deviates from the training distribution. Most existing solutions resort to introducing additional data to explicitly balance the dataset, e.g., counterfactually generating augmented data. In this paper, we argue that there actually exist valuable samples within the original dataset which are potential to assist model circumvent spurious correlations. We call those observed samples with inconsistent bias-task correspondences with the majority samples as counterexample. By analyzing when and how counterexamples assist in circumventing spurious correlations, we propose Counterexample Contrastive Learning (CounterCL) to exploit the limited observed counterexample to regulate feature representation. Specifically, CounterCL manages to pull counterexamples close to the samples with the different bias features in the same class and at the same time push them away from the samples with the same bias features in the different classes. Quantitative and qualitative experiments validate the effectiveness and demonstrate the compatibility to other debiasing solutions.

MC-SLT: Towards Low-Resource Signer-Adaptive Sign Language Translation

  • Tao Jin
  • Zhou Zhao
  • Meng Zhang
  • Xingshan Zeng

One of the challenging factors in real application of sign language translation (SLT) is inter-signer variation. With the assumption that the pre-trained translation model cannot cover all the signers, the adaptation capability for unseen signers is of great concern. In this paper, we take a completely different perspective for SLT, called signer-adaptive SLT, which mainly considers the transferable ability of SLT systems. To attack this challenging problem, we propose MC-SLT, a novel meta-learning framework that could exploit additional new-signer data via a support set, and output a signer-adaptive model via a few-gradient-step update. Considering the various degrees of style discrepancies of different words performed by multiple signers, we further devise diversity-aware meta-adaptive weights for the token-wise cross-entropy losses. Besides, to improve the training robustness, we adopt the self-guided curriculum learning scheme that first captures the global curricula from each signer to avoid falling into a bad local optimum early, and then learns the curricula of individualities to improve the model adaptability for learning signer-specific knowledge. We re-construct the existing standard datasets of SLT for the signer-adaptive setting and establish a new benchmark for subsequent research.

Deep Evidential Learning with Noisy Correspondence for Cross-modal Retrieval

  • Yang Qin
  • Dezhong Peng
  • Xi Peng
  • Xu Wang
  • Peng Hu

Cross-modal retrieval has been a compelling topic in the multimodal community. Recently, to mitigate the high cost of data collection, the co-occurred pairs (e.g., image and text) could be collected from the Internet as a large-scaled cross-modal dataset, e.g., Conceptual Captions. However, it will unavoidably introduce noise (i.e., mismatched pairs) into training data, dubbed noisy correspondence. Unquestionably, such noise will make supervision information unreliable/uncertain and remarkably degrade the performance. Besides, most existing methods focus training on hard negatives, which will amplify the unreliability of noise. To address the issues, we propose a generalized Deep Evidential Cross-modal Learning framework (DECL), which integrates a novel Cross-modal Evidential Learning paradigm (CEL) and a Robust Dynamic Hinge loss (RDH) with positive and negative learning. CEL could capture and learn the uncertainty brought by noise to improve the robustness and reliability of cross-modal retrieval. Specifically, the bidirectional evidence based on cross-modal similarity is first modeled and parameterized into the Dirichlet distribution, which not only provides accurate uncertainty estimation but also imparts resilience to perturbations against noisy correspondence. To address the amplification problem, RDH smoothly increases the hardness of negatives focused on, thus embracing higher robustness against high noise. Extensive experiments are conducted on three image-text benchmark datasets, i.e., Flickr30K, MS-COCO, and Conceptual Captions, to verify the effectiveness and efficiency of the proposed method. The code is available at \url

CAliC: Accurate and Efficient Image-Text Retrieval via Contrastive Alignment and Visual Contexts Modeling

  • Hongyu Gao
  • Chao Zhu
  • Mengyin Liu
  • Weibo Gu
  • Hongfa Wang
  • Wei Liu
  • Xu-cheng Yin

Image-text retrieval is an essential task of information retrieval, in which the models with the Vision-and-Language Pretraining(VLP) are able to achieve ideal accuracy compared with the ones without VLP. Among different VLP approaches, the single-stream models achieve the overall best retrieval accuracy, but slower inference speed. Recently, researchers have introduced the two-stage retrieval setting commonly used in the information retrieval field to the single-stream VLP model for a better accuracy/efficiency trade-off. However, the retrieval accuracy and efficiency are still unsatisfactory mainly due to the limitations of the patch-based visual unimodal encoder in these VLP models. The unimodal encoders are trained on pure visual data, so the visual features extracted by them are difficult to align with the textual features and it is also difficult for the multi-modal encoder to understand visual information. Under these circumstances, we propose an accurate and efficient two-stage image-text retrieval model via Contrastive Alignment and visual Contexts modeling(CAliC). In the first stage of the proposed model, the visual unimodal encoder is pretrained with cross-modal contrastive learning to extract easily aligned visual features, which improves the retrieval accuracy and the inference speed. In the second stage of the proposed model, we introduce a new visual contexts modeling task during pretraining to help the multi-modal encoder better understand the visual information and get more accurate predictions. Extensive experimental evaluation validates the effectiveness of our proposed approach, which achieves a higher retrieval accuracy while keeping a faster inference speed, and outperforms existing state-of-the-art retrieval methods on image-text retrieval tasks over Flickr30K and COCO benchmarks.

Correspondence Matters for Video Referring Expression Comprehension

  • Meng Cao
  • Ji Jiang
  • Long Chen
  • Yuexian Zou

We investigate the problem of video Referring Expression Comprehension (REC), which aims to localize the referent objects described in the sentence to visual regions in the video frames. Despite the recent progress, existing methods suffer from two problems: 1) inconsistent localization results across video frames; 2) confusion between the referent and contextual objects. To this end, we propose a novel Dual Correspondence Network (dubbed as DCNet) which explicitly enhances the dense associations in both the inter-frame and cross-modal manners. Firstly, we aim to build the inter-frame correlations for all existing instances within the frames. Specifically, we compute the inter-frame patch-wise cosine similarity to estimate the dense alignment and then perform the inter-frame contrastive learning to map them close in feature space. Secondly, we propose to build the fine-grained patch-word alignment to associate each patch with certain words. Due to the lack of this kind of detailed annotations, we also predict the patch-word correspondence through the cosine similarity. Extensive experiments demonstrate that our DCNet achieves state-of-the-art performance on both video and image REC benchmarks. Furthermore, we conduct comprehensive ablation studies and thorough analyses to explore the optimal model designs. Notably, our inter-frame and cross-modal contrastive losses are plug-and-play functions and are applicable to any video REC architectures. For example, by building on top of Co-grounding, we boost the performance by 1.48% absolute improvement on Accu.@0.5 for VID-Sentence dataset.

Point to Rectangle Matching for Image Text Retrieval

  • Zheng Wang
  • Zhenwei Gao
  • Xing Xu
  • Yadan Luo
  • Yang Yang
  • Heng Tao Shen

The difficulty of image-text retrieval is further exacerbated by the phenomenon of one-to-many correspondence, where multiple semantic manifestations of the other modality could be obtained by a given query. However, the prevailing methods adopt the deterministic embedding strategy to retrieve the most similar candidate, which encodes the representations of different modalities as single points in vector space. We argue that such a deterministic point mapping is obviously insufficient to represent a potential set of retrieval results for one-to-many correspondence, despite its noticeable progress. As a remedy to this issue, we propose a Point to Rectangle Matching (abbreviated as P2RM) mechanism, which actually is a geometric representation learning method for image-text retrieval. Specifically, our intuitive insight is that the representations of different modalities could be extended to rectangles, then a set of points inside such a rectangle embedding could be semantically related to many candidate correspondences. Thus our P2RM method could essentially address the one-to-many correspondence. Besides, we design a novel semantic similarity measurement method from the perspective of distance for our rectangle embedding. Under the evaluation metric for multiple matches, extensive experiments and ablation studies on two commonly used benchmarks demonstrate our effectiveness and superiority in tackling the multiplicity of image-text retrieval.

Shifting Perspective to See Difference: A Novel Multi-view Method for Skeleton based Action Recognition

  • Ruijie Hou
  • Yanran Li
  • Ningyu Zhang
  • Yulin Zhou
  • Xiaosong Yang
  • Zhao Wang

Skeleton-based human action recognition is a longstanding challenge due to its complex dynamics. Some fine-grain details of the dynamics play a vital role in classification. The existing work largely focuses on designing incremental neural networks with more complicated adjacent matrices to capture the details of joints relationships. However, they still have difficulties distinguishing actions that have broadly similar motion patterns but belong to different categories. Interestingly, we found that the subtle differences in motion patterns can be significantly amplified and become easy for audience to distinct through specified view directions, where this property haven't been fully explored before. Drastically different from previous work, we boost the performance by proposing a conceptually simple yet effective Multi-view strategy that recognizes actions from a collection of dynamic view features. Specifically, we design a novel Skeleton-Anchor Proposal (SAP) module which contains a Multi-head structure to learn a set of views. For feature learning of different views, we introduce a novel Angle Representation to transform the actions under different views and feed the transformations into the baseline model. Our module can work seamlessly with the existing action classification model. Incorporated with baseline models, our SAP module exhibits clear performance gains on many challenging benchmarks. Moreover, comprehensive experiments show that our model consistently beats down the state-of-the-art and remains effective and robust especially when dealing with corrupted data. Related code will be available on

Counterfactually Measuring and Eliminating Social Bias in Vision-Language Pre-training Models

  • Yi Zhang
  • Junyang Wang
  • Jitao Sang

Vision-Language Pre-training (VLP) models have achieved state-of-the-art performance in numerous cross-modal tasks. Since they are optimized to capture the statistical properties of intra- and inter-modality, there remains risk to learn social biases presented in the data as well. In this work, we (1) introduce a counterfactual-based bias measurement CounterBias to quantify the social bias in VLP models by comparing the [MASK]ed prediction probabilities of factual and counterfactual samples; (2) construct a novel VL-Bias dataset including 24K image-text pairs for measuring gender bias in VLP models, from which we observed that significant gender bias is prevalent in VLP models; and (3) propose a VLP debiasing method FairVLP to minimize the difference in the [MASK]ed prediction probabilities between factual and counterfactual image-text pairs for VLP debiasing. Although CounterBias and FairVLP focus on social bias, they are generalizable to serve as tools and provide new insights to probe and regularize more knowledge in VLP models.

Towards Adversarial Attack on Vision-Language Pre-training Models

  • Jiaming Zhang
  • Qi Yi
  • Jitao Sang

While vision-language pre-training model (VLP) has shown revolutionary improvements on various vision-language (V+L) tasks, the studies regarding its adversarial robustness remain largely unexplored. This paper studied the adversarial attack on popular VLP models and V+L tasks. First, we analyzed the performance of adversarial attacks under different settings. By examining the influence of different perturbed objects and attack targets, we concluded some key observations as guidance on both designing strong multimodal adversarial attack and constructing robust VLP models. Second, we proposed a novel multimodal attack method on the VLP models called Collaborative Multimodal Adversarial Attack (Co-Attack), which collectively carries out the attacks on the image modality and the text modality. Experimental results demonstrated that the proposed method achieves improved attack performances on different V+L downstream tasks and VLP models. The analysis observations and novel attack method hopefully provide new understanding into the adversarial robustness of VLP models, so as to contribute their safe and reliable deployment in more real-world scenarios.

TPSNet: Reverse Thinking of Thin Plate Splines for Arbitrary Shape Scene Text Representation

  • Wei Wang
  • Yu Zhou
  • Jiahao Lv
  • Dayan Wu
  • Guoqing Zhao
  • Ning Jiang
  • Weipinng Wang

The research focus of scene text detection and recognition has shifted to arbitrary shape text in recent years, where the text shape representation is a fundamental problem. An ideal representation should be compact, complete, efficient, and reusable for subsequent recognition in our opinion. However, previous representations have flaws in one or more aspects. Thin-Plate-Spline (TPS) transformation has achieved great success in scene text recognition. Inspired by this, we reversely think of its usage and sophisticatedly take TPS as an exquisite representation for arbitrary shape text representation. The TPS representation is compact, complete, and efficient. With the predicted TPS parameters, the detected text region can be directly rectified to a near-horizontal one to assist the subsequent recognition. To further exploit the potential of the TPS representation, the Border Alignment Loss is proposed. Based on these designs, we implement the text detector TPSNet, which can be extended to a text spotter conveniently. Extensive evaluation and ablation of several public benchmarks demonstrate the effectiveness and superiority of the proposed method for text representation and spotting. Particularly, TPSNet achieves the detection F-Measure improvement of 4.4% (78.4% vs. 74.0%) on Art dataset and the end-to-end spotting F-Measure improvement of 5.0% (78.5% vs. 73.5%) on Total-Text, which are large margins with no bells and whistles. The source code will be available.

Efficient Modeling of Future Context for Image Captioning

  • Zhengcong Fei

Existing approaches to image captioning usually generate the sentence word-by-word from left to right, with the constraint of conditioned on local context including the given image and history generated words. There have been many studies target to make use of global information during decoding, e.g., iterative refinement. However, it is still under-explored how to effectively and efficiently incorporate the future context. To respond to this issue, inspired by that Non-Autoregressive Image Captioning (NAIC) can leverage two-side relation with modified mask operation, we aim to graft this advance to the conventional Autoregressive Image Captioning (AIC) model while maintaining the inference efficiency without extra time cost. Specifically, AIC and NAIC models are first trained combined with shared visual encoders, forcing the visual encoder to contain sufficient and valid future context; then the AIC model is encouraged to capture the causal dynamics of cross-layer interchanging from NAIC model on its unconfident words, which follows a teacher-student paradigm and optimized with the distribution calibration training objective. Empirical evidences demonstrate that our proposed approach clearly surpass the state-of-the-art baselines in both automatic metrics and human evaluations on the MS COCO benchmark. The source code is available at:

Relative Pose Estimation for Multi-Camera Systems from Point Correspondences with Scale Ratio

  • Banglei Guan
  • Ji Zhao

The use of multi-camera systems is becoming more common in self-driving cars, micro aerial vehicles or augmented reality headsets. In order to perform 3D geometric tasks, the accuracy and efficiency of relative pose estimation algorithms are very important for the multi-camera systems, and is catching significant research attention these days. The point coordinates of point correspondences (PCs) obtained from feature matching strategies have been widely used for relative pose estimation. This paper exploits known scale ratios besides the point coordinates, which are also intrinsically provided by scale invariant feature detectors (e.g., SIFT). Two-view geometry of scale ratio associated with the extracted features is derived for multi-camera systems. Thanks to the constraints provided by the scale ratio across two views, the number of PCs needed for relative pose estimation is reduced from 6 to 3. Requiring fewer PCs makes RANSAC-like randomized robust estimation significantly faster. For different point correspondence layouts, four minimal solvers are proposed for typical two-camera rigs. Extensive experiments demonstrate that our solvers have better accuracy than the state-of-the-art ones and outperform them in terms of processing time.

Towards Open-Ended Text-to-Face Generation, Combination and Manipulation

  • Jun Peng
  • Han Pan
  • Yiyi Zhou
  • Jing He
  • Xiaoshuai Sun
  • Yan Wang
  • Yongjian Wu
  • Rongrong Ji

Text-to-face (T2F) generation is an emerging research hot spot in multimedia, and its main challenge lies in the high fidelity requirement of generated portraits. Many existing works resort to exploring the latent space in a pre-trained generator, e.g., StyleGAN, which has obvious shortcomings in efficiency and generalization ability. In this paper, we propose a generative network for open-ended text-to-face generation, which is termed OpenFaceGAN. Differing from existing StyleGAN-based methods, OpenFaceGAN constructs an effective multi-modal latent space that directly converts the natural language description into a face. This mapping paradigm can fit the real data distribution well and make the model capable of open-ended and even zero-shot T2F generation. Our method improves the inference speed by an order of magnitude, e.g., 294 times than TediGAN. Based on OpenFaceGAN, we further explore text-guided face manipulation (editing). In particular, we propose a parameterized module, OpenEditor, to automatically disentangle the target latent code and update the original style information. OpenEditor also makes OpenFaceGAN directly applicable for most manipulation instructions without example-dependent searches or optimizations, greatly improving the efficiency of face manipulation. We conduct extensive experiments on two benchmark datasets namely Multi-Modal CelebA-HQ and Face2Text-v1.0. The experimental results not only show the superior performance of OpenFaceGAN to the existing T2F methods in both image quality and image-text matching degree but also greatly confirm its outstanding ability in the zero-shot generation. Codes will be released at: \textcolormagenta \url

Improving Fusion of Region Features and Grid Features via Two-Step Interaction for Image-Text Retrieval

  • Dongqing Wu
  • Huihui Li
  • Cang Gu
  • Lei Guo
  • Hang Liu

In recent years, region features extracted from object detection networks have been widely used in the image-text retrieval task. However, they lack rich background and contextual information, which makes it difficult to match words describing global concepts in sentences. Meanwhile, the region features also lose the details of objects in the image. Fortunately, these disadvantages of region features are the advantages of grid features. In this paper, we propose a novel framework, which fuses the region features and grid features through a two-step interaction strategy, thus extracting a more comprehensive image representation for image-text retrieval. Concretely, in the first step, a joint graph with spatial information constraints is constructed, where all region features and grid features are represented as graph nodes. By modeling the relationships using the joint graph, the information can be passed edge-wise. In the second step, we propose a Cross-attention Gated Fusion module, which further explores the complex interactions between region features and grid features, and then adaptively fuses different types of features. With these two steps, our model can fully realize the complementary advantages of region features and grid features. In addition, we propose a Multi-Attention Pooling module to better aggregate the fused region features and grid features. Extensive experiments on two public datasets, including Flickr30K and MS-COCO, demonstrate that our model achieves the state-of-the-art and pushes the performance of image-text retrieval to a new height.

A Numerical DEs Perspective on Unfolded Linearized ADMM Networks for Inverse Problems

  • Weixin An
  • Yingjie Yue
  • Yuanyuan Liu
  • Fanhua Shang
  • Hongying Liu

Many research works show that the continuous-time Differential Equations (DEs) allow for a better understanding of traditional Alternating Direction Multiplier Methods (ADMMs). And many unfolded algorithms directly inherit the traditional iterations to build deep networks. Although they obtain a faster convergence rate and superior practical performance, there is a lack of an appropriate explanation of the unfolded network architectures. Thus, we attempt to explore the connection between the existing unfolded Linearized ADMM (LADMM) and numerical DEs, and propose efficient unfolded network design schemes. First, we present an unfolded Euler LADMM scheme as a by-product, which originates from the Euler method for solving first-order DEs. Then inspired by the trapezoid method in numerical DEs, we design a new more effective network scheme, called unfolded Trapezoid LADMM scheme. Moreover, we analyze that the Trapezoid LADMM scheme has higher precision than the Euler LADMM scheme. To the best of our knowledge, this is the first work to explore the connection between unfolded ADMMs and numerical DEs with theoretical guarantees. Finally, we instantiate our Euler LADMM and Trapezoid LADMM schemes into ELADMM and TLADMM with the proximal operators, and ELADMM-Net and TLADMM-Net with convolutional neural networks. And extensive experiments show that our algorithms are competitive with state-of-the-art methods.

UDoc-GAN: Unpaired Document Illumination Correction with Background Light Prior</