MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Digital Library logo
Full Citation in the ACM Digital Library

SESSION: Keynote Talks

Alexa, let's work together! How Alexa Helps Customers Complete Tasks with Verbal and Visual Guidance in the Alexa Prize TaskBot Challenge

Yoelle Maarek

In this talk, I will present the Alexa Prize TaskBot Challenge, which allows selected academic teams to develop TaskBots. TaskBots are agents that interact with Alexa users who require assistance (via "Alexa, let's work together") to complete everyday tasks requiring multiple steps and decisions, such as cooking and home improvement. One of the unique elements of this challenge is its multi-modal nature, where users receive both verbal guidance and visual instructions, when a screen is available (e.g., on Echo Show devices). Some of the hard AI challenges the teams addressed included leveraging domain knowledge, tacking dialogue state, supporting adaptive and robust conversations and probably the most relevant to this conference: handling multi-modal interactions.

Data Science against COVID-19: The Valencian Experience

Nuria Oliver

This invited talk describes the work that a multi-disciplinary team of 20+ volunteer scientists did between March of 2020 and April of 2022, working very closely with the Presidency of the Valencian Government to support their decision-making during the COVID-19 pandemic in Spain. This team was known as the Data Science against COVID-19 taskforce. The team's work was structured in 4 areas: (1) large-scale human mobility modeling; (2) development of computational epidemiological models (metapopulation, individual and LSTM-based models); (3) development of predictive models of hospital and intensive care units' occupancy; and (4) a large-scale, online citizen surveys called the COVID19impactsurvey (https://covid19impactsurvey.org) with over 720,000 answers worldwide. This survey enabled us to shed light on the impact that the pandemic had on people's lives during the period of study [3,4,5]. In the talk, I will present the results obtained in each of these four areas, including winning the 500K XPRIZE Pandemic Response Challenge [1] and obtaining a best paper award at ECML-PKDD 2021 [2]. I will share the lessons learned in this very special initiative of collaboration between the civil society at large (through the citizen survey), the scientific community (through the Data Science against COVID-19 taskforce) and a public administration (through our collaboration with the Presidency of the Valencian Government). For those interested in knowing more about this initiative, WIRED magazine published an extensive article describing the story of this effort: https://www.wired.co.uk/article/valencia-ai-covid-data

Grounding, Meaning and Foundation Models: Adventures in Multimodal Machine Learning

Douwe Kiela

In this talk I will present a vision for acquiring perceptually grounded meaning in machines, as a key next challenge for natural language processing. I will cover some recent work that tries to improve how we do model evaluation in multimodal settings, focusing on the new Adversarial VQA and Winoground evaluation datasets. After that, I will talk about our latest large-scale vision and language "foundation model", called FLAVA: a single holistic universal transformer that targets all modalities at once and that shows impressive performance on a wide range of tasks.

SESSION: Oral Session I: Engaging Users with Multimedia -- Emotional and Social Signals

A Multi-view Spectral-Spatial-Temporal Masked Autoencoder for Decoding Emotions with Self-supervised Learning

Rui Li
Yiting Wang
Wei-Long Zheng
Bao-Liang Lu

Affective Brain-computer Interface has achieved considerable advances that researchers can successfully interpret labeled and flawless EEG data collected in laboratory settings. However, the annotation of EEG data is time-consuming and requires a vast workforce which limits the application in practical scenarios. Furthermore, daily collected EEG data may be partially damaged since EEG signals are sensitive to noise. In this paper, we propose a Multi-view Spectral-Spatial-Temporal Masked Autoencoder (MV-SSTMA) with self-supervised learning to tackle these challenges towards daily applications. The MV-SSTMA is based on a multi-view CNN-Transformer hybrid structure, interpreting the emotion-related knowledge of EEG signals from spectral, spatial, and temporal perspectives. Our model consists of three stages: 1) In the generalized pre-training stage, channels of unlabeled EEG data from all subjects are randomly masked and later reconstructed to learn the generic representations from EEG data; 2) In the personalized calibration stage, only few labeled data from a specific subject are used to calibrate the model; 3) In the personal test stage, our model can decode personal emotions from the sound EEG data as well as damaged ones with missing channels. Extensive experiments on two open emotional EEG datasets demonstrate that our proposed model achieves state-of-the-art performance on emotion recognition. In addition, under the abnormal circumstance of missing channels, the proposed model can still effectively recognize emotions.

Counterfactual Reasoning for Out-of-distribution Multimodal Sentiment Analysis

Teng Sun
Wenjie Wang
Liqaing Jing
Yiran Cui
Xuemeng Song
Liqiang Nie

Existing studies on multimodal sentiment analysis heavily rely on textual modality and unavoidably induce the spurious correlations between textual words and sentiment labels. This greatly hinders the model generalization ability. To address this problem, we define the task of out-of-distribution (OOD) multimodal sentiment analysis. This task aims to estimate and mitigate the bad effect of textual modality for strong OOD generalization. To this end, we embrace causal inference, which inspects the causal relationships via a causal graph. From the graph, we find that the spurious correlations are attributed to the direct effect of textual modality on the model prediction while the indirect one is more reliable by considering multimodal semantics. Inspired by this, we devise a model-agnostic counterfactual framework for multimodal sentiment analysis, which captures the direct effect of textual modality via an extra text model and estimates the indirect one by a multimodal model. During the inference, we first estimate the direct effect by the counterfactual inference, and then subtract it from the total effect of all modalities to obtain the indirect effect for reliable prediction. Extensive experiments show the superior effectiveness and generalization ability of our proposed framework.

MAFW: A Large-scale, Multi-modal, Compound Affective Database for Dynamic Facial Expression Recognition in the Wild

Yuanyuan Liu
Wei Dai
Chuanxu Feng
Wenbin Wang
Guanghao Yin
Jiabei Zeng
Shiguang Shan

Dynamic facial expression recognition (FER) databases provide important data support for affective computing and applications. However, most FER databases are annotated with several basic mutually exclusive emotional categories and contain only one modality, e.g., videos. The monotonous labels and modality cannot accurately imitate human emotions and fulfill applications in the real world. In this paper, we propose MAFW, a large-scale multi-modal compound affective database with 10,045 video-audio clips in the wild. Each clip is annotated with a compound emotional category and a couple of sentences that describe the subjects' affective behaviors in the clip. For the compound emotion annotation, each clip is categorized into one or more of the 11 widely-used emotions, i.e., anger, disgust, fear, happiness, neutral, sadness, surprise, contempt, anxiety, helplessness, and disappointment. To ensure high quality of the labels, we filter out the unreliable annotations by an Expectation Maximization (EM) algorithm, and then obtain 11 single-label emotion categories and 32 multi-label emotion categories. To the best of our knowledge, MAFW is the first in-the-wild multi-modal database annotated with compound emotion annotations and emotion-related captions. Additionally, we also propose a novel Transformer-based expression snippet feature learning method to recognize the compound emotions leveraging the expression-change relations among different emotions and modalities. Extensive experiments on MAFW database show the advantages of the proposed method over other state-of-the-art methods for both uni- and multi-modal FER. Our MAFW database is publicly available from https://mafw-database.github.io/MAFW.

SER30K: A Large-Scale Dataset for Sticker Emotion Recognition

Shengzhe Liu
Xin Zhang
Jufeng Yang

With the popularity of instant messaging applications, online chatting plays an essential role in our daily life. The prevailing use of stickers to express emotions in online chatting leads to the necessity of multimodal sticker emotion recognition. Considering the lack of sticker emotion data, we collect a large-scale sticker emotion recognition dataset named SER30K. It consists of a total of 1,887 sticker themes with total 30,739 sticker images. Some commonly used images, such as realistic images and facial expression images, have been well studied in the field of emotion analysis. However, it is still challenging to understand the emotion of sticker images. Since the characteristics in stickers from the same theme are similar, we can only accurately predict the emotion by capturing the local information (e.g., expressions, poses) and understanding the global information (e.g., relations among objects). To tackle this challenge, we propose a LOcal Re-Attention multimodal network (LORA) to learn sticker emotions in an end-to-end manner. Different from previous approaches using convolutional neural networks, LORA employs the vision transformer to extract visual features, leading to better capture the global relations. In addition, we design a local re-attention module to focus on important region information. Then a simple but efficient modal fusion module combines visual and language features. Extensive experiments are performed on the SER30K and other emotion recognition datasets, demonstrating the effectiveness of our proposed method. Our code, model and dataset are released on https://github.com/nku-shengzheliu/SER30K.

SESSION: Poster Session I: Engaging Users with Multimedia -- Emotional and Social Signals

Representation Learning through Multimodal Attention and Time-Sync Comments for Affective Video Content Analysis

Jicai Pan
Shangfei Wang
Lin Fang

Although temporal patterns inherent in visual and audio signals are crucial for affective video content analysis, they have not been thoroughly explored yet. In this paper, we propose a novel Temporal-Aware Multimodal (TAM) method to fully capture the temporal information. Specifically, we design a cross-temporal multimodal fusion module that applies attention-based fusion to different modalities within and across video segments. As a result, it fully captures the temporal relations between different modalities. Furthermore, a single emotion label lacks supervision for learning representation of each segment, making temporal pattern mining difficult. We leverage time-synchronized comments (TSCs) as auxiliary supervision, since these comments are easily accessible and contain rich emotional cues. Two TSC-based self-supervised tasks are designed: the first aims to predict the emotional words in a TSC from video representation and TSC contextual semantics, and the second predicts the segment in which the TSC appears by calculating the correlation between video representation and TSC embedding. These self-supervised tasks are used to pre-train the cross-temporal multimodal fusion module on a large-scale video-TSC dataset, which is crawled from the web without labeling costs. These self-supervised pre-training tasks prompt the fusion module to perform representation learning on segments including TSC, thus capturing more temporal affective patterns. Experimental results on three benchmark datasets show that the proposed fusion module achieves state-of-the-art results in affective video content analysis. Ablation studies verify that after TSC-based pre-training, the fusion module learns more segments' affective patterns and achieves better performance.

TFF-Former: Temporal-Frequency Fusion Transformer for Zero-training Decoding of Two BCI Tasks

Xujin Li
Wei Wei
Shuang Qiu
Huiguang He

Brain-computer interface (BCI) systems provide a direct connection between the human brain and external devices. Visual evoked BCI systems including Event-related Potential (ERP) and Steady-state Visual Evoked Potential (SSVEP) have attracted extensive attention because of their strong brain responses and wide applications. Previous studies have made some breakthroughs in within-subject decoding algorithms for specific tasks. However, there are two challenges in current decoding algorithms in BCI systems. Firstly, current decoding algorithms cannot accurately classify EEG signals without the data of the new subject, but the calibration procedure is time-consuming. Secondly, algorithms are tailored to extract features for one specific task, which limits their applications across tasks. In this study, we proposed a Temporal-Frequency Fusion Transformer (TFF-Former) for zero-training decoding across two BCI tasks. EEG data were organized into temporal-spatial and frequency-spatial forms, which can be considered as two views. In the TFF-Former framework, two symmetrical Transformer streams were designed to extract view-specific features. The cross-view module based on the cross-attention mechanism was proposed to guide each stream to strengthen common representations of features across EEG views. Additionally, an attention-based fusion module was built to fuse the representations from the two views effectively. The mean mask mechanism was applied to adaptively decrease redundant EEG tokens aggregation for the integration of common representations. We validated our method on the self-collected RSVP dataset and benchmark SSVEP dataset. Experimental results demonstrated that our TFF-Former model achieved competitive performance compared with models in each of the above paradigms. It can further promote the application of visual evoked EEG-based BCI system.

Towards Unbiased Visual Emotion Recognition via Causal Intervention

Yuedong Chen
Xu Yang
Tat-Jen Cham
Jianfei Cai

Although much progress has been made in visual emotion recognition, researchers have realized that modern deep networks tend to exploit dataset characteristics to learn spurious statistical associations between the input and the target. Such dataset characteristics are usually treated as dataset bias, which damages the robustness and generalization performance of these recognition systems. In this work, we scrutinize this problem from the perspective of causal inference, where such dataset characteristic is termed as a confounder which misleads the system to learn the spurious correlation. To alleviate the negative effects brought by the dataset bias, we propose a novel Interventional Emotion Recognition Network (IERN) to achieve the backdoor adjustment, which is one fundamental deconfounding technique in causal inference. Specifically, IERN starts by disentangling the dataset-related context feature from the actual emotion feature, where the former forms the confounder. The emotion feature will then be forced to see each confounder stratum equally before being fed into the classifier. A series of designed tests validate the efficacy of IERN, and experiments on three emotion benchmarks demonstrate that IERN outperforms state-of-the-art approaches for unbiased visual emotion recognition.

Bodily Behaviors in Social Interaction: Novel Annotations and State-of-the-Art Evaluation

Michal Balazia
Philipp Müller
Ákos Levente Tánczos
August von Liechtenstein
François Brémond

Body language is an eye-catching social signal and its automatic analysis can significantly advance artificial intelligence systems to understand and actively participate in social interactions. While computer vision has made impressive progress in low-level tasks like head and body pose estimation, the detection of more subtle behaviors such as gesturing, grooming, or fumbling is not well explored. In this paper we present BBSI, the first set of annotations of complex Bodily Behaviors embedded in continuous Social Interactions in a group setting. Based on previous work in psychology, we manually annotated 26 hours of spontaneous human behavior in the MPIIGroupInteraction dataset with 15 distinct body language classes. We present comprehensive descriptive statistics on the resulting dataset as well as results of annotation quality evaluations. For automatic detection of these behaviors, we adapt the Pyramid Dilated Attention Network (PDAN), a state-of-the-art approach for human action detection. We perform experiments using four variants of spatial-temporal features as input to PDAN: Two-Stream Inflated 3D CNN, Temporal Segment Networks, Temporal Shift Module and Swin Transformer. Results are promising and indicate a great room for improvement in this difficult task. Representing a key piece in the puzzle towards automatic understanding of social behavior, BBSI is fully available to the research community.

Learning from Label Relationships in Human Affect

Niki Maria Foteinopoulou
Ioannis Patras

Human affect and mental state estimation in an automated manner, face a number of difficulties, including learning from labels with poor or no temporal resolution, learning from few datasets with little data (often due to confidentiality constraints) and, (very) long, in-the-wild videos. For these reasons, deep learning methodologies tend to overfit, that is, arrive at latent representations with poor generalisation performance on the final regression task. To overcome this, in this work, we introduce two complementary contributions. First, we introduce a novel relational loss for multilabel regression and ordinal problems that regularises learning and leads to better generalisation. The proposed loss uses label vector inter-relational information to learn better latent representations by aligning batch label distances to the distances in the latent feature space. Second, we utilise a two-stage attention architecture that estimates a target for each clip by using features from the neighbouring clips as temporal context. We evaluate the proposed methodology on both continuous affect and schizophrenia severity estimation problems, as there are methodological and contextual parallels between the two. Experimental results demonstrate that the proposed methodology outperforms the baselines that are trained using the supervised regression loss, as well as pre-training the network architecture with an unsupervised contrastive loss. In the domain of schizophrenia, the proposed methodology outperforms previous state-of-the-art by a large margin, achieving a PCC of up to 78% performance close to that of human experts (85%) and much higher than previous works (uplift of up to 40%). In the case of affect recognition, we outperform previous vision-based methods in terms of CCC on both the OMG and the AMIGOS datasets. Specifically for AMIGOS, we outperform previous SoTA CCC for both arousal and valence by 9% and 13% respectively, and in the OMG dataset we outperform previous vision works by up to 5% for both arousal and valence.

Brain Topography Adaptive Network for Satisfaction Modeling in Interactive Information Access System

Ziyi Ye
Xiaohui Xie
Yiqun Liu
Zhihong Wang
Xuesong Chen
Min Zhang
Shaoping Ma

With the growth of information on the Web, most users heavily rely on information access systems (e.g., search engines, recommender systems, etc.) in their daily lives. During this procedure, modeling users' satisfaction status plays an essential part in improving their experiences with the systems. In this paper, we aim to explore the benefits of using Electroencephalography (EEG) signals for satisfaction modeling in interactive information access system design. Different from existing EEG classification tasks, the arisen of satisfaction involves multiple brain functions, such as arousal, prototypicality, and appraisals, which are related to different brain topographical areas. Thus modeling user satisfaction raises great challenges to existing solutions. To address this challenge, we propose BTA, a Brain Topography Adaptive network with a multi-centrality encoding module and a spatial attention mechanism module to capture cognitive connectives in different spatial distances. We explore the effectiveness of BTA for satisfaction modeling in two popular information access scenarios, i.e., search and recommendation. Extensive experiments on two real-world datasets verify the effectiveness of introducing brain topography adaptive strategy in satisfaction modeling. Furthermore, we also conduct search result re-ranking task and video rating prediction task based on the satisfaction inferred from brain signals on search and recommendation scenarios, respectively. Experimental results show that brain signals extracted with BTA help improve the performance of interactive information access systems significantly.

DPCNet: Dual Path Multi-Excitation Collaborative Network for Facial Expression Representation Learning in Videos

Yan Wang
Yixuan Sun
Wei Song
Shuyong Gao
Yiwen Huang
Zhaoyu Chen
Weifeng Ge
Wenqiang Zhang

Current works of facial expression learning in video consume significant computational resources to learn spatial channel feature representations and temporal relationships. To mitigate this issue, we propose a Dual Path multi-excitation Collaborative Network (DPCNet) to learn the critical information for facial expression representation from fewer keyframes in videos. Specifically, the DPCNet learns the important regions and keyframes from a tuple of four view-grouped frames by multi-excitation modules and produces dual-path representations of one video with consistency under two regularization strategies. A spatial-frame excitation module and a channel-temporal aggregation module are introduced consecutively to learn spatial-frame representation and generate complementary channel-temporal aggregation, respectively. Moreover, we design a multi-frame regularization loss to enforce the representation of multiple frames in the dual view to be semantically coherent. To obtain consistent prediction probabilities from the dual path, we further propose a dual path regularization loss, aiming to minimize the divergence between the distributions of two-path embeddings. Extensive experiments and ablation studies show that the DPCNet can significantly improve the performance of video-based FER and achieve state-of-the-art results on the large-scale DFEW dataset.

Pursuing Knowledge Consistency: Supervised Hierarchical Contrastive Learning for Facial Action Unit Recognition

Yingjie Chen
Chong Chen
Xiao Luo
Jianqiang Huang
Xian-Sheng Hua
Tao Wang
Yun Liang

With the increasing need for emotion analysis, facial action unit (AU) recognition has attracted much more attention as a fundamental task for affective computing. Although deep learning has boosted the performance of AU recognition to a new level in recent years, it remains challenging to extract subject-consistent representations since the appearance changes caused by AUs are subtle and ambiguous among subjects. We observe that there are three kinds of inherent relations among AUs, which can be treated as strong prior knowledge, and pursuing the consistency of such knowledge is the key to learning subject-consistent representations. To this end, we propose a supervised hierarchical contrastive learning method (SupHCL) for AU recognition to pursue knowledge consistency among different facial images and different AUs, which is orthogonal to methods focusing on network architecture design. Specifically, SupHCL contains three relation consistency modules, i.e., unary, binary, and multivariate relation consistency modules, which take the corresponding kind of inherent relations as extra supervision to encourage knowledge-consistent distributions of both AU-level and image-level representations. Experiments conducted on two commonly used AU benchmark datasets, BP4D and DISFA, demonstrate the effectiveness of each relation consistency module and the superiority of SupHCL.

Unsupervised Domain Adaptation Integrating Transformer and Mutual Information for Cross-Corpus Speech Emotion Recognition

Shiqing Zhang
Ruixin Liu
Yijiao Yang
Xiaoming Zhao
Jun Yu

This paper focuses on an interesting task, i.e., unsupervised cross-corpus Speech Emotion Recognition (SER), in which the labelled training (source) corpus and the unlabelled testing (target) corpus have different feature distributions, resulting in the discrepancy between the source and target domains. To address this issue, this paper proposes an unsupervised domain adaptation method integrating Transformers and Mutual Information (MI) for cross-corpus SER. Initially, our method employs encoder layers of Transformers to capture long-term temporal dynamics in an utterance from the extracted segment-level log-Mel spectrogram features, thereby producing the corresponding utterance-level features for each utterance in two domains. Then, we propose an unsupervised feature decomposition method with a hybrid Max-Min MI strategy to separately learn domain-invariant features and domain-specific features from the extracted mixed utterance-level features, in which the discrepancy between two domains is eliminated as much as possible and meanwhile their individual characteristic is preserved. Finally, an interactive Multi-Head attention fusion strategy is designed to learn the complementarity between domain-invariant features and domain-specific features so that they can be interactively fused for SER. Extensive experiments on the IEMOCAP and MSP-Improv datasets demonstrate the effectiveness of our proposed method on unsupervised cross-corpus SER tasks, outperforming state-of-the-art unsupervised cross-corpus SER methods.

Co-Completion for Occluded Facial Expression Recognition

Zhen Xing
Weimin Tan
Ruian He
Yangle Lin
Bo Yan

The existence of occlusions brings in semantically irrelevant visual patterns and leads to the content loss of occluded regions. Although previous works have made improvement on occluded facial expression recognition, they do not explicitly handle the interference factors aforementioned. In this paper, we propose an intuitive and simplified workflow, Co-Completion, which combines occlusion discarding and feature completion together to reduce the impact of occlusions on facial expression recognition. To protect key features from being contaminated and reduce the dependency of feature completion on occlusion discarding, guidance from discriminative regions is also introduced for joint feature completion. Moreover, we release the COO-RW database for occlusion simulation and refine the occlusion generation protocol for fair comparison in this filed. Experiments on synthetic and realistic databases demonstrate the superiority of our method. The COO-RW database can be downloaded from https://github.com/loveSmallOrange/COO-RW.

Generalized Inter-class Loss for Gait Recognition

Weichen Yu
Hongyuan Yu
Yan Huang
Liang Wang

Gait recognition is a unique biometric technique that can be performed at a long distance non-cooperatively and has broad applications in public safety and intelligent traffic systems. The previous gait works focus more on minimizing the intra-class variance while ignoring the significance of constraining inter-class variance. To this end, we propose a generalized inter-class loss that resolves the inter-class variance from both sample-level feature distribution and class-level feature distribution. Instead of equal penalty strength on pair scores, the proposed loss optimizes sample-level inter-class feature distribution by dynamically adjusting the pairwise weight. Further, in class-level distribution, the proposed loss adds a constraint on the uniformity of inter-class feature distribution, which forces the feature representations to approximate a hypersphere and keep maximal inter-class variance. In addition, the proposed method automatically adjusts the margin between classes which enables the inter-class feature distribution to be more flexible. The proposed method can be generalized to different gait recognition networks and achieves significant improvements. We conduct a series of experiments on CASIA-B and OUMVLP, and the experimental results show that the proposed loss can significantly improve the performance and achieves the state-of-the-art performances.

Feeling Without Sharing: A Federated Video Emotion Recognition Framework Via Privacy-Agnostic Hybrid Aggregation

Fan Qi
Zixin Zhang
Xianshan Yang
Huaiwen Zhang
Changsheng Xu

The explosion of video data brings new opportunities and challenges for emotion recognition. Video emotion applications have great commercial value, but the potential to involve illegal snooping on personal feelings has led to controversy over privacy protection. The federated learning (FL) paradigm can substantially address the growing public concerns about data privacy in video emotion recognition. However, conventional FL methods perform poorly due to the uniqueness of the task: the data are heterogeneous across clients induced by emotional label skew and cross-culture expression differences. To mitigate the heterogeneous data, we propose EmoFed, a practical framework of federated learning video-based emotion recognition via multi-group clustering and privacy-agnostic hybrid aggregation. It yields a generically applicable and improved model while protecting privacy, which trains local models under group-aware personalized aggregation. To further encourage communicating comprehensive and privacy-agnostic information among clients, we upload model parameters of both the global layers and personalization layers to the server. We utilize the homomorphically encrypted method for personalization layers, which incurs no learning accuracy loss since no noise is added to the model updates during the encryption/decryption process. The proposed method works on video-based emotion recognition tasks to predict actors' emotional expressions and induced emotion by viewers. Extensive experiments and ablation studies on four benchmarks have demonstrated the efficacy and practicability of our method.

Self-Paced Label Distribution Learning for In-The-Wild Facial Expression Recognition

Jianjian Shao
Zhenqian Wu
Yuanyan Luo
Shudong Huang
Xiaorong Pu
Yazhou Ren

Label distribution learning (LDL) has achieved great progress in facial expression recognition (FER), where the generating label distribution is a key procedure for LDL-based FER. However, many existing researches have shown the common problem with noisy samples in FER, especially on in-the-wild datasets. This issue may lead to generating unreliable label distributions (which can be seen as label noise), and will further negatively affect the FER model. To this end, we propose a play-and-plug method of self-paced label distribution learning (SPLDL) for in-the-wild FER. Specifically, a simple yet efficient label distribution generator is adopted to generate label distributions to guide label distribution learning. We then introduce self-paced learning (SPL) paradigm and develop a novel self-paced label distribution learning strategy, which considers both classification losses and distribution losses. SPLDL first learns easy samples with reliable label distributions and gradually steps to complex ones, effectively suppressing the negative impact introduced by noisy samples and unreliable label distributions. Extensive experiments on in-the-wild FER datasets (\emphi.e., RAF-DB and AffectNet) based on three backbone networks demonstrate the effectiveness of the proposed method.

Uncertainty-Aware Semi-Supervised Learning of 3D Face Rigging from Single Image

Yong Zhao
Haifeng Chen
Hichem Sahli
Ke Lu
Dongmei Jiang

We present a method to rig 3D faces via Action Units (AUs), viewpoint and light direction, from single input image. Existing 3D methods for face synthesis and animation rely heavily on 3D morphable model (3DMM), which was built on 3D data and cannot provide intuitive expression parameters, while AU-driven 2D methods cannot handle head pose and lighting effect. We bridge the gap by integrating a recent 3D reconstruction method with 2D AU-driven method in a semi-supervised fashion. Built upon the auto-encoding 3D face reconstruction model that decouples depth, albedo, viewpoint and light without any supervision, we further decouple expression from identity for depth and albedo with a novel conditional feature translation module and pretrained critics for AU intensity estimation and image classification. Novel objective functions are designed using unlabeled in-the-wild images and in-door images with AU labels. We also leverage uncertainty losses to model the probably changing AU region of images as input noise for synthesis, and model the noisy AU intensity labels for intensity estimation of the AU critic. Experiments with face editing and animation on four datasets show that, compared with six state-of-the-art methods, our proposed method is superior and effective on expression consistency, identity similarity and pose similarity.

A Unified Framework against Topology and Class Imbalance

Junyu Chen
Qianqian Xu
Zhiyong Yang
Xiaochun Cao
Qingming Huang

The Area Under ROC curve (AUC) is widely used as an evaluation metric in various applications. Due to its insensitivity towards class distribution, directly optimizing AUC performs well on the class imbalance problem. However, existing AUC optimization methods are limited to regular data such as text, images, and video. AUC optimization on graph data, which is ubiquitous and important, is seldom studied. Different from regular data, AUC optimization on graphs suffers from not only the class imbalance but also topology imbalance. To solve the complicated imbalance problem, we propose a unified topology-aware AUC optimization (TOPOAUC) framework, which could simultaneously deal with the topology and class imbalance problem in graph learning. We develop a multi-class AUC optimization work to deal with the class imbalance problem. With respect to topology imbalance, we propose a T opology-A ware I mportance L earning mechanism (TAIL), which considers the topology of pairwise nodes and different contributions of topology information to pairwise node neighbors. Extensive experiments on three real-world datasets demonstrate the effectiveness of our proposed method.

Unified Multi-modal Pre-training for Few-shot Sentiment Analysis with Prompt-based Learning

Yang Yu
Dong Zhang
Shoushan Li

Multi-modal sentiment analysis (MSA) has become more and more attractive in both academia and industry. The conventional studies normally require massive labeled data to train the deep neural models. To alleviate the above issue, in this paper, we conduct few-shot MSA with quite a small number of labeled samples. Inspired by the success of textual prompt-based fine-tuning (PF) approaches in few-shot scenario, we introduce a multi-modal prompt-based fine-tuning (MPF) approach. To narrow the semantic gap between language and vision, we propose unified pre-training for multi-modal prompt-based fine-tuning (UP-MPF) with two stages. First, in unified pre-training stage, we employ a simple and effective task to obtain coherent vision-language representations from fixed pre-trained language models (PLMs), i.e., predicting the rotation direction of the input image with a prompt phrase as input concurrently. Second, in multi-modal prompt-based fine-tuning, we freeze the visual encoder to reduce more parameters, which further facilitates few-shot MSA. Extensive experiments and analysis on three coarse-grained and three fine-grained MSA datasets demonstrate the better performance of our UP-MPF against the state-of-the-art of PF, MSA, and multi-modal pre-training approaches.

Temporal Sentiment Localization: Listen and Look in Untrimmed Videos

Zhicheng Zhang
Jufeng Yang

Video sentiment analysis aims to uncover the underlying attitudes of viewers, which has a wide range of applications in real world. Existing works simply classify a video into a single sentimental category, ignoring the fact that sentiment in untrimmed videos may appear in multiple segments with varying lengths and unknown locations. To address this, we propose a challenging task, i.e., Temporal Sentiment Localization (TSL), to find which parts of the video convey sentiment. To systematically investigate fully- and weakly-supervised settings for TSL, we first build a benchmark dataset named TSL-300, which is consisting of 300 videos with a total length of 1,291 minutes. Each video is labeled in two ways, one of which is frame-by-frame annotation for the fully-supervised setting, and the other is single-frame annotation, i.e., only a single frame with strong sentiment is labeled per segment for the weakly-supervised setting. Due to the high cost of labeling a densely annotated dataset, we propose TSL-Net in this work, employing single-frame supervision to localize sentiment in videos. In detail, we generate the pseudo labels for unlabeled frames using a greedy search strategy, and fuse the affective features of both visual and audio modalities to predict the temporal sentiment distribution. Here, a reverse mapping strategy is designed for feature fusion, and a contrastive loss is utilized to maintain the consistency between the original feature and the reverse prediction. Extensive experiments show the superiority of our method against the state-of-the-art approaches.

VigilanceNet: Decouple Intra- and Inter-Modality Learning for Multimodal Vigilance Estimation in RSVP-Based BCI

Xinyu Cheng
Wei Wei
Changde Du
Shuang Qiu
Sanli Tian
Xiaojun Ma
Huiguang He

Recently, brain-computer interface (BCI) technology has made impressive progress and has been developed for many applications. Thereinto, the BCI system based on rapid serial visual presentation (RSVP) is a promising information detection technology. However, the use of RSVP is closely related to the user's performance, which can be influenced by their vigilance levels. Therefore it is crucial to detect vigilance levels in RSVP-based BCI. In this paper, we conducted a long-term RSVP target detection experiment to collect electroencephalography (EEG) and electrooculogram (EOG) data at different vigilance levels. In addition, to estimate vigilance levels in RSVP-based BCI, we propose a multimodal method named VigilanceNet using EEG and EOG. Firstly, we define the multiplicative relationships in conventional EOG features that can better describe the relationships between EOG features, and design an outer product embedding module to extract the multiplicative relationships. Secondly, we propose to decouple the learning of intra- and inter-modality to improve multimodal learning. Specifically, for intra-modality, we introduce an intra-modality representation learning (intra-RL) method to obtain effective representations of each modality by letting each modality independently predict vigilance levels during the multimodal training process. For inter-modality, we employ the cross-modal Transformer based on cross-attention to capture the complementary information between EEG and EOG, which only pays attention to the inter-modality relations. Extensive experiments and ablation studies are conducted on the RSVP and SEED-VIG public datasets. The results demonstrate the effectiveness of the method in terms of regression error and correlation.

EASE: Robust Facial Expression Recognition via Emotion Ambiguity-SEnsitive Cooperative Networks

Lijuan Wang
Guoli Jia
Ning Jiang
Haiying Wu
Jufeng Yang

Facial Expression Recognition (FER) plays a crucial role in the real-world applications. However, large-scale FER datasets collected in the wild usually contain noises. More importantly, due to the ambiguity of emotion, facial images with multiple emotions are hard to be distinguished from the ones with noisy labels. Therefore, it is challenging to train a robust model for FER. To address this, we propose Emotion Ambiguity-SEnsitive cooperative networks (EASE) which contain two components. First, the ambiguity-sensitive learning module divides the training samples into three groups. The samples with small-losses in both networks are considered as clean samples, and the ones with large-losses are noisy. Note for the conflict samples that one network disagrees with the other, we distinguish the samples conveying ambiguous emotions from the ones with noises, using the polarity cues of emotions. Here, we utilize KL divergence to optimize the networks, enabling them to pay attention to the non-dominant emotions. The second part of EASE aims to enhance the diversity of the cooperative networks. With the training epochs increasing, the cooperative networks would converge to a consensus. We construct a penalty term according to the correlation between the features, which helps the networks learn diverse representations from the images. Extensive experiments on 6 popular facial expression datasets demonstrate that EASE outperforms the state-of-the-art approaches.

Mimicking the Annotation Process for Recognizing the Micro Expressions

Bo-Kai Ruan
Ling Lo
Hong-Han Shuai
Wen-Huang Cheng

Micro-expression recognition (MER) has recently become a popular research topic due to its wide applications, e.g., movie rating and recognizing the neurological disorder. By virtue of deep learning techniques, the performance of MER has been significantly improved and reached unprecedented results. This paper proposes a novel architecture to mimic how the expressions are annotated. Specifically, during the annotation process in several datasets, the AU labels are first obtained with FACS, and the expression labels are then decided based on the combinations of the AU labels. Meanwhile, these AU labels describe either the eyes or mouth movements (mutually-exclusive). Following this idea, we design a dual-branch structure with a new augmentation method to separately capture the eyes and mouth features and teach the model what the general expressions should be. Moreover, to adaptively fuse the area features for different expressions, we propose Area Weighted Module to assign different weights to each region. Additionally, we set up an auxiliary task to align the AU similarity scores to help our model capture facial patterns further with AU labels. The proposed approach outperforms other state-of-the-art methods in terms of accuracy on the CASME II and SAMM datasets. Moreover, we provide a new visualization approach to show the relationship between the facial regions and AU features.

SESSION: Oral Session II: Engaging User with Multimedia -- Multimedia Search and Recommendation

Machine Unlearning for Image Retrieval: A Generative Scrubbing Approach

Peng-Fei Zhang
Guangdong Bai
Zi Huang
Xin-Shun Xu

Data owners have the right to request for deleting their data from a machine learning (ML) model. In response, a naïve way is to retrain the model with the original dataset excluding the data to forget, which is however unrealistic as the required dataset may no longer be available and the retraining process is usually computationally expensive. To cope with this reality, machine unlearning has recently attained much attention, which aims to enable data removal from a trained ML model responding to deletion requests, without retraining the model from scratch or full access to the original training dataset. Existing unlearning methods mainly focus on handling conventional ML methods, while unlearning deep neural networks (DNNs) based models remains underexplored, especially for the ones trained on large-scale datasets.

In this paper, we make the first attempt to realize data forgetting on deep models for image retrieval. Image retrieval targets at searching relevant data to the query according to similarity measures. Intuitively, unlearning a deep image retrieval model can be achieved by breaking down its ability of similarity modeling on the data to forget. To this end, we propose a generative scrubbing (GS) method that learns a generator to craft noisy data to manipulate the model weights. A novel framework is designed consisting of the generator and the target retrieval model, where a pair of coupled static and dynamic learning procedures are performed simultaneously. This novel learning strategy effectively enables the generated noisy data to fade away the memory of the model on the data to forget whilst retaining the information of the remaining data. Extensive experiments on three widely-used datasets have successfully verified the effectiveness of the proposed method.

Partially Relevant Video Retrieval

Jianfeng Dong
Xianke Chen
Minsong Zhang
Xun Yang
Shujie Chen
Xirong Li
Xun Wang

Current methods for text-to-video retrieval (T2VR) are trained and tested on video-captioning oriented datasets such as MSVD, MSR-VTT and VATEX. A key property of these datasets is that videos are assumed to be temporally pre-trimmed with short duration, whilst the provided captions well describe the gist of the video content. Consequently, for a given paired video and caption, the video is supposed to be fully relevant to the caption. In reality, however, as queries are not known a priori, pre-trimmed video clips may not contain sufficient content to fully meet the query. This suggests a gap between the literature and the real world. To fill the gap, we propose in this paper a novel T2VR subtask termed Partially Relevant Video Retrieval (PRVR). An untrimmed video is considered to be partially relevant w.r.t. a given textual query if it contains a moment relevant to the query. PRVR aims to retrieve such partially relevant videos from a large collection of untrimmed videos. PRVR differs from single video moment retrieval and video corpus moment retrieval, as the latter two are to retrieve moments rather than untrimmed videos. We formulate PRVR as a multiple instance learning (MIL) problem, where a video is simultaneously viewed as a bag of video clips and a bag of video frames. Clips and frames represent video content at different time scales. We propose a Multi-Scale Similarity Learning (MS-SL) network that jointly learns clip-scale and frame-scale similarities for PRVR. Extensive experiments on three datasets (TVR, ActivityNet Captions, and Charades-STA) demonstrate the viability of the proposed method. We also show that our method can be used for improving video corpus moment retrieval.

From Abstract to Details: A Generative Multimodal Fusion Framework for Recommendation

Fangxiong Xiao
Lixi Deng
Jingjing Chen
Houye Ji
Xiaorui Yang
Zhuoye Ding
Bo Long

In E-commerce recommendation, Click-Through Rate (CTR) prediction has been extensively studied in both academia and industry to enhance user experience and platform benefits. At present, most popular CTR prediction methods are concatenation-based models that represent items by simply merging multiple heterogeneous features including ID, visual, and text features into a large vector. As these heterogeneous modalities have moderately different properties, directly concatenating them without mining the correlation and reducing the redundancy are unlikely to achieve the optimal fusion results. Besides, these concatenation-based models treat all modalities equally for each user and overlook the fact that users tend to pay unequal attention to information of various modalities when browsing items in the real scenario. To address the above issues, this paper proposes a generative multimodal fusion framework (GMMF) for CTR prediction task. To eliminate the redundancy and strength the complementary of multimodal features, GMMF generates the new visual and text representations by a Difference-Set network (DSN). These representations are non-overlapping with the information conveyed by ID embedding. Specifically, DSN maps ID embedding into visual and text modalities and depicts the difference between multiple modalities based on their properties. Besides, GMMF learns unequal weights to multiple modalities with a Modal-Interest network (MIN) modeling users' preference on heterogeneous modalities. These weights reflect the usual habits and hobbies of users. Finally, We conduct extensive experiments on both public and collected industrial datasets, and the results show that GMMF greatly improves performance and achieves state-of-the-art performance.

Bi-directional Heterogeneous Graph Hashing towards Efficient Outfit Recommendation

Weili Guan
Xuemeng Song
Haoyu Zhang
Meng Liu
Chung-Hsing Yeh
Xiaojun Chang

Personalized outfit recommendation, which aims to recommend the outfits to a given user according to his/her preference, has gained increasing research attention due to its economic value. Nevertheless, the majority of existing methods mainly focus on improving the recommendation effectiveness, while overlooking the recommendation efficiency. Inspired by this, we devise a novel bi-directional heterogeneous graph hashing scheme, called BiHGH, towards efficient personalized outfit recommendation. In particular, this scheme consists of three key components: heterogeneous graph node initialization, bi-directional sequential graph convolution, and hash code learning. We first unify four types of entities (i.e., users, outfits, items, and attributes) and their relations via a heterogeneous four-partite graph. To perform graph learning, we then creatively devise a bi-directional graph convolution algorithm to sequentially transfer knowledge via repeating upwards and downwards convolution, whereby we divide the four-partite graph into three subgraphs and each subgraph only involves two adjacent entity types. We ultimately adopt the bayesian personalized ranking loss for the user preference learning and design the dual similarity preserving regularization to prevent the information loss during hash learning. Extensive experiments on the benchmark dataset demonstrate the superiority of BiHGH.

Semantic Structure Enhanced Contrastive Adversarial Hash Network for Cross-media Representation Learning

Meiyu Liang
Junping Du
Xiaowen Cao
Yang Yu
Kangkang Lu
Zhe Xue
Min Zhang

Deep cross-media hashing technology provides an efficient cross-media representation learning solution for cross-media search. However, the existing methods do not consider both fine-grained semantic features and semantic structures to mine implicit cross-media semantic associations, which leads to weaker semantic discrimination and consistency for cross-media representation. To tackle this problem, we propose a novel semantic structure enhanced contrastive adversarial hash network for cross-media representation learning (SCAHN). Firstly, in order to capture more fine-grained cross-media semantic associations, a fine-grained cross-media attention feature learning network is constructed, thus the learned saliency features of different modalities are more conducive to cross-media semantic alignment and fusion. Secondly, for further improving learning ability of implicit cross-media semantic associations, a semantic label association graph is constructed, and the graph convolutional network is utilized to mine the implicit semantic structures, thus guiding learning of discriminative features of different modalities. Thirdly, a cross-media and intra-media contrastive adversarial representation learning mechanism is proposed to further enhance the semantic discriminativeness of different modal representations, and a dual-way adversarial learning strategy is developed to maximize cross-media semantic associations, so as to obtain cross-media unified representations with stronger discriminativeness and semantic consistency preserving power. Extensive experiments on several cross-media benchmark datasets demonstrate that the proposed SCAHN outperforms the state-of-the-art methods.

Cross-Domain 3D Model Retrieval Based On Contrastive Learning And Label Propagation

Dan Song
Yue Yang
Weizhi Nie
Xuanya Li
An-An Liu

In this work, we aim to tackle the task of unsupervised image based 3D model retrieval, where we seek to retrieve unlabeled 3D models that are most visually similar to the 2D query image. Due to the challenging modality gap between 2D images and 3D models, existing mainstream methods adopt domain-adversarial techniques to eliminate the gap, which cannot guarantee category-level alignment that is important for retrieval performance. Recent methods align the class centers of 2D images and 3D models to pay attention to the category-level alignment. However, there still exist two main issues: 1) the category-level alignment is too rough, and 2) the category prediction of unlabeled 3D models is not accurate. To overcome the first problem, we utilize contrastive learning for fine-grained category-level alignment across domains, which pulls both prototypes and samples with the same semantic information closer and pushes those with different semantic information apart. To provide reliable semantic prediction for contrastive learning and also address the second issue, we propose the consistent decision for pseudo labels of 3D models based on both the trained image classifier and label propagation. Experiments are carried out on MI3DOR and MI3DOR-2 datasets, and the results demonstrate the effectiveness of our proposed method.

Interactive Video Corpus Moment Retrieval using Reinforcement Learning

Zhixin Ma
Chong Wah Ngo

Known-item video search is effective with human-in-the-loop to interactively investigate the search result and refine the initial query. Nevertheless, when the first few pages of results are swamped with visually similar items, or the search target is hidden deep in the ranked list, finding the know-item target usually requires a long duration of browsing and result inspection. This paper tackles the problem by reinforcement learning, aiming to reach a search target within a few rounds of interaction by long-term learning from user feedbacks. Specifically, the system interactively plans for navigation path based on feedback and recommends a potential target that maximizes the long-term reward for user comment. We conduct experiments for the challenging task of video corpus moment retrieval (VCMR) to localize moments from a large video corpus. The experimental results on TVR and DiDeMo datasets verify that our proposed work is effective in retrieving the moments that are hidden deep inside the ranked lists of CONQUER and HERO, which are the state-of-the-art auto-search engines for VCMR.

Hierarchical Graph Embedded Pose Regularity Learning via Spatio-Temporal Transformer for Abnormal Behavior Detection

Chao Huang
Yabo Liu
Zheng Zhang
Chengliang Liu
Jie Wen
Yong Xu
Yaowei Wang

Abnormal behavior detection in surveillance video is a fundamental task in modern public security. Different from typical pixel-based solutions, pose-based approaches leverage low-dimensional and strongly-structured skeleton feature, which enables the anomaly detector to be immune to complex background noise and obtain higher efficiency. However, existing pose-based methods only utilize the pose of each individual independently while ignore the important interactions between individuals. In this paper, we present a hierarchical graph embedded pose regularity learning framework via spatio-temporal transformer, which leverages the strength of graph representation in encoding strongly-structured skeleton feature. Specifically, skeleton feature is encoded as the hierarchical graph representation, which jointly models the interactions among multiple individuals and the correlations among body joints within the same individual. Furthermore, a novel task-specific spatial-temporal graph transformer is designed to encode the hierarchical spatio-temporal graph embeddings of human skeletons and learn the regular patterns within normal training videos. Experimental results indicate that our method obtains superior performance over state-of-the-art methods on several challenging datasets.

HMTN: Hierarchical Multi-scale Transformer Network for 3D Shape Recognition

Yue Zhao
Weizhi Nie
Zan Gao
An-an Liu

As an important field of multimedia, 3D shape recognition has attracted much research attention in recent years. Various approaches have been proposed, within which the multiview-based methods show their promising performances. In general, an effective 3D shape recognition algorithm should take both the multiview local and global visual information into consideration, and explore the inherent properties of generated 3D descriptors to guarantee the performance of feature alignment in the common space. To tackle these issues, we propose a novel Hierarchical Multi-scale Transformer Network (HMTN) for the 3D shape recognition task. In HMTN, we propose a multi-level regional transformer (MLRT) module for shape descriptor generation. MLRT includes two branches that aim to extract the intra-view local characteristics by modeling region-wise dependencies and give the supervision of multiview global information under different granularities. Specifically, MLRT can comprehensively consider the relations of different regions and focus on the discriminative parts, which improves the effectiveness of the learned descriptors. Finally, we adopt the cross-granularity contrastive learning (CCL) mechanism for shape descriptor alignment in the common space. It can explore and utilize the cross-granularity semantic correlation to guide the descriptor extraction process while performing the instance alignment based on the category information. We evaluate the proposed network on several public benchmarks, and HMTN achieves competitive performance compared with the state-of-the-art (SOTA) methods.

IDEAL: High-Order-Ensemble Adaptation Network for Learning with Noisy Labels

Peng-Fei Zhang
Zi Huang
Guangdong Bai
Xin-Shun Xu

Data annotations obtained for supervised learning often suffer from label noise, which would inevitably incur unreliable deep neural networks. Existing solutions to this problem typically limit the scope to instance-independent label noise. Due to the high illegibility of data and the inexperience of annotators, instance-dependent noise has also been widely observed, however, not being investigated. In this paper, we propose a novel \underlineIDE ntify and \underlineAL ign (IDEAL) methodology, which aims to eliminate the feature distribution shift raised by a broad spectrum of noise patterns. The proposed model is capable of learning noise-resilient feature representations, thereby correctly predicting data instances. More specifically, we formulate the robust learning against noisy labels as a domain adaptation problem by identifying noisy data (i.e., data samples with incorrect labels) and clean data from the dataset as two domains and minimizing their domain discrepancy in the feature space. In this framework, a high-order-ensemble adaptation network is devised to provide high-confidence predictions, according to which a specific criterion is defined for differentiating clean and noisy data. A new metric based on data augmentation is designed to measure the discrepancy between the clean and noisy domains. Along with a min-max learning strategy between the feature encoder and the classifier on the discrepancy, the domain gap will be bridged, which encourages a noise-resilient model. In-depth theoretical analysis and extensive experiments on widely-used benchmark datasets demonstrate the effectiveness of the proposed method.

DVR: Micro-Video Recommendation Optimizing Watch-Time-Gain under Duration Bias

Yu Zheng
Chen Gao
Jingtao Ding
Lingling Yi
Depeng Jin
Yong Li
Meng Wang

Recommender systems are prone to be misled by biases in the data. Models trained with biased data fail to capture the real interests of users, thus it is critical to alleviate the impact of bias to achieve unbiased recommendation. In this work, we focus on an essential bias in micro-video recommendation, duration bias. Specifically, existing micro-video recommender systems usually consider watch time as the most critical metric, which measures how long a user watches a video. Since videos with longer duration tend to have longer watch time, there exists a kind of duration bias, making longer videos tend to be recommended more against short videos. In this paper, we empirically show that commonly-used metrics are vulnerable to duration bias, making them NOT suitable for evaluating micro-video recommendation. To address it, we further propose an unbiased evaluation metric, called WTG (short for Watch Time Gain). Empirical results reveal that WTG can alleviate duration bias and better measure recommendation performance. Moreover, we design a simple yet effective model named DVR (short for Debiased Video Recommendation) that can provide unbiased recommendation of micro-videos with varying duration, and learn unbiased user preferences via adversarial learning. Extensive experiments based on two real-world datasets demonstrate that DVR successfully eliminates duration bias and significantly improves recommendation performance with over 30% relative progress. Codes and datasets are released at https://github.com/tsinghua-fib-lab/WTG-DVR.

SESSION: Poster Session II: Engaging User with Multimedia -- Multimedia Search and Recommendation

Video Moment Retrieval with Hierarchical Contrastive Learning

Bolin Zhang
Chao Yang
Bin Jiang
Xiaokang Zhou

This paper explores the task of video moment retrieval (VMR), which aims to localize the temporal boundary of a specific moment from an untrimmed video by a sentence query. Previous methods either extract pre-defined candidate moment features and select the moment that best matches the query by ranking, or directly align the boundary clips of a target moment with the query and predict matching scores. Despite their effectiveness, these methods mostly focus only on aligning the query and single-level clip or moment features, and ignore the different granularities involved in the video itself, such as clip, moment, or video, resulting in insufficient cross-modal interaction. To this end, we propose a Temporal Localization Network with Hierarchical Contrastive Learning (HCLNet) for the VMR task. Specifically, we introduce a hierarchical contrastive learning method to better align the query and video by maximizing the mutual information (MI) between query and three different granularities of video to learn informative representations. Meanwhile, we introduce a self-supervised cycle-consistency loss to enforce the further semantic alignment between fine-grained video clips and query words. Experiments on three standard benchmarks show the effectiveness of our proposed method.

Learning to Retrieve Videos by Asking Questions

Avinash Madasu
Junier Oliva
Gedas Bertasius

The majority of traditional text-to-video retrieval systems operate in static environments, i.e., there is no interaction between the user and the agent beyond the initial textual query provided by the user. This can be suboptimal if the initial query has ambiguities, which would lead to many falsely retrieved videos. To overcome this limitation, we propose a novel framework for Video Retrieval using Dialog (ViReD), which enables the user to interact with an AI agent via multiple rounds of dialog. The key contribution of our framework is a novel multimodal question generator that learns to ask questions that maximize the subsequent video retrieval performance. Our multimodal question generator uses (i) the video candidates retrieved during the last round of interaction with the user and (ii) the text-based dialog history documenting all previous interactions, to generate questions that incorporate both visual and linguistic cues relevant to video retrieval. Furthermore, to generate maximally informative questions, we propose an Information-Guided Supervision (IGS), which guides the question generator to ask questions that would boost subsequent video retrieval accuracy. We validate the effectiveness of our interactive ViReD framework on the AVSD dataset, showing that our interactive method performs significantly better than traditional non-interactive video retrieval systems. Furthermore, we also demonstrate that our proposed approach also generalizes to the real-world settings that involve interactions with real humans, thus, demonstrating the robustness and generality of our framework.

HEART: Towards Effective Hash Codes under Label Noise

Jinan Sun
Haixin Wang
Xiao Luo
Shikun Zhang
Wei Xiang
Chong Chen
Xian-Sheng Hua

Hashing, which encodes raw data into compact binary codes, has grown in popularity for large-scale image retrieval due to its storage and computation efficiency. Although deep supervised hashing has lately shown promising performance, they mostly assume that the semantic labels of training data are ideally noise-free, which is often unrealistic in real-world applications. In this paper, considering the practical application, we focus on the problem of learning to hash with label noise and propose a novel method called HEART to address the problem. HEART is a holistic framework which explores latent semantic distributions to select both clean samples and pairs of high confidence for mitigating the impacts of label noise. From a statistical perspective, our HEART characterizes each image by its multiple augmented views that can be considered as examples from its latent distribution and then calculates semantic distances between images using energy distances between their latent distributions. With semantic distances, we can select confident similar pairs to guide hashing contrastive learning for high-quality hash codes. Moreover, to prevent the memorization of noisy examples, we propose a novel strategy to identify clean samples which have small variations of losses on the latent distributions and train the network on clean samples using a pointwise loss. Experimental results on several popular benchmark datasets demonstrate the effectiveness of our HEART compared with a wide range of baselines.

Learning Hybrid Behavior Patterns for Multimedia Recommendation

Zongshen Mu
Yueting Zhuang
Jie Tan
Jun Xiao
Siliang Tang

Multimedia recommendation aims to predict user preferences where users interact with multimodal items. Collaborative filtering based on graph convolutional networks manifests impressive performance gains in multimedia recommendation. This is attributed to the capability of learning good user and item embeddings by aggregating the collaborative signals from high-order neighbors. However, previous researches [37,38] fail to explicitly mine different behavior patterns (i.e., item categories, common user interests) by exploiting user-item and item-item graphs simultaneously, which plays an important role in modeling user preferences. And it is the lack of different behavior pattern constraints and multimodal feature reconciliations that results in performance degradation. Towards this end, We propose a Hybrid Clustering Graph Convolutional Network (HCGCN) for multimedia recommendation. We perform high-order graph convolutions inside user-item clusters and item-item clusters to capture various user behavior patterns. Meanwhile, we design corresponding clustering losses to enhance user-item preference feedback and multimodal representation learning constraint to adjust the modality importance, making more accurate recommendations. Experimental results on three real-world multimedia datasets not only demonstrate the significant improvement of our model over the state-of-the-art methods, but also validate the effectiveness of integrating hybrid user behavior patterns for multimedia recommendation.

Breaking Isolation: Multimodal Graph Fusion for Multimedia Recommendation by Edge-wise Modulation

Feiyu Chen
Junjie Wang
Yinwei Wei
Hai-Tao Zheng
Jie Shao

In a multimedia recommender system, rich multimodal dynamics of user-item interactions are worth availing ourselves of and have been facilitated by Graph Convolutional Networks (GCNs). Yet, the typical way of conducting multimodal fusion with GCN-based models is either through graph mergence fusion that delivers insufficient inter-modal dynamics, or through node alignment fusion that brings in noises which potentially harm multimodal modelling. Unlike existing works, we propose EgoGCN, a structure that seeks to enhance multimodal learning of user-item interactions. At its core is a simple yet effective fusion operation dubbed EdGe-wise mOdulation (EGO) fusion. EGO fusion adaptively distils edge-wise multimodal information and learns to modulate each unimodal node under the supervision of other modalities. It breaks isolated unimodal propagations, allows the most informative inter-modal messages to spread, whilst preserving intra-modal processing. We present a hard modulation and a soft modulation to fully investigate the multimodal dynamics behind. Experiments on two real-world datasets show that EgoGCN comfortably beats prior methods.

Image-Text Matching with Fine-Grained Relational Dependency and Bidirectional Attention-Based Generative Networks

Jianwei Zhu
Zhixin Li
Yufei Zeng
Jiahui Wei
Huifang Ma

Generally, most existing cross-modal retrieval methods only consider global or local semantic embeddings, lacking fine-grained dependencies between objects. At the same time, it is usually ignored that the mutual transformation between modalities also facilitates the embedding of modalities. Given these problems, we propose a method called BiKA (Bidirectional Knowledge-assisted embedding and Attention-based generation). The model uses a bidirectional graph convolutional neural network to establish dependencies between objects. In addition, it employs a bidirectional attention-based generative network to achieve the mutual transformation between modalities. Specifically, the knowledge graph is used for local matching to constrain the local expression of the modalities, in which the generative network is used for mutual transformation to constrain the global expression of the modalities. In addition, we also propose a new position relation embedding network to embed position relation information between objects. The experiments on two public datasets show that the performance of our method has been dramatically improved compared to many state-of-the-art models.

Visual Grounding in Remote Sensing Images

Yuxi Sun
Shanshan Feng
Xutao Li
Yunming Ye
Jian Kang
Xu Huang

Ground object retrieval from a large-scale remote sensing image is very important for lots of applications. We present a novel problem of visual grounding in remote sensing images. Visual grounding aims to locate the particular objects (in the form of the bounding box or segmentation mask) in an image by a natural language expression. The task already exists in the computer vision community. However, existing benchmark datasets and methods mainly focus on natural images rather than remote sensing images. Compared with natural images, remote sensing images contain large-scale scenes and the geographical spatial information of ground objects (e.g., longitude, latitude). The existing method cannot deal with these challenges. In this paper, we collect a new visual grounding dataset, called RSVG, and design a new method, namely GeoVG. In particular, the proposed method consists of a language encoder, image encoder, and fusion module. The language encoder is used to learn numerical geospatial relations and represent a complex expression as a geospatial relation graph. The image encoder is applied to learn large-scale remote sensing scenes with adaptive region attention. The fusion module is used to fuse the text and image feature for visual grounding. We evaluate the proposed method by comparing it to the state-of-the-art methods on RSVG. Experiments show that our method outperforms the previous methods on the proposed datasets. https://sunyuxi.github.io/publication/GeoVG

Prompt-based Zero-shot Video Moment Retrieval

Guolong Wang
Xun Wu
Zhaoyuan Liu
Junchi Yan

Video moment retrieval aims at localizing a specific moment from an untrimmed video by a sentence query. Most methods rely on heavy annotations of video moment-query pairs. Recent zero-shot methods reduced annotation cost, yet they neglected the global visual feature due to the separation of video and text learning process. To avoid the lack of visual features, we propose a Prompt-based Zero-shot Video Moment Retrieval (PZVMR) method. Motivated by the frame of prompt learning, we design two modules: 1) Proposal Prompt (PP): We randomly masks sequential frames to build a prompt to generate proposals; 2) Verb Prompt (VP): We provide patterns of nouns and the masked verb to build a prompt to generate pseudo queries with verbs. Our PZVMR utilizes task-relevant knowledge distilled from pre-trained CLIP and adapts the knowledge to VMR. Unlike the pioneering work, we introduce visual features into each module. Extensive experiments show that our PZVMR not only outperforms the existing zero-shot method (PSVL) on two public datasets (Charades-STA and ActivityNet-Captions) by 4.4% and 2.5% respectively in mIoU, but also outperforms several methods using stronger supervision.

Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning

Yabing Wang
Jianfeng Dong
Tianxiang Liang
Minsong Zhang
Rui Cai
Xun Wang

Despite the recent developments in the field of cross-modal retrieval, there has been less research focusing on low-resource languages due to the lack of manually annotated datasets. In this paper, we propose a noise-robust cross-lingual cross-modal retrieval method for low-resource languages. To this end, we use Machine Translation (MT) to construct pseudo-parallel sentence pairs for low-resource languages. However, as MT is not perfect, it tends to introduce noise during translation, rendering textual embeddings corrupted and thereby compromising the retrieval performance. To alleviate this, we introduce a multi-view self-distillation method to learn noise-robust target-language representations, which employs a cross-attention module to generate soft pseudo-targets to provide direct supervision from the similarity-based view and feature-based view. Besides, inspired by the back-translation in unsupervised MT, we minimize the semantic discrepancies between origin sentences and back-translated sentences to further improve the noise robustness of the textual encoder. Extensive experiments are conducted on three video-text and image-text cross-modal retrieval benchmarks across different languages, and the results demonstrate that our method significantly improves the overall performance without using extra human-labeled data. In addition, equipped with a pre-trained visual encoder from a recent vision and language pre-training framework, i.e., CLIP, our model achieves a significant performance gain, showing that our method is compatible with popular pre-training models. Code and data are available at https://github.com/HuiGuanLab/nrccr.

Learn to Understand Negation in Video Retrieval

Ziyue Wang
Aozhu Chen
Fan Hu
Xirong Li

Negation is a common linguistic skill that allows human to express what we do NOT want. Naturally, one might expect video retrieval to support natural-language queries with negation, e.g., finding shots of kids sitting on the floor and not playing with a dog. However, the state-of-the-art deep learning based video retrieval models lack such ability, as they are typically trained on video description datasets such as MSR-VTT and VATEX that lack negated descriptions. Their retrieved results basically ignore the negator in the sample query, incorrectly returning videos showing kids playing with dog. This paper presents the first study on learning to understand negation in video retrieval and make contributions as follows. By re-purposing two existing datasets (MSR-VTT and VATEX), we propose a new evaluation protocol for video retrieval with negation. We propose a learning based method for training a negation-aware video retrieval model. The key idea is to first construct a soft negative caption for a specific training video by partially negating its original caption, and then compute a bidirectionally constrained loss on the triplet. This auxiliary loss is weightedly added to a standard retrieval loss. Experiments on the re-purposed benchmarks show that re-training the CLIP (Contrastive Language-Image Pre-Training) model by the proposed method clearly improves its ability to handle queries with negation. In addition, the model performance on the original benchmarks is also improved.

AdsCVLR: Commercial Visual-Linguistic Representation Modeling in Sponsored Search

Yongjie Zhu
Chunhui Han
Yuefeng Zhan
Bochen Pang
Zhaoju Li
Hao Sun
Si Li
Boxin Shi
Nan Duan
Weiwei Deng
Ruofei Zhang
Liangjie Zhang
Qi Zhang

Sponsored search advertisements (ads) appear next to search results when consumers look for products and services on search engines. As the fundamental basis of search ads, relevance modeling has attracted increasing attention due to the significant research challenges and tremendous practical value. In this paper, we address the problem of multi-modal modeling in sponsored search, which models the relevance between user query and commercial ads with multi-modal structured information. To solve this problem, we propose a transformer architecture with Ads data on Commercial Visual-Linguistic Representation (AdsCVLR) with contrastive learning that naturally extends the transformer encoder with the complementary multi-modal inputs, serving as a strong aggregator of image-text features. We also make a public advertising dataset, which includes 480K labeled query-ad pairwise data with structured information of image, title, seller, description, and so on. Empirically, we evaluate the AdsCVLR model over the large industry dataset, and the experimental results of online/offline tests show the superiority of our method.

Differentiable Cross-modal Hashing via Multimodal Transformers

Junfeng Tu
Xueliang Liu
Zongxiang Lin
Richang Hong
Meng Wang

Cross-modal hashing aims at projecting the cross modal content into a common Hamming space for efficient search. Most existing work first encodes the samples with a deep network and then binaries the encoded feature into hashing code. However, the relative location information in the image may be lost when an image is encoded by the convolutional network, which makes it challenging to model the relationship of different modalities. Moreover, it is NP-hard to optimize the model with the discrete sign binary function popularly used in existing solutions. To address these issues, we propose a differentiable cross-modal hashing method that utilizes the multimodal transformer as the backbone to capture the location information in an image when encoding the visual content. In addition, a novel differentiable cross-modal hashing method is proposed to generate the binary code by a selecting mechanism, which could be formulated as a continuous and easily optimized problem. We perform extensive experiments on several cross modal datasets and the results show that the proposed method outperforms many existing solutions.

Multi-Level Region Matching for Fine-Grained Sketch-Based Image Retrieval

Zhixin Ling
Zhen Xing
Jiangtong Li
Li Niu

Fine-Grained Sketch-Based Image Retrieval (FG-SBIR) is to use free-hand sketches as queries to perform instance-level retrieval in an image gallery. Existing works usually leverage only high-level information and perform matching in a single region. However, both low-level and high-level information are helpful to establish fine-grained correspondence. Besides, we argue that matching different regions between each sketch-image pair can further boost model robustness. Therefore, we propose Multi-Level Region Matching (MLRM) for FG-SBIR, which consists of two modules: a Discriminative Region Extraction module (DRE) and a Region and Level Attention module (RLA). In DRE, we propose Light-weighted Attention Map Augmentation (LAMA) to extract local feature from different regions. In RLA, we propose a transformer-based attentive matching module to learn attention weights to explore different importance from different image/sketch regions and feature levels. Furthermore, to ensure that the geometrical and semantic distinctiveness is well modeled, we also explore a novel LAMA overlapping penalty and a local region-negative triplet loss in our proposed MLRM method. Comprehensive experiments conducted on five datasets (i.e., Sketchy, QMUL-ChairV2, QMUL-ShoeV2, QMUL-Chair, QMUL-Shoe) demonstrate effectiveness of our method.

DDGHM: Dual Dynamic Graph with Hybrid Metric Training for Cross-Domain Sequential Recommendation

Xiaolin Zheng
Jiajie Su
Weiming Liu
Chaochao Chen

Sequential Recommendation (SR) characterizes evolving patterns of user behaviors by modeling how users transit among items. However, the short interaction sequences limit the performance of existing SR. To solve this problem, we focus on Cross-Domain Sequential Recommendation (CDSR) in this paper, which aims to leverage information from other domains to improve the sequential recommendation performance of a single domain. Solving CDSR is challenging. On the one hand, how to retain single domain preferences as well as integrate cross-domain influence remains an essential problem. On the other hand, the data sparsity problem cannot be totally solved by simply utilizing knowledge from other domains, due to the limited length of the merged sequences. To address the challenges, we propose DDGHM, a novel framework for the CDSR problem, which includes two main modules, i.e., dual dynamic graph modeling and hybrid metric training. The former captures intra-domain and inter-domain sequential transitions through dynamically constructing two-level graphs, i.e., the local graphs and the global graphs, and incorporating them with a fuse attentive gating mechanism. The latter enhances user and item representations by employing hybrid metric learning, including collaborative metric for achieving alignment and contrastive metric for preserving uniformity, to further alleviate data sparsity issue and improve prediction accuracy. We conduct experiments on two benchmark datasets and the results demonstrate the effectiveness of DDGHM.

Spatial-Temporal Aligned Multi-Agent Learning for Visual Dialog Systems

Yong Zhuang
Tong Yu
Junda Wu
Shiqu Wu
Shuai Li

Existing interactive learning systems usually train models on simulators as surrogates for real users. Due to the limited amount of user data, trained simulators may lead to biased results as it fails to well represent real users. One solution is to model users as agents, and then simultaneously train the interactive system and user agents by multi-agent reinforcement learning (MARL) frameworks. However, developing efficient MARL frameworks for modern interactive multimodal systems is still challenging. First, given the existence of multimodal data, how to develop accurate multimodal fusion within and between agents in each interaction is challenging and unclear. Second, interactions between users and systems are complex and it is challenging to track and synchronize the interactions over time. The above multimodal fusion between agents and synchronization over time becomes even more challenging, when the amount of user data is limited. To jointly address these challenges and achieve more sample-efficient learning, we propose a novel spatial-temporal aligned sta multi-agent reinforcement learning framework to better align the multimodal data within and between agents over time. Based on our framework, we develop sample-efficient visual dialog systems. Through extensive experiments and analysis, we validate the effectiveness of our spatial-temporal aligned sta multi-agent reinforcement learning framework in visual dialog systems.

Learning Intrinsic and Extrinsic Intentions for Cold-start Recommendation with Neural Stochastic Processes

Huafeng Liu
Liping Jing
Dahai Yu
Mingjie Zhou
Michael Ng

User behavior data in recommendation are driven by the complex interactions of many intentions behind the user's decision making process. However, user behavior data tends to be sparse because of the limited user response and the vase combinations of users and items, which result in unclear user intentions and suffer from cold-start problem. The intentions are highly compound, and may range from high-level ones that govern user's intrinsic interests and realize the underlying reasons behind the user's decision making processes, to low-level one that characterize a user's extrinsic preference when executing intention to specific items. In this paper, we propose an intention neural process model (INP) for user cold-start recommendation (i.e., user with very few historical interactions), a novel extension of the neural stochastic process family using a general meta learning strategy with intrinsic and extrinsic intention learning for robust user preference learning. By regarding the recommendation process for each user as a stochastic process, INP defines distributions over functions, is capable of rapid adaptation to new users. Our approach learns intrinsic intentions by inferring the high-level concepts associated with user interests or purposes, while capturing the target preference of a user by performing self-supervised intention matching between historical items and target items in a disentangled latent space. Extrinsic intentions are learned by simultaneously generating the point-wise implicit feedback data and creates the pair-wise ranking list by sufficient exploiting both interacted and non-interacted items for each user. Empirical results show that our approach can achieve substantial improvement over the state-of-the-art baselines on cold-start recommendation.

Camera-specific Informative Data Augmentation Module for Unbalanced Person Re-identification

Pingting Hong
Dayan Wu
Bo Li
Weipinng Wang

Person re-identification~(Re-ID) aims at retrieving the same person across the non-overlapped camera networks. Recent works have achieved impressive performance due to the rapid development of deep learning techniques. However, most existing methods have ignored the practical unbalanced property in real-world Re-ID scenarios. In fact, the number of pedestrian images in different cameras vary a lot. Some cameras cover thousands of images while others only have a few. As a result, the camera-unbalanced problem will reduce intra-camera diversity, then the model cannot learn camera-invariant features to distinguish pedestrians from "poor" cameras. In this paper, we design a novel camera-specific informative data augmentation module~(CIDAM) to alleviate the proposed camera-unbalanced problem. Specifically, we first calculate the camera-specific distribution online, then refine the "poor" camera-specific covariance matrix with similar cameras defined in the prototype-based similarity matrix. Consequently, informative augmented samples are generated by combining original samples with sampled random vectors in feature space. To ensure these augmented samples can better benefit the model training, we further propose a dynamic-threshold-based contrastive loss. Since augmented samples may not be as real as original ones, we calculate a threshold for each original one dynamically and only push hard negative augmented samples away. Moreover, our CIDAM can be compatible with a variety of existing Re-ID methods. Extensive experiments prove the effectiveness of our method.

TopicVAE: Topic-aware Disentanglement Representation Learning for Enhanced Recommendation

Zhiqiang Guo
Guohui Li
Jianjun Li
Huaicong Chen

Learning disentangled representations that reflect user preference based on user behavior (implicit feedback, such as click and purchase) and content information (e.g., plot description, poster) has become a hot research topic in modern recommender systems. However, most existing methods considering content information are not well-designed to disentangle user preference features due to neglecting the diversity of user preference on different semantic topics of items, resulting in sub-optimal performance and low interpretability. To address this problem, we propose a novelTopic-aware Disentangled Variational AutoEncoder (TopicVAE) to learn disentangled representations for enhanced recommendation. Specifically, we first utilize an attention-based topic extraction to extract the topic-level item representations and topic-item probability distribution from item content, and then introduce variational autoencoder to infer topic-level disentangled user representations. To guide the learning of topic-level disentanglement, we present a topic-guided self-supervised contrastive loss to promote the otherness of different topics by introducing a neighborhood-based user representation as guidance. Besides, a heuristic regularization is designed to force each dimension of the disentangled representations to independently reflect a fine-grained factor of a specific topic (e.g., red or blue for color) for feature-level disentanglement. Extensive experimental studies on three public datasets show that TopicVAE significantly outperforms several state-of-the-art baselines. Further empirical experiments also illustrate the interpretability of disentangled representations learned by TopicVAE.

Pixel-Level Anomaly Detection via Uncertainty-aware Prototypical Transformer

Chao Huang
Chengliang Liu
Zheng Zhang
Zhihao Wu
Jie Wen
Qiuping Jiang
Yong Xu

Pixel-level visual anomaly detection, which aims to recognize the abnormal areas from images, plays an important role in industrial fault detection and medical diagnosis. However, it is a challenging task due to the following reasons: i) the large variation of anomalies; and ii) the ambiguous boundary between anomalies and their normal surroundings. In this work, we present an uncertainty-aware prototypical transformer (UPformer), which takes into account both the diversity and uncertainty of anomaly to achieve accurate pixel-level visual anomaly detection. To this end, we first design a memory-guided prototype learning transformer encoder to learn and memorize the prototypical representations of anomalies for enabling the model to capture the diversity of anomalies. Additionally, an anomaly detection uncertainty quantizer is designed to learn the distributions of anomaly detection for measuring the anomaly detection uncertainty. Furthermore, an uncertainty-aware transformer decoder is proposed to leverage the detection uncertainties to guide the model to focus on the uncertain areas and generate the final detection results. As a result, our method achieves more accurate anomaly detection by combining the benefits of prototype learning and uncertainty estimation. Experimental results on five datasets indicate that our method achieves state-of-the-art anomaly detection performance.

Dynamic Prototype Mask for Occluded Person Re-Identification

Lei Tan
Pingyang Dai
Rongrong Ji
Yongjian Wu

Although person re-identification has achieved an impressive improvement in recent years, the common occlusion case caused by different obstacles is still an unsettled issue in real application scenarios. Existing methods mainly address this issue by employing body clues provided by an extra network to distinguish the visible part. Nevertheless, the inevitable domain gap between the assistant model and the ReID datasets has highly increased the difficulty to obtain an effective and efficient model. To escape from the extra pre-trained networks and achieve an automatic alignment in an end-to-end trainable network, we propose a novel Dynamic Prototype Mask (DPM) based on two self-evident prior knowledge. Specifically, we first devise a Hierarchical Mask Generator which utilizes the hierarchical semantic to select the visible pattern space between the high-quality holistic prototype and the feature representation of the occluded input image. Under this condition, the occluded representation could be well aligned in a selected subspace spontaneously. Then, to enrich the feature representation of the high-quality holistic prototype and provide a more complete feature space, we introduce a Head Enrich Module to encourage different heads to aggregate different patterns representation in the whole image. Extensive experimental evaluations conducted on occluded and holistic person re-identification benchmarks demonstrate the superior performance of the DPM over the state-of-the-art methods.

Meta Reconciliation Normalization for Lifelong Person Re-Identification

Nan Pu
Yu Liu
Wei Chen
Erwin M. Bakker
Michael S. Lew

Lifelong person re-identification (LReID) is a challenging and emerging task, which concerns the ReID capability on both seen and unseen domains after learning across different domains continually. Existing works on LReID are devoted to introducing commonly-used lifelong learning approaches, while neglecting a serious side effect caused by using normalization layers in the context of domain-incremental learning. In this work, we aim to raise awareness of the importance of training proper batch normalization layers by proposing a new meta reconciliation normalization (MRN) method specifically designed for tackling LReID. Our MRN consists of grouped mixture standardization and additive rectified rescaling components, which are able to automatically maintain an optimal balance between domain-dependent and domain-independent statistics, and even adapt MRN for different testing instances. Furthermore, inspired by synaptic plasticity in human brain, we present a MRN-based meta-learning framework for mining the meta-knowledge shared across different domains, even without replaying any previous data, and further improve the model's LReID ability with theoretical analyses. Our method achieves new state-of-the-art performances on both balanced and imbalanced LReID benchmarks.

Attack is the Best Defense: Towards Preemptive-Protection Person Re-Identification

Lin Wang
Wanqian Zhang
Dayan Wu
Fei Zhu
Bo Li

Person Re-IDentification (ReID) aims at retrieving images of the same person across multiple camera views. Despite its popularity in surveillance and public safety, the leakage of identity information is still at risk. For example, once obtaining the illegal access to ReID systems, malicious user can accurately retrieve the target person, leading to the exposure of private information. Recently, some pioneering works protect private images with adversarial examples by adding imperceptible perturbations to target images. However, in this paper, we argue that directly applying adversary-based methods to protect the ReID system is sub-optimal due to the 'overlap identity' issue. Specifically, merely pushing the adversarial image away from its original label would probably make it move into the vicinity of other identities. This leads to the potential risk of being retrieved when querying with all the other identities exhaustively. We thus propose a novel preemptive-Protection person Re-IDentification (PRIDE) method. By explicitly constraining the adversarial image to an isolated location, the target person is far away from neither the original identity nor any other identities, which protects him from being retrieved by illegal queries. Moreover, we further propose two crucial attack scenarios (Random Attack and Order Attack) and a novel Success Protection Rate (SPR) metric to quantify the protection ability. Experiments show consistent outperformance of our method over other baselines across different ReID models, datasets and attack scenarios.

TAGPerson: A Target-Aware Generation Pipeline for Person Re-identification

Kai Chen
Weihua Chen
Tao He
Rong Du
Fan Wang
Xiuyu Sun
Yuchen Guo
Guiguang Ding

Nowadays, real data in person re-identification (ReID) task is facing privacy issues, e.g., the banned dataset DukeMTMC-ReID. Thus it becomes much harder to collect real data for ReID task. Meanwhile, the labor cost of labeling ReID data is still very high and further hinders the development of the ReID research. Therefore, many methods turn to generate synthetic images for ReID algorithms as alternatives instead of real images. However, there is an inevitable domain gap between synthetic and real images. In previous methods, the generation process is based on virtual scenes, and their synthetic training data can not be changed according to different target real scenes automatically. To handle this problem, we propose a novel Target-Aware Generation pipeline to produce synthetic person images, called TAGPerson. Specifically, it involves a parameterized rendering method, where the parameters are controllable and can be adjusted according to the target scenes. In TAGPerson, we extract information from target scenes and use them to control our parameterized rendering process to generate target-aware synthetic images, which would hold a smaller gap to the real images in the specific target domain. In our experiments, our target-aware synthetic images can achieve a much higher performance than the generalized synthetic images on MSMT17, i.e. 47.5% vs. 40.9% for rank-1 accuracy. We will release this toolkit for the ReID community to generate synthetic images at any desired taste. The code is available at: https://github.com/tagperson/tagperson-blender

Efficient Hash Code Expansion by Recycling Old Bits

Dayan Wu
Qinghang Su
Bo Li
Weiping Wang

Deep hashing methods have been intensively studied and successfully applied in large-scale multimedia retrieval. In real-world scenarios, code length can not be set once for all if retrieval accuracy is not satisfying. However, when code length increases, conventional deep hashing methods have to retrain their models and regenerate the whole database codes, which is impractical for large-scale retrieval system. In this paper, we propose an interesting deep hashing method from a brand new perspective, called Code Expansion oriented Deep Hashing (CEDH). Different from conventional deep hashing methods, our CEDH focuses on the fast expansion of existing hash codes. Instead of regenerating all bits from raw images, the new bits in CEDH can be incrementally learned by recycling the old ones. Specifically, we elaborately design an end-to-end asymmetric framework to simultaneously optimize a CNN model for query images and a code projection matrix for database images. With the learned code projection matrix, hash codes can achieve fast expansion through simple matrix multiplication. Subsequently, a novel code expansion hashing loss is proposed to preserve the similarities between query codes and expanded database codes. Due to the loose coupling in our framework, our CEDH is compatible with a variety of deep hashing methods. Moreover, we propose to adopt smooth similarity matrix to solve "similarity contradiction" problem existing in multi-label image datasets, thus further improving our performance on multi-label datasets. Extensive experiments on three widely used image retrieval benchmarks demonstrate that CEDH can significantly reduce the cost for expanding database codes (about 100,000x faster with GPU and 1,000,000x faster with CPU) when code length increases while keeping the state-of-the-art retrieval accuracy. Our code is available at https://github.com/IIE-MMR/2022MM-CEDH.

Adaptive Anti-Bottleneck Multi-Modal Graph Learning Network for Personalized Micro-video Recommendation

Desheng Cai
Shengsheng Qian
Quan Fang
Jun Hu
Changsheng Xu

Micro-video recommendation has attracted extensive research attention with the increasing popularity of micro-video sharing platforms. There exists a substantial amount of excellent efforts made to the micro-video recommendation task. Recently, homogeneous (or heterogeneous) GNN-based approaches utilize graph convolutional operators (or meta-path based similarity measures) to learn meaningful representations for users and micro-videos and show promising performance for the micro-video recommendation task. However, these methods may suffer from the following problems: (1) fail to aggregate information from distant or long-range nodes; (2) ignore the varying intensity of users' preferences for different items in micro-video recommendations; (3) neglect the similarities of multi-modal contents of micro-videos for recommendation tasks. In this paper, we propose a novel Adaptive Anti-Bottleneck Multi-Modal Graph Learning Network for personalized micro-video recommendation. Specifically, we design a collaborative representation learning module and a semantic representation learning module to fully exploit user-video interaction information and the similarities of micro-videos, respectively. Furthermore, we utilize an anti-bottleneck module to automatically learn the importance weights of short-range and long-range neighboring nodes to obtain more expressive representations of users and micro-videos. Finally, to consider the varying intensity of users' preferences for different micro-videos, we design and optimize an adaptive recommendation loss to train our model in an end-to-end manner. We evaluate our method on three real-world datasets and the results demonstrate that the proposed model outperforms the baselines.

Show Me What I Like: Detecting User-Specific Video Highlights Using Content-Based Multi-Head Attention

Uttaran Bhattacharya
Gang Wu
Stefano Petrangeli
Viswanathan Swaminathan
Dinesh Manocha

We propose a method to detect individualized highlights for users on given target videos based on their preferred highlight clips marked on previous videos they have watched. Our method explicitly leverages the contents of both the preferred clips and the target videos using pre-trained features for the objects and the human activities. We design a multi-head attention mechanism to adaptively weigh the preferred clips based on their object- and human-activity-based contents, and fuse them using these weights into a single feature representation for each user. We compute similarities between these per-user feature representations and the per-frame features computed from the desired target videos to estimate the user-specific highlight clips from the target videos. We test our method on a large-scale highlight detection dataset containing the annotated highlights of individual users. Compared to current baselines, we observe an absolute improvement of 2-4% in the mean average precision of the detected highlights. We also perform extensive ablation experiments on the number of preferred highlight clips associated with each user as well as on the object- and human-activity-based feature representations to validate that our method is indeed both content-based and user-specific.

Prototype-based Selective Knowledge Distillation for Zero-Shot Sketch Based Image Retrieval

Kai Wang
Yifan Wang
Xing Xu
Xin Liu
Weihua Ou
Huimin Lu

Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) is an emerging research task that aims to retrieve data of new classes across sketches and images. It is challenging due to the heterogeneous distributions and the inconsistent semantics across seen and unseen classes of the cross-modal data of sketches and images. To realize knowledge transfer, the latest approaches introduce knowledge distillation, which optimizes the student network through the teacher signal distilled from the teacher network pre-trained on large-scale datasets. However, these methods often ignore the mispredictions of the teacher signal, which may make the model vulnerable when disturbed by the wrong output of the teacher network. To tackle the above issues, we propose a novel method termed Prototype-based Selective Knowledge Distillation (PSKD) for ZS-SBIR. Our PSKD method first learns a set of prototypes to represent categories and then utilizes an instance-level adaptive learning strategy to strengthen semantic relations between categories. Afterwards, a correlation matrix targeted for the downstream task is established through the prototypes. With the learned correlation matrix, the teacher signal given by transformers pre-trained on ImageNet and fine-tuned on the downstream dataset, can be reconstructed to weaken the impact of mispredictions and selectively distill knowledge on the student network. Extensive experiments conducted on three widely-used datasets demonstrate that the proposed PSKD method establishes the new state-of-the-art performance on all datasets for ZS-SBIR.

ARRA: Absolute-Relative Ranking Attack against Image Retrieval

Siyuan Li
Xing Xu
Zailei Zhou
Yang Yang
Guoqing Wang
Heng Tao Shen

With the extensive application of deep learning, adversarial attacks especially query-based attacks receive more concern than ever before. However, the scenarios assumed by existing query-based attacks against image retrieval are usually too simple to satisfy the attack demand. In this paper, we propose a novel method termed Absolute-Relative Ranking Attack (ARRA) that considers a more practical attack scenario. Specifically, we propose two compatible goals for the query-based attack, i.e., absolute ranking attack and relative ranking attack, which aim to change the relative order of chosen candidates and assign the specific ranks to chosen candidates in retrieval list respectively. We further devise the Absolute Ranking Loss (ARL) and Relative Ranking Loss (RRL) for the above goals and implement our ARRA by minimizing their combination with black-box optimizers and evaluate the attack performance by attack success rate and normalized ranking correlation. Extensive experiments conducted on widely-used SOP and CUB-200 datasets demonstrate the superiority of the proposed approach over the baselines. Moreover, the attack result on a real-world image retrieval system, i.e., Huawei Cloud Image Search, also proves the practicability of our ARRA approach.

Invariant Representation Learning for Multimedia Recommendation

Xiaoyu Du
Zike Wu
Fuli Feng
Xiangnan He
Jinhui Tang

Multimedia recommendation forms a personalized ranking task with multimedia content representations which are mostly extracted via generic encoders. However, the generic representations introduce spurious correlations --- the meaningless correlation from the recommendation perspective. For example, suppose a user bought two dresses on the same model, this co-occurrence would produce a correlation between the model and purchases, but the correlation is spurious from the view of fashion recommendation. Existing work alleviates this issue by customizing preference-aware representations, requiring high-cost analysis and design.

In this paper, we propose an Invariant Representation Learning Framework (InvRL) to alleviate the impact of the spurious correlations. We utilize environments to reflect the spurious correlations and determine each environment with a set of interactions. We then learn invariant representations --- the inherent factors attracting user attention --- to make a consistent prediction of user-item interaction across various environments. In this light, InvRL proposes two iteratively executed modules to cluster user-item interactions and learn invariant representations. With them, InvRL trains a final recommender model thus mitigating the spurious correlations. We demonstrate InvRL on a cutting-edge recommender model UltraGCN and conduct extensive experiments on three public multimedia recommendation datasets, Movielens, Tiktok, and Kwai. The experimental results validate the rationality and effectiveness of InvRL. Codes are released at https://github.com/nickwzk/InvRL.

Early-Learning regularized Contrastive Learning for Cross-Modal Retrieval with Noisy Labels

Tianyuan Xu
Xueliang Liu
Zhen Huang
Dan Guo
Richang Hong
Meng Wang

Cross modal retrieval receives intensive attention for flexible queries between different modalities. However, in practice it is challenging to retrieve cross modal content with noisy labels. The latest research on machine learning shows that a model tends to fit cleanly labeled data at early learning stage and then memorize the data with noisy labels. Although the clustering strategy in cross modal retrieval can be utilized for alleviating outliers, the networks will rapidly overfit after clean data is fitted well and the noisy labels begin to force the cluster center drift. Motivated by these fundamental phenomena, we propose an Early Learning regularized Contrastive Learning method for Cross Modal Retrieval with Noisy Labels (ELRCMR). In the solution, we propose to project the multi-modal data to a shared feature space by contrastive learning, in which early learning regularization is employed to prevent the memorization of noisy labels when training the model, and the dynamic weight balance strategy is employed to alleviate clustering drift. We evaluated the method with extensive experiments, and the result shows the proposed method could solve the cluster drift in conventional solutions and achieve promising performance on widely used benchmark datasets.

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

Yiwei Ma
Guohai Xu
Xiaoshuai Sun
Ming Yan
Ji Zhang
Rongrong Ji

Video-text retrieval has been a crucial and fundamental task in multi-modal research. The development of video-text retrieval has been considerably promoted by large-scale multi-modal contrastive pre-training, which primarily focuses on coarse-grained or fine-grained contrast. However, cross-grained contrast, which is the contrast between coarse-grained representations and fine-grained representations, has rarely been explored in prior research. Compared with fine-grained or coarse-grained contrasts, cross-grained contrast calculate the correlation between coarse-grained features and each fine-grained feature, and is able to filter out the unnecessary fine-grained features guided by the coarse-grained feature during similarity calculation, thus improving the accuracy of retrieval. To this end, this paper presents a novel multi-grained contrastive model, namely X-CLIP, for video-text retrieval. However, another challenge lies in the similarity aggregation problem, which aims to aggregate fine-grained and cross-grained similarity matrices to instance-level similarity. To address this challenge, we propose the Attention Over Similarity Matrix (AOSM) module to make the model focus on the contrast between essential frames and words, thus lowering the impact of unnecessary frames and words on retrieval results. With multi-grained contrast and the proposed AOSM module, X-CLIP achieves outstanding performance on five widely-used video-text retrieval datasets, including MSR-VTT (49.3 R@1), MSVD (50.4 R@1), LSMDC (26.1 R@1), DiDeMo (47.8 R@1) and ActivityNet (46.2 R@1).

Mixed Supervision for Instance Learning in Object Detection with Few-shot Annotation

Yi Zhong
Chengyao Wang
Shiyong Li
Zhu Zhou
Yaowei Wang
Wei-Shi Zheng

Mixed supervision for object detection (MSOD) that utilizes image-level annotations and a small amount of instance-level annotations has emerged as an efficient tool by alleviating the requirement for a large amount of costly instance-level annotations and providing effective instance supervision on previous methods that only use image-level annotations. In this work, we introduce the mixed supervision instance learning (MSIL), as a novel MSOD framework to leverage a handful of instance-level annotations to provide more explicit and implicit supervision. Rather than just adding instance-level annotations directly on loss functions for detection, we aim to dig out more effective explicit and implicit relations between these two different level annotations. In particular, we firstly propose the Instance-Annotation Guided Image Classification strategy to provide explicit guidance from instance-level annotations by using positional relation to force the image classifier to focus on the proposals which contain the correct object. And then, in order to exploit more implicit interaction between the mixed annotations, an instance reproduction strategy guided by the extra instance-level annotations is developed for generating more accurate pseudo ground truth, achieving a more discriminative detector. Finally, a false target instance mining strategy is used to refine the above processing by enriching the number and diversity of training instances with the position and score information. Our experiments show that the proposed MSIL framework outperforms recent state-of-the-art mixed supervised detectors with a large margin on both the Pascal VOC2007 and the MS-COCO dataset.

Improved Deep Unsupervised Hashing via Prototypical Learning

Zeyu Ma
Wei Ju
Xiao Luo
Chong Chen
Xian-Sheng Hua
Guangming Lu

Hashing has become increasingly popular in approximate nearest neighbor search in recent years due to its storage and computational efficiency. While deep unsupervised hashing has shown encouraging performance recently, its efficacy in the more realistic unsupervised situation is far from satisfactory due to two limitations. On one hand, they usually neglect the underlying global semantic structure in the deep feature space. On the other hand, they also ignore reconstructing the global structure in the hash code space. In this research, we develop a simple yet effective approach named deeP U nsupeR vised hashing via P rototypical LEarning.. Specifically, introduces both feature prototypes and hashing prototypes to model the underlying semantic structures of the images in both deep feature space and hash code space. Then we impose a smoothness constraint to regularize the consistency of the global structures in two spaces through our semantic prototypical consistency learning. Moreover, our method encourages the prototypical consistency for different augmentations of each image via contrastive prototypical consistency learning. Comprehensive experiments on three benchmark datasets demonstrate that our proposed performs better than a variety of state-of-the-art retrieval methods.

Adaptive Camera Margin for Mask-guided Domain Adaptive Person Re-identification

Rui Wang
Feng Chen
Jun Tang
Pu Yan

Research on transferring the learned person re-identification (ReID) model in the source domain to other domains is of great importance since deploying a ReID model to a new scenario is common in practical applications. Most of existing unsupervised domain adaptation methods for person ReID employ the framework of pre-training in the source domain, and clustering and fine-tuning in the target domain. However, how to reduce the intra-domain variations and narrow the inter-domain gaps is far from solved and remains a challenging problem under this framework. In this paper, we address these issues from two aspects. Firstly, a voted-mask guided image channel shuffling strategy for data augmentation is proposed to enhance visual diversity, where image channel shuffling is used as an efficient tool to bridge the inter-domain gap, and voted masks are employed to extract the foregrounds of pedestrian images to relief the negative effects of various backgrounds for reducing the intra-domain variations. Secondly, a novel plug-and-play metric named adaptive camera margin is proposed to fully exploit the low-cost camera tags for producing high-quality pseudo labels, which can significantly reduce the intra-domain variations without extra training cost. Specifically, the proposed network consists of a sensitive branch and an adaptive branch accompanied with our strategy of data augmentation, which are embedded into a joint learning framework to decouple visual representations for better capturing transferable features across different domains in both two stages. Adaptive camera margin is employed to pull samples with different camera IDs closer in the procedure of DBSCAN clustering, which can reduce the influence of intra-domain variations caused by camera shift to a large extent in an effective and efficient manner. Comprehensive experiments show that the proposed method achieves competitive performance compared with state-of-the-art methods on the benchmark datasets. Source code will be released at: https://github.com/ahuwangrui/MACM.

BadHash: Invisible Backdoor Attacks against Deep Hashing with Clean Label

Shengshan Hu
Ziqi Zhou
Yechao Zhang
Leo Yu Zhang
Yifeng Zheng
Yuanyuan He
Hai Jin

Due to its powerful feature learning capability and high efficiency, deep hashing has achieved great success in large-scale image retrieval. Meanwhile, extensive works have demonstrated that deep neural networks (DNNs) are susceptible to adversarial examples, and exploring adversarial attack against deep hashing has attracted many research efforts. Nevertheless, backdoor attack, another famous threat to DNNs, has not been studied for deep hashing yet. Although various backdoor attacks have been proposed in the field of image classification, existing approaches failed to realize a truly imperceptive backdoor attack that enjoys invisible triggers and clean label setting simultaneously, and they cannot meet the intrinsic demand of image retrieval backdoor.

In this paper, we propose BadHash, the first imperceptible backdoor attack against deep hashing, which can effectively generate invisible and input-specific poisoned images with clean label. We first propose a new conditional generative adversarial network (cGAN) pipeline to effectively generate poisoned samples. For any given benign image, it seeks to generate a natural-looking poisoned counterpart with a unique invisible trigger. In order to improve the attack effectiveness, we introduce a label-based contrastive learning network LabCLN to exploit the semantic characteristics of different labels, which are subsequently used for confusing and misleading the target model to learn the embedded trigger. We finally explore the mechanism of backdoor attacks on image retrieval in the hash space. Extensive experiments on multiple benchmark datasets verify that BadHash can generate imperceptible poisoned samples with strong attack ability and transferability over state-of-the-art deep hashing schemes.

EliMRec: Eliminating Single-modal Bias in Multimedia Recommendation

Xiaohao Liu
Zhulin Tao
Jiahong Shao
Lifang Yang
Xianglin Huang

The main idea of multimedia recommendation is to introduce the profile content of multimedia documents as an auxiliary, so as to endow recommenders with generalization ability and gain better performance. However, recent studies using non-uniform datasets roughly fuse single-modal features into multi-modal features and adopt the strategy of directly maximizing the likelihood of user preference scores, leading to the single-modal bias. Owing to the defect in architecture, there is still room for improvement for recent multimedia recommendation.

In this paper, we propose EliMRec, a generic and modal-agnostic framework to eliminate the single-modal bias in multimedia recommendation. From our observation, biased predictive reasoning is influenced directly by the single modality rather than considering the all given multiple views of the item. Through the novel perspective of causal inference, we manage to explain the single-modal issue and exploit the inner working of multi-modal fusion. To eliminate single-modal bias, we enhance the bias-capture ability of a general multimedia recommendation framework and imagine several counterfactual worlds that control one modality variant with other modality fixed or blank. Truth to be told, counterfactual analysis enables us to identify and eliminate bias lying in the direct effect from single-modal features to the preference score. Extensive experiments on real-world datasets demonstrate that our method significantly improves over several state-of-the-art baselines like LightGCN and MMGCN. Codes are available at https://github.com/Xiaohao-Liu/EliMRec.

Patch-based Knowledge Distillation for Lifelong Person Re-Identification

Zhicheng Sun
Yadong MU

The task of lifelong person re-identification aims to match a person across multiple cameras given continuous data streams. Similar to other lifelong learning tasks, it severely suffers from the so-called catastrophic forgetting problem, which refers to the notable performance degradation on previously-seen data after adapting the model to some newly incoming data. To alleviate it, a few existing methods have utilized knowledge distillation to enforce consistency between the original and adapted models. However, the effectiveness of such a strategy can be largely reduced facing the data distribution discrepancy between seen and new data. The hallmark of our work is using adaptively-chosen patches (rather than whole images as in other works) to pilot the forgetting-resistant distillation. Specifically, the technical contributions of our patch-based new solution are two-fold: first, a novel patch sampler is proposed. It is fully differentiable and trained to select a diverse set of image patches that stay crucial and discriminative under streaming data. Secondly, with those patches we curate a novel knowledge distillation framework. Valuable patch-level knowledge within individual patch features and mutual relations is well preserved by the two newly introduced distillation modules, further mitigating catastrophic forgetting. Extensive experiments on twelve person re-identification datasets clearly validate the superiority of our method over state-of-the-art competitors by large performance margins.

SESSION: Oral Session III: Engaging User with Multimedia -- Summarization, Analytics, and Storytelling

MAPLE: Masked Pseudo-Labeling autoEncoder for Semi-supervised Point Cloud Action Recognition

Xiaodong Chen
Wu Liu
Xinchen Liu
Yongdong Zhang
Jungong Han
Tao Mei

Recognizing human actions from point cloud videos has attracted tremendous attention from both academia and industry due to its wide applications like automatic driving, robotics, and so on. However, current methods for point cloud action recognition usually require a huge amount of data with manual annotations and a complex backbone network with high computation cost, which makes it impractical for real-world applications. Therefore, this paper considers the task of semi-supervised point cloud action recognition. We propose a Masked Pseudo-Labeling autoEncoder (MAPLE) framework to learn effective representations with much fewer annotations for point cloud action recognition. In particular, we design a novel and efficient Decoupled spatial-temporal TransFormer (DestFormer) as the backbone of MAPLE. In DestFormer, the spatial and temporal dimensions of the 4D point cloud videos are decoupled to achieve an efficient self-attention for learning both long-term and short-term features. Moreover, to learn discriminative features from fewer annotations, we design a masked pseudo-labeling autoencoder structure to guide the DestFormer to reconstruct features of masked frames from the available frames. More importantly, for unlabeled data, we exploit the pseudo-labels from the classification head as the supervision signal for the reconstruction of features from the masked frames. Finally, comprehensive experiments demonstrate that MAPLE achieves superior results on three public benchmarks and outperforms the state-of-the-art method by 8.08% accuracy on the MSR-Action3D dataset.

DHHN: Dual Hierarchical Hybrid Network for Weakly-Supervised Audio-Visual Video Parsing

Xun Jiang
Xing Xu
Zhiguo Chen
Jingran Zhang
Jingkuan Song
Fumin Shen
Huimin Lu
Heng Tao Shen

The Weakly-Supervised Audio-Visual Video Parsing (AVVP) task aims to parse a video into temporal segments and predict their event categories in terms of modalities, labeling them as either audible, visible, or both. Since the temporal boundaries and modalities annotations are not provided, only video-level event labels are available, this task is more challenging than conventional video understanding tasks.Most previous works attempt to analyze videos by jointly modeling the audio and video data and then learning information from the segment-level features with fixed lengths. However, such a design exist two defects: 1) The various semantic information hidden in temporal lengths is neglected, which may lead the models to learn incorrect information; 2) Due to the joint context modeling, the unique features of different modalities are not fully explored. In this paper, we propose a novel AVVP framework termedDual Hierarchical Hybrid Network (DHHN) to tackle the above two problems. Our DHHN method consists of three components: 1) A hierarchical context modeling network for extracting different semantics in multiple temporal lengths; 2) A modality-wise guiding network for learning unique information from different modalities; 3) A dual-stream framework generating audio and visual predictions separately. It maintains the best adaptions on different modalities, further boosting the video parsing performance. Extensive quantitative and qualitative experiments demonstrate that our proposed method establishes the new state-of-the-art performance on the AVVP task.

SESSION: Poster Session III: Engaging User with Multimedia -- Summarization, Analytics, and Storytelling

Weakly-Supervised Temporal Action Alignment Driven by Unbalanced Spectral Fused Gromov-Wasserstein Distance

Dixin Luo
Yutong Wang
Angxiao Yue
Hongteng Xu

Temporal action alignment aims at segmenting videos into clips and tagging each clip with a textual description, which is an important task of video semantic analysis. Most existing methods, however, rely on supervised learning to train their alignment models, whose applications are limited because of the common insufficiency issue of labeled videos. To mitigate this issue, we propose a weakly-supervised temporal action alignment method based on a novel computational optimal transport technique called unbalanced spectral fused Gromov-Wasserstein (US-FGW) distance. Instead of using videos with known clips and corresponding textual tags, our method just needs each training video to be associated with a set of (unsorted) texts while does not require the fine-grained correspondence between the frames and the texts. Given such weakly-supervised video-text pairs, our method trains the representation models of the video frames and the texts jointly in a probabilistic or deterministic autoencoding architecture and penalizes the US-FGW distance between the distribution of visual latent codes and that of textual latent codes. We compute the US-FGW distance efficiently by leveraging the Bregman ADMM algorithm. Furthermore, we generalize classic contrastive learning framework and reformulate it based on the proposed US-FGW distance, which provides a new viewpoint of contrastive learning for our problem. Experimental results show that our method and its variants outperform state-of-the-art weakly-supervised temporal action alignment methods, whose results are even comparable to those derived by supervised learning methods on some specific evaluation measurements. The code is available at \urlhttps://github.com/hhhh1138/Temporal-Action-Alignment-USFGW.

A Knowledge Augmented and Multimodal-Based Framework for Video Summarization

Jiehang Xie
Xuanbai Chen
Shao-Ping Lu
Yulu Yang

Video summarization aims to generate a compact version of a lengthy video that retains its primary content. In general, humans are gifted with producing a high-quality video summary, because they acquire crucial content through multiple dimensional information and own abundant background knowledge about the original video. However, existing methods rarely consider multichannel information and ignore the impact of external knowledge, resulting in the limited quality of the generated summaries. This paper proposes a knowledge augmented and multimodal-based video summarization method, termed KAMV, to address the problem above. Specifically, we design a knowledge encoder with a hybrid method consisting of generation and retrieval, to capture descriptive content and latent connections between events and entities based on the external knowledge base, which can provide rich implicit knowledge for better comprehending the video viewed. Furthermore, for the sake of exploring the interactions among visual, audio, implicit knowledge and emphasizing the content that is most relevant to the desired summary, we present a fusion module under the supervision of these multimodal information. By conducting extensive experiments on four public datasets, the results demonstrate the superior performance yielded by the proposed KAMV compared to the state-of-the-art video summarization approaches.

MMT: Image-guided Story Ending Generation with Multimodal Memory Transformer

Dizhan Xue
Shengsheng Qian
Quan Fang
Changsheng Xu

As a specific form of story generation, Image-guided Story Ending Generation (IgSEG) is a recently proposed task of generating a story ending for a given multi-sentence story plot and an ending-related image. Unlike existing image captioning tasks or story ending generation tasks, IgSEG aims to generate a factual description that conforms to both the contextual logic and the relevant visual concepts. To date, existing methods for IgSEG ignore the relationships between the multimodal information and do not integrate multimodal features appropriately. Therefore, in this work, we propose Multimodal Memory Transformer (MMT), an end-to-end framework that models and fuses both contextual and visual information to effectively capture the multimodal dependency for IgSEG. Firstly, we extract textual and visual features separately by employing modality-specific large-scale pretrained encoders. Secondly, we utilize the memory-augmented cross-modal attention network to learn cross-modal relationships and conduct the fine-grained feature fusion effectively. Finally, a multimodal transformer decoder constructs attention among multimodal features to learn the story dependency and generates informative, reasonable, and coherent story endings. In experiments, extensive automatic evaluation results and human evaluation results indicate the significant performance boost of our proposed MMT over state-of-the-art methods on two benchmark datasets.

An End-to-End Conditional Generative Adversarial Network Based on Depth Map for 3D Craniofacial Reconstruction

Niankai Zhang
Junli Zhao
Fuqing Duan
Zhenkuan Pan
Zhongke Wu
Mingquan Zhou
Xianfeng Gu

Craniofacial reconstruction is fundamental in resolving forensic cases. It is rather challenging due to the complex topology of the craniofacial model and the ambiguous relationship between a skull and the corresponding face. In this paper, we propose a novel approach for 3D craniofacial reconstruction by utilizing Conditional Generative Adversarial Networks (CGAN) based on craniofacial depth map. More specifically, we treat craniofacial reconstruction as a mapping problem from skull to face. We represent 3D cran- iofacial shapes with depth maps, which include most craniofacial features for identification purposes and are easy to generate and apply to neural networks. We designed an end-to-end neural networks model based on CGAN then trained the model with paired craniofacial data to automatically learn the complex nonlinear relationship between skull and face. By introducing body mass index classes(BMIC) into CGAN, we can realize objective reconstruction of 3D facial geometry according to its skull, which is a complicated 3D shape generation task with different topologies. Through comparative experiments, our method shows accuracy and verisimilitude in craniofacial reconstruction results.

Clustering Generative Adversarial Networks for Story Visualization

Bowen Li
Philip H. S. Torr
Thomas Lukasiewicz

Story visualization aims to generate a series of images, semantically matching a given sequence of sentences, one for each, and different output images within a story should be consistent with each other. Current methods generate story images by using a heavy architecture with two generative adversarial networks (GANs), one for image quality, and one for story consistency, and also rely on additional segmentation masks or auxiliary captioning networks. In this paper, we aim to build a concise and single-GAN-based network, neither depending on additional semantic information nor captioning networks. To achieve this, we propose a contrastive-learning- and clustering-learning-based approach for story visualization. Our network utilizes contrastive losses between language and visual information to maximize the mutual information between them, and further extends it with clustering learning in the training process to capture semantic similarity across modalities. So, the discriminator in our approach provides comprehensive feedback to the generator, regarding both image quality and story consistency at the same time, allowing to have a single-GAN-based network to produce high-quality synthetic results. Extensive experiments on two datasets demonstrate that our single-GAN-based network has a smaller number of total parameters in the network, but achieves a major step up from previous methods, which improves FID from 78.64 to 39.17, and FSD from 94.53 to 41.18 on Pororo-SV, and establishes a strong benchmark FID of 76.51 and FSD of 19.74 on Abstract Scenes.

DeViT: Deformed Vision Transformers in Video Inpainting

Jiayin Cai
Changlin Li
Xin Tao
Chun Yuan
Yu-Wing Tai

This paper presents a novel video inpainting architecture named Deformed Vision Transformers (DeViT). We make three significant contributions to this task: First, we extended previous Transformers with patch alignment by introducing Deformed Patch-based Homography Estimator (DePtH), which enriches the patch-level feature alignments in key and query with additional offsets learned from patch pairs without additional supervision. DePtH enables our method to handle challenging scenes or agile motion with in-plane or out-of-plane deformation, which previous methods usually fail. Second, we introduce the Mask Pruning-based Patch Attention (MPPA) to improve the standard patch-wised feature matching by pruning out less essential features and considering the saliency map. MPPA enhances the matching accuracy between warped tokens with invalid pixels. Third, we introduce the Spatial-Temporal weighting Adaptor (STA) module to assign more accurate attention to spatial-temporal tokens under the guidance of the Deformation Factor learned from DePtH, especially for videos with agile motions. Experimental results demonstrate that our method outperforms previous state-of-the-art methods in quality and quantity and achieves a new state-of-the-art for video inpainting.

Multi-Level Spatiotemporal Network for Video Summarization

Ming Yao
Yu Bai
Wei Du
Xuejun Zhang
Heng Quan
Fuli Cai
Hongwei Kang

With the increasing of ubiquitous devices with cameras, video content is widely produced in the industry. Automation video summarization allows content consumers effectively retrieve the moments that capture their primary attention. Existing supervised methods mainly focus on frame-level information. As a natural phenomenon, video fragments in different shots are richer in semantics than frames. We leverage this as a free latent supervision signal and introduce a novel model named multi-level spatiotemporal network (MLSN). Our approach contains Multi-Level Feature Representations (MLFR) and Local Relative Loss (LRL). MLFR module consists of frame-level features, fragment-level features, and shot-level features with relative position encoding. For videos of different shot durations, it can flexibly capture and accommodate semantic information of different spatiotemporal granularities; LRL utilizes the partial ordering relations among frames of each fragment to capture highly discriminative features to improve the sensitivity of the model. Our method substantially improves the best existing published method by 7% on our industrial products dataset LSVD. Meanwhile, experimental results on two widely used benchmark datasets SumMe and TVSum demonstrate that our method outperforms most state-of-the-art ones.

SESSION: Oral Session IV: Experience -- Interactions and Quality of Experience

TVFormer: Trajectory-guided Visual Quality Assessment on 360° Images with Transformers

Li Yang
Mai Xu
Tie Liu
Liangyu Huo
Xinbo Gao

Visual quality assessment (VQA) on 360° images plays an important role in optimizing immersive multimedia systems. Due to the absence of pristine 360° images in real world, blind VQA (BVQA) on 360° images has drawn much research attention. In subjective VQA on 360^ images, human intuitively make the quality-scoring decisions through the quality degradation of each observed viewport on the head trajectories. Unfortunately, the existing BVQA works for 360° images neglect the dynamic property of head trajectories with viewport interactions, thus failing to obtain human-like quality scores. In this paper, we propose a novel Transformer-based approach for trajectory-guided VQA on 360° images (named TVFormer), in which both the tasks of head trajectory prediction and BVQA can be accomplished for 360° images. In the first task, we develop a trajectory-aware memory updater (TMU) module, for maintaining the coherence and accuracy of predicted head trajectories. To capture the long-range quality dependency across time-ordered viewports, we propose a spatio-temporal factorized self-attention (STF) module in the encoder of TVFormer for the BVQA task. By implanting the predicted head trajectories into the BVQA task, we can obtain the human-like quality scores. Extensive experiments demonstrate the superior BVQA performance of TVFormer over state-of-the-art approaches on three benchmark datasets.

KnifeCut: Refining Thin Part Segmentation with Cutting Lines

Zheng Lin
Zheng-Peng Duan
Zhao Zhang
Chun-Le Guo
Ming-Ming Cheng

Objects with thin structures remain challenging for current image segmentation techniques. Their outputs often do well in the main body but with thin parts unsatisfactory. In practical use, they inevitably need post-processing. However, repairing them is time-consuming and laborious, either in professional editing applications (e.g. PhotoShop) or by current interactive image segmentation methods (e.g. by click, scribble, and polygon). To refine the thin parts for unsatisfactory pre-segmentation, we propose an efficient interaction mode, where users only need to draw a line across the mislabeled thin part like cutting with a knife. This low-stress and intuitive action does not require the user to aim deliberately, and is friendly when using the mouse, touchpad, and mobile devices. Additionally, the line segment provides a contrasting prior because it passes through both the foreground and background regions and there must be thin part pixels on it. Based on the interaction idea, we propose KnifeCut, which offers the users two results, where one only focuses on the target thin part and the other provides the refinements for all thin parts that share similar features with the target one. To our best knowledge, KnifeCut is the first method to solve interactive thin structure refinement pertinently. Extensive experiments and visualized results further demonstrate its friendliness, convenience, and effectiveness. The project page is available on http://mmcheng.net/knifecut/.

Multi-view Layout Design for VR Concert Experience

Minju Kim
Yuhyun Lee
Jungjin Lee

Owing to the COVID-19 pandemic, concerts are gradually being held online. Beyond live-streaming, it has recently become popular to utilize various realistic video technologies to add entertainment value and immersion to online concerts. We conducted a multi-view layout design study in a virtual reality environment with a head-mounted display to help users effectively explore and immerse themselves in multiple videos from various angles. Based on an analysis of an existing user interface for multi-view navigation and the characteristics of virtual reality, we proposed four layouts, i.e., 1) an evenly divided space, 2) an evenly divided designated space, 3) a widget type, and 4) an avatar type. We implemented a prototype by applying Korean pop concerts, where multi-view videos are the most actively utilized, and then conducted a user study to evaluate the usability and preferences of the proposed layouts. The results show that it is adequate to arrange the multi-view videos within a 60° to 110° space and on the left and right side of the main view, which is a range that the users can comfortably access. In addition, when placing multiple videos in a designated space, it is helpful to use visual effects or simple avatars to avoid visual burden being placed on the users.

Magic ELF: Image Deraining Meets Association Learning and Transformer

Kui Jiang
Zhongyuan Wang
Chen Chen
Zheng Wang
Laizhong Cui
Chia-Wen Lin

Convolutional neural network (CNN) and Transformer have achieved great success in multimedia applications. However, little effort has been made to effectively and efficiently harmonize these two architectures to satisfy image deraining. This paper aims to unify these two architectures to take advantage of their learning merits for image deraining. In particular, the local connectivity and translation equivariance of CNN and the global aggregation ability of self-attention (SA) in Transformer are fully exploited for specific local context and global structure representations. Based on the observation that rain distribution reveals the degradation location and degree, we introduce degradation prior to help background recovery and accordingly present the association refinement deraining scheme. A novel multi-input attention module (MAM) is proposed to associate rain perturbation removal and background recovery. Moreover, we equip our model with effective depth-wise separable convolutions to learn the specific feature representations and trade off computational complexity. Extensive experiments show that our proposed method (dubbed as ELF) outperforms the state-of-the-art approach (MPRNet) by 0.25 dB on average, but only accounts for 11.7% and 42.1% of its computational cost and parameters.

Exploring the Effectiveness of Video Perceptual Representation in Blind Video Quality Assessment

Liang Liao
Kangmin Xu
Haoning Wu
Chaofeng Chen
Wenxiu Sun
Qiong Yan
Weisi Lin

With the rapid growth of in-the-wild videos taken by non-specialists, blind video quality assessment (VQA) has become a challenging and demanding problem. Although lots of efforts have been made to solve this problem, it remains unclear how the human visual system (HVS) relates to the temporal quality of videos. Meanwhile, recent work has found that the frames of natural video transformed into the perceptual domain of the HVS tend to form a straight trajectory of the representations. With the obtained insight that distortion impairs the perceived video quality and results in a curved trajectory of the perceptual representation, we propose a temporal perceptual quality index (TPQI) to measure the temporal distortion by describing the graphic morphology of the representation. Specifically, we first extract the video perceptual representations from the lateral geniculate nucleus (LGN) and primary visual area (V1) of the HVS, and then measure the straightness and compactness of their trajectories to quantify the degradation in naturalness and content continuity of video. Experiments show that the perceptual representation in the HVS is an effective way of predicting subjective temporal quality, and thus TPQI can, for the first time, achieve comparable performance to the spatial quality metric and be even more effective in assessing videos with large temporal variations. We further demonstrate that by combining with NIQE, a spatial quality metric, TPQI can achieve top performance over popular in-the-wild video datasets. More importantly, TPQI does not require any additional information beyond the video being evaluated and thus can be applied to any datasets without parameter tuning. Source code is available at https://github.com/UoLMM/TPQI-VQA.

You Only Align Once: Bidirectional Interaction for Spatial-Temporal Video Super-Resolution

Mengshun Hu
Kui Jiang
Zhixiang Nie
Zheng Wang

Spatial-Temporal Video Super-Resolution (ST-VSR) technology generates high-quality videos with higher resolution and higher frame rates. Existing advanced methods accomplish ST-VSR tasks through the association of Spatial and Temporal video super-resolution (S-VSR and T-VSR). These methods require two alignments and fusions in S-VSR and T-VSR, which is obviously redundant and fails to sufficiently explore the information flow of consecutive spatial LR frames. Although bidirectional learning (future-to-past and past-to-future) was introduced to cover all input frames, the direct fusion of final predictions fails to sufficiently exploit intrinsic correlations of bidirectional motion learning and spatial information from all frames. We propose an effective yet efficient recurrent network with bidirectional interaction for ST-VSR, where only one alignment and fusion is needed. Specifically, it first performs backward inference from future to past, and then follows forward inference to super-resolve intermediate frames. The backward and forward inferences are assigned to learn structures and details to simplify the learning task with joint optimizations. Furthermore, a Hybrid Fusion Module (HFM) is designed to aggregate and distill information to refine spatial information and reconstruct high-quality video frames. Extensive experiments on two public datasets demonstrate that our method outperforms state-of-the-art methods in efficiency, and reduces calculation cost by about 22%.

A Deep Learning based No-reference Quality Assessment Model for UGC Videos

Wei Sun
Xiongkuo Min
Wei Lu
Guangtao Zhai

Quality assessment for User Generated Content (UGC) videos plays an important role in ensuring the viewing experience of end-users. Previous UGC video quality assessment (VQA) studies either use the image recognition model or the image quality assessment (IQA) models to extract frame-level features of UGC videos for quality regression, which are regarded as the sub-optimal solutions because of the domain shifts between these tasks and the UGC VQA task. In this paper, we propose a very simple but effective UGC VQA model, which tries to address this problem by training an end-to-end spatial feature extraction network to directly learn the quality-aware spatial feature representation from raw pixels of the video frames. We also extract the motion features to measure the temporal-related distortions that the spatial features cannot model. The proposed model utilizes very sparse frames to extract spatial features and dense frames (i.e. the video chunk) with a very low spatial resolution to extract motion features, which thereby has low computational complexity. With the better quality-aware features, we only use the simple multilayer perception layer (MLP) network to regress them into the chunk-level quality scores, and then the temporal average pooling strategy is adopted to obtain the video-level quality score. We further introduce a multi-scale quality fusion strategy to solve the problem of VQA across different spatial resolutions, where the multi-scale weights are obtained from the contrast sensitivity function of the human visual system. The experimental results show that the proposed model achieves the best performance on five popular UGC VQA databases, which demonstrates the effectiveness of the proposed model.

SESSION: Poster Session IV: Experience - Interactions and Quality of Experience

Improving Meeting Inclusiveness using Speech Interruption Analysis

Szu-Wei Fu
Yaran Fan
Yasaman Hosseinkashi
Jayant Gupchup
Ross Cutler

Meetings are a pervasive method of communication within all types of companies and organizations, and using remote collaboration systems to conduct meetings has increased dramatically since the COVID-19 pandemic. However, not all meetings are inclusive, especially in terms of the participation rates among attendees. In a recent large-scale survey conducted at Microsoft, the top suggestion given by meeting participants for improving inclusiveness is to improve the ability of remote participants to interrupt and acquire the floor during meetings. We show that the use of the virtual raise hand (VRH) feature can lead to an increase in predicted meeting inclusiveness at Microsoft. One challenge is that VRH is used in less than $1%$ of all meetings. In order to drive adoption of its usage to improve inclusiveness (and participation), we present a machine learning-based system that predicts when a meeting participant attempts to obtain the floor, but fails to interrupt (termed a 'failed interruption'). This prediction can be used to nudge the user to raise their virtual hand within the meeting. We believe this is the first failed speech interruption detector, and the performance on a realistic test set has an area under curve (AUC) of 0.95 with a true positive rate (TPR) of 50% at a false positive rate (FPR) of 1%. To our knowledge, this is also the first dataset of interruption categories (including the failed interruption category) for remote meetings. Finally, we believe this is the first such system designed to improve meeting inclusiveness through speech interruption analysis and active intervention.

Transductive Aesthetic Preference Propagation for Personalized Image Aesthetics Assessment

Yaohui Li
Yuzhe Yang
Huaxiong Li
Haoxing Chen
Liwu Xu
Leida Li
Yaqian Li
Yandong Guo

Personalized image aesthetics assessment (PIAA) aims at capturing individual aesthetic preference. Fine-tuning on personalized data has been proven to be effective in PIAA task. However, a fixed fine-tuning strategy may cause under/over-fitting on limited personal data and it also brings additional training cost. To alleviate these issues, we employ a meta learning-based Transductive Aesthetic Preference Propagation (TAPP-PIAA) algorithm under regression manner to substitute the fine-tuning strategy. Specifically, each user's data is regarded as a meta-task and spilt into support and query set. Then, we extract deep aesthetic features with a pre-trained generic image aesthetic assessment (GIAA) model. Next, we treat image features as graph nodes and their similarities as edge weights to construct an undirected nearest neighbor graph for inference. Instead of fine-tuning on support set, TAPP-PIAA propagates aesthetic preference from support to query set with a predefined propagation formula. Finally, to learn a generalizable aesthetic representation for various users, we optimize our TAPP-PIAA across different users with meta-learning framework. Experimental results indicate that our TAPP-PIAA can surpass the state-of-the-art methods on benchmark databases.

Multi-Mode Interactive Image Segmentation

Zheng Lin
Zhao Zhang
Ling-Hao Han
Shao-Ping Lu

Large-scale pixel-level annotations are scarce for current data-hungry medical image analysis models. For the fast acquisition of annotations, an economical and efficient interactive medical image segmentation method is urgently needed. However, current techniques usually fail in many cases, as their interaction styles cannot work on various inherent ambiguities of medical images, such as irregular shapes and fuzzy boundaries. To address this problem, we propose a multi-mode interactive segmentation framework for medical images, where diverse interaction modes can be chosen and allowed to cooperate with each other. In our framework, users can encircle the target regions with various initial interaction modes according to the structural complexity. Then, based on the initial segmentation, users can jointly utilize the region and boundary interactions to refine the mislabeled regions caused by different ambiguities. We evaluate our framework on extensive medical images, including X-ray, CT, MRI, ultrasound, endoscopy, and photo. Sufficient experimental results and user study show that our framework is a reliable choice for image annotation in various real scenes.

Deep-BVQM: A Deep-learning Bitstream-based Video Quality Model

Nasim Jamshidi Avanaki
Steven Schmidt
Thilo Michael
Saman Zadtootaghaj
Sebastian Möller

With the rapid increase of video streaming content, high-quality video quality metrics, mainly signal-based video quality metrics, are emerging, notably VMAF, SSIMPLUS, and AVQM. Besides signal-based video quality metrics, within the standardization body, ITU-T Study Group 12, two well-known bitstream-based video quality metrics are developed named P.1203 and P.1204.3. Due to the low complexity and low level of access to the bitstream data, these models gained attention from network providers and service providers. In this paper, we proposed a new bitstream-based model named Deep-BVQM, which outperforms the standard models on the tested datasets. While the model comes with slightly higher computational complexity, it offers a frame-level quality prediction which is essential diagnostic information for some video streaming services such as cloud gaming. Deep-BVQM is developed in two layers; first, the frame quality was predicted using a lightweight CNN model. Next, the latent features of the CNN were used to train an LSTM network to predict the video quality in a short-term duration.

MESH2IR: Neural Acoustic Impulse Response Generator for Complex 3D Scenes

Anton Ratnarajah
Zhenyu Tang
Rohith Aralikatti
Dinesh Manocha

We propose a mesh-based neural network (MESH2IR) to generate acoustic impulse responses (IRs) for indoor 3D scenes represented using a mesh. The IRs are used to create a high-quality sound experience in interactive applications and audio processing. Our method can handle input triangular meshes with arbitrary topologies (2K - 3M triangles). We present a novel training technique to train MESH2IR using energy decay relief and highlight its benefits. We also show that training MESH2IR on IRs preprocessed using our proposed technique significantly improves the accuracy of IR generation. We reduce the non-linearity in the mesh space by transforming 3D scene meshes to latent space using a graph convolution network. Our MESH2IR is more than 200 times faster than a geometric acoustic algorithm on a CPU and can generate more than 10,000 IRs per second on an NVIDIA GeForce RTX 2080 Ti GPU for a given furnished indoor 3D scene. The acoustic metrics are used to characterize the acoustic environment. We show that the acoustic metrics of the IRs predicted from our MESH2IR match the ground truth with less than 10% error. We also highlight the benefits of MESH2IR on audio and speech processing applications such as speech dereverberation and speech separation. To the best of our knowledge, ours is the first neural-network-based approach to predict IRs from a given 3D scene mesh in real-time.

Quality Assessment of Image Super-Resolution: Balancing Deterministic and Statistical Fidelity

Wei Zhou
Zhou Wang

There has been a growing interest in developing image super-resolution (SR) algorithms that convert low-resolution (LR) to higher resolution images, but automatically evaluating the visual quality of super-resolved images remains a challenging problem. Here we look at the problem of SR image quality assessment (SR IQA) in a two-dimensional (2D) space of deterministic fidelity (DF) versus statistical fidelity (SF). This allows us to better understand the advantages and disadvantages of existing SR algorithms, which produce images at different clusters in the 2D space of (DF, SF). Specifically, we observe an interesting trend from more traditional SR algorithms that are typically inclined to optimize for DF while losing SF, to more recent generative adversarial network (GAN) based approaches that by contrast exhibit strong advantages in achieving high SF but sometimes appear weak at maintaining DF. Furthermore, we propose an uncertainty weighting scheme based on content-dependent sharpness and texture assessment that merges the two fidelity measures into an overall quality prediction named the Super Resolution Image Fidelity (SRIF) index, which demonstrates superior performance against state-of-the-art IQA models when tested on subject-rated datasets.

No-reference Omnidirectional Image Quality Assessment Based on Joint Network

Chaofan Zhang
Shiguang Liu

In panoramic multimedia applications, the perception quality of the omnidirectional content often comes from the observer's perception of the viewports and the overall impression after browsing. Starting from this hypothesis, this paper proposes a deep-learning based joint network to model the no-reference quality assessment of omnidirectional images. On the one hand, motivated by different scenarios that lead to different human understandings, a convolutional neural network (CNN) is devised to simultaneously encode the local quality features and the latent perception rules of different viewports, which are more likely to be noticed by the viewers. On the other hand, a recurrent neural network (RNN) is designed to capture the interdependence between viewports from their sequence representation, and then predict the impact of each viewport on the observer's overall perception. Experiments on two popular omnidirectional image quality databases demonstrate that the proposed method outperforms the state-of-the-art omnidirectional image quality metrics.

PassWalk: Spatial Authentication Leveraging Lateral Shift and Gaze on Mobile Headsets

Abhishek Kumar
Lik-Hang Lee
Jagmohan Chauhan
Xiang Su
Mohammad A. Hoque
Susanna Pirttikangas
Sasu Tarkoma
Pan Hui

Secure and usable user authentication on mobile headsets is a challenging problem. The miniature-sized touchpad on such devices becomes a hurdle to user interactions that impact usability. However, the most common authentication methods, i.e., the standard QWERTY virtual keyboard or mid-air inputs to enter passwords are highly vulnerable to shoulder surfing attacks. In this paper, we present PassWalk, a keyboard-less authentication system leveraging multi-modal inputs on mobile headsets. PassWalk demonstrates the feasibility of user authentication driven by the user's gaze and lateral shifts (i.e., footsteps) simultaneously. The keyboard-less authentication interface in PassWalk enables users to accomplish highly mobile inputs of graphical passwords, containing digital overlays and physical objects. We conduct an evaluation with 22 recruited participants (15 legitimate users and 7 attackers). Our results show that PassWalk provides high security (only 1.1% observation attacks were successful) with a mean authentication time of 8.028s, which outperforms the commercial method of using the QWERTY virtual keyboard (21.5% successful attacks) and a research prototype LookUnLock (5.5% successful attacks). Additionally, PassWalk entails a significantly smaller workload on the user than the current commercial methods.

Adaptive Hypergraph Convolutional Network for No-Reference 360-degree Image Quality Assessment

Jun Fu
Chen Hou
Wei Zhou
Jiahua Xu
Zhibo Chen

In no-reference 360-degree image quality assessment (NR 360IQA), graph convolutional networks (GCNs), which model interactions between viewports through graphs, have achieved impressive performance. However, prevailing GCN-based NR 360IQA methods suffer from three main limitations. First, they only use high-level features of the distorted image to regress the quality score, while the human visual system scores the image based on hierarchical features. Second, they simplify complex high-order interactions between viewports in a pairwise fashion through graphs. Third, in the graph construction, they only consider the spatial location of the viewport, ignoring its content characteristics. Accordingly, to address these issues, we propose an adaptive hypergraph convolutional network for NR 360IQA, denoted as AHGCN. Specifically, we first design a multi-level viewport descriptor for extracting hierarchical representations from viewports. Then, we model interactions between viewports through hypergraphs, where each hyperedge connects two or more viewports. In the hypergraph construction, we build a location-based hyperedge and a content-based hyperedge for each viewport. Experimental results on two public 360IQA databases demonstrate that our proposed approach has a clear advantage over state-of-the-art full-reference and no-reference IQA models.

DeepWSD: Projecting Degradations in Perceptual Space to Wasserstein Distance in Deep Feature Space

Xingran Liao
Baoliang Chen
Hanwei Zhu
Shiqi Wang
Mingliang Zhou
Sam Kwong

Existing deep learning-based full-reference IQA (FR-IQA) models usually predict the image quality in a deterministic way by explicitly comparing the features, gauging how severely distorted an image is by how far the corresponding feature lies from the space of the reference images. Herein, we look at this problem from a different viewpoint and propose to model the quality degradation in perceptual space from a statistical distribution perspective. As such, the quality is measured based upon the Wasserstein distance in the deep feature domain. More specifically, the 1D Wasserstein distance at each stage of the pre-trained VGG network is measured, based on which the final quality score is performed. The deep Wasserstein distance (DeepWSD) performed on features from neural networks enjoys better interpretability of the quality contamination caused by various types of distortions and presents an advanced quality prediction capability. Extensive experiments and theoretical analysis show the superiority of the proposed DeepWSD in terms of both quality prediction and optimization. The implementation of our method is publicly available at https://github.com/Buka-Xing/DeepWSD.

Angular Gap: Reducing the Uncertainty of Image Difficulty through Model Calibration

Bohua Peng
Mobarakol Islam
Mei Tu

Curriculum learning needs example difficulty to proceed from easy to hard. However, the credibility of image difficulty is rarely investigated, which can seriously affect the effectiveness of curricula. In this work, we propose Angular Gap, a measure of difficulty based on the difference in angular distance between feature embeddings and class-weight embeddings built by hyperspherical learning. To ascertain difficulty estimation, we introduce class-wise model calibration, as a post-training technique, to the learnt hyperbolic space. This bridges the gap between probabilistic model calibration and angular distance estimation of hyperspherical learning. We show the superiority of our calibrated Angular Gap over recent difficulty metrics on CIFAR10-H and ImageNetV2. We further propose a curriculum based on Angular Gap for unsupervised domain adaptation that can translate from learning easy samples to mining hard samples. We combine this curriculum with a state-of-the-art self-training method, Cycle Self Training (CST). The proposed Curricular CST learns robust representations and outperforms recent baselines on Office31 and VisDA 2017.

GCL: Graph Calibration Loss for Trustworthy Graph Neural Network

Min Wang
Hao Yang
Qing Cheng

Despite the great success of Graph Neural Networks (GNNs), the trustworthiness is still lack-explored. A very recent study suggests that GNNs are under-confident on the predictions which is opposite to deep neural networks. In this paper, we investigate why this is the case. We discover that the "shallow" network of GNNs is the central cause. To address this challenge, we propose a novel Graph Calibration Loss (GCL), the first end-to-end calibration method for GNNs, which reshapes the standard Cross Entropy loss and is encouraged to assign up-weights loss to high-confidence examples. Through empirical observation and theoretical justification, we discover the GCL's calibration mechanism is to add a minimal-entropy regulariser to KL-divergence to bring down the entropy of correctly classified samples. To evaluate the effectiveness of the GCL, we train several representative GNNs models which use the GCL as loss function on various citation networks datasets, and further apply the GCL to a self-training framework. Compared to the existed methods, the proposed method achieves state-of-the-art calibration performance on node classification task and even improves the standard classification accuracy in almost all cases.

Image Quality Assessment: From Mean Opinion Score to Opinion Score Distribution

Yixuan Gao
Xiongkuo Min
Yucheng Zhu
Jing Li
Xiao-Ping Zhang
Guangtao Zhai

Recently, many methods have been proposed to predict the image quality which is generally described by the mean opinion score (MOS) of all subjective ratings given to an image. However, few efforts focus on predicting the opinion score distribution of the image quality ratings. In fact, the opinion score distribution reflecting subjective diversity, uncertainty, etc., can provide more subjective information about the image quality than a single MOS, which is worthy of in-depth study. In this paper, we propose a convolutional neural network based on fuzzy theory to predict the opinion score distribution of image quality. The proposed method consists of three main steps: feature extraction, feature fuzzification and fuzzy transfer. Specifically, we first use the pre-trained VGG16 without fully-connected layers to extract image features. Then, the extracted features are fuzzified by fuzzy theory, which is used to model epistemic uncertainty in the process of feature extraction. Finally, a fuzzy transfer network is used to predict the opinion score distribution of image quality by learning the mapping from epistemic uncertainty to the uncertainty existing in the image quality ratings. In addition, a new loss function is designed based on the subjective uncertainty of the opinion score distribution. Extensive experimental results prove the superior prediction performance of our proposed method.

No-Reference Image Quality Assessment Using Dynamic Complex-Valued Neural Model

Zihan Zhou
Yong Xu
Ruotao Xu
Yuhui Quan

Deep convolutional neural networks (CNNs) have become a promising approach to no-reference image quality assessment (NR-IQA). This paper aims at improving the power of CNNs for NR-IQA in two aspects. Firstly, motivated by the deep connection between complex-valued transforms and human visual perception, we introduce complex-valued convolutions and phase-aware activations beyond traditional real-valued CNNs, which improves the accuracy of NR-IQA without bringing noticeable additional computational costs. Secondly, considering the content-awareness of visual quality perception, we include a dynamic filtering module for better extracting content-aware features, which predicts features based on both local content and global semantics. These two improvements lead to a complex-valued content-aware neural NR-IQA model with good generalization. Extensive experiments on both synthetically and authentically distorted data have demonstrated the state-of-the-art performance of the proposed approach.

Hybrid Conditional Deep Inverse Tone Mapping

Tong Shao
Deming Zhai
Junjun Jiang
Xianming Liu

Emerging modern displays are capable to render ultra-high definition (UHD) media contents with high dynamic range (HDR) and wide color gamut (WCG). Although more and more native contents as such have been getting produced, the total amount is still in severe lack. Considering the massive amount of legacy contents with standard dynamic range (SDR) which may be exploitable, the urgent demand for proper conversion techniques thus springs up. In this paper, we try to tackle the conversion task from SDR to HDR-WCG for media contents and consumer displays. We propose a deep learning based SDR-to-HDR solution, Hybrid Conditional Deep Inverse Tone Mapping (HyCondITM), which is an end-to-end trainable framework including global transform, local adjustment, and detail refinement in a single unified pipeline. We present a hybrid condition network that can simultaneously extract both global and local priors for guidance to achieve scene-adaptive and spatially-variant manipulations. Experiments show that our method achieves state-of-the-art performance in both quantitative comparisons and visual quality, out-performing the previous methods.

Where Are You Looking?: A Large-Scale Dataset of Head and Gaze Behavior for 360-Degree Videos and a Pilot Study

Yili Jin
Junhua Liu
Fangxin Wang
Shuguang Cui

360° videos in recent years have experienced booming development. Compared to traditional videos, 360° videos are featured with uncertain user behaviors, bringing opportunities as well as challenges. Datasets are necessary for researchers and developers to explore new ideas and conduct reproducible analyses for fair comparisons among different solutions. However, existing related datasets mostly focused on users' field of view (FoV), ignoring the more important eye gaze information, not to mention the integrated extraction and analysis of both FoV and eye gaze. Besides, users' behavior patterns are highly related to videos, yet most existing datasets only contained videos with subjective and qualitative classification from video genres, which lack quantitative analysis and fail to characterize the intrinsic properties of a video scene. To this end, we first propose a quantitative taxonomy for 360° videos that contains three objective technical metrics. Based on this taxonomy, we collect a dataset containing users' head and gaze behaviors simultaneously, which outperforms existing datasets with rich dimensions, large scale, strong diversity, and high frequency. Then we conduct a pilot study on users' behaviors and get some interesting findings such as user's head direction will follow his/her gaze direction with the most possible time interval. A case of application in tile-based 360° video streaming based on our dataset is later conducted, demonstrating a great performance improvement of existing works by leveraging our provided gaze information. Our dataset is available at https://cuhksz-inml.github.io/head_gaze_dataset/

SESSION: Oral Session V: Experience -- Art and Culture

Im2Oil: Stroke-Based Oil Painting Rendering with Linearly Controllable Fineness Via Adaptive Sampling

Zhengyan Tong
Xiaohang Wang
Shengchao Yuan
Xuanhong Chen
Junjie Wang
Xiangzhong Fang

This paper proposes a novel stroke-based rendering (SBR) method that translates images into vivid oil paintings. Previous SBR techniques usually formulate the oil painting problem as pixel-wise approximation. Different from this technique route, we treat oil painting creation as an adaptive sampling problem. Firstly, we compute a probability density map based on the texture complexity of the input image. Then we use the Voronoi algorithm to sample a set of pixels as the stroke anchors. Next, we search and generate an individual oil stroke at each anchor. Finally, we place all the strokes on the canvas to obtain the oil painting. By adjusting the hyper-parameter maximum sampling probability, we can control the oil painting fineness in a linear manner. Comparison with existing state-of-the-art oil painting techniques shows that our results have higher fidelity and more realistic textures. A user opinion test demonstrates that people behave more preference toward our oil paintings than the results of other methods. More interesting results and the code are in https://github.com/TZYSJTU/Im2Oil.

ReLyMe: Improving Lyric-to-Melody Generation by Incorporating Lyric-Melody Relationships

Chen Zhang
Luchin Chang
Songruoyao Wu
Xu Tan
Tao Qin
Tie-Yan Liu
Kejun Zhang

Lyric-to-melody generation, which generates melody according to given lyrics, is one of the most important automatic music composition tasks. With the rapid development of deep learning, previous works address this task with end-to-end neural network models. However, deep learning models cannot well capture the strict but subtle relationships between lyrics and melodies, which compromises the harmony between lyrics and generated melodies. In this paper, we propose ReLyMe, a method that incorporates Relationships between Lyrics and Melodies from music theory to ensure the harmony between lyrics and melodies. Specifically, we first introduce several principles that lyrics and melodies should follow in terms of tone, rhythm, and structure relationships. These principles are then integrated into neural network lyric-to-melody models by adding corresponding constraints during the decoding process to improve the harmony between lyrics and melodies. We use a series of objective and subjective metrics to evaluate the generated melodies. Experiments on both English and Chinese song datasets show the effectiveness of ReLyMe, demonstrating the superiority of incorporating lyric-melody relationships from the music domain into neural lyric-to-melody generation.

SongDriver: Real-time Music Accompaniment Generation without Logical Latency nor Exposure Bias

Zihao Wang
Kejun Zhang
Yuxing Wang
Chen Zhang
Qihao Liang
Pengfei Yu
Yongsheng Feng
Wenbo Liu
Yikai Wang
Yuntao Bao
Yiheng Yang

Real-time music accompaniment generation has a wide range of applications in the music industry, such as music education and live performances. However, automatic real-time music accompaniment generation is still understudied and often faces a trade-off between logical latency and exposure bias. In this paper, we propose SongDriver, a real-time music accompaniment generation system without logical latency nor exposure bias. Specifically, SongDriver divides one accompaniment generation task into two phases: 1) The arrangement phase, where a Transformer model first arranges chords for input melodies in real-time, and caches the chords for the next phase instead of playing them out. 2) The prediction phase, where a CRF model generates playable multi-track accompaniments for the coming melodies based on previously cached chords. With this two-phase strategy, SongDriver directly generates the accompaniment for the upcoming melody, achieving zero logical latency. Furthermore, when predicting chords for a timestep, SongDriver refers to the cached chords from the first phase rather than its previous predictions, which avoids the exposure bias problem. Since the input length is often constrained under real-time conditions, another potential problem is the loss of long-term sequential information. To make up for this disadvantage, we extract four musical features from a long-term music piece before the current time step as global information. In the experiment, we train SongDriver on some open-source datasets and an original àiMusic Dataset built from Chinese-style modern pop music sheets. The results show that SongDriver outperforms existing SOTA (state-of-the-art) models on both objective and subjective metrics, meanwhile significantly reducing the physical latency.

CACOLIT: Cross-domain Adaptive Co-learning for Imbalanced Image-to-Image Translation

Yijun Wang
Tao Liang
Jianxin Lin

State-of-the-art unsupervised image-to-image translation (I2I) methods have made great progress on transferring images from a source domain X to a target domain Y. However, training these unsupervised I2I models on imbalanced target domain (e.g., Y with limited samples) usually causes mode collapse, which has not been well solved in current literature. In this work, we propose a new Cross-domain Adaptive Co-learning paradigm, CACOLIT, to alleviate the imbalanced unsupervised I2I training problem. Concretely, CACOLIT first constructs a teacher translation model by introducing an auxiliary domain along with source domain as well as two complementary student translation models formulating an I2I closed loop. Then, the two student models are simultaneously learned by transferring correspondence knowledge from teacher model in an interactive way. With extensive experiments on both human face style transfer and animal face translation tasks, we demonstrate that our adaptive co-learning model effectively transfers correspondence knowledge from teacher model to student models and generates more diverse and realistic images than existing I2I methods both qualitatively and quantitatively.

EuglPollock: Rethinking Interspecies Collaboration through Art Making

Kyungwon Lee
Yu-Kyung Jang
Jaewoo Jung
Dong Hwan Kim
Hyun Jean Lee
Seung Ah Lee

Humans are no longer the exclusive creators of art; art can now be produced by non-human actors such as artificial intelligence, machines, or animals. This paper presents EuglPollock, a platform for creating artwork through interactions between humans and algae called Euglena gracilis. Through light-mediated interactions between human users and microorganisms under a microscope, EuglPollock generates various versions of artworks, each of them unique and non-repeatable. This paper proposes a new method to create art while simultaneously raising interest in microorganisms and microbiology through interspecies collaborations.

SESSION: Poster Session V: Experience -- Art and Culture

Draw Your Art Dream: Diverse Digital Art Synthesis with Multimodal Guided Diffusion

Nisha Huang
Fan Tang
Weiming Dong
Changsheng Xu

Digital art synthesis is receiving increasing attention in the multimedia community because of engaging the public with art effectively. Current digital art synthesis methods usually use single-modality inputs as guidance, thereby limiting the expressiveness of the model and the diversity of generated results. To solve this problem, we propose the multimodal guided artwork diffusion (MGAD) model, which is a diffusion-based digital artwork generation approach that utilizes multimodal prompts as guidance to control the classifier-free diffusion model. Additionally, the contrastive language-image pretraining (CLIP) model is used to unify text and image modalities. Extensive experimental results on the quality and quantity of the generated digital art paintings confirm the effectiveness of the combination of the diffusion model and multimodal guidance. Code is available at https://github.com/haha-lisa/MGAD-multimodal-guided-artwork-diffusion.

AesUST: Towards Aesthetic-Enhanced Universal Style Transfer

Zhizhong Wang
Zhanjie Zhang
Lei Zhao
Zhiwen Zuo
Ailin Li
Wei Xing
Dongming Lu

Recent studies have shown remarkable success in universal style transfer which transfers arbitrary visual styles to content images. However, existing approaches suffer from the aesthetic-unrealistic problem that introduces disharmonious patterns and evident artifacts, making the results easy to spot from real paintings. To address this limitation, we propose AesUST, a novel Aesthetic-enhanced Universal Style Transfer approach that can generate aesthetically more realistic and pleasing results for arbitrary styles. Specifically, our approach introduces an aesthetic discriminator to learn the universal human-delightful aesthetic features from a large corpus of artist-created paintings. Then, the aesthetic features are incorporated to enhance the style transfer process via a novel Aesthetic-aware Style-Attention (AesSA) module. Such an AesSA module enables our AesUST to efficiently and flexibly integrate the style patterns according to the global aesthetic channel distribution of the style image and the local semantic spatial distribution of the content image. Moreover, we also develop a new two-stage transfer training strategy with two aesthetic regularizations to train our model more effectively, further improving stylization performance. Extensive experiments and user studies demonstrate that our approach synthesizes aesthetically more harmonious and realistic results than state of the art, greatly narrowing the disparity with real artist-created paintings. Our code is available at https://github.com/EndyWon/AesUST.

Semi-supervised Human Pose Estimation in Art-historical Images

Matthias Springstein
Stefanie Schneider
Christian Althaus
Ralph Ewerth

Gesture as language of non-verbal communication has been theoretically established since the 17th century. However, its relevance for the visual arts has been expressed only sporadically. This may be primarily due to the sheer overwhelming amount of data that traditionally had to be processed by hand. With the steady progress of digitization, though, a growing number of historical artifacts have been indexed and made available to the public, creating a need for automatic retrieval of art-historical motifs with similar body constellations or poses. Since the domain of art differs significantly from existing real-world data sets for human pose estimation due to its style variance, this presents new challenges. In this paper, we propose a novel approach to estimate human poses in art-historical images. In contrast to previous work that attempts to bridge the domain gap with pre-trained models or through style transfer, we suggest semi-supervised learning for both object and keypoint detection. Furthermore, we introduce a novel domain-specific art data set that includes both bounding box and keypoint annotations of human figures. Our approach achieves significantly better results than methods that use pre-trained models or style transfer.

Understanding and Identifying Artwork Plagiarism with the Wisdom of Designers: A Case Study on Poster Artworks

Shenglan Cui
Fang Liu
Tongqing Zhou
Mohan Zhang

The wide sharing and rapid dissemination of digital artworks has aggravated the issues of plagiarism, raising significant concerns in cultural preservation and copyright protection. Yet, modes of plagiarism are formally uncharted, causing rough plagiarism detection practices with duplicate checking. This work is thus devoted to understanding artwork plagiarism, with poster design as the running case, for building more dedicated detection techniques. As the first study of such, we elaborate on 8 elements that form unique posters and 6 judgement criteria for plagiarism using an exploratory study with designers. Second, we build a novel poster dataset with plagiarism annotations according to the criteria. Third, we propose models, leveraging the combination of primary elements and criteria of plagiarism, to find suspect instances in a retrieval process. The models are trained under the context of modern artwork and evaluated on the poster plagiarism dataset. The proposal is shown to outperform the baseline with superior Top-K accuracy (~33%) and retrieval performance (~42%).

REMOT: A Region-to-Whole Framework for Realistic Human Motion Transfer

Quanwei Yang
Xinchen Liu
Wu Liu
Hongtao Xie
Xiaoyan Gu
Lingyun Yu
Yongdong Zhang

Human Video Motion Transfer (HVMT) aims to, given an image of a source person, generate his/her video that imitates the motion of the driving person. Existing methods for HVMT mainly exploit Generative Adversarial Networks (GANs) to perform the warping operation based on the flow estimated from the source person image and each driving video frame. However, these methods always generate obvious artifacts due to the dramatic differences in poses, scales, and shifts between the source person and the driving person. To overcome these challenges, this paper presents a novel REgion-to-whole human MOtion Transfer (REMOT) framework based on GANs. To generate realistic motions, the REMOT adopts a progressive generation paradigm: it first generates each body part in the driving pose without flow-based warping, then composites all parts into a complete person of the driving motion. Moreover, to preserve the natural global appearance, we design a Global Alignment Module to align the scale and position of the source person with those of the driving person based on their layouts. Furthermore, we propose a Texture Alignment Module to keep each part of the person aligned according to the similarity of the texture. Finally, through extensive quantitative and qualitative experiments, our REMOT achieves state-of-the-art results on two public benchmarks.

GroupDancer: Music to Multi-People Dance Synthesis with Style Collaboration

Zixuan Wang
Jia Jia
Haozhe Wu
Junliang Xing
Jinghe Cai
Fanbo Meng
Guowen Chen
Yanfeng Wang

Different people dance in different styles. So when multiple people dance together, the phenomenon of style collaboration occurs: people need to seek common points while reserving differences in various dancing periods. Thus, we introduce a novel Music-driven Group Dance Synthesis task. Compared with single-people dance synthesis explored by most previous works, modeling the style collaboration phenomenon and choreographing for multiple people are more complicated and challenging. Moreover, the lack of sufficient records for conducting multi-people choreography in prior datasets further aggravates this problem. To address these issues, we construct a rich-annotated 3D Multi-Dancer Choreography dataset (MDC) and newly devise a metric SCEU for style collaboration evaluation. To our best knowledge, MDC is the first 3D dance dataset that collects both individual and collaborated music-dance pairs. Based on MDC, we present a novel framework, GroupDancer, consisting of three stages: Dancer Collaboration, Motion Choreography and Motion Transition. The Dancer Collaboration stage determines when and which dancers should collaborate their dancing styles from music. Afterward, the Motion Choreography stage produces a motion sequence for each dancer. Finally, the Motion Transition stage fills the gaps between the motions to achieve fluent and natural group dance. To make GroupDancer trainable from end to end and able to synthesize group dance with style collaboration, we propose mixed training and selective updating strategies. Comprehensive evaluations on the MDC dataset demonstrate that the proposed GroupDancer model can synthesize quite satisfactory group dance synthesis results with style collaboration.

CharFormer: A Glyph Fusion based Attentive Framework for High-precision Character Image Denoising

Daqian Shi
Xiaolei Diao
Lida Shi
Hao Tang
Yang Chi
Chuntao Li
Hao Xu

Degraded images commonly exist in the general sources of character images, leading to unsatisfactory character recognition results. Existing methods have dedicated efforts to restoring degraded character images. However, the denoising results obtained by these methods do not appear to improve character recognition performance. This is mainly because current methods only focus on pixel-level information and ignore critical features of a character, such as its glyph, resulting in character-glyph damage during the denoising process. In this paper, we introduce a novel generic framework based on glyph fusion and attention mechanisms, i.e., CharFormer, for precisely recovering character images without changing their inherent glyphs. Unlike existing frameworks, CharFormer introduces a parallel target task for capturing additional information and injecting it into the image denoising backbone, which will maintain the consistency of character glyphs during character image denoising. Moreover, we utilize attention-based networks for global-local feature interaction, which will help to deal with blind denoising and enhance denoising performance. We compare CharFormer with state-of-the-art methods on multiple datasets. The experimental results show the superiority of CharFormer quantitatively and qualitatively.

Delving into the Frequency: Temporally Consistent Human Motion Transfer in the Fourier Space

Guang Yang
Wu Liu
Xinchen Liu
Xiaoyan Gu
Juan Cao
Jintao Li

Human motion transfer refers to synthesizing photo-realistic and temporally coherent videos that enable one person to imitate the motion of others. However, current synthetic videos suffer from the temporal inconsistency in sequential frames that significantly degrades the video quality, yet is far from solved by existing methods in the pixel domain. Recently, some works on DeepFake detection try to distinguish the natural and synthetic images in the frequency domain because of the frequency insufficiency of image synthesizing methods. Nonetheless, there is no work to study the temporal inconsistency of synthetic videos from the aspects of the frequency-domain gap between natural and synthetic videos. Therefore, in this paper, we propose to delve into the frequency space for temporally consistent human motion transfer. First of all, we make the first comprehensive analysis of natural and synthetic videos in the frequency domain to reveal the frequency gap in both the spatial dimension of individual frames and the temporal dimension of the video. To close the frequency gap between the natural and synthetic videos, we propose a novel Frequency-based human MOtion TRansfer framework, named FreMOTR, which can effectively mitigate the spatial artifacts and the temporal inconsistency of the synthesized videos. FreMOTR explores two novel frequency-based regularization modules: 1) the Frequency-domain Appearance Regularization (FAR) to improve the appearance of the person in individual frames and 2) Temporal Frequency Regularization (TFR) to guarantee the temporal consistency between adjacent frames. Finally, comprehensive experiments demonstrate that the FreMOTR not only yields superior performance in temporal consistency metrics but also improves the frame-level visual quality of synthetic videos. In particular, the temporal consistency metrics are improved by nearly 30% than the state-of-the-art model.

Adaptive Affine Transformation: A Simple and Effective Operation for Spatial Misaligned Image Generation

Zhimeng Zhang
Yu Ding

One challenging problem, named spatial misaligned image generation, describing a translation between two face/pose images with large spatial deformation, is widely faced in tasks of face/pose reenactment. Advanced researchers use the dense flow to solve this problem. However, under a complex spatial deformation, even using carefully designed networks, intrinsical complexities make it difficult to compute an accurate dense flow, leading to distorted results. Different from those dense flow based methods, we propose one simple but effective operator named AdaAT (Adaptive Affine Transformation) to realize misaligned image generation. AdaAT simulates spatial deformation by computing hundreds of affine transformations, resulting in less distortions. Without computing any dense flow, AdaAT directly carries out affine transformations in feature channel spaces. Furthermore, we package several AdaAT operators to one universal AdaAT module that is used for different face/pose generation tasks. To validate the effectiveness of our AdaAT, we conduct qualitative and quantitative experiments on four common datasets in the tasks of talking face generation, face reenactment, pose transfer and person image generation. We achieve state-of-the-art results on three of them.

RCRN: Real-world Character Image Restoration Network via Skeleton Extraction

Daqian Shi
Xiaolei Diao
Hao Tang
Xiaomin Li
Hao Xing
Hao Xu

Constructing high-quality character image datasets is challenging because real-world images are often affected by image degradation. There are limitations when applying current image restoration methods to such real-world character images, since (i) the categories of noise in character images are different from those in general images; (ii) real-world character images usually contain more complex image degradation, e.g., mixed noise at different noise levels. To address these problems, we propose a real-world character restoration network (RCRN) to effectively restore degraded character images, where character skeleton information and scale-ensemble feature extraction are utilized to obtain better restoration performance. The proposed method consists of a skeleton extractor (SENet) and a character image restorer (CiRNet). SENet aims to preserve the structural consistency of the character and normalize complex noise. Then, CiRNet reconstructs clean images from degraded character images and their skeletons. Due to the lack of benchmarks for real-world character image restoration, we constructed a dataset containing 1,606 character images with real-world degradation to evaluate the validity of the proposed method. The experimental results demonstrate that RCRN outperforms state-of-the-art methods quantitatively and qualitatively.

Exploring Negatives in Contrastive Learning for Unpaired Image-to-Image Translation

Yupei Lin
Sen Zhang
Tianshui Chen
Yongyi Lu
Guangping Li
Yukai Shi

Unpaired image-to-image translation aims to find a mapping between the source domain and the target domain. To alleviate the problem of the lack of supervised labels for the source images, cycle-consistency based methods have been proposed for image structure preservation by assuming a reversible relationship between unpaired images. However, this assumption only uses limited correspondence between image pairs. Recently, contrastive learning (CL) has been used to further investigate the image correspondence in unpaired image translation by using patch-based positive/negative learning. Patch-based contrastive routines obtain the positives by self-similarity computation and recognize the rest patches as negatives. This flexible learning paradigm obtains auxiliary contextualized information at a low cost. As the negatives own an impressive sample number, with curiosity, we make an investigation based on a question: are all negatives necessary for feature contrastive learning? Unlike previous CL approaches that use negatives as much as possible, in this paper, we study the negatives from an information-theoretic perspective and introduce a new negative Pruning technology for Unpaired image-to-image Translation (PUT) by sparsifying and ranking the patches. The proposed algorithm is efficient, flexible and enables the model to learn essential information between corresponding patches stably. By putting quality over quantity, only a few negative patches are required to achieve better results. Lastly, we validate the superiority, stability, and versatility of our model through comparative experiments.

Sundial-GAN: A Cascade Generative Adversarial Networks Framework for Deciphering Oracle Bone Inscriptions

Xiang Chang
Fei Chao
Changjing Shang
Qiang Shen

Oracle Bone Inscription (OBI) is an early hieroglyph in China, which is the most famous ancient writing system in the world. However, only a small number of OBI characters have been fully deciphered today. Chinese characters have different forms in different historical stages; therefore, it is very difficult to directly translate OBI characters to modern Chinese characters due to the long historic evolutionary process. In this paper, we propose a cascade generative adversarial networks (GAN) framework for deciphering OBI characters, named "Sundial-GAN'', which is a cascaded structure to simulate Chinese characters' evolutionary process from an OBI character to its potential modern Chinese character. We select four representative stages in the evolutionary process of OBI, each of which is implemented by an individual GAN structure based on the characteristics of each evolutionary stage. These structures are cascaded in sequence to accurately simulate the Chinese characters' evolutionary process. For each input OBI character, Sundial-GAN can successfully generate the input's different forms at the four historical stages. Extensive experiments and comparisons demonstrate that generated characters at each stage have high similarities with real existing characters; therefore, the proposed method can significantly improve the efficiency and accuracy of OBI deciphering for archaeological researchers. Compared to direct image-to-image translation methods, our approach allows for a smoother translation process, a better grasp of details, and more effective avoiding random mappings in GANs.

Structure-Enhanced Pop Music Generation via Harmony-Aware Learning

Xueyao Zhang
Jinchao Zhang
Yao Qiu
Li Wang
Jie Zhou

Pop music generation has always been an attractive topic for both musicians and scientists for a long time. However, automatically composing pop music with a satisfactory structure is still a challenging issue. In this paper, we propose to leverage harmony-aware learning for structure-enhanced pop music generation. On the one hand, one of the participants of harmony, chord, represents the harmonic set of multiple notes, which is integrated closely with the spatial structure of music, the texture. On the other hand, the other participant of harmony, chord progression, usually accompanies the development of the music, which promotes the temporal structure of music, the form. Moreover, when chords evolve into chord progression, the texture and form can be bridged by the harmony naturally, which contributes to the joint learning of the two structures. Furthermore, we propose the Harmony-Aware Hierarchical Music Transformer (HAT), which can exploit the structure adaptively from the music, and make the musical tokens interact hierarchically to enhance the structure in multi-level musical elements. Experimental results reveal that compared to the existing methods, HAT owns a much better understanding of the structure and it can also improve the quality of generated music, especially in the form and texture.

Dynamic Weighted Semantic Correspondence for Few-Shot Image Generative Adaptation

Xingzhong Hou
Boxiao Liu
Shuai Zhang
Lulin Shi
Zite Jiang
Haihang You

Few-shot image generative adaptation, which finetunes well-trained generative models on limited examples, is of practical importance. The main challenge is that the few-shot model easily becomes overfitting. It can be attributed to two aspects: the lack of sample diversity for the generator and the failure of fidelity discrimination for the discriminator. In this paper, we introduce two novel methods to solve the diversity and fidelity respectively. Concretely, we propose dynamic weighted semantic correspondence to keep the diversity for the generator, which benefits from the richness of samples generated by source models. To prevent discriminator overfitting, we propose coupled training paradigm across the source and target domains to keep the feature extraction capability of the discriminator backbone. Extensive experiments show that our method outperforms previous methods both on image quality and diversity significantly.

The Beauty of Repetition in Machine Composition Scenarios

Zhejing Hu
Xiao Ma
Yan Liu
Gong Chen
Yongxu Liu

Repetition, a basic form of artistic creation, appears in most musical works and delivers enthralling aesthetic experiences. However, repetition remains underexplored in terms of automatic music composition. As an initial effort in repetition modelling, this paper focuses on generating motif-level repetitions via domain knowledge-based and example-based learning techniques. A novel repetition transformer (R-Transformer) that combines a Transformer encoder and a repetition-aware learner is trained on a new repetition dataset with 584,329 samples from different categories of motif repetition. The Transformer encoder learns the representation among music notes from the repetition dataset; the novel repetition-aware learner exploits repetitions' unique characteristics based on music theory. Experiments show that, with any given motif, R-Transformer can generate a large number of variable and beautiful repetitions. With ingenious fusion of these high-quality pieces, the musicality and appeal of machine-composed music have been greatly improved.

CariPainter: Sketch Guided Interactive Caricature Generation

Xin Huang
Dong Liang
Hongrui Cai
Juyong Zhang
Jinyuan Jia

In this paper, we propose CariPainter, the first interactive caricature generating and editing method. The main challenge of caricature generation lies in the fact that it not only exaggerates the facial geometry but also refreshes the facial texture. We solve this challenging problem by utilizing the semantic segmentation maps as an intermediary domain, removing the influence of photo texture while preserving the person-specific geometry features. Specifically, our proposed method consists of two main components: CariSketchNet and CariMaskGAN. CariSketchNet exaggerates the photo segmentation map to construct CariMask. Then, CariMask is converted into a caricature by CariMaskGAN. In this step, users can edit and adjust the geometry of caricatures freely. Additionally, we propose a semantic detail pre-processing approach, which considerably increases details of generated images and allows modification of hair strands, wrinkles, and beards. Extensive experimental results show that our method produces higher-quality caricatures as well as supports easily used interactive modification.

Cartoon-Flow: A Flow-Based Generative Adversarial Network for Arbitrary-Style Photo Cartoonization

Jieun Lee
Hyeonwoo Kim
Jonghwa Shim
Eenjun Hwang

Photo cartoonization aims to convert photos of real-world scenes into cartoon-style images. Recently, generative adversarial network (GAN)-based methods for photo cartoonization have been proposed to generate pleasable cartoonized images. However, as these methods can transfer only learned cartoon styles to photos, they are limited in general-purpose applications where unlearned styles are often required. To address this limitation, an arbitrary style transfer (AST) method that transfers arbitrary artistic style into content images can be used. However, conventional AST methods do not perform satisfactorily in cartoonization for two reasons. First, they cannot capture the unique characteristics of cartoons that differ from common artistic styles. Second, they suffer from content leaks in which the semantic structure of the content is distorted. In this paper, to solve these problems, we propose a novel arbitrary-style photo cartoonization method, Cartoon-Flow. More specifically, we construct a new hybrid GAN with an invertible neural flow generator to effectively preserve content information. In addition, we introduce two new losses for cartoonization: (1) edge-promoting smooth loss to learn the unique characteristics of cartoons with smooth surfaces and clear edges, and (2) line loss to mimic the line drawing of cartoons. Extensive experiments demonstrate that the proposed method outperforms previous methods both quantitatively and qualitatively.

SESSION: Oral Session VI: Experience -- Multimedia Applications

Span-based Audio-Visual Localization

Yiling Wu
Xinfeng Zhang
Yaowei Wang
Qingming Huang

This paper focuses on the audio-visual event localization task that aims to match both visible and audible components in a video to identify the event of interest. Existing methods primarily ignore the continuity of audio-visual events and classify each segment separately. They either classify the event category score of each segment separately or calculate the event-relevant score of each segment separately. However, events in video are often continuous and last several segments. Motivated by these, we propose a span-based framework that considers consecutive segments jointly. The span-based framework handles the audio-visual localization task by predicting the event class and extracting the event span. Specifically, a [CLS] token is applied to collect the global information with self-attention mechanisms to predict the event class. Relevance scores and positional embeddings are inserted into the span predictor to estimate the start and end boundaries of the event. Multi-modal Mixup are further used to improve the robustness and generalization of the model. Experiments conducted on the AVE dataset demonstrate that the proposed method outperforms state-of-the-art methods.

PC-Dance: Posture-controllable Music-driven Dance Synthesis

Jibin Gao
Junfu Pu
Honglun Zhang
Ying Shan
Wei-Shi Zheng

Music-driven dance synthesis is a task to generate high-quality dance according to the music given by the user, which has promising entertainment applications. However, most of the existing methods cannot provide an efficient and effective way for user intervention in dance generation, e.g., posture-controllable. In this work, we propose a powerful framework named PC-Dance to perform adaptive posture-controllable music-driven dance synthesis. Consisting of an music-to-dance alignment embedding network (M2D-Align) and a posture-controllable dance synthesis (PC-Syn), PC-Dance allows fine-grained control by input anchor poses efficiently without artist participation. Specifically, to relieve the cost of artist participation but ensure generating high-quality dance efficiently, a self-supervised rhythm alignment module is designed to further learn the music-to-dance alignment embedding. As for PC-Syn, we introduce an efficient scheme for adaptive motion graph construction (AMGC), which could improve the efficiency of graph-based optimization and preserve the diversity of motions. Since there is few related public dataset, we collect an MMD-ARC dataset for music-driven dance synthesis. The experimental results on MMD-ARC dataset demonstrate the effectiveness of our framework and the feasibility for dance synthesis with adaptive posture controlling.

Delving Globally into Texture and Structure for Image Inpainting

Haipeng Liu
Yang Wang
Meng Wang
Yong Rui

Image inpainting has achieved remarkable progress and inspired abundant methods, where the critical bottleneck is identified as how to fulfill the high-frequency structure and low-frequency texture information on the masked regions with semantics. To this end, deep models exhibit powerful superiority to capture them, yet constrained on the local spatial regions. In this paper, we delve globally into texture and structure information to well capture the semantics for image inpainting. As opposed to the existing arts trapped on the independent local patches, the texture information of each patch is reconstructed from all other patches across the whole image, to match the coarsely filled information, especially the structure information over the masked regions. Unlike the current decoder-only transformer within the pixel level for image inpainting, our model adopts the transformer pipeline paired with both encoder and decoder. On one hand, the encoder captures the texture semantic correlations of all patches across image via self-attention module. On the other hand, an adaptive patch vocabulary is dynamically established in the decoder for the filled patches over the masked regions. Building on this, a structure-texture matching attention module anchored on the known regions comes up to marry the best of these two worlds for progressive inpainting via a probabilistic diffusion process. Our model is orthogonal to the fashionable arts, such as Convolutional Neural Networks (CNNs), Attention and Transformer model, from the perspective of texture and structure information for image inpainting. The extensive experiments over the benchmarks validate its superiority. Our code is available here

Rethinking Open-World Object Detection in Autonomous Driving Scenarios

Zeyu Ma
Yang Yang
Guoqing Wang
Xing Xu
Heng Tao Shen
Mingxing Zhang

Existing object detection models have been demonstrated to successfully discriminate and localize the predefined object categories under the seen or similar situations. However, the open-world object detection as required by autonomous driving perception systems refers to recognizing unseen objects under various scenarios. On the one hand, the knowledge gap between seen and unseen object categories poses extreme challenges for models trained with supervision only from the seen object categories. On the other hand, the domain differences across different scenarios also cause an additional urge to take the domain gap into consideration by aligning the sample or label distribution. Aimed at resolving these two challenges simultaneously, we firstly design a pre-training model to formulate the mappings between visual images and semantic embeddings from the extra annotations as guidance to link the seen and unseen object categories through a self-supervised manner. Within this formulation, the domain adaptation is then utilized for extracting the domain-agnostic feature representations and alleviating the misdetection of unseen objects caused by the domain appearance changes. As a result, the more realistic and practical open-world object detection problem is visited and resolved by our novel formulation, which could detect the unseen categories from unseen domains without any bounding box annotations while there is no obvious performance drop in detecting the seen categories. We are the first to formulate a unified model for open-world task and establish a new state-of-the-art performance for this challenge.

MVLayoutNet: 3D Layout Reconstruction with Multi-view Panoramas

Zhihua Hu
Bo Duan
Yanfeng Zhang
Mingwei Sun
Jingwei Huang

We present MVLayoutNet, a network for holistic 3D reconstruction from multi-view panoramas. Our core contribution is to seamlessly combine learned monocular layout estimation and multi-view stereo (MVS) for accurate layout reconstruction in both 3D and image space. We jointly train a layout module to produce an initial layout and a novel MVS module to obtain accurate layout geometry. Unlike standard MVSNet, our MVS module takes a newly-proposed layout cost volume, which aggregates multi-view costs at the same depth layer into corresponding layout elements. We additionally provide an attention-based scheme that guides the MVS module to focus on structural regions. Such a design considers both local pixel-level costs and global holistic information for better reconstruction. Experiments show that our method outperforms state-of-the-arts in terms of depth rmse by 21.7% and 41.2% on the 2D-3D-S [1] and ZInD [4] datasets. For complex scenes with multiple rooms, our method can be applied to each layout element of a precomputed topology to accurately reconstruct a globally coherent layout geometry.

Wavelet-enhanced Weakly Supervised Local Feature Learning for Face Forgery Detection

Jiaming Li
Hongtao Xie
Lingyun Yu
Yongdong Zhang

Face forgery detection is getting increasing attention due to the security threats caused by forged faces. Recently, local patch-based approaches have achieved sound achievements due to effective attention to local details. However, there are still unignorable problems: a) local feature learning requires patch-level labels to circumvent label noise, which is not practical in real-world scenarios; b) the commonly used DCT (FFT) transform loses all spatial information, which brings difficulty in handling local details. To compensate for such limitations, a novel wavelet-enhanced weakly supervised local feature learning framework is proposed in this paper. Specifically, to supervise the learning of local features with only image-level labels, two modules are devised based on the idea of multi-instance learning: local relation constraint module (LRCM) and category knowledge-guided local feature aggregation module (CKLFA). LRCM constrains the maximum distance between local features of forged face images greater than that of real face images. CKLFA adaptively aggregates local features based on their correlation to global embedding containing global category information. Combining these two modules, the network is encouraged to learn discriminative local features supervised only by image-level labels. Besides, a multi-level wavelet-powered feature enhancement module is developed to promote the network mining local forgery artifacts from spatio-frequency domain, which is beneficial to learning discriminative local features. Extensive experiments show that our approach outperforms previous state-of-the-art methods when only image-level labels are available and achieves comparable or even better performance than counterparts using patch-level labels.

ADGNet: Attention Discrepancy Guided Deep Neural Network for Blind Image Quality Assessment

Xiaoyu Ma
Yaqi Wang
Chang Liu
Suiyu Zhang
Dingguo Yu

This work explores how to efficiently incorporate semantic knowledge for blind image quality assessment and proposes an end-to-end attention discrepancy guided deep neural network for perceptual quality assessment. Our method is established on a multi-task learning framework in which two sub-tasks including semantic recognition and image quality prediction are jointly optimized with a shared feature-extracting branch and independent spatial-attention branch. The discrepancy between semantic-aware attention and quality-aware attention is leveraged to refine the quality predictions. The proposed ADGNet is based on the observation that human visual systems exhibit different mechanisms when viewing images with different amounts of distortion. Such a manner would result in the variation of attention discrepancy between the quality branch and semantic branch, which are therefore employed to enhance the accuracy and generalization ability of our method. We systematically study the major components of our framework, and experimental results on both authentically and synthetically distorted image quality datasets demonstrate the superiority of our model as compared to the state-of-the-art approaches.

Decoupling Recognition from Detection: Single Shot Self-Reliant Scene Text Spotter

Jingjing Wu
Pengyuan Lyu
Guangming Lu
Chengquan Zhang
Kun Yao
Wenjie Pei

Typical text spotters follow the two-stage spotting strategy: detect the precise boundary for a text instance first and then perform text recognition within the located text region. While such strategy has achieved substantial progress, there are two underlying limitations. 1) The performance of text recognition depends heavily on the precision of text detection, resulting in the potential error propagation from detection to recognition. 2) The RoI cropping which bridges the detection and recognition brings noise from background and leads to information loss when pooling or interpolating from feature maps. In this work we propose the single shot Self-Reliant Scene Text Spotter (SRSTS), which circumvents these limitations by decoupling recognition from detection. Specifically, we conduct text detection and recognition in parallel and bridge them by the shared positive anchor point. Consequently, our method is able to recognize the text instances correctly even though the precise text boundaries are challenging to detect. Additionally, our method reduces the annotation cost for text detection substantially. Extensive experiments on regular-shaped benchmark and arbitrary-shaped benchmark demonstrate that our SRSTS compares favorably to previous state-of-the-art spotters in terms of both accuracy and efficiency.

Real-World Blind Super-Resolution via Feature Matching with Implicit High-Resolution Priors

Chaofeng Chen
Xinyu Shi
Yipeng Qin
Xiaoming Li
Xiaoguang Han
Tao Yang
Shihui Guo

A key challenge of real-world image super-resolution (SR) is to recover the missing details in low-resolution (LR) images with complex unknown degradations (\eg, downsampling, noise and compression). Most previous works restore such missing details in the image space. To cope with the high diversity of natural images, they either rely on the unstable GANs that are difficult to train and prone to artifacts, or resort to explicit references from high-resolution (HR) images that are usually unavailable. In this work, we propose Feature Matching SR (FeMaSR), which restores realistic HR images in a much more compact feature space. Unlike image-space methods, our FeMaSR restores HR images by matching distorted LR image features to their distortion-free HR counterparts in our pretrained HR priors, and decoding the matched features to obtain realistic HR images. Specifically, our HR priors contain a discrete feature codebook and its associated decoder, which are pretrained on HR images with a Vector Quantized Generative Adversarial Network (VQGAN). Notably, we incorporate a novel semantic regularization in VQGAN to improve the quality of reconstructed images. For the feature matching, we first extract LR features with an LR encoder consisting of several Swin Transformer blocks and then follow a simple nearest neighbour strategy to match them with the pretrained codebook. In particular, we equip the LR encoder with residual shortcut connections to the decoder, which is critical to the optimization of feature matching loss and also helps to complement the possible feature matching errors.Experimental results show that our approach produces more realistic HR images than previous methods. Code will be made publicly available.

Leveraging GAN Priors for Few-Shot Part Segmentation

Mengya Han
Heliang Zheng
Chaoyue Wang
Yong Luo
Han Hu
Bo Du

Few-shot part segmentation aims to separate different parts of an object given only a few annotated samples. Due to the challenge of limited data, existing works mainly focus on learning classifiers over pre-trained features, failing to learn task-specific features for part segmentation. In this paper, we propose to learn task-specific features in a "pre-training"-"fine-tuning" paradigm. We conduct prompt designing to reduce the gap between the pre-train task (i.e., image generation) and the downstream task (i.e., part segmentation), so that the GAN priors for generation can be leveraged for segmentation. This is achieved by projecting part segmentation maps into the RGB space and conducting interpolation between RGB segmentation maps and original images. Specifically, we design a fine-tuning strategy to progressively tune an image generator into a segmentation generator, where the supervision of the generator varying from images to segmentation maps by interpolation. Moreover, we propose a two-stream architecture, i.e., a segmentation stream to generate task-specific features, and an image stream to provide spatial constraints. The image stream can be regarded as a self-supervised auto-encoder, and this enables our model to benefit from large-scale support images. Overall, this work is an attempt to explore the internal relevance between generation tasks and perception tasks by prompt designing. Extensive experiments show that our model can achieve state-of-the-art performance on several part segmentation datasets.

MaMiCo: Macro-to-Micro Semantic Correspondence for Self-supervised Video Representation Learning

Bo Fang
Wenhao Wu
Chang Liu
Yu Zhou
Dongliang He
Weipinng Wang

Contrastive self-supervised learning (CSL) has remarkably promoted the progress of visual representation learning. However, existing video CSL methods mainly focus on clip-level temporal semantic consistency. The temporal and spatial semantic correspondence across different granularities, i.e., video, clip, and frame levels, is typically overlooked. To tackle this issue, we propose a self-supervised Macro-to-Micro Semantic Correspondence (MaMiCo) learning framework, pursuing fine-grained spatiotemporal representations from a macro-to-micro perspective. Specifically, MaMiCo constructs a multiple branch architecture of T-MaMiCo and S-MaMiCo on a temporally-nested clip pyramid (video-to-frame). On the pyramid, T-MaMiCo aims at temporal correspondence by simultaneously assimilating semantic invariance representations and retaining appearance dynamics in long temporal ranges. For spatial correspondence, S-MaMiCo perceives subtle motion cues via ameliorating dense CSL for videos where stationary clips are applied for stably dense contrasting reference to alleviate semantic inconsistency caused by ''mismatching''. Extensive experiments justify that MaMiCo learns rich general video representations and works well on various downstream tasks, e.g., (fine-grained) action recognition, action localization, and video retrieval.

ChebyLighter: Optimal Curve Estimation for Low-light Image Enhancement

Jinwang Pan
Deming Zhai
Yuanchao Bai
Junjun Jiang
Debin Zhao
Xianming Liu

Low-light enhancement aims to recover a high contrast normal light image from a low-light image with bad exposure and low contrast. Inspired by curve adjustment in photo editing software and Chebyshev approximation, this paper presents a novel model for brightening low-light images. The proposed model, ChebyLighter, learns to estimate pixel-wise adjustment curves for a low-light image recurrently to reconstruct an enhanced output. In ChebyLighter, Chebyshev image series are first generated. Then pixel-wise coefficient matrices are estimated with Triple Coefficient Estimation (TCE) modules and the final enhanced image is recurrently reconstructed by Chebyshev Attention Weighted Summation (CAWS). The TCE module is specifically designed based on dual attention mechanism with three necessary inputs. Our method can achieve ideal performance because adjustment curves can be obtained with numerical approximation by our model. With extensive quantitative and qualitative experiments on diverse test images, we demonstrate that the proposed method performs favorably against state-of-the-art low-light image enhancement algorithms.

Bayesian based Re-parameterization for DNN Model Pruning

Xiaotong Lu
Teng Xi
Baopu Li
Gang Zhang
Weisheng Dong
Guangming Shi

Filter pruning, as an effective strategy to obtain efficient compact structures from over-parametric deep neural networks(DNN), has attracted a lot of attention. Previous pruning methods select channels for pruning by developing different criteria, yet little attention has been devoted to whether these criteria can represent correlations between channels. Meanwhile, most existing methods generally ignore the parameters being pruned and only perform additional training on the retained network to reduce accuracy loss. In this paper, we present a novel perspective of re-parametric pruning by Bayesian estimation. First, we estimate the probability distribution of different channels based on Bayesian estimation and indicate the importance of the channels by the discrepancy in the distribution before and after channel pruning. Second, to minimize the variation in distribution after pruning, we re-parameterize the pruned network based on the probability distribution to pursue optimal pruning. We evaluate our approach on popular datasets with some typical network architectures, and comprehensive experimental results validate that this method illustrates better performance compared to the state-of-the-art approaches.

ReCoRo: Region-Controllable Robust Light Enhancement with User-Specified Imprecise Masks

Dejia Xu
Hayk Poghosyan
Shant Navasardyan
Yifan Jiang
Humphrey Shi
Zhangyang Wang

Low-light enhancement is an increasingly important function in image editing and visual creation. Most existing enhancing algorithms are trained to enlighten a given image in a globally homogeneous way, and (implicitly) to some predefined extent of brightness. They are neither capable of enhancing only local regions of interest ("where") while keeping the overall visual appearance plausible, nor producing outputs at a range of different illumination levels ("how much"). Those hurdles significantly limit the prospect of flexible, customizable, or even user-interactive low-light enhancement. To address these gaps, we propose Region-Controllable Robust Light Enhancement (ReCoRo), a novel framework that allows users to directly specify "where" and "how much" they want to enhance from an input low-light image; meanwhile, the model will learn to intelligently maintain the overall consistent visual appearance and plausible composition via a discriminator. Moreover, since in practical mobile APPs, such user specifications often come in imprecise forms (e.g., finger-drawn masks), we propose to bake in domain-specific data augmentations into training ReCoRo, so that the learned model can gain resilience to various roughly-supplied user masks. Up to our best knowledge, ReCoRo is the first of its kind that allows the user to localize the enlightenment region as well as to control the light intensity. Extensive experiments clearly demonstrate that ReCoRo outperforms state-of-the-art methods in terms of qualitative results, quantitative metrics, and versatile controllability. Project repository: https://bit.ly/ReCoRo-lowlight.

Domain-Specific Fusion Of Objective Video Quality Metrics

Aaron Chadha
Ioannis Katsavounidis
Ayan Kumar Bhunia
Cosmin Stejerean
Mohammad Umar Karim Khan
Yiannis Andreopoulos

Video processing algorithms like video upscaling, denoising, and compression are now increasingly optimized for perceptual quality metrics instead of signal distortion. This means that they may score well for metrics like video multi-method assessment fusion (VMAF), but this may be because of metric overfitting. This imposes the need for costly subjective quality assessments that cannot scale to large datasets and large parameter explorations. We propose a methodology that fuses multiple quality metrics based on small scale subjective testing in order to unlock their use at scale for specific application domains of interest. This is achieved by employing pseudo-random sampling of the resolution, quality range and test video content available, which is initially guided by quality metrics in order to cover the quality range useful to each application. The selected samples then undergo a subjective test, such as ITU-T P.910 absolute categorical rating, with the results of the test postprocessed and used as the means to derive the best combination of multiple objective metrics using support vector regression. We showcase the benefits of this approach in two applications: video encoding with and without perceptual preprocessing, and deep video denoising & upscaling of compressed content. For both applications, the derived fusion of metrics allows for a more robust alignment to mean opinion scores than a perceptually-uninformed combination of the original metrics themselves. The dataset and code is available at https://github.com/isize-tech/VideoQualityFusion.

Learning for Motion Deblurring with Hybrid Frames and Events

Wen Yang
Jinjian Wu
Jupo Ma
Leida Li
Weisheng Dong
Guangming Shi

Event camera responds to the brightness changes at each pixel independently with microsecond accuracy. Event cameras offer attractive property that can record well high-speed scene but ignore static and non-moving areas, while conventional frame cameras are able to acquire the whole intensity information of the scene but suffer from motion blur. Therefore, it would be desirable to combine the best of two cameras for reconstructing high quality intensity frame with no motion blur. The human visual system presents a two-pathway procedure for non-action-based representation and objects motion perception, which corresponds well to the hybrid frame and event. In this paper, inspired by the two-pathway visual system, a novel dual-stream based framework is proposed for motion deblurring (DS-Deblur), which flexibly utilizes the respective advantages from frame and event. A complementary-unique information splitting based feature fusion module is firstly proposed to adaptively aggregate the frame and event progressively at multiple levels, which is well-grounded on the hierarchical process in twopathway visual system. Then, a recurrent spatio-temporal feature transformation module is designed to exploit relevant information between adjacent frames, in which features of both current and previous frames are transformed in a global-local manner. Extensive experiments on both synthetic and real motion blur datasets demonstrate our method achieves state-of-the-art performance. Project website: https://github.com/wyang-vis/Motion-Deblurringwith-Hybrid-Frames-and-Events.

Bidirectional Self-Training with Multiple Anisotropic Prototypes for Domain Adaptive Semantic Segmentation

Yulei Lu
Yawei Luo
Li Zhang
Zheyang Li
Yi Yang
Jun Xiao

A thriving trend for domain adaptive segmentation endeavors to generate the high-quality pseudo labels for target domain and retrain the segmentor on them. Under this self-training paradigm, some competitive methods have sought to the latent-space information, which establishes the feature centroids (a.k.a prototypes) of the semantic classes and determines the pseudo label candidates by their distances from these centroids. In this paper, we argue that the latent space contains more information to be exploited thus taking one step further to capitalize on it. Firstly, instead of merely using the source-domain prototypes to determine the target pseudo labels as most of the traditional methods do, we bidirectionally produce the target-domain prototypes to degrade those source features which might be too hard or disturbed for the adaptation. Secondly, existing attempts simply model each category as a single and isotropic prototype while ignoring the variance of the feature distribution, which could lead to the confusion of similar categories. To cope with this issue, we propose to represent each category with multiple and anisotropic prototypes via Gaussian Mixture Model, in order to fit the de facto distribution of source domain and estimate the likelihood of target samples based on the probability density. We apply our method on GTA5->Cityscapes and Synthia->Cityscapes tasks and achieve 61.2% and 62.8% respectively in terms of mean IoU, substantially outperforming other competitive self-training methods. Noticeably, in some categories which severely suffer from the categorical confusion such as "truck" and "bus", our method achieves 56.4% and 68.8% respectively, which further demonstrates the effectiveness of our design. The code and model are available at https://github.com/luyvlei/BiSMAPs.

Semi-supervised Crowd Counting via Density Agency

Hui Lin
Zhiheng Ma
Xiaopeng Hong
Yaowei Wang
Zhou Su

In this paper, we propose a new agency-guided semi-supervised counting approach. First, we build a learnable auxiliary structure, namely the density agency to bring the recognized foreground regional features close to corresponding density sub-classes (agents) and push away background ones. Second, we propose a density-guided contrastive learning loss to consolidate the backbone feature extractor. Third, we build a regression head by using a transformer structure to refine the foreground features further. Finally, an efficient noise depression loss is provided to minimize the negative influence of annotation noises. Extensive experiments on four challenging crowd counting datasets demonstrate that our method achieves superior performance to the state-of-the-art semi-supervised counting methods by a large margin. The code is available at https://github.com/LoraLinH/Semi-supervised-Crowd-Counting-via-Density-A....

AEDNet: Asynchronous Event Denoising with Spatial-Temporal Correlation among Irregular Data

Huachen Fang
Jinjian Wu
Leida Li
Junhui Hou
Weisheng Dong
Guangming Shi

Dynamic Vision Sensor (DVS) is a compelling neuromorphic camera compared to conventional camera, but it suffers from fiercer noise. Due to the nature of irregular format and asynchronous readout, DVS data is always transformed into a regular tensor (e.g., 3D voxel or image) for deep learning method, which corrupts its own asynchronous properties. To maintain asynchronous, we establish an innovative asynchronous event denoise neural network, named AEDNet, which directly consumes the correlation of the irregular signal in spatial-temporal range without destroying its original structural property. Based on the property of continuation in temporal domain and discreteness in spatial domain, we decompose the DVS signal into two parts, i.e., temporal correlation and spatial affinity, and separately process these two parts. Our spatial feature embedding unit is a unique feature extraction module that extracts feature from event-level, which perfectly maintains its spatial-temporal correlation. To test effectiveness, we build a novel dataset named DVSCLEAN containing both simulated and real-world data. The experimental results of AEDNet achieve SOTA.

Learnability Enhancement for Low-light Raw Denoising: Where Paired Real Data Meets Noise Modeling

Hansen Feng
Lizhi Wang
Yuzhi Wang
Hua Huang

Low-light raw denoising is an important and valuable task in computational photography where learning-based methods trained with paired real data are mainstream. However, the limited data volume and complicated noise distribution have constituted a learnability bottleneck for paired real data, which limits the denoising performance of learning-based methods. To address this issue, we present a learnability enhancement strategy to reform paired real data according to noise modeling. Our strategy consists of two efficient techniques: shot noise augmentation (SNA) and dark shading correction (DSC). Through noise model decoupling, SNA improves the precision of data mapping by increasing the data volume and DSC reduces the complexity of data mapping by reducing the noise complexity. Extensive results on the public datasets and real imaging scenarios collectively demonstrate the state-of-the-art performance of our method.

Multi-Modal Experience Inspired AI Creation

Qian Cao
Xu Chen
Ruihua Song
Hao Jiang
Guang Yang
Zhao Cao

AI creation, such as poem or lyrics generation, has attracted increasing attention from both industry and academic communities, with many promising models proposed in the past few years. Existing methods usually estimate the outputs based on single and independent visual or textual information. However, in reality, humans usually make creations according to their experiences, which may involve different modalities and be sequentially correlated. To model such human capabilities, in this paper, we define and solve a novel AI creation problem based on human experiences. More specifically, we study how to generate texts based on sequential multi-modal information. Compared with the previous works, this task is much more difficult because the designed model has to well understand and adapt the semantics among different modalities and effectively convert them into the output in a sequential manner. To alleviate these difficulties, we firstly design a multi-channel sequence-to-sequence architecture equipped with a multi-modal attention network. For more effective optimization, we then propose a curriculum negative sampling strategy tailored for the sequential inputs. To benchmark this problem and demonstrate the effectiveness of our model, we manually labeled a new multi-modal experience dataset. With this dataset, we conduct extensive experiments by comparing our model with a series of representative baselines, where we can demonstrate significant improvements in our model based on both automatic and human-centered metrics. The code and data are available at: https://github.com/Aman-4-Real/MMTG.

Factorized and Controllable Neural Re-Rendering of Outdoor Scene for Photo Extrapolation

Boming Zhao
Bangbang Yang
Zhenyang Li
Zuoyue Li
Guofeng Zhang
Jiashu Zhao
Dawei Yin
Zhaopeng Cui
Hujun Bao

Expanding an existing tourist photo from a partially captured scene to a full scene is one of the desired experiences for photography applications. Although photo extrapolation has been well studied, it is much more challenging to extrapolate a photo (i.e., selfie) from a narrow field of view to a wider one while maintaining a similar visual style. In this paper, we propose a factorized neural re-rendering model to produce photorealistic novel views from cluttered outdoor Internet photo collections, which enables the applications including controllable scene re-rendering, photo extrapolation and even extrapolated 3D photo generation. Specifically, we first develop a novel factorized re-rendering pipeline to handle the ambiguity in the decomposition of geometry, appearance and illumination. We also propose a composited training strategy to tackle the unexpected occlusion in Internet images. Moreover, to enhance photo-realism when extrapolating tourist photographs, we propose a novel realism augmentation process to complement appearance details, which automatically propagates the texture details from a narrow captured photo to the extrapolated neural rendered image. The experiments and photo editing examples on outdoor scenes demonstrate the superior performance of our proposed method in both photo-realism and downstream applications. Code and the supplementary material are available on the project webpage: https://zju3dv.github.io/neural_outdoor_rerender/.

On Generating Identifiable Virtual Faces

Zhuowen Yuan
Zhengxin You
Sheng Li
Zhenxing Qian
Xinpeng Zhang
Alex Kot

Face anonymization with generative models have become increasingly prevalent since they sanitize private information by generating virtual face images, ensuring both privacy and image utility. Such virtual face images are usually not identifiable after the removal or protection of the original identity. In this paper, we formalize and tackle the problem of generating identifiable virtual face images. Our virtual face images are visually different from the original ones for privacy protection. In addition, they are bound with new virtual identities, which can be directly used for face recognition. We propose an Identifiable Virtual Face Generator (IVFG) to generate the virtual face images. The IVFG projects the latent vectors of the original face images into virtual ones according to a user specific key, based on which the virtual face images are generated. To make the virtual face images identifiable, we propose a multi-task learning objective as well as a triplet styled training strategy to learn the IVFG. We evaluate the performance of our virtual face images using different face recognizers on diffident face image datasets, all of which demonstrate the effectiveness of the IVFG for generate identifiable virtual face images.

Keyword Spotting in the Homomorphic Encrypted Domain Using Deep Complex-Valued CNN

Peijia Zheng
Zhiwei Cai
Huicong Zeng
Jiwu Huang

In this paper, we propose a non-interactive scheme to achieve end-to-end keyword spotting in the homomorphic encrypted domain using deep learning techniques. We carefully designed a complex-valued convolutional neural network (CNN) structure for the encrypted domain keyword spotting to take full advantage of the limited multiplicative depth. At the same depth, the proposed complex-valued CNN can learn more speech representations than the real-valued CNN, thus achieving higher accuracy in keyword spotting. The complex activation function of the complex-valued CNN is non-arithmetic and cannot be supported by homomorphic encryption. To implement the complex activation function in the encrypted domain without interaction, we design methods to approximate complex activation functions with low-degree polynomials while preserving the keyword spotting performance. Our scheme supports single-instruction multiple-data (SIMD), which reduces the total size of ciphertexts and improves computational efficiency. We conducted extensive experiments to investigate our performance with various metrics, such as accuracy, robustness, and F1-score. The experimental results show that our approach significantly outperforms the state-of-the-art solutions on every metric.

Cycle-Interactive Generative Adversarial Network for Robust Unsupervised Low-Light Enhancement

Zhangkai Ni
Wenhan Yang
Hanli Wang
Shiqi Wang
Lin Ma
Sam Kwong

Getting rid of the fundamental limitations in fitting to the paired training data, recent unsupervised low-light enhancement methods excel in adjusting illumination and contrast of images. However, for unsupervised low light enhancement, the remaining noise suppression issue due to the lacking of supervision of detailed signal largely impedes the wide deployment of these methods in real-world applications. Herein, we propose a novel Cycle-Interactive Generative Adversarial Network (CIGAN) for unsupervised low-light image enhancement, which is capable of not only better transferring illumination distributions between low/normal-light images but also manipulating detailed signals between two domains, e.g., suppressing/synthesizing realistic noise in the cyclic enhancement/degradation process. In particular, the proposed low-light guided transformation feed-forwards the features of low-light images from the generator of enhancement GAN (eGAN) into the generator of degradation GAN (dGAN). With the learned information of real low-light images, dGAN can synthesize more realistic diverse illumination and contrast in low-light images. Moreover, the feature randomized perturbation module in dGAN learns to increase the feature randomness to produce diverse feature distributions, persuading the synthesized low-light images to contain realistic noise. Extensive experiments demonstrate both the superiority of the proposed method and the effectiveness of each module in CIGAN.

Skeleton2Humanoid: Animating Simulated Characters for Physically-plausible Motion In-betweening

Yunhao Li
Zhenbo Yu
Yucheng Zhu
Bingbing Ni
Guangtao Zhai
Wei Shen

Human motion synthesis is a long-standing problem with various applications in digital twins and the Metaverse. However, modern deep learning based motion synthesis approaches barely consider the physical plausibility of synthesized motions and consequently they usually produce unrealistic human motions. In order to solve this problem, we propose a system "Skeleton2Humanoid" which performs physics-oriented motion correction at test time by regularizing synthesized skeleton motions in a physics simulator. Concretely, our system consists of three sequential stages: (I) test time motion synthesis network adaptation, (II) skeleton to humanoid matching and (III) motion imitation based on reinforcement learning (RL). Stage I introduces a test time adaptation strategy, which improves the physical plausibility of synthesized human skeleton motions by optimizing skeleton joint locations. Stage II performs an analytical inverse kinematics strategy, which converts the optimized human skeleton motions to humanoid robot motions in a physics simulator, then the converted humanoid robot motions can be served as reference motions for the RL policy to imitate. Stage III introduces a curriculum residual force control policy, which drives the humanoid robot to mimic complex converted reference motions in accordance with the physical law. We verify our system on a typical human motion synthesis task, motion-in-betweening. Experiments on the challenging LaFAN1 dataset show our system can outperform prior methods significantly in terms of both physical plausibility and accuracy. Code will be released for research purposes at: https://github.com/michaelliyunhao/Skeleton2Humanoid.

Hybrid Spatial-Temporal Entropy Modelling for Neural Video Compression

Jiahao Li
Bin Li
Yan Lu

For neural video codec, it is critical, yet challenging, to design an efficient entropy model which can accurately predict the probability distribution of the quantized latent representation. However, most existing video codecs directly use the ready-made entropy model from image codec to encode the residual or motion, and do not fully leverage the spatial-temporal characteristics in video. To this end, this paper proposes a powerful entropy model which efficiently captures both spatial and temporal dependencies. In particular, we introduce the latent prior which exploits the correlation among the latent representation to squeeze the temporal redundancy. Meanwhile, the dual spatial prior is proposed to reduce the spatial redundancy in a parallel-friendly manner. In addition, our entropy model is also versatile. Besides estimating the probability distribution, our entropy model also generates the quantization step at spatial-channel-wise. This content-adaptive quantization mechanism not only helps our codec achieve the smooth rate adjustment in single model but also improves the final rate-distortion performance by dynamic bit allocation. Experimental results show that, powered by the proposed entropy model, our neural codec can achieve 18.2% bitrate saving on UVG dataset when compared with H.266 (VTM) using the highest compression ratio configuration. It makes a new milestone in the development of neural video codec. The codes are at https://github.com/microsoft/DCVC.

Geometric Warping Error Aware CNN for DIBR Oriented View Synthesis

Shuai Li
Kaixin Wang
Yanbo Gao
Xun Cai
Mao Ye

Depth Image based Rendering (DIBR) oriented view synthesis is an important virtual view generation technique. It warps the reference view images to the target viewpoint based on their depth maps, without requiring many available viewpoints. However, in the 3D warping process, pixels are warped to fractional pixel locations and then rounded (or interpolated) to integer pixels, resulting in geometric warping error and reducing the image quality. This resembles, to some extent, the image super-resolution problem, but with unfixed fractional pixel locations. To address this problem, we propose a geometric warping error aware CNN (GWEA) framework to enhance the DIBR oriented view synthesis. First, a deformable convolution based geometric warping error aware alignment (GWEA-DCA) module is developed, by taking advantage of the geometric warping error preserved in the DIBR module. The offset learned in the deformable convolution can account for the geometric warping error to facilitate the mapping from the fractional pixels to integer pixels. Moreover, in view that the pixels in the warped images are of different qualities due to the different strengths of warping errors, an attention enhanced view blending (GWEA-AttVB) module is further developed to adaptively fuse the pixels from different warped images. Finally, a partial convolution based hole filling and refinement module fills the remaining holes and improves the quality of the overall image. Experiments show that our model can synthesize higher-quality images than the existing methods, and ablation study is also conducted, validating the effectiveness of each proposed module.

SESSION: Poster Session VI: Experience -- Multimedia Applications

FedMed-ATL: Misaligned Unpaired Cross-Modality Neuroimage Synthesis via Affine Transform Loss

Jinbao Wang
Guoyang Xie
Yawen Huang
Yefeng Zheng
Yaochu Jin
Feng Zheng

The existence of completely aligned and paired multi-modal neuroimaging data has proved its effectiveness in the diagnosis of brain diseases. However, collecting the full set of well-aligned and paired data is impractical, since the practical difficulties may include high cost, long time acquisition, image corruption, and privacy issues. Previously, the misaligned unpaired neuroimaging data (termed as MUD) are generally treated as noisy labels. However, such a noisy label-based method fails to accomplish well when misaligned data occurs distortions severely. For example, the angle of rotation is different. In this paper, we propose a novel federated self-supervised learning (FedMed) for brain image synthesis. An affine transform loss (ATL) was formulated to make use of severely distorted images without violating privacy legislation for the hospital. We then introduce a new data augmentation procedure for self-supervised training and fed it into three auxiliary heads, namely auxiliary rotation, auxiliary translation, and auxiliary scaling heads. The proposed method demonstrates the advanced performance in both the quality of our synthesized results under a severely misaligned and unpaired data setting, and better stability than other GAN-based algorithms. The proposed method also reduces the demand for deformable registration while encouraging to leverage the misaligned and unpaired data. Experimental results verify the outstanding performance of our learning paradigm compared to other state-of-the-art approaches.

Towards Blind Watermarking: Combining Invertible and Non-invertible Mechanisms

Rui Ma
Mengxi Guo
Yi Hou
Fan Yang
Yuan Li
Huizhu Jia
Xiaodong Xie

Blind watermarking provides powerful evidence for copyright protection, image authentication, and tampering identification.However, it remains a challenge to design a watermarking model with high imperceptibility and robustness against strong noise attacks. To resolve this issue, we present a framework Combining the Invertible and Non-invertible (CIN) mechanisms. The CIN is composed of the invertible part to achieve high imperceptibility and the non-invertible part to strengthen the robustness against strong noise attacks. For the invertible part, we develop a diffusion and extraction module (DEM) and a fusion and split module (FSM) to embed and extract watermarks symmetrically in an invertible way. For the non-invertible part, we introduce a non-invertible attention-based module (NIAM) and the noise-specific selection module (NSM) to solve the asymmetric extraction under a strong noise attack. Extensive experiments demonstrate that our framework outperforms the current state-of-the-art methods of imperceptibility and robustness significantly. Our framework can achieve an average of 99.99% accuracy and 67.66 dB PSNR under noise-free conditions, while 96.64% and 39.28 dB combined strong noise attacks. The code will be available in https://github.com/RM1110/CIN.

Improving Transferability for Domain Adaptive Detection Transformers

Kaixiong Gong
Shuang Li
Shugang Li
Rui Zhang
Chi Harold Liu
Qiang Chen

DETR-style detectors stand out amongst in-domain scenarios, but their properties in domain shift settings are under-explored. This paper aims to build a simple but effective baseline with a DETR-style detector on domain shift settings based on two findings. For one, mitigating the domain shift on the backbone and the decoder output features excels in getting favorable results. For another, advanced domain alignment methods in both parts further enhance the performance. Thus, we propose the Object-Aware Alignment (OAA) module and the Optimal Transport based Alignment (OTA) module to achieve comprehensive domain alignment on the outputs of the backbone and the detector. The OAA module aligns the foreground regions identified by pseudo-labels in the backbone outputs, leading to domain-invariant base features. The OTA module utilizes sliced Wasserstein distance to maximize the retention of location information while minimizing the domain gap in the decoder outputs. We implement the findings and the alignment modules into our adaptation method, and it benchmarks the DETR-style detector on the domain shift settings. Experiments on various domain adaptive scenarios validate the effectiveness of our method.

Support for Teaching Mathematics of the Blind by Sighted Tutors Through Multisensual Access to Formulas with Braille Converters and Speech

Dariusz Mikulowski

Nowadays, teaching various subjects at school is successfully supported by information and remote technologies such as Google Class, Moodle and others. Nevertheless, students with special needs such as the visually impaired (BVI) face incredible barriers to using such remote technologies, especially with learning mathematics or physics. The main problem is that BVI uses different tools and techniques than their sighted peers, i.e., a different way of working with mathematical expressions or a lack of the possibility to edit graphics. Traditional methods such as the Brailler, figure models or cubarithms are still used. Another challenge is that there are entirely different systems of presenting formulas in different countries, so-called Braille mathematical notations. To overcome these barriers, we propose universal tools to assist sighted teachers and BVI students in remote training math using a multimodal form of editing of mathematical formulas. It consists of the simultaneous combination of three forms of presentation of math formulas in graphical form for the teacher, intelligent reading through speech synthesis and Braille mathematical notation for BVI. It is possible thanks to the use of intelligent converters between formats such as MathML, intelligent text and Braille and dedicated editors that allow for creating math documents by students and teachers.

Geometry Aligned Variational Transformer for Image-conditioned Layout Generation

Yunning Cao
Ye Ma
Min Zhou
Chuanbin Liu
Hongtao Xie
Tiezheng Ge
Yuning Jiang

Layout generation is a novel task in computer vision, which combines the challenges in both object localization and aesthetic appraisal, widely used in advertisements, posters and slides design. An accurate and pleasant layout should consider both the intra-domain relationship within layout elements and the inter-domain relationship between layout elements and image. However, most previous methods simply focus on image-content-agnostic layout generation, without leveraging the complex visual information from the image. To this end, we explore a novel paradigm entitled image-conditioned layout generation, which aims to add text overlays to an image in a semantically coherent manner. Specifically, we propose an Image-Conditioned Variational Transformer (ICVT) that autoregressively generates various layouts in an image. First, self-attention mechanism is adopted to model the contextual relationship within layout elements, while cross-attention mechanism is used to fuse the visual information of conditional images. Subsequently, we take them as building blocks of conditional variational autoencoder (CVAE), which demonstrates appealing diversity. Second, in order to alleviate the gap between layout elements domain and visual domain, we design a Geometry Alignment module, in which the geometric information of the image is aligned with the layout representation. In addition, we construct a large-scale advertisement poster layout designing dataset with delicate layout and saliency map annotations. Experimental results show that our model can adaptively generate layouts in the non-intrusive area of the image, resulting in a harmonious layout design.

PVSeRF: Joint Pixel-, Voxel- and Surface-Aligned Radiance Field for Single-Image Novel View Synthesis

Xianggang Yu
Jiapeng Tang
Yipeng Qin
Chenghong Li
Xiaoguang Han
Linchao Bao
Shuguang Cui

We present PVSeRF, a learning framework that reconstructs neural radiance fields from single-view RGB images, for novel view synthesis. Previous solutions, such as pixelNeRF, rely only on pixel-aligned features and suffer from feature ambiguity issues. As a result, they struggle with the disentanglement of geometry and appearance, leading to implausible geometries and blurry results. To address this challenge, we propose to incorporate explicit geometry reasoning and combine it with pixel-aligned features for radiance field prediction. Specifically, in addition to pixel-aligned features, we further constrain the radiance field learning to be conditioned on i) voxel-aligned features learned from a coarse volumetric grid and ii) fine surface-aligned features extracted from a regressed point cloud. We show that the introduction of such geometry-aware features helps to achieve a better disentanglement between appearance and geometry, i.e. recovering more accurate geometries and synthesizing higher quality images of novel views. Extensive experiments against state-of-the-art methods on ShapeNet benchmarks demonstrate the superiority of our approach for single-image novel view synthesis.

Cross-Modality High-Frequency Transformer for MR Image Super-Resolution

Chaowei Fang
Dingwen Zhang
Liang Wang
Yulun Zhang
Lechao Cheng
Junwei Han

Improving the resolution of magnetic resonance (MR) image data is critical to computer-aided diagnosis and brain function analysis. Higher resolution helps to capture more detailed content, but typically induces to lower signal-to-noise ratio and longer scanning time. To this end, MR image super-resolution has become a widely-interested topic in recent times. Existing works establish extensive deep models with the conventional architectures based on convolutional neural networks (CNN). In this work, to further advance this research field, we make an early effort to build a Transformer-based MR image super-resolution framework, with careful designs on exploring valuable domain prior knowledge. Specifically, we consider two-fold domain priors including the high-frequency structure prior and the inter-modality context prior, and establish a novel Transformer architecture, called Cross-modality high-frequency Transformer (Cohf-T), to introduce such priors into super-resolving the low-resolution (LR) MR images. Experiments on two datasets indicate that Cohf-T achieves new state-of-the-art performance.

Adma-GAN: Attribute-Driven Memory Augmented GANs for Text-to-Image Generation.

Xintian Wu
Hanbin Zhao
Liangli Zheng
Shouhong Ding
Xi Li

As a challenging task, text-to-image generation aims to generate photo-realistic and semantically consistent images according to the given text descriptions. Existing methods mainly extract the text information from only one sentence to represent an image and the text representation effects the quality of the generated image well. However, directly utilizing the limited information in one sentence misses some key attribute descriptions, which are the crucial factors to describe an image accurately. To alleviate the above problem, we propose an effective text representation method with the complements of attribute information. Firstly, we construct an attribute memory to jointly control the text-to-image generation with sentence input. Secondly, we explore two update mechanisms, sample-aware and sample-joint mechanisms, to dynamically optimize a generalized attribute memory. Furthermore, we design an attribute-sentence-joint conditional generator learning scheme to align the feature embeddings among multiple representations, which promotes the cross-modal network training. Experimental results illustrate that the proposed method obtains substantial performance improvements on both the CUB (FID from 14.81 to 8.57) and COCO (FID from 21.42 to 12.39) datasets.

Efficient Multiple Kernel Clustering via Spectral Perturbation

Chang Tang
Zhenglai Li
Weiqing Yan
Guanghui Yue
Wei Zhang

Clustering is a fundamental task in the machine learning and data mining community. Among existing clustering methods, multiple kernel clustering (MKC) has been widely investigated due to its effectiveness to capture non-linear relationships among samples. However, most of the existing MKC methods bear intensive computational complexity in learning an optimal kernel and seeking the final clustering partition. In this paper, based on the spectral perturbation theory, we propose an efficient MKC method that reduces the computational complexity from O(n3) to O(nk2 + k3), with n and k denoting the number of data samples and the number of clusters, respectively. The proposed method recovers the optimal clustering partition from base partitions by maximizing the eigen gaps to approximate the perturbation errors. An equivalent optimization objective function is introduced to obtain base partitions. Furthermore, a kernel weighting scheme is embedded to capture the diversity among multiple kernels. Finally, the optimal partition, base partitions, and kernel weights are jointly learned in a unified framework. An efficient alternate iterative optimization algorithm is designed to solve the resultant optimization problem. Experimental results on various benchmark datasets demonstrate the superiority of the proposed method when compared to other state-of-the-art ones in terms of both clustering efficacy and efficiency.

DOMFN: A Divergence-Orientated Multi-Modal Fusion Network for Resume Assessment

Yang Yang
Jingshuai Zhang
Fan Gao
Xiaoru Gao
Hengshu Zhu

In talent management, resume assessment aims to analyze the quality of a job seeker's resume, which can assist recruiters to discover suitable candidates and benefit job seekers improving resume quality in return. Recent machine learning based methods on large-scale public resume datasets have provided the opportunity for automatic assessment for reducing manual costs. However, most existing approaches are still content-dominated and ignore other valuable information. Inspired by practical resume evaluations that consider both the content and layout, we construct the multi-modalities from resumes but face a new challenge that sometimes the performance of multi-modal fusion is even worse than the best uni-modality. In this paper, we experimentally find that this phenomenon is due to the cross-modal divergence. Therefore, we need to consider when is it appropriate to perform multi-modal fusion? To address this problem, we design an instance-aware fusion method, i.e., Divergence-Orientated Multi-Modal Fusion Network (DOMFN), which can adaptively fuse the uni-modal predictions and multi-modal prediction based on cross-modal divergence. Specifically, DOMFN computes a functional penalty score to measure the divergence of cross-modal predictions. Then, the learned divergence can be used to decide whether to conduct multi-modal fusion and be adopted into an amended loss for reliable training. Consequently, DOMFN rejects multi-modal prediction when the cross-modal divergence is too large, avoiding the overall performance degradation, so as to achieve better performance than uni-modalities. In experiments, qualitative comparison with baselines on real-world dataset demonstrates the superiority and explainability of the proposed DOMFN, e.g., we find a meaningful phenomenon that multi-modal fusion has positive effects for assessing resumes from UI Designer and Enterprise Service positions, whereas affects the assessment of Technology and Product Operation positions.

Generative Steganography Network

Ping Wei
Sheng Li
Xinpeng Zhang
Ge Luo
Zhenxing Qian
Qing Zhou

Steganography usually modifies cover media to embed secret data. A new steganographic approach called generative steganography (GS) has emerged recently, in which stego images (images containing secret data) are generated from secret data directly without cover media. However, existing GS schemes are often criticized for their poor performances. In this paper, we propose an advanced generative steganography network (GSN) that can generate realistic stego images without using cover images. We firstly introduce the mutual information mechanism in GS, which helps to achieve high secret extraction accuracy. Our model contains four sub-networks, i.e., an image generator (G), a discriminator (D), a steganalyzer (S), and a data extractor (E). D and S act as two adversarial discriminators to ensure the visual quality and security of generated stego images. E is to extract the hidden secret from generated stego images. The generator G is flexibly constructed to synthesize either cover or stego images with different inputs. It facilitates covert communication by concealing the function of generating stego images in a normal generator. A module named secret block is designed to hide secret data in the feature maps during image generation, with which high hiding capacity and image fidelity are achieved. In addition, a novel hierarchical gradient decay (HGD) skill is developed to resist steganalysis detection. Experiments demonstrate the superiority of our work over existing methods.

You Only Hypothesize Once: Point Cloud Registration with Rotation-equivariant Descriptors

Haiping Wang
Yuan Liu
Zhen Dong
Wenping Wang

In this paper, we propose a novel local descriptor-based framework, called You Only Hypothesize Once (YOHO), for the registration of two unaligned point clouds. In contrast to most existing local descriptors which rely on a fragile local reference frame to gain rotation invariance, the proposed descriptor achieves the rotation invariance by recent technologies of group equivariant feature learning, which brings more robustness to point density and noise. Meanwhile, the descriptor in YOHO also has a rotation-equivariant part, which enables us to estimate the registration from just one correspondence hypothesis. Such property reduces the searching space for feasible transformations, thus greatly improving both the accuracy and the efficiency of YOHO. Extensive experiments show that YOHO achieves superior performances with much fewer needed RANSAC iterations on four widely-used datasets, the 3DMatch/3DLoMatch datasets, the ETH dataset and the WHU-TLS dataset. More details are shown in our project page: https://hpwang-whu.github.io/YOHO/.

Disentangled Representation Learning for Multimodal Emotion Recognition

Dingkang Yang
Shuai Huang
Haopeng Kuang
Yangtao Du
Lihua Zhang

Multimodal emotion recognition aims to identify human emotions from text, audio, and visual modalities. Previous methods either explore correlations between different modalities or design sophisticated fusion strategies. However, the serious problem is that the distribution gap and information redundancy often exist across heterogeneous modalities, resulting in learned multimodal representations that may be unrefined. Motivated by these observations, we propose a Feature-Disentangled Multimodal Emotion Recognition (FDMER) method, which learns the common and private feature representations for each modality. Specifically, we design the common and private encoders to project each modality into modality-invariant and modality-specific subspaces, respectively. The modality-invariant subspace aims to explore the commonality among different modalities and reduce the distribution gap sufficiently. The modality-specific subspaces attempt to enhance the diversity and capture the unique characteristics of each modality. After that, a modality discriminator is introduced to guide the parameter learning of the common and private encoders in an adversarial manner. We achieve the modality consistency and disparity constraints by designing tailored losses for the above subspaces. Furthermore, we present a cross-modal attention fusion module to learn adaptive weights for obtaining effective multimodal representations. The final representation is used for different downstream tasks. Experimental results show that the FDMER outperforms the state-of-the-art methods on two multimodal emotion recognition benchmarks. Moreover, we further verify the effectiveness of our model via experiments on the multimodal humor detection task.

Relative Alignment Network for Source-Free Multimodal Video Domain Adaptation

Yi Huang
Xiaoshan Yang
Ji Zhang
Changsheng Xu

Video domain adaptation aims to transfer knowledge from labeled source videos to unlabeled target videos. Existing video domain adaptation methods require full access to the source videos to reduce the domain gap between the source and target videos, which are impractical in real scenarios where the source videos are not available with concerns in transmission efficiency or privacy issues. To address this problem, in this paper, we propose to solve a source-free domain adaptation task for videos where only a pre-trained source model and unlabeled target videos are available for learning a multimodal video classification model. Existing source-free domain adaptation methods cannot be directly applied to this task, since videos always suffer from domain discrepancy along both the multimodal and temporal aspects, which brings difficulties in domain adaptation especially when the source data are unavailable. In this paper, we propose a Multimodal and Temporal Relative Alignment Network (MTRAN) to deal with the above challenges. To explicitly imitate the domain shifts contained in the multimodal information and the temporal dynamics of the source and target videos, we divide the target videos into two splits according to the self-entropy values of the classification results. The low-entropy videos are deemed to be source-like while the high-entropy videos are deemed to be target-like. Then, we adopt a self-entropy-guided MixUp strategy to generate synthetic samples and hypothetical samples as instance-level based on source-like and target-like videos, and push each synthetic sample to be similar with the corresponding hypothetical sample that is slightly closer to the source-like videos than the synthetic sample by multimodal and temporal relative alignment schemes. We evaluate the proposed model on four public video datasets. The results show that our model outperforms existing state-of-the-art methods.

PRO-Face: A Generic Framework for Privacy-preserving Recognizable Obfuscation of Face Images

Lin Yuan
Linguo Liu
Xiao Pu
Zhao Li
Hongbo Li
Xinbo Gao

A number of applications (e.g., video surveillance and authentication) rely on automated face recognition to guarantee functioning of secure services, and meanwhile, have to take into account the privacy of individuals exposed under camera systems. This is the so-called Privacy-Utility trade-off. However, most existing approaches to facial privacy protection focus on removing identifiable visual information from images, leaving protected face unrecognizable to machine, which sacrifice utility for privacy. To tackle the privacy-utility challenge, we propose a novel, generic, effective, yet lightweight framework for Privacy-preserving Recognizable Obfuscation of Face images (named as PRO-Face). The framework allows one to first process a face image using any preferred obfuscation, such as image blur, pixelate and face morphing. It then leverages a Siamese network to fuse the original image with its obfuscated form, generating the final protected image visually similar to the obfuscated one from human perception (for privacy) but still recognized as the original identity by machine (for utility). The framework supports various obfuscations for facial anonymization. The face recognition can be performed accurately not only across anonymized images but also between plain and anonymized ones, based on only pre-trained recognizers. Those feature the "generic" merit of the proposed framework. In-depth objective and subjective evaluations demonstrate the effectiveness of the proposed framework in both privacy protection and utility preservation under distinct scenarios. Our source code, models and any supplementary materials are made publicly available.

Skeleton-based Action Recognition via Adaptive Cross-Form Learning

Xuanhan Wang
Yan Dai
Lianli Gao
Jingkuan Song

Skeleton-based action recognition aims to project skeleton sequences to action categories, where skeleton sequences are derived from multiple forms of pre-detected points. Compared with earlier methods that focus on exploring single-form skeletons via Graph Convolutional Networks (GCNs), existing methods tend to improve GCNs by leveraging multi-form skeletons due to their complementary cues. However, these methods (either adapting structure of GCNs or model ensemble) require the co-existence of all skeleton forms during both training and inference stages, while a typical situation in real life is the existence of only partial forms for inference. To tackle this, we present Adaptive Cross-Form Learning (ACFL), which empowers well-designed GCNs to generate complementary representation from single-form skeletons without changing model capacity. Specifically, each GCN model in ACFL not only learns action representation from the single-form skeletons, but also adaptively mimics useful representations derived from other forms of skeletons. In this way, each GCN can learn how to strengthen what has been learned, thus exploiting model potential and facilitating action recognition as well. Extensive experiments conducted on three challenging benchmarks, i.e., NTU-RGB+D 120, NTU-RGB+D 60 and UAV-Human, demonstrate the effectiveness and generalizability of our method. Specifically, the ACFL significantly improves various GCN models (i.e., CTR-GCN, MS-G3D, and Shift-GCN), achieving a new record for skeleton-based action recognition.

Sample Weighted Multiple Kernel K-means via Min-Max optimization

Yi Zhang
Weixuan Liang
Xinwang Liu
Sisi Dai
Siwei Wang
Liyang Xu
En Zhu

A representative multiple kernel clustering (MKC) algorithm, termed simple multiple kernel k-means (SMKKM), is recently proposed to optimally mine useful information from a set of pre-specified kernels to improve clustering performance. Different from existing min-min learning framework, it puts a novel min-max optimization manner, which attracts considerable attention in related community. Despite achieving encouraged success, we observe that SMKKM only focuses on combination coefficients among kernels and ignores the relationship among the importance of different samples. As a result, it does not sufficiently consider different contributions of each sample to clustering, and thus cannot effectively obtain the "ideal" similarity structure, leading to unsatisfying performance. To address this issue, this paper proposes a novel sample weighted multiple kernel k-means via min-max optimization (SWMKKM), which sufficiently considers the sum of relationship between one sample and the others to represent the sample weights. Such a weighting criterion helps clustering algorithm pay more attention to samples with more positive effects on clustering and avoids unreliable overestimation for samples with poor quality. Based on SMKKM, we adopt a reduced gradient algorithm with proved convergence to solve the resultant optimization problem. Comprehensive experiments on multiple benchmark datasets demonstrate that our proposed SWMKKM dramatically improves the state-of-the-art MKC algorithms, verifying the effectiveness of our proposed sample weighting criterion.

MIntRec: A New Dataset for Multimodal Intent Recognition

Hanlei Zhang
Hua Xu
Xin Wang
Qianrui Zhou
Shaojie Zhao
Jiayan Teng

Multimodal intent recognition is a significant task for understanding human language in real-world multimodal scenes. Most existing intent recognition methods have limitations in leveraging the multimodal information due to the restrictions of the benchmark datasets with only text information. This paper introduces a novel dataset for multimodal intent recognition (MIntRec) to address this issue. It formulates coarse-grained and fine-grained intent taxonomies based on the data collected from the TV series Superstore. The dataset consists of 2,224 high-quality samples with text, video, and audio modalities and has multimodal annotations among twenty intent categories. Furthermore, we provide annotated bounding boxes of speakers in each video segment and achieve an automatic process for speaker annotation. MIntRec is helpful for researchers to mine relationships between different modalities to enhance the capability of intent recognition. We extract features from each modality and model cross-modal interactions by adapting three powerful multimodal fusion methods to build baselines. Extensive experiments show that employing the non-verbal modalities achieves substantial improvements compared with the text-only modality, demonstrating the effectiveness of using multimodal information for intent recognition. The gap between the best-performing methods and humans indicates the challenge and importance of this task for the community. The full dataset and codes are available for use at https://github.com/thuiar/MIntRec.

Adaptive Transformer-Based Conditioned Variational Autoencoder for Incomplete Social Event Classification

Zhangming Li
Shengsheng Qian
Jie Cao
Quan Fang
Changsheng Xu

With the rapid development of the Internet and the expanding scale of social media, incomplete social event classification has increasingly become a challenging task. The key for incomplete social event classification is to accurately leverage the image-level and text-level information. However, most of the existing approaches may suffer from the following limitations: (1) Most Generative Models use the available features to generate the incomplete modality features for social events classification while ignoring the rich semantic label information. (2) The majority of existing multi-modal methods just simply concatenate the coarse-grained image features and text features of the event to get the multi-modal features to classify social events, which ignores the irrelevant multi-modal features and limits their modeling capabilities. To tackle these challenges, in this paper, we propose an Adaptive Transformer-Based Conditioned Variational Autoencoder Network (AT-CVAE) for incomplete social event classification. In the AT-CVAE, we propose a novel Transformer-based Conditioned Variational Autoencoder to jointly model the textual information, visual information and label information into a unified deep model, which can generate more discriminative latent features and enhance the performance of incomplete social event classification. Furthermore, the Mixture-of-Experts Mechanism is utilized to dynamically acquire the weights of each multi-modal information, which can better filter out the irrelevant multi-modal information and capture the vitally important information. Extensive experiments are conducted on two public event datasets, demonstrating the superior performance of our AT-CVAE method.

Learning Modality-Specific and -Agnostic Representations for Asynchronous Multimodal Language Sequences

Dingkang Yang
Haopeng Kuang
Shuai Huang
Lihua Zhang

Understanding human behaviors and intents from videos is a challenging task. Video flows usually involve time-series data from different modalities, such as natural language, facial gestures, and acoustic information. Due to the variable receiving frequency for sequences from each modality, the collected multimodal streams are usually unaligned. For multimodal fusion of asynchronous sequences, the existing methods focus on projecting multiple modalities into a common latent space and learning the hybrid representations, which neglects the diversity of each modality and the commonality across different modalities. Motivated by this observation, we propose a Multimodal Fusion approach for learning modality-Specific and modality-Agnostic representations (MFSA) to refine multimodal representations and leverage the complementarity across different modalities. Specifically, a predictive self-attention module is used to capture reliable contextual dependencies and enhance the unique features over the modality-specific spaces. Meanwhile, we propose a hierarchical cross-modal attention module to explore the correlations between cross-modal elements over the modality-agnostic space. In this case, a double-discriminator strategy is presented to ensure the production of distinct representations in an adversarial manner. Eventually, the modality-specific and -agnostic multimodal representations are used together for downstream tasks. Comprehensive experiments on three multimodal datasets clearly demonstrate the superiority of our approach.

DoF-NeRF: Depth-of-Field Meets Neural Radiance Fields

Zijin Wu
Xingyi Li
Juewen Peng
Hao Lu
Zhiguo Cao
Weicai Zhong

Neural Radiance Field (NeRF) and its variants have exhibited great success on representing 3D scenes and synthesizing photo-realistic novel views. However, they are generally based on the pinhole camera model and assume all-in-focus inputs. This limits their applicability as images captured from the real world often have finite depth-of-field (DoF). To mitigate this issue, we introduce DoF-NeRF, a novel neural rendering approach that can deal with shallow DoF inputs and can simulate DoF effect. In particular, it extends NeRF to simulate the aperture of lens following the principles of geometric optics. Such a physical guarantee allows DoF-NeRF to operate views with different focus configurations. Benefiting from explicit aperture modeling, DoF-NeRF also enables direct manipulation of DoF effect by adjusting virtual aperture and focus parameters. It is plug-and-play and can be inserted into NeRF-based frameworks. Experiments on synthetic and real-world datasets show that, DoF-NeRF not only performs comparably with NeRF in the all-in-focus setting, but also can synthesize all-in-focus novel views conditioned on shallow DoF inputs. An interesting application of DoF-NeRF to DoF rendering is also demonstrated. The source code will be made available at: https://github.com/zijinwuzijin/DoF-NeRF.

RKformer: Runge-Kutta Transformer with Random-Connection Attention for Infrared Small Target Detection

Mingjin Zhang
Haichen Bai
Jing Zhang
Rui Zhang
Chaoyue Wang
Jie Guo
Xinbo Gao

Infrared small target detection (IRSTD) refers to segmenting the small targets from infrared images, which is of great significance in practical applications. However, due to the small scale of targets as well as noise and clutter in the background, current deep neural network-based methods struggle in extracting features with discriminative semantics while preserving fine details. In this paper, we address this problem by proposing a novel RKformer model with an encoder-decoder structure, where four specifically designed Runge-Kutta transformer (RKT) blocks are stacked sequentially in the encoder. Technically, it has three key designs. First, we adopt a parallel encoder block (PEB) of the transformer and convolution to take their advantages in long-range dependency modeling and locality modeling for extracting semantics and preserving details. Second, we propose a novel random-connection attention (RCA) block, which has a reservoir structure to learn sparse attention via random connections during training. RCA encourages the target to attend to sparse relevant positions instead of all the large-area background pixels, resulting in more informative attention scores. It has fewer parameters and computations than the original self-attention in the transformer while performing better. Third, inspired by neural ordinary differential equations (ODE), we stack two PEBs with several residual connections as the basic encoder block to implement the Runge-Kutta method for solving ODE, which can effectively enhance the feature and suppress noise. Experiments on the public NUAA-SIRST dataset and IRSTD-1k dataset demonstrate the superiority of the RKformer over state-of-the-art methods.

Self-Supervised Human Pose based Multi-Camera Video Synchronization

Liqiang Yin
Ruize Han
Wei Feng
Song Wang

Multi-view video collaborative analysis is an important task and has many applications in multimedia community. However, it always requires the given multiple videos to be temporally synchronized. Existing methods commonly synchronize the videos by the wired communication, which may hinder the practical application in real world, especially for moving cameras. In this paper, we focus on the human-centric video analysis and propose a self-supervised framework for the automatic multi-camera video synchronization. Specifically, we develop SeSyn-Net with the 2D human pose as input for feature embedding and design a series of self-supervised losses to effectively extract the view-invariant but time-discriminative representation for video synchronization. We also build two new datasets for the performance evaluation. Extensive experimental results verify the effectiveness of our method, which achieves the superior performance compared to both the classical and state-of-the-art methods.

Energy-Based Domain Generalization for Face Anti-Spoofing

Zhekai Du
Jingjing Li
Lin Zuo
Lei Zhu
Ke Lu

With various unforeseeable face presentation attacks (PA) springing up, face anti-spoofing (FAS) urgently needs to generalize to unseen scenarios. Research on generalizable FAS has lately attracted growing attention. Existing methods cast FAS as a vanilla binary classification problem and address it by a standard discriminative classifier p(y|x) under a domain generalization framework. However, discriminative models are unreliable for samples far away from the training distribution. In this paper, we resort to an energy-based model (EBM) to tackle FAS in a generative perspective. Our motivation is to model the joint density p(x,y), which allows to compute not only p(y|x) but also p(x). Due to the intractability of direct modeling, we use EBMs as an alternative to probabilistic estimation. With energy-based training, real faces are encouraged to get low free energy associated with the marginal probability p(x) of real faces, and all samples with high free energy are regarded as fake faces, thus rejecting any kind of PA out of the distribution of real faces. To learn to generalize to unseen domains, we generate diverse and novel populations in feature space under the guidance of energy model. Our model is updated in a meta-learning schema, where the original source samples are utilized for meta-training and the generated ones for meta-testing. We validate our method on four widely used FAS datasets. Comprehensive experimental results demonstrate the effectiveness of our method compared with state-of-the-arts.

Revisiting Stochastic Learning for Generalizable Person Re-identification

Jiajian Zhao
Yifan Zhao
Xiaowu Chen
Jia Li

Generalizable person re-identification aims to achieve a well generalization capability on target domains without accessing target data. Existing methods focus on suppressing domain-specific information or simulating unseen environments by meta-learning strategies, which could damage the capture ability on fine-grained visual patterns or lead to overfitting issues by the repetitive training of episodes. In this paper, we revisit the stochastic behaviors from two different perspectives: 1) Stochastic splitting-sliding sampler. It splits domain sources into approximately equal sample-size subsets and selects several subsets from various sources by a sliding window, forcing the model to step out of local minimums under stochastic sources. 2) Variance-varying gradient dropout. Gradients in parts of network are also selected by a sliding window and multiplied by binary masks generated from Bernoulli distribution, making gradients in varying variance and preventing the model from local minimums. By applying these two proposed stochastic behaviors, the model achieves a better generalization performance on unseen target domains without any additional computation costs or auxiliary modules. Extensive experiments demonstrate that our proposed model is effective and outperforms state-of-the-art methods on public domain generalizable person Re-ID benchmarks.

D2Animator: Dual Distillation of StyleGAN For High-Resolution Face Animation

Zhuo Chen
Chaoyue Wang
Haimei Zhao
Bo Yuan
Xiu Li

The style-based generator architectures (e.g. StyleGAN v1, v2) largely promote the controllability and explainability of Generative Adversarial Networks (GANs). Many researchers have applied the pretrained style-based generators to image manipulation and video editing by exploring the correlation between linear interpolation in the latent space and semantic transformation in the synthesized image manifold. However, most previous studies focused on manipulating separate discrete attributes, which is insufficient to animate a still image to generate videos with complex and diverse poses and expressions. In this work, we devise a dual distillation strategy (D2Animator) for generating animated high-resolution face videos conditioned on identities and poses from different images. Specifically, we first introduce a Clustering-based Distiller (CluDistiller) to distill diverse interpolation directions in the latent space, and synthesize identity-consistent faces with various poses and expressions, such as blinking, frowning, looking up/down, etc. Then we propose an Augmentation-based Distiller (AugDistiller) that learns to encode arbitrary face deformation into a combination of interpolation directions via training on augmentation samples synthesized by CluDistiller. Through assembling the two distillation methods, D2Animator can generate high-resolution face animation videos without training on video sequences. Extensive experiments on self-driving, cross-identity and sequence-driving tasks demonstrate the superiority of the proposed D2Animator over existing StyleGAN manipulation and face animation methods in both generation quality and animation fidelity.

Adaptive Hierarchical Pooling for Weakly-supervised Sound Event Detection

Lijian Gao
Ling Zhou
Qirong Mao
Ming Dong

In Weakly-supervised Sound Event Detection (WSED), the ground truth of training data contains the presence or absence of each sound event only at the clip-level (i.e., no frame-level annotations). Recently, WSED has been formulated under the multi-instance learning framework, and a critical component within this formulation is the design of the temporal pooling function. In this paper, we propose an adaptive hierarchical pooling (HiPool) for WSED, which combines the advantages of max pooling in audio tagging and weighted average pooling in audio localization through a novel hierarchical structure and learns event-wise optimal pooling functions through continuous relaxation-based joint optimization. Extensive experiments on benchmark datasets show that HiPool outperforms the current pooling methods and greatly improves the performance of WSED. HiPool also has great generality - ready to be plugged into any WSED models.

Mutual Adaptive Reasoning for Monocular 3D Multi-Person Pose Estimation

Juze Zhang
Jingya Wang
Ye Shi
Fei Gao
Lan Xu
Jingyi Yu

Inter-person occlusion and depth ambiguity make estimating the 3D poses of monocular multiple persons as camera-centric coordinates a challenging problem. Typical top-down frameworks suffer from high computational redundancy with an additional detection stage. By contrast, the bottom-up methods enjoy low computational costs as they are less affected by the number of humans. However, most existing bottom-up methods treat camera-centric 3D human pose estimation as two unrelated subtasks: 2.5D pose estimation and camera-centric depth estimation. In this paper, we propose a unified model that leverages the mutual benefits of both these subtasks. Within the framework, a robust structured 2.5D pose estimation is designed to recognize inter-person occlusion based on depth relationships. Additionally, we develop an end-to-end geometry-aware depth reasoning method that exploits the mutual benefits of both 2.5D pose and camera-centric root depths. This method first uses 2.5D pose and geometry information to infer camera-centric root depths in a forward pass, and then exploits the root depths to further improve representation learning of 2.5D pose estimation in a backward pass. Further, we designed an adaptive fusion scheme that leverages both visual perception and body geometry to alleviate inherent depth ambiguity issues. Extensive experiments demonstrate the superiority of our proposed model over a wide range of bottom-up methods. Our accuracy is even competitive with top-down counterparts. Notably, our model runs much faster than existing bottom-up and top-down methods.

Learning Generalizable Latent Representations for Novel Degradations in Super-Resolution

Fengjun Li
Xin Feng
Fanglin Chen
Guangming Lu
Wenjie Pei

Typical methods for blind image super-resolution (SR) focus on dealing with unknown degradations by directly estimating them or learning the degradation representations in a latent space. A potential limitation of these methods is that they assume the unknown degradations can be simulated by the integration of various handcrafted degradations (e.g., bicubic downsampling), which is not necessarily true. The real-world degradations can be beyond the simulation scope by the handcrafted degradations, which are referred to as novel degradations. In this work, we propose to learn a latent representation space for degradations, which can be generalized from handcrafted (base) degradations to novel degradations. Furthermore, we perform variational inference to match the posterior of degradations in latent representation space with a prior distribution (e.g., Gaussian distribution). Consequently, we are able to sample more high-quality representations for a novel degradation to augment the training data for SR model. We conduct extensive experiments on both synthetic and real-world datasets to validate the effectiveness and advantages of our method for blind super-resolution with novel degradations.

Rethinking the Vulnerability of DNN Watermarking: Are Watermarks Robust against Naturalness-aware Perturbations?

Run Wang
Haoxuan Li
Lingzhou Mu
Jixing Ren
Shangwei Guo
Li Liu
Liming Fang
Jing Chen
Lina Wang

Training Deep Neural Networks (DNN) is a time-consuming process and requires a large amount of training data, which motivates studies working on protecting the intellectual property (IP) of DNN models by employing various watermarking techniques. Unfortunately, in recent years, adversaries have been exploiting the vulnerabilities of the employed watermarking techniques to remove the embedded watermarks. In this paper, we investigate and introduce a novel watermark removal attack, called AdvNP, against all the existing four different types of DNN watermarking schemes via input preprocessing by injecting Adversarial Naturalness-aware Perturbations. In contrast to the prior studies, our proposed method is the first work that generalizes all the existing four watermarking schemes well without involving any model modification, which preserves the fidelity of the target model. We conduct the experiments against four state-of-the-art (SOTA) watermarking schemes on two real tasks (e.g., image classification on ImageNet, face recognition on CelebA) across multiple DNN models. Overall, our proposed AdvNP significantly invalidates the watermarks against the four watermarking schemes on two real-world datasets, i.e., 60.9% on the average attack success rate and up to 97% in the worse case. Moreover, our AdvNP could well survive the image denoising techniques and outperforms the baseline in both the fidelity preserving and watermark removal. Furthermore, we introduce two defense methods to enhance the robustness of DNN watermarking against our AdvNP. Our experimental results pose real threats to the existing watermarking schemes and call for more practical and robust watermarking techniques to protect the copyright of pre-trained DNN models. The source code and models are available at ttps://github.com/GitKJ123/AdvNP.

In-N-Out Generative Learning for Dense Unsupervised Video Segmentation

Xiao Pan
Peike Li
Zongxin Yang
Huiling Zhou
Chang Zhou
Hongxia Yang
Jingren Zhou
Yi Yang

In this paper, we focus on unsupervised learning for Video Object Segmentation (VOS) which learns visual correspondence (i.e., the similarity between pixel-level features) from unlabeled videos. Previous methods are mainly based on the contrastive learning paradigm, which optimize either in image level or pixel level. Image-level optimization (e.g., the spatially pooled feature of ResNet) learns robust high-level semantics but is sub-optimal since the pixel-level features are optimized implicitly. By contrast, pixel-level optimization is more explicit, however, it is sensitive to the visual quality of training data and is not robust to object deformation. To complementarily perform these two levels of optimization in a unified framework, we propose the In-aNd-Out (INO) generative learning from a purely generative perspective with the help of naturally designed class tokens and patch tokens in Vision Transformer (ViT). Specifically, for image-level optimization, we force the out-view imagination from local to global views on class tokens, which helps capture high-level semantics, and we name it as out-generative learning. As to pixel-level optimization, we perform in-view masked image modeling on patch tokens, which recovers the corrupted parts of an image via inferring its fine-grained structure, and we term it as in-generative learning. To discover the temporal information better, we additionally force the inter-frame consistency from both feature and affinity matrix levels. Extensive experiments on DAVIS-2017 val and YouTube-VOS 2018 val show that our INO outperforms previous state-of-the-art methods by significant margins.

Everything is There in Latent Space: Attribute Editing and Attribute Style Manipulation by StyleGAN Latent Space Exploration

Rishubh Parihar
Ankit Dhiman
Tejan Karmali
Venkatesh R

Unconstrained Image generation with high realism is now possible using recent Generative Adversarial Networks (GANs). However, it is quite challenging to generate images with a given set of attributes. Recent methods use style-based GAN models to perform image editing by leveraging the semantic hierarchy present in the layers of the generator. We present Few-shot Latent-based Attribute Manipulation and Editing (FLAME), a simple yet effective framework to perform highly controlled image editing by latent space manipulation. Specifically, we estimate linear directions in the latent space (of a pre-trained StyleGAN) that controls semantic attributes in the generated image. In contrast to previous methods that either rely on large-scale attribute labeled datasets or attribute classifiers, FLAME uses minimal supervision of a few curated image pairs to estimate disentangled edit directions. FLAME can perform both individual and sequential edits with high precision on a diverse set of images while preserving identity. Further, we propose a novel task of Attribute Style Manipulation to generate diverse styles for attributes such as eyeglass and hair. We first encode a set of synthetic images of the same identity but having different attribute styles in the latent space to estimate an attribute style manifold. Sampling a new latent from this manifold will result in a new attribute style in the generated image. We propose a novel sampling method to sample latent from the manifold, enabling us to generate a diverse set of attribute styles beyond the styles present in the training set. FLAME can generate diverse attribute styles in a disentangled manner. We illustrate the superior performance of FLAME against previous image editing methods by extensive qualitative and quantitative comparisons. FLAME generalizes well on out-of-distribution images from art domain as well as on other datasets such as cars and churches.

An Image-to-video Model for Real-Time Video Enhancement

Dongyu She
Kun Xu

Recent years have witnessed the increasing popularity of learning-based methods to enhance the color and tone of images. Although these methods achieve satisfying performance on static images, it is non-trivial to extend such image-to-image methods to handle videos. A straight extension would easily lead to computation inefficiency or distracting flickering effects. In this paper, we propose a novel image-to-video model enforcing the temporal stability for real-time video enhancement, which is trained using only static images. Specifically, we first propose a lightweight image enhancer via learnable flexible 2-dimensional lookup tables (F2D LUTs), which can consider scenario information adaptively. To impose temporal constancy, we further propose to infer the motion fields via a virtual camera motion engine, which can be utilized to stabilize the image-to-video model with temporal consistency loss. Experimental results show that our image-to-video model not only achieves the state-of-the-art performance on the image enhancement task, but also performs favorably against baselines on the video enhancement task. Our source code is available at https://github.com/shedy-pub/I2VEnhance.

Learning an Inference-accelerated Network from a Pre-trained Model with Frequency-enhanced Feature Distillation

Xuesong Niu
Jili Gu
Guoxin Zhang
Pengfei Wan
Zhongyuan Wang

Convolution neural networks (CNNs) have achieved great success in various computer vision tasks, but they are still suffering from the heavy computation costs, which are mainly resulted from the substantial redundancy of the feature maps. In order to reduce these redundancy, we proposed a simple but effective frequency-enhanced feature distillation strategy to train an inference-accelerated network with a pre-trained model. Traditionally, one CNN can be regarded as a hierarchical structure, which can generate the low-level, middle-level and high-level feature maps from different convolution layers. In order to accelerate the inference time of CNNs, in this paper, we propose to resize the low-level and middle-level feature maps to smaller scales to reduce the spatial computation costs of CNNs. A frequency-enhanced feature distillation training strategy with a pre-trained model is then used to help the inference-accelerated network to maintain the core information after resizing the feature maps. To be specific, the original pre-trained network and the inference-accelerated network with resized feature maps are regarded as the teacher network and student network respectively. Considering that the low-frequency domain of the feature maps contribute the most parts to the final classification, we then transform the feature maps of different levels into a frequency-enhanced feature space, which highlights the low-frequency features for both the teacher and student networks. The frequency-enhanced features are used to transfer the knowledge from the teacher network to the student network. At the same time, knowledge for the final classification, i.e., the classification feature and predicted probabilities, are also used for distillation. Experiments on multiple databases based on various network structure types, e.g., ResNet, Res2Net, MobileNetV2, and ConvNeXt, have shown that with the proposed frequency-enhanced feature distillation training strategy, our method could get an inference-accelerated network with comparable performance and much less computation cost.

Exploring Feature Compensation and Cross-level Correlation for Infrared Small Target Detection

Mingjin Zhang
Ke Yue
Jing Zhang
Yunsong Li
Xinbo Gao

Single frame infrared small target (SIRST) detection is useful for many practical applications, such as maritime rescue. However, SIRST detection is challenging due to the low-contrast between small targets and noisy background in infrared images. To address this challenge, we propose a novel FC3-Net by exploring feature compensation and cross-level correlation for SIRST detection. Specifically, FC3-Net consists of a Fine-detail guided Multi-level Feature Compensation (F-MFC) module, and a Cross-level Feature Correlation (CFC) module. The F-MFC module aims to compensate the information loss of details caused by the downsampling layers in convolutional neural networks (CNN) via aggregating features from multiple adjacent levels, so that the detail features of small targets can be propagated to the deeper layers of the network. Besides, to suppress the side impact of background noise, the CFC module constructs an energy filtering kernel based on the higher-level features with less background noise to filter out the noise in the middle-level features, and fuse them with the low-level ones to learn a strong target representation. Putting them together into the encoder-decoder structure, our FC3-Net could produce an accurate target mask with fine shape and details. Experiment results on the public NUAA-SIRST and IRSTD-1k datasets demonstrate that the proposed FC3-Net outperforms state-of-the-art methods in terms of both pixel-level and object-level metrics. The code will be released at https://github.com/IPIC-Lab/SIRST-Detection-FC3-Net.

Pixel Exclusion: Uncertainty-aware Boundary Discovery for Active Cross-Domain Semantic Segmentation

fuming you
Jingjing Li
Zhi Chen
Lei Zhu

Unsupervised Domain Adaptation (UDA) has been shown to alleviate the heavy annotations for semantic segmentation. Recently, numerous self-training approaches are proposed to address the challenging cross-domain semantic segmentation problem. However, there still exists two open issues: (1) The generated pseudo-labels are inevitably noisy without external supervision. (2) These is a performance gap between UDA models and the fully-supervised model. In this paper, we propose to investigate Active Learning (AL) that selects a small portion of unlabeled pixels (or images) to be annotated, which leads to an impressive performance gain. Specifically, we propose a novel Uncertainty-aware Boundary Discovery (UBD) strategy that selects the uncertain pixels in the boundary areas that contains rich contextual information. Technically, we firstly select the pixels with top entropy values, and then re-select the pixels that are exclusive to their neighbors. We leverage the Kullback-Leibler divergence between one pixel's softmax prediction and its neighbors' to measure its "exclusivity". Extensive experiments show that our approach outperforms previous methods with both pixel-level and image-level label acquisition protocols.

Deep Flexible Structure Preserving Image Smoothing

Mingjia Li
Yuanbin Fu
Xinhui Li
Xiaojie Guo

Structure preserving image smoothing is fundamental to numerous multimedia, computer vision, and graphics tasks. This paper develops a deep network in the light of flexibility in controlling, structure preservation in smoothing, and efficiency. Following the principle of divide-and-rule, we decouple the original problem into two specific functionalities, i.e., controllable guidance prediction and image smoothing conditioned on the predicted guidance. Concretely, for flexibly adjusting the strength of smoothness, we customize a two-branch module equipped with a sluice mechanism, which enables altering the strength during inference in a fixed range from 0 (fully smoothing) to 1 (non-smoothing). Moreover, we build a UNet-in-UNet structure with carefully designed loss terms to seek visually pleasant smoothing results without paired data involved for training. As a consequence, our method can produce promising smoothing results with structures well-preserved at arbitrary levels through a compact model with 0.6M parameters, making it attractive for practical use. Quantitative and qualitative experiments are provided to reveal the efficacy of our design, and demonstrate its superiority over other competitors. The code can be found at https://github.com/lime-j/DeepFSPIS.

Defending Physical Adversarial Attack on Object Detection via Adversarial Patch-Feature Energy

Taeheon Kim
Youngjoon Yu
Yong Man Ro

Object detection plays an important role in security-critical systems such as autonomous vehicles but has shown to be vulnerable to adversarial patch attacks. Existing defense methods are restricted to localized noise patches by removing noisy regions in the input image. However, adversarial patches have developed into natural-looking patterns which evade existing defenses. To address this issue, we propose a defense method based on a novel concept "Adversarial Patch- Feature Energy" (APE) which exploits common deep feature characteristics of an adversarial patch. Our proposed defense consists of APE-masking and APE-refinement which can be employed to defend against any adversarial patch on literature. Extensive experiments demonstrate that APE-based defense achieves impressive robustness against adversarial patches both in the digital space and the physical world.

Multiview Contrastive Learning for Completely Blind Video Quality Assessment of User Generated Content

Shankhanil Mitra
Rajiv Soundararajan

Completely blind video quality assessment (VQA) refers to a class of quality assessment methods that do not use any reference videos, human opinion scores or training videos from the target database to learn a quality model. The design of this class of methods is particularly important since it can allow for superior generalization in performance across various datasets. We consider the design of completely blind VQA for user generated content. While several deep feature extraction methods have been considered in supervised and weakly supervised settings, such approaches have not been studied in the context of completely blind VQA. We bridge this gap by presenting a self-supervised multiview contrastive learning framework to learn spatio-temporal quality representations. In particular, we capture the common information between frame differences and frames by treating them as a pair of views and similarly obtain the shared representations between frame differences and optical flow. The resulting features are then compared with a corpus of pristine natural video patches to predict the quality of the distorted video. Detailed experiments on multiple camera captured VQA datasets reveal the superior performance of our method over other features when evaluated without training on human scores. Code will be made available at https://github.com/Shankhanil006/VISION.

Compound Batch Normalization for Long-tailed Image Classification

Lechao Cheng
Chaowei Fang
Dingwen Zhang
Guanbin Li
Gang Huang

Significant progress has been made in learning image classification neural networks under long-tail data distribution using robust training algorithms such as data re-sampling, re-weighting, and margin adjustment. Those methods, however, ignore the impact of data imbalance on feature normalization. The dominance of majority classes (head classes) in estimating statistics and affine parameters causes internal covariate shifts within less-frequent categories to be overlooked. To alleviate this challenge, we propose a compound batch normalization method based on a Gaussian mixture. It can model the feature space more comprehensively and reduce the dominance of head classes. In addition, a moving average-based expectation maximization (EM) algorithm is employed to estimate the statistical parameters of multiple Gaussian distributions. However, the EM algorithm is sensitive to initialization and can easily become stuck in local minima where the multiple Gaussian components continue to focus on majority classes. To tackle this issue, we developed a dual-path learning framework that employs class-aware split feature normalization to diversify the estimated Gaussian distributions, allowing the Gaussian components to fit with training samples of less-frequent classes more comprehensively. Extensive experiments on commonly used datasets demonstrated that the proposed method outperforms existing methods on long-tailed image classification.

Alleviating Style Sensitivity then Adapting: Source-free Domain Adaptation for Medical Image Segmentation

Yalan Ye
Ziqi Liu
Yangwuyong Zhang
Jingjing Li
Hengtao Shen

Recently, source-free domain adaptation (SFDA) has attracted extensive attention in medical image segmentation due to the ability of knowledge transfer without accessing source data. However, existing SFDA methods suffer from severe performance degradation since the style of the target data shifts from the source. Although traditional unsupervised domain adaptation (UDA) methods are capable of addressing the style shifts issue using both domain data, they fail to extract the source style due to a lack of source data in source-free scenarios. In this paper, we propose a novel style-insensitive source-free domain adaptation framework (SI-SFDA) for medical image segmentation to reduce the impacts of style shifts. The proposed framework first pretrains a generalized source model and then adapts the source model in a source data-free manner. Towards the former, a cross-patch style generalization (CPSG) mechanism is introduced to reduce the style sensitivity of the source model via a self-training paradigm with Transformer structure. Towards the latter, an adaptive confidence regularization (ACR) loss with dynamic scaling strategy is developed to further reduce the classification confusion caused by style shifts. The proposed ACR loss is model-independent so that it can be used with other methods to improve the segmentation performance. Extensive experiments are conducted on five public medical image benchmarks, the promising performance on organ and fundus segmentation tasks demonstrates the effectiveness of our framework.

Multimedia Event Extraction From News With a Unified Contrastive Learning Framework

Jian Liu
Yufeng Chen
Jinan Xu

Extracting events from news have seen many benefits in downstream applications. Today's event extraction (EE) systems, however, usually focus on a single modality --- either for text or image, and such methods suffer from incomplete information because a news document is typically presented in a multimedia format. In this paper, we propose a new method for multimedia EE by bridging the textual and visual modalities with a unified contrastive learning framework. Our central idea is to create a shared space for texts and images in order to improve their similar representation. This is accomplished by training on text-image pairs in general, and we demonstrate that it is possible to use this framework to boost learning for one modality by investigating the complementary of the other modality. On the benchmark dataset, our approach establishes a new state-of-the-art performance and shows a 3 percent improvement in F1. Furthermore, we demonstrate that it can achieve cutting-edge performance for visual EE even in a zero-shot scenario with no annotated data in the visual modality.

DomainPlus: Cross Transform Domain Learning towards High Dynamic Range Imaging

Bolun Zheng
Xiaokai Pan
Hua Zhang
Xiaofei Zhou
Gregory Slabaugh
Chenggang Yan
Shanxin Yuan

High dynamic range (HDR) imaging by combining multiple low dynamic range (LDR) images of different exposures provides a promising way to produce high quality photographs. However, the misalignment between the input images leads to ghosting artifacts in the reconstructed HDR image. In this paper, we propose a cross-transform domain neural network for efficient HDR imaging. Our approach consists of two modules: a merging module and a restoration module. For the merging module, we propose a Multiscale Attention with Fronted Fusion (MAFF) mechanism to achieve coarse-to-fine spatial fusion. For the restoration module, we propose fronted Discrete Wavelet Transform (DWT) and Discrete Cosine Transform (DCT)-based learnable bandpass filters to formulate a cross-transform domain learning block, dubbed DomainPlus Block (DPB) for effective ghosting removal. Our ablation study and comprehensive experiments show that DomainPlus outperforms the existing state-of-the-art on several datasets.

Tracking Game: Self-adaptative Agent based Multi-object Tracking

Shuai Wang
Da Yang
Yubin Wu
Yang Liu
Hao Sheng

Multi-object tracking (MOT) has become a hot task in multi-media analysis. It not only locates the objects but also maintains their unique identities. However, previous methods encounter tracking failures in complex scenes, since they lose most of the unique attributes of each target. In this paper, we formulate the MOT problem as Tracking Game and propose a Self-adaptative Agent Tracker (SAT) framework to solve this problem. The roles in Tracking Game are divided into two classes including the agent player and the game organizer. The organizer controls the game and optimizes the agents' actions from a global perspective. The agent encodes the attributes of targets and selects action dynamically. For these purposes, we design the State Transition Net to update the agent state and the Action Decision Net to implement the flexible tracking strategy for each agent. Finally, we present the organizer-agent coordination tracking algorithm to leverage both global and individual information. The experiments show that the proposed SAT achieves the state-of-the-art performance on both MOT17 and MOT20 benchmarks.

Self-Supervised Text Erasing with Controllable Image Synthesis

Gangwei Jiang
Shiyao Wang
Tiezheng Ge
Yuning Jiang
Ying Wei
Defu Lian

Recent efforts on text erasing have shown promising results. However, existing methods require rich yet costly label annotations to obtain robust models, which limits their use for practical applications. To this end, we study an unsupervised scenario by proposing a novel Self-supervised Text Erasing (STE) framework that jointly learns to synthesize training images with erasure ground-truth and accurately erase texts in the real world. We first design a style-aware image synthesis function to generate synthetic images with diverse styled texts based on two synthetic mechanisms. To bridge the text style gap between the synthetic and real-world data, a policy network is constructed to control the synthetic mechanisms by picking style parameters with the guidance of two specifically designed rewards. The synthetic training images with ground-truth are then fed to train a coarse-to-fine erasing network. To produce better erasing outputs, a triplet erasure loss is designed to enforce the refinement stage to recover background textures. Moreover, we provide a new dataset (called PosterErase), which contains 60K high-resolution posters and is more challenging for the erasing task. The proposed method has been extensively evaluated with both PosterErase and the widely-used SCUT-Enstext dataset. Notably, on PosterErase, our method achieves 5.07 in terms of FID, with a relative improvement of 20.9% over existing supervised baselines.

Look Before You Leap: Improving Text-based Person Retrieval by Learning A Consistent Cross-modal Common Manifold

Zijie Wang
Aichun Zhu
Jingyi Xue
Xili Wan
Chao Liu
Tian Wang
Yifeng Li

The core problem of text-based person retrieval is how to bridge the heterogeneous gap between multi-modal data. Many previous approaches contrive to learning a latent common manifold mapping paradigm following a cross-modal distribution consensus prediction (CDCP) manner. When mapping features from distribution of one certain modality into the common manifold, feature distribution of the opposite modality is completely invisible. That is to say, how to achieve a cross-modal distribution consensus so as to embed and align the multi-modal features in a constructed cross-modal common manifold all depends on the experience of the model itself, instead of the actual situation. With such methods, it is inevitable that the multi-modal data can not be well aligned in the common manifold, which finally leads to a sub-optimal retrieval performance. To overcome this CDCP dilemma, we propose a novel algorithm termed LBUL to learn a Consistent Cross-modal Common Manifold (C3 M) for text-based person retrieval. The core idea of our method, just as a Chinese saying goes, is to 'san si er hou xing', namely, to Look Before yoU Leap (LBUL). The common manifold mapping mechanism of LBUL contains a looking step and a leaping step. Compared to CDCP-based methods, LBUL considers distribution characteristics of both the visual and textual modalities before embedding data from one certain modality into C3 M to achieve a more solid cross-modal distribution consensus, and hence achieve a superior retrieval accuracy. We evaluate our proposed method on two text-based person retrieval datasets CUHK-PEDES and RSTPReid. Experimental results demonstrate that the proposed LBUL outperforms previous methods and achieves the state-of-the-art performance.

The More, The Better? Active Silencing of Non-Positive Transfer for Efficient Multi-Domain Few-Shot Classification

Xingxing Zhang
Zhizhe Liu
Weikai Yang
Liyuan Wang
Jun Zhu

Few-shot classification refers to recognizing several novel classes given only a few labeled samples. Many recent methods try to gain an adaptation benefit by learning prior knowledge from more base training domains, aka. multi-domain few-shot classification. However, with extensive empirical evidence, we find more is not always better: current models do not necessarily benefit from pre-training on more base classes and domains, since the pre-trained knowledge might be non-positive for a downstream task. In this work, we hypothesize that such redundant pre-training can be avoided without compromising the downstream performance. Inspired by the selective activating/silencing mechanism in the biological memory system, which enables the brain to learn a new concept from a few experiences both quickly and accurately, we propose to actively silence those redundant base classes and domains for efficient multi-domain few-shot classification. Then, a novel data-driven approach named Active Silencing with hierarchical Subset Selection (AS3) is developed to address two problems: 1) finding a subset of base classes that adequately represent novel classes for efficient positive transfer; and 2) finding a subset of base learners (i.e., domains) with confident accurate prediction in a new domain. Both problems are formulated as distance-based sparse subset selection. We extensively evaluate AS3 on the recent META-DATASET benchmark as well as MNIST, CIFAR10, and CIFAR100, where AS3 achieves over 100% acceleration while maintaining or even improving accuracy. Our code and Appendix are available at https://github.com/indussky8/AS3.

Hierarchical Few-Shot Object Detection: Problem, Benchmark and Method

Lu Zhang
Yang Wang
Jiaogen Zhou
Chenbo Zhang
Yinglu Zhang
Jihong Guan
Yatao Bian
Shuigeng Zhou

Few-shot object detection (FSOD) is to detect objects with a few examples. However, existing FSOD methods do not consider hierarchical fine-grained category structures of objects that exist widely in real life. For example, animals are taxonomically classified into orders, families, genera and species etc. In this paper, we propose and solve a new problem called hierarchical few-shot object detection (Hi-FSOD), which aims to detect objects with hierarchical categories in the FSOD paradigm. To this end, on the one hand, we build the first large-scale and high-quality Hi-FSOD benchmark dataset HiFSOD-Bird, which contains 176,350 wild-bird images falling to 1,432 categories. All the categories are organized into a 4-level taxonomy, consisting of 32 orders, 132 families, 572 genera and 1,432 species. On the other hand, we propose the first Hi-FSOD method HiCLPL, where a hierarchical contrastive learning approach is developed to constrain the feature space so that the feature distribution of objects is consistent with the hierarchical taxonomy and the model's generalization power is strengthened. Meanwhile, a probabilistic loss is designed to enable the child nodes to correct the classification errors of their parent nodes in the taxonomy. Extensive experiments on the benchmark dataset HiFSOD-Bird show that our method HiCLPL outperforms the existing FSOD methods.

Few-shot X-ray Prohibited Item Detection: A Benchmark and Weak-feature Enhancement Network

Renshuai Tao
Tianbo Wang
Ziyang Wu
Cong Liu
Aishan Liu
Xianglong Liu

X-ray prohibited items detection of security inspection plays an important role in protecting public safety. It is a typical few-shot object detection (FSOD) task because some categories of prohibited items are highly scarce due to low-frequency appearance, e.g. pistols, which has been ignored by recent X-ray detection works. In contrast to most FSOD studies that rely on rich feature correlations from natural scenarios, the more practical X-ray security inspection usually faces the dilemma of only weak features learnable due to heavy occlusion, color fading, etc, which causes a severe performance drop when traditional FSOD methods are adopted. However, professional X-ray FSOD evaluation benchmarks and effective models of this scenario have been rarely studied in recent years. Therefore, in this paper, we propose the first X-ray FSOD dataset on the typical industrial X-ray security inspection scenario consisting of 12,333 images and 41,704 instances from 20 categories, which could benchmark and promote FSOD studies in such more challenging scenarios. Further, we propose the Weak-feature Enhancement Network (WEN) containing two core modules, i.e. Prototype Perception (PR) and Feature Reconciliation (FR), where PR first generates a prototype library by aggregating and extracting the basis feature from critical regions around instances, to generate the basis information for each category; FR then adaptively adjusts the impact intensity of the corresponding prototype and forces the model to precisely enhance the weak features of specific objects through the basis information. This mechanism is also effective in traditional FSOD tasks. Extensive experiments on X-ray FSOD and Pascal VOC datasets demonstrate that WEN outperforms other baselines in both X-ray and common scenarios.

High-Fidelity Variable-Rate Image Compression via Invertible Activation Transformation

Shilv Cai
Zhijun Zhang
Liqun Chen
Luxin Yan
Sheng Zhong
Xu Zou

Learning-based methods have effectively promoted the community of image compression. Meanwhile, variational autoencoder(VAE) based variable-rate approaches have recently gained much attention to avoid the usage of a set of different networks for various compression rates. Despite the remarkable performance that has been achieved, these approaches would be readily corrupted once multiple compression/decompression operations are executed, resulting in the fact that image quality would be tremendously dropped and strong artifacts would appear. Thus, we try to tackle the issue of high-fidelity fine variable-rate image compression and propose the Invertible Activation Transformation(IAT) module. We implement the IAT in a mathematical invertible manner on a single rate Invertible Neural Network(INN) based model and the quality level(QLevel) would be fed into the IAT to generate scaling and bias tensors. IAT and QLevel together give the image compression model the ability of fine variable-rate control while better maintaining the image fidelity. Extensive experiments demonstrate that the single rate image compression model equipped with our IAT module has the ability to achieve variable-rate control without any compromise. And our IAT-embedded model obtains comparable rate-distortion performance with recent learning-based image compression methods. Furthermore, our method outperforms the state-of-the-art variable-rate image compression method by a large margin, especially after multiple re-encodings.

Cycle Encoding of a StyleGAN Encoder for Improved Reconstruction and Editability

Xudong Mao
Liujuan Cao
Aurele Tohokantche Gnanha
Zhenguo Yang
Qing Li
Rongrong Ji

GAN inversion aims to invert an input image into the latent space of a pre-trained GAN. Despite the recent advances in GAN inversion, there remain challenges to mitigate the tradeoff between distortion and editability, i.e. reconstructing the input image accurately and editing the inverted image with a small visual quality drop. The recently proposed pivotal tuning model makes significant progress towards reconstruction and editability, by using a two-step approach that first inverts the input image into a latent code, called pivot code, and then alters the generator so that the input image can be accurately mapped into the pivot code. Here, we show that both reconstruction and editability can be improved by a proper design of the pivot code. We present a simple yet effective method, named cycle encoding, for a high-quality pivot code. The key idea of our method is to progressively train an encoder in varying spaces according to a cycle scheme: W->W+->W. This training methodology preserves the properties of both W and W+ spaces, i.e. high editability of W and low distortion of W+. To further decrease the distortion, we also propose to refine the pivot code with an optimization-based method, where a regularization term is introduced to reduce the degradation in editability. Qualitative and quantitative comparisons to several state-of-the-art methods demonstrate the superiority of our approach.

Speech Fusion to Face: Bridging the Gap Between Human's Vocal Characteristics and Facial Imaging

Yeqi BAI
Tao Ma
Lipo Wang
Zhenjie Zhang

While deep learning technologies are now capable of generating realistic images confusing humans, the research efforts are turning to the synthesis of images for more concrete and application-specific purposes. Facial image generation based on vocal characteristics from speech is one of such important yet challenging tasks. It is the key enabler to influential use cases of image generation, especially for business in public security and entertainment. Existing solutions to the problem of speech2face renders limited image quality and fails to preserve facial similarity due to the lack of quality dataset for training and appropriate integration of vocal features. In this paper, we investigate these key technical challenges and propose Speech Fusion to Face, or SF2F in short, attempting to address the issue of facial image quality and the poor connection between vocal feature domain and modern image generation models. By adopting new strategies on data model and training, we demonstrate dramatic performance boost over state-of-the-art solution, by doubling the recall of individual identity, and lifting the quality score from 15 to 19 based on the mutual information score with VGGFace classifier.

Learning Action-guided Spatio-temporal Transformer for Group Activity Recognition

Wei Li
Tianzhao Yang
Xiao Wu
Xian-Jun Du
Jian-Jun Qiao

Learning spatial and temporal relations among people plays an important role in recognizing group activity. Recently, transformer-based methods have become popular solutions due to the proposal of self-attention mechanism. However, the person-level features are fed directly into the self-attention module without any refinement. Moreover, group activity in a clip often involves unbalanced spatio-temporal interactions, where only a few persons with special actions are critical to identifying different activities. It is difficult to learn the spatio-temporal interactions due to the lack of elaborately modeling the action dependencies among all people. In this paper, a novel Action-guided Spatio-Temporal transFormer (ASTFormer) is proposed to capture the interaction relations for group activity recognition by learning action-centric aggregation and modeling spatio-temporal action dependencies. Specifically, ASTFormer starts with assigning all persons in each frame to the latent actions, while an action-centric aggregation strategy is performed by weighting the sum of residuals for each latent action under the supervision of global action information. Then, a dual-branch transformer is proposed to refine the inter- and intra-frame action-level features, where two encoders with the self-attention mechanism are employed to select important tokens. Next, a semantic action graph is explicitly devised to model the dynamic action-wise dependencies. Finally, our model is capable of boosting group activity recognition by fusing these important cues, while only requiring video-level action labels. Extensive experiments on two popular benchmarks (Volleyball and Collective Activity) demonstrate the superior performance of our method in comparison with the state-of-the-art methods using only raw RGB frames as input.

A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQA

Yangyang Guo
Liqiang Nie
Yongkang Wong
Yibing Liu
Zhiyong Cheng
Mohan Kankanhalli

Knowledge-based Visual Question Answering (VQA) expects models to rely on external knowledge for robust answer prediction. Though significant it is, this paper discovers several leading factors impeding the advancement of current state-of-the-art methods. On the one hand, methods which exploit the explicit knowledge take the knowledge as a complement for the coarsely trained VQA model. Despite their effectiveness, these approaches often suffer from noise incorporation and error propagation. On the other hand, pertaining to the implicit knowledge, the multi-modal implicit knowledge for knowledge-based VQA still remains largely unexplored. This work presents a unified end-to-end retriever-reader framework towards knowledge-based VQA. In particular, we shed light on the multi-modal implicit knowledge from vision-language pre-training models to mine its potential in knowledge reasoning. As for the noise problem encountered by the retrieval operation on explicit knowledge, we design a novel scheme to create pseudo labels for effective knowledge supervision. This scheme is able to not only provide guidance for knowledge retrieval, but also drop these instances potentially error-prone towards question answering. To validate the effectiveness of the proposed method, we conduct extensive experiments on the benchmark dataset. The experimental results reveal that our method outperforms existing baselines by a noticeable margin. Beyond the reported numbers, this paper further spawns several insights on knowledge utilization for future research with some empirical findings.

PIA: Parallel Architecture with Illumination Allocator for Joint Enhancement and Detection in Low-Light

Tengyu Ma
Long Ma
Xin Fan
Zhongxuan Luo
Risheng Liu

Visual perception in low-light conditions (e.g., nighttime) plays an important role in various multimedia-related applications (e.g., autonomous driving). The enhancement (provides a visual-friendly appearance) and detection (detects the instances of objects) in low-light are two fundamental and crucial visual perception tasks. In this paper, we make efforts on how to simultaneously realize low-light enhancement and detection from two aspects. First, we define a parallel architecture to satisfy the task demand for both two tasks. In which, a decomposition-type warm-start acting on the entrance of parallel architecture is developed to narrow down the adverse effects brought by low-light scenes to some extent. Second, a novel illumination allocator is designed by encoding the key illumination component (the inherent difference between normal-light and low-light) to extract hierarchical features for assisting in enhancement and detection. Further, we make a substantive discussion for our proposed method. That is, we solve enhancement in a coarse-to-fine manner and handle detection in a decomposed-to-integrated fashion. Finally, multidimensional analytical and evaluated experiments are performed to indicate our effectiveness and superiority. The code is available at \urlhttps://github.com/tengyu1998/PIA

Robust Actor Recognition in Entertainment Multimedia at Scale

Abhinav Aggarwal
Yash Pandya
Lokesh A. Ravindranathan
Laxmi S. Ahire
Manivel Sethu
Kaustav Nandy

Actor identification and localization in movies and TV series seasons can enable deeper engagement with the content. Manual actor identification and tagging at every time-instance in a video is error prone as it is a highly repetitive, decision intensive and time-consuming task. The goal of this paper is to accurately label as many faces as possible in the video with actor names. We solve this problem using a multi-step clustering process followed by a selection of face-instances that are: (a) representative of their member clusters and (b) aesthetically pleasing for visual identification. These face-instances can be matched with the actor names by automated or manual techniques to complete actor tagging. This solution is further optimized for seasons with repeating cast members which constitutes majority of the entertainment multimedia content. In such titles, the face labels from the previous episodes are efficiently used to pre-label faces in the subsequent episode. We guarantee the same level of accuracy even after scaling the solution to TV series seasons. This novel solution works in a completely realistic setup where the input to the solution is just the raw video. This is the first known work which has proved its robustness on more than 5000 TV episodes and movies across different genres, languages and runtimes with actors of diverse ethnicity, race, gender identity, age, etc. The proposed solution establishes a new state-of-the-art for cluster purity in both movies and TV series seasons by achieving near-perfect cluster homogeneity.

MF-Net: A Novel Few-shot Stylized Multilingual Font Generation Method

Yufan Zhang
Junkai Man
Peng Sun

Creating a complete stylized font library that helps the audience to perceive information from the text often requires years of study and proficiency in the use of many professional tools. Accordingly, automatic stylized font generation in a deep learning-based fashion is a desirable but challenging task that has attracted a lot of attention in recent years. This paper revisits the state-of-the-art methods for stylized font generation and presents a taxonomy of the deep learning-based stylized font generation. Despite the notable performance of the existing models, stylized multilingual font generation, the task of applying specific font style to diverse characters in multiple languages has never been reported to be addressed. An efficient and economical method for stylized multilingual font generation is essential in numerous application scenarios that require communication with international audiences. We propose a solution for few-shot multilingual stylized font generation by a fast feed-forward network, Multilingual Font Generation Network (MF-Net), which can transfer previously unseen font styles from a few samples to characters from previously unseen languages. Following the Generative Adversarial Network (GAN) framework, MF-Net adopts two separate encoders in the generator to decouple a font image's content and style information. We adopt an attention module in the style encoder to extract both shallow and deep style features. Moreover, we also design a novel language complexity-aware skip connection to adaptive adjust the structural information to be preserved. With an effective loss function to improve the visual quality of the generated font images, we show the effectiveness of the proposed MF-Net based on quantitative and subjective visual evaluation, and compare it with the existing models in the scenario of stylized multilingual font generation. The source code is available on https://github.com/iamyufan/MF-Net.

Feature and Semantic Views Consensus Hashing for Image Set Classification

Yuan Sun
Dezhong Peng
Haixiao Huang
Zhenwen Ren

Image set classification (ISC) has always been an active topic, primarily due to the fact that image set can provide more comprehensive information to describe a subject. However, the existing ISC methods face two problems: (1) The high computational cost prohibits these methods from being applied into median or large-scale applications; (2) the consensus information between feature and semantic representation of image set are largely ignored. To overcome these issues, in this paper, we propose a novel ISC method, termly feature and semantic views consensus hashing (FSVCH). Specifically, a kernelized bipartite graph is constructed to capture the nonlinear structure of data, and then two-views (\ie feature and semantic) consensus hashing learning (TCHL) is proposed to obtain a shared hidden consensus information. Meanwhile, for robust out-of-sample prediction purpose, we further propose TCHL guided optimal hash function inversion (TGHI) to learn a high-quality general hash function. Afterwards, hashing rotating (HR) is employed to obtain a more approximate real-valued hash solution. A large number of experiments show that FSVCH remarkably outperforms comparison methods on three benchmark datasets, in term of running time and classification performance. Experimental results also indicate that FSVCH can be scalable to median or large-scale ISC task.

Evidential Reasoning for Video Anomaly Detection

Che Sun
Yunde Jia
Yuwei Wu

Video anomaly detection aims to discriminate events that deviate from normal patterns in a video. Modeling the decision boundaries of anomalies is challenging, due to the uncertainty in the probability of deviating from normal patterns. In this paper, we propose a deep evidential reasoning method that explicitly learns the uncertainty to model the boundaries. Our method encodes various visual cues as evidences representing potential deviations, assigns beliefs to the predicted probability of deviating from normal patterns based on the evidences, and estimates the uncertainty from the remained beliefs to model the boundaries. To do this, we build a deep evidential reasoning network to encode evidence vectors and estimate uncertainty by learning evidence distributions and deriving beliefs from the distributions. We introduce an unsupervised strategy to train our network by minimizing an energy function of the deep Gaussian mixed model (GMM). Experimental results show that our uncertainty score is beneficial for modeling the boundaries of video anomalies on three benchmark datasets.

Gaze- and Spacing-flow Unveil Intentions: Hidden Follower Discovery

Danni Xu
Ruimin Hu
Zheng Wang
Linbo Luo
Dengshi Li
Wenjun Zeng

We raise a new and challenging multimedia application in video surveillance system, i.e., Hidden Follower Discovery (HFD). In contrast to the common abnormal behaviors that are occurring, hidden following is not an ongoing activity, but a preparatory action. Hidden following behavior does not have salient features, making it hard to be discovered. Fortunately, from a socio-cognitive perspective, we found and verified the phenomena that the gaze-flow pattern and the spacing-flow pattern between hidden and normal followers are different. To promote HFD research, we construct two pioneering datasets and devise an HFD baseline network based on the recognition of both gaze-flow and spacing-flow patterns from surveillance videos. Extensive experiments demonstrate their effectiveness.

Semi-supervised Learning for Multi-label Video Action Detection

Hongcheng Zhang
Xu Zhao
Dongqi Wang

Semi-supervised multi-label video action detection aims to locate all the persons and recognize their multiple action labels by leveraging both labeled and unlabeled videos. Compared to the single-label scenario, semi-supervised learning in multi-label video action detection is more challenging due to two significant issues: generation of multiple pseudo labels and class-imbalanced data distribution. In this paper, we propose an effective semi-supervised learning method to tackle these challenges. Firstly, to make full use of the informative unlabeled data for better training, we design an effective multiple pseudo labeling strategy by setting dynamic learnable threshold for each class. Secondly, to handle the long-tailed distribution for each class, we propose the unlabeled class balancing strategy. We select training samples according to the multiple pseudo labels generated during the training iteration, instead of the usual data re-sampling that requires label information before training. Then the balanced re-weighting is leveraged to mitigate the class imbalance caused by multi-label co-occurrence. Extensive experiments conducted on two challenging benchmarks, AVA and UCF101-24, demonstrate the effectiveness of our proposed designs. By using the unlabeled data effectively, our method achieves the state-of-the-art performance in video action detection on both AVA and UCF101-24 datasets. Besides, it can still achieve competitive performance compared with fully-supervised methods when using limited annotations on AVA dataset.

Learning Cross-Image Object Semantic Relation in Transformer for Few-Shot Fine-Grained Image Classification

Bo Zhang
Jiakang Yuan
Baopu Li
Tao Chen
Jiayuan Fan
Botian Shi

Few-shot fine-grained learning aims to classify a query image into one of a set of support categories with fine-grained differences. Although learning different objects' local differences via Deep Neural Networks has achieved success, how to exploit the query-support cross-image object semantic relations in Transformer-based architecture remains under-explored in the few-shot fine-grained scenario. In this work, we propose a Transformer-based double-helix model, namely HelixFormer, to achieve the cross-image object semantic relation mining in a bidirectional and symmetrical manner. The HelixFormer consists of two steps: 1) Relation Mining Process (RMP) across different branches, and 2) Representation Enhancement Process (REP) within each individual branch. By the designed RMP, each branch can extract fine-grained object-level Cross-image Semantic Relation Maps (CSRMs) using information from the other branch, ensuring better cross-image interaction in semantically related local object regions. Further, with the aid of CSRMs, the developed REP can strengthen the extracted features for those discovered semantically-related local regions in each branch, boosting the model's ability to distinguish subtle feature differences of fine-grained objects. Extensive experiments conducted on five public fine-grained benchmarks demonstrate that HelixFormer can effectively enhance the cross-image object semantic relation matching for recognizing fine-grained objects, achieving much better performance over most state-of-the-art methods under 1-shot and 5-shot scenarios.

Progressive Spatial-temporal Collaborative Network for Video Frame Interpolation

Mengshun Hu
Kui Jiang
Liang Liao
Zhixiang Nie
Jing Xiao
Zheng Wang

Most video frame interpolation (VFI) algorithms infer the intermediate frame with the help of adjacent frames through the cascaded motion estimation and content refinement.However, the intrinsic correlations between motion and content are barely investigated, commonly producing interpolated results with inconsistency and blurry contents.Specifically, we first discover a simple yet essential domain knowledge that contents and motions characteristics should be homogeneous to a certain degree from the same objects, and formulate the consistency into the loss function for model optimization. Based on this, we propose to learn the collaborative representation between motions and contents, and construct a novel progressive spatial-temporal Collaborative network (Prost-Net) for video frame interpolation.Specifically, we develop a content-guided motion module (CGMM) and a motion-guided content module (MGCM) for individual content and motion representation. In particular, the predicted motion in CGMM is used to guide the fusion and distillation of contents for intermediate frame interpolation, and vice versa. Furthermore, by considering collaborative strategy in a multi-scale framework, our Prost-Net progressively optimizes motions and contents in a coarse-to-fine manner, making it robust to various challenging scenarios (occlusion and large motions) in VFI. Extensive experiments on the benchmark datasets demonstrate that our method significantly outperforms state-of-the-art methods.

Best of Both Worlds: See and Understand Clearly in the Dark

Xinwei Xue
Jia He
Long Ma
Yi Wang
Xin Fan
Risheng Liu

Recently, with the development of intelligent technology, the perception of low-light scenes has been gaining widespread attention. However, existing techniques usually focus on only one task (e.g., enhancement) and lose sight of the others (e.g., detection), making it difficult to perform all of them well at the same time. To overcome this limitation, we propose a new method that can handle visual quality enhancement and semantic-related tasks (e.g., detection, segmentation) simultaneously in a unified framework. Specifically, we build a cascaded architecture to meet the task requirements. To better enhance the entanglement in both tasks and achieve mutual guidance, we develop a new contrastive-alternative learning strategy for learning the model parameters, to largely improve the representational capacity of the cascaded architecture. Notably, the contrastive learning mechanism establishes the communication between two objective tasks in essence, which actually extends the capability of contrastive learning to some extent. Finally, extensive experiments are performed to fully validate the advantages of our method over other state-of-the-art works in enhancement, detection, and segmentation. A series of analytical evaluations are also conducted to reveal our effectiveness. The code is available at https://github.com/k914/contrastive-alternative-learning.

Meta Clustering Learning for Large-scale Unsupervised Person Re-identification

Xin Jin
Tianyu He
Xu Shen
Tongliang Liu
Xinchao Wang
Jianqiang Huang
Zhibo Chen
Xian-Sheng Hua

Unsupervised Person Re-identification (U-ReID) with pseudo labeling recently reaches a competitive performance compared to fully-supervised ReID methods based on modern clustering algorithms. However, such clustering-based scheme becomes computationally prohibitive for large-scale datasets, making it infeasible to be applied in real-world application. How to efficiently leverage endless unlabeled data with limited computing resources for better U-ReID is under-explored. In this paper, we make the first attempt to the large-scale U-ReID and propose a "small data for big task" paradigm dubbed Meta Clustering Learning (MCL). MCL only pseudo-labels a subset of the entire unlabeled data via clustering to save computing for the first-phase training. After that, the learned cluster centroids, termed as meta-prototypes in our MCL, are regarded as a proxy annotator to softly annotate the rest unlabeled data for further polishing the model. To alleviate the potential noisy labeling issue in the polishment phase, we enforce two well-designed loss constraints to promise intra-identity consistency and inter-identity strong correlation. For multiple widely-used U-ReID benchmarks, our method significantly saves computational cost while achieving a comparable or even better performance compared to prior works.

Adjustable Memory-efficient Image Super-resolution via Individual Kernel Sparsity

Xiaotong Luo
Mingliang Dai
Yulun Zhang
Yuan Xie
Ding Liu
Yanyun Qu
Yun Fu
Junping Zhang

Though single image super-resolution (SR) has witnessed incredible progress, the increasing model complexity impairs its applications in memory-limited devices. To solve this problem, prior arts have aimed to reduce the number of model parameters and sparsity has been exploited, which usually enforces the group sparsity constraint on the filter level and thus is not arbitrarily adjustable for satisfying the customized memory requirements. In this paper, we propose an individual kernel sparsity (IKS) method for memory-efficient and sparsity-adjustable image SR to aid deep network deployment in memory-limited devices. IKS performs model sparsity in the weight level that implicitly allocates the user-defined target sparsity to each individual kernel. To induce the kernel sparsity, a soft thresholding operation is used as a gating constraint for filtering the trivial weights. To achieve adjustable sparsity, a dynamic threshold learning algorithm is proposed, in which the threshold is updated by associated training with the network weight and is adaptively decayed with the guidance of the desired sparsity. This work essentially provides a dynamic parameter reassignment scheme with a given resource budget for an off-the-shelf SR model. Extensive experimental results demonstrate that IKS imparts considerable sparsity with negligible effect on SR quality. The code is available at: https://github.com/RaccoonDML/IKS.

GT-MUST: Gated Try-on by Learning the Mannequin-Specific Transformation

Ning Wang
Jing Zhang
Lefei Zhang
Dacheng Tao

Given the mannequin (i.e., reference person) and target garment, the virtual try-on (VTON) task aims at dressing the mannequin in the provided garment automatically, having attracted increasing attention in recent years. Previous works usually conduct the garment deformation under the guidance of ''shape''. However, ''shape-only transformation'' ignores the local structures and results in unnatural distortions. To address this issue, we propose a Gated Try-on method by learning the ManneqUin-Specific Transformation (GT-MUST). Technically, we implement GT-MUST as a three-stage deep neural model. First, GT-MUST learns the ''mannequin-specific transformation'' with a ''take-off'' mechanism, which recovers the warped clothes of the mannequin to its original in-shop state. Then, the learned ''mannequin-specific transformation'' is inverted and utilized to help generate the mannequin-specific warped state for a target garment. Finally, a special gate is employed to better combine the mannequin-specific warped garment with the mannequin. GT-MUST benefits from learning to solve a much easier ''take-off'' task to obtain the mannequin-specific information than the common ''try-on'' task, since flat in-shop garments usually have less variation in shape than those clothed on the body. Experiments on the fashion dataset demonstrate that GT-MUST outperforms the state-of-the-art virtual try-on methods. The code is available at https://github.com/wangning-001/GT-MUST.

PC2-PU: Patch Correlation and Point Correlation for Effective Point Cloud Upsampling

Chen Long
WenXiao Zhang
Ruihui Li
Hao Wang
Zhen Dong
Bisheng Yang

Point cloud upsampling is to densify a sparse point set acquired from 3D sensors, providing a denser representation for the underlying surface. Existing methods divide the input points into small patches and upsample each patch separately, however, ignoring the global spatial consistency between patches. In this paper, we present a novel method PC$^2$-PU, which explores patch-to-patch and point-to-point correlations for more effective and robust point cloud upsampling. Specifically, our network has two appealing designs: (i) We take adjacent patches as supplementary inputs to compensate the loss structure information within a single patch and introduce a Patch Correlation Module to capture the difference and similarity between patches. (ii) After augmenting each patch's geometry, we further introduce a Point Correlation Module to reveal the relationship of points inside each patch to maintain the local spatial consistency. Extensive experiments on both synthetic and real scanned datasets demonstrate that our method surpasses previous upsampling methods, particularly with the noisy inputs. The code and data are at: https://github.com/chenlongwhu/PC2-PU.git.

Self-Supervised Multi-view Stereo via Adjacent Geometry Guided Volume Completion

Luoyuan Xu
Tao Guan
Yuesong Wang
Yawei Luo
Zhuo Chen
Wenkai Liu
Wei Yang

Existing self-supervised multi-view stereo (MVS) approaches largely rely on photometric consistency for geometry inference, and hence suffer from low-texture or non-Lambertian appearances. In this paper, we observe that adjacent geometry shares certain commonality that can help to infer the correct geometry of the challenging or low-confident regions. Yet exploiting such property in a non-supervised MVS approach remains challenging for the lacking of training data and necessity of ensuring consistency between views. To address the issues, we propose a novel geometry inference training scheme by selectively masking regions with rich textures, where geometry can be well recovered and used for supervisory signal, and then lead a deliberately designed cost volume completion network to learn how to recover geometry of the masked regions. During inference, we then mask the low-confident regions instead and use the cost volume completion network for geometry correction. To deal with the different depth hypotheses of the cost volume pyramid, we design a three-branch volume inference structure for the completion network. Further, by considering plane as a special geometry, we first identify planar regions from pseudo labels and then correct the low-confident pixels by high-confident labels through plane normal consistency. Extensive experiments on DTU and Tanks & Temples demonstrate the effectiveness of the proposed framework and the state-of-the-art performance.

AtHom: Two Divergent Attentions Stimulated By Homomorphic Training in Text-to-Image Synthesis

Zhenbo Shi
Zhi Chen
Zhenbo Xu
Wei Yang
Liusheng Huang

Image generation from text is a challenging and ill-posed task. Images generated from previous methods usually have low semantic consistency with texts and the achieved resolution is limited. To generate semantically consistent high-resolution images, we propose a novel method named AtHom, in which two attention modules are developed to extract the relationships from both independent modality and unified modality. The first is a novel Independent Modality Attention Module (IAM), which is presented to find out semantically important areas in generated images and to extract the informative context in texts. The second is a new module named Unified Semantic Space Attention Module (UAM), which is utilized to find out the relationships between extracted text context and essential areas in generated images. In particular, to bring the semantic features of texts and images closer in a unified semantic space, AtHom incorporates a homomorphic training mode by exploiting an extra discriminator to distinguish between two different modalities. Extensive experiments show that our AtHom surpasses previous methods by large margins.

One-step Low-Rank Representation for Clustering

Zhiqiang Fu
Yao Zhao
Dongxia Chang
Yiming Wang
Jie Wen
Xingxing Zhang
Guodong Guo

Existing low-rank representation-based methods adopt a two-step framework, which must employ an extra clustering method to gain labels after representation learning. In this paper, a novel one-step representation-based method, i.e., One-step Low-Rank Representation (OLRR), is proposed to capture multi-subspace structures for clustering. OLRR integrates the low-rank representation model and clustering into a unified framework. Thus it can jointly learn the low-rank subspace structure embedded in the database and gain the clustering results. In particular, by approximating the representation matrix with two same clustering indicator matrices, OLRR can directly show the probability of samples belonging to each cluster. Further, a probability penalty is introduced to ensure that the samples with smaller distances are more inclined to be in the same cluster, thus enhancing the discrimination of the clustering indicator matrix and resulting in a more favorable clustering performance. Moreover, to enhance the robustness against noise, OLRR uses the probability to guide denoising and then performs representation learning and clustering in a recovered clean space. Extensive experiments well demonstrate the robustness and effectiveness of OLRR. Our code is publicly available at: https://github.com/fuzhiqiang1230/OLRR.

Customizing GAN Using Few-shot Sketches

Syed Muhammad Israr
Feng Zhao

Generative adversarial networks (GANs) have demonstrated remarkable success in image synthesis applications, but their performance deteriorates under limited data regimes. The fundamental challenge is that it is extremely difficult to synthesize photo-realistic and highly diversified images while capturing meaningful attributes of the targets under minimum supervision. Previous methods either fine-tune or rewrite the model weights to adapt to few-shot datasets. However, this either overfits or requires access to large-scale data on which they are trained. To tackle the problem, we propose a framework that repurposes the existing pre-trained generative models using only a few samples (e.g., <30) of sketches. Unlike previous works, we transfer the sample diversity and quality without accessing the source data using inter-domain distance consistency. By employing cross-domain adversarial learning, we encourage the model output to closely resemble the input sketches in both shape and pose. Extensive experiments show that our method significantly outperforms the existing approaches in terms of sample quality and diversity. The qualitative and quantitative results on various standard datasets also demonstrate its efficacy. On the most popularly used dataset, Gabled church, we achieve a Fréchet inception distance (FID) score of 15.63.

Video Coding using Learned Latent GAN Compression

Mustafa Shukor
Bharath Bhushan Damodaran
Xu Yao
Pierre Hellier

We propose in this paper a new paradigm for facial video compression. We leverage the generative capacity of GANs such as StyleGAN to represent and compress a video, including intra and inter compression. Each frame is inverted in the latent space of StyleGAN, from which the optimal compression is learned. To do so, a diffeomorphic latent representation is learned using a normalizing flows model, where an entropy model can be optimized for image coding. In addition, we propose a new perceptual loss that is more efficient than other counterparts. Finally, an entropy model for video inter coding with residual is also learned in the previously constructed latent representation. Our method (SGANC) is simple, faster to train, and achieves better results for image and video coding compared to state-of-the-art codecs such as VTM, AV1, and recent deep learning techniques. In particular, it drastically minimizes perceptual distortion at low bit rates.

Action-conditioned On-demand Motion Generation

Qiujing Lu
Yipeng Zhang
Mingjian Lu
Vwani Roychowdhury

We propose a novel framework, On-Demand MOtion Generation (ODMO), for generating realistic and diverse long-term 3D human motion sequences conditioned only on action types with an additional capability of customization. ODMO shows improvements over SOTA approaches on all traditional motion evaluation metrics when evaluated on three public datasets (HumanAct12, UESTC, and MoCap). Furthermore, we provide both qualitative evaluations and quantitative metrics demonstrating several first-known customization capabilities afforded by our framework, including mode discovery, interpolation, and trajectory customization. These capabilities significantly widen the spectrum of potential applications of such motion generation models. The novel on-demand generative capabilities are enabled by innovations in both the encoder and decoder architectures: (i) Encoder: Utilizing contrastive learning in low-dimensional latent space to create a hierarchical embedding of motion sequences, where not only the codes of different action types form different groups, but within an action type, codes of similar inherent patterns (motion styles) cluster together, making them readily discoverable; (ii) Decoder: Using a hierarchical decoding strategy where the motion trajectory is reconstructed first and then used to reconstruct the whole motion sequence. Such an architecture enables effective trajectory control. Our code is released on the Github page: https://github.com/roychowdhuryresearch/ODMO https://github.com/roychowdhuryresearch/ODMO

Universal Domain Adaptive Object Detector

Wenxu Shi
Lei Zhang
Weijie Chen
Shiliang Pu

Universal domain adaptive object detection (UniDAOD) is more challenging than domain adaptive object detection (DAOD) since the label space of the source domain may not be the same as that of the target and the scale of objects in the universal scenarios can vary dramatically (i.e, category shift and scale shift). To this end, we propose US-DAF, namely Universal Scale-Aware Domain Adaptive Faster RCNN with Multi-Label Learning, to reduce the negative transfer effect during training while maximizing transferability as well as discriminability in both domains under a variety of scales. Specifically, our method is implemented by two modules: 1) We facilitate the feature alignment of common classes and suppress the interference of private classes by designing a Filter Mechanism module to overcome the negative transfer caused by category shift. 2) We fill the blank of scale-aware adaptation in object detection by introducing a new Multi-Label Scale-Aware Adapter to perform individual alignment between corresponding scale for two domains. Experiments show that US-DAF achieves state-of-the-art results on three scenarios (\emphi.e, Open-Set, Partial-Set, and Closed-Set) and yields 7.1% and 5.9% relative improvement on benchmark datasets Clipart1k and Watercolor in particular.

PIMoG: An Effective Screen-shooting Noise-Layer Simulation for Deep-Learning-Based Watermarking Network

Han Fang
Zhaoyang Jia
Zehua Ma
Ee-Chien Chang
Weiming Zhang

With the omnipresence of camera phone and digital display, capturing digitally displayed image with camera phone are getting widely practiced. In the context of watermarking, this brings forth the issue of screen-shooting robustness. The key to acquiring screen-shooting robustness is designing a good noise layer that could represent screen-shooting distortions in a deep-learning-based watermarking framework. However, it is very difficult to quantitatively formulate the screen-shooting distortion since the screen-shooting process is too complex. In order to design an effective noise layer for screen-shooting robustness, we propose new insight in this paper, that is, it is not necessary to quantitatively simulate the overall procedure in the screen-shooting noise layer, only including the most influenced distortions is enough to generate an effective noise layer with strong robustness. To verify this insight, we propose a screen-shooting noise layer dubbed PIMoG. Specifically, we summarize the most influenced distortions of screen-shooting process into three parts (p erspective distortion, i llumination distortion and mo iré distortion) and further simulate them in a differentiable way. For the rest distortion, we utilize the G aussian noise to approximate the main part of them. As a result, the whole network can be trained end-to-end with such noise layer. Extensive experiments illustrate the superior performance of the proposed PIMoG noise layer. In addition to the noise layer design, we also propose a gradient mask-guided image loss and an edge mask-guided image loss to further improve the robustness and invisibility of the whole network respectively. Based on the proposed loss and PIMoG noise layer, the whole framework outperforms the SOTA watermarking method with at least 5% in extraction accuracy and achieves more than 97% accuracy in different screen-shooting conditions.

MONOPOLY: Financial Prediction from MONetary POLicY Conference Videos Using Multimodal Cues

Puneet Mathur
Atula Neerkaje
Malika Chhibber
Ramit Sawhney
Fuming Guo
Franck Dernoncourt
Sanghamitra Dutta
Dinesh Manocha

Risk prediction and price movement classification are essential tasks in financial markets. Monetary policy calls (MPC) provide important insights into the actions taken by a country's central bank on economic goals related to inflation, employment, prices, and interest rates. Analyzing visual, vocal, and textual cues from MPC calls can help analysts and policymakers evaluate the economic risks and make sound investment decisions. To aid the analysis of MPC calls, we curate the Monopoly dataset, a collection of public conference call videos along with their corresponding audio recordings and text transcripts released by six international banks between 2009 and 2022. Our dataset is the first attempt to explore the benefits of visual cues in addition to audio and textual signals for financial prediction tasks. We introduce MPCNet, a competitive baseline architecture that takes advantage of the cross-modal transformer blocks and modality-specific attention fusion to forecast the financial risk and price movement associated with the MPC calls. Empirical results prove that the task is challenging, with the proposed architecture performing 5-18% better than strong Transformer-based baselines. We release the MPC dataset and benchmark models to motivate future research in this new challenging domain.

Structure-Inferred Bi-level Model for Underwater Image Enhancement

Pan Mu
Haotian Qian
Cong Bai

Very recently, with the development of underwater robots, underwater image enhancement arising growing interests in the computer vision community. However, owing to light being scattered and absorbed while it traveling in water, underwater captured images often suffer from color cast and low visibility. Existing methods depend on specific prior knowledge and training data to enhance underwater images in the absence of structure information, which results in poor and unnatural performance. To this end, we propose a Structural-Inferred Bi-level Model (SIBM) that incorporates different modalities of knowledge (i.e., semantic domain, gradient-domain, and pixel domain) hierarchically enhancing underwater images. In particular, by introducing a semantic mask, we individually optimize the forehand branch that avoids unnecessary interference arising from the background region. We design a gradient-based high-frequency branch to exploit gradient-space guidance for preserving texture structures. Moreover, we construct a pixel-based branch by feeding semantic and gradient information to enhance underwater images. To exploit different modalities, we introduce a hyper-parameter optimization scheme to fuse the above domain information. Experimental results illustrate that the developed method not only outperforms the previous methods in quantitative scores but also generalizes well on real-world underwater datasets. Source code is available at \hrefhttps://github.com/IntegralCoCo/SIBM https://github.com/IntegralCoCo/SIBM.

Composite Photograph Harmonization with Complete Background Cues

Yazhou Xing
Yu Li
Xintao Wang
Ye Zhu
Qifeng Chen

Compositing portrait photographs or videos to novel backgrounds is an important application in computational photography. Seamless blending along boundaries and globally harmonic colors are two desired properties of the photo-realistic composition of foregrounds and new backgrounds. Existing works are dedicated to either foreground alpha matte generation or after-blending harmonization, leading to sub-optimal background replacement when putting foregrounds and backgrounds together. In this work, we unify the two objectives in a single framework to obtain realistic portrait image composites. Specifically, we investigate the usage of a target background and find that a complete background plays a vital role in both seamlessly blending and harmonization. We develop a network to learn the composition process given an imperfect alpha matte with appearance features extracted from the complete background to adjust color distribution. Our dedicated usage of a complete background enables realistic portrait image composition and also temporally stable results on videos. Extensive quantitative and qualitative experiments on both synthetic and real-world data demonstrate that our method achieves state-of-the-art performance.

Self-supervised Multi-view Stereo via Inter and Intra Network Pseudo Depth

Ke Qiu
Yawen Lai
Shiyi Liu
Ronggang Wang

Recent self-supervised learning-based multi-view stereo (MVS) approaches have shown promising results. However, previous methods primarily utilize view synthesis as the replacement for costly ground-truth depth data to guide network learning, still maintaining a performance gap with recent supervised methods. In this paper, we propose a self-supervised dual network MVS framework with inter and intra network pseudo depth labels for more powerful supervision guidance. Specifically, the inter network pseudo depth labels are estimated by an unsupervised network, filtered by multi-view geometry consistency, updated iteratively by a pseudo depth supervised network, and finally refined by our efficient geometry priority sampling strategy. And we dynamically generate multi-scale intra network pseudo labels inside our cascade unsupervised network during training to provide additional reliable supervision. Experimental results on the DTU and Tanks & Temples datasets demonstrate that our proposed methods achieve state-of-the-art performance among unsupervised methods and even achieve comparable performance and generalization ability with supervised adversaries.

Delegate-based Utility Preserving Synthesis for Pedestrian Image Anonymization

Zhenzhong Kuang
Longbin Teng
Zhou Yu
Jun Yu
Jianping Fan
Mingliang Xu

The rapidly growing application of pedestrian images has aroused wide concern on visual privacy protection because personal information is under the risk of privacy disclosure. Anonymization is regarded as an effective solution by identity obfuscation. Most recent methods focus on face, but it is not enough when the presence of human body carries lots of identifiable information. This paper presents a new delegate-based utility preserving synthesis (DUPS) approach for pedestrian image anonymization. This is challenging because one may expect that the anonymized image can still be useful in various computer vision tasks. We model DUPS as an adaptive translation process from source to target. To provide a comprehensive identity protection, we first perform anonymous delegate sampling based on image-level differential privacy. To synthesize anonymous images, we then introduce an adaptive translation network and optimize it with a multi-task loss function. Our approach is theoretically sound and can generate diverse results by preserving data utility. The experiments on multiple datasets show that DUPS can not only achieve superior anonymization performance against deep pedestrian recognizers, but also can obtain a better tradeoff between privacy protection and utility preservation compared with state-of-the-art methods.

Video Instance Lane Detection via Deep Temporal and Geometry Consistency Constraints

Mingqian Wang
Yujun Zhang
Wei Feng
Lei Zhu
Song Wang

Video instance lane detection is one of the most important tasks in autonomous driving.Due to the very sparse region and weak context in lane annotations, accurately detecting instance-level lanes in real-world traffic scenarios is challenging, especially for scenes with occlusion, bad weather conditions, dim or dazzling lights.Current methods mainly address this problem by integrating features of adjacent video frames to simply encourage temporal constancy for image-level lane detectors. However, most of them ignore lane shape constraint of adjacent frames and geometry consistency of individual lanes, thereby harming the performance of video instance lane detection. In this paper, we propose TGC-Net via temporal and geometry consistency constraints for reliable video instance lane detection. Specifically, we devise a temporal recurrent feature-shift aggregation module (T-RESA) to learn spatio-temporal lane features along horizontal, vertical, and temporal directions of the feature tensor. We further impose temporal consistency constraint by encouraging spatial distribution consistency among the lane features of adjacent frames. Besides, we devise two effective geometry constraints to ensure the integrity and continuity of lane predictions by leveraging pairwise point affinity loss and vanishing point guided geometric context, respectively. Extensive experiments on public benchmark dataset show that our TGC-Net quantitatively and qualitatively outperforms state-of-the-art video instance lane detectors and video object segmentation competitors. Our code and our results have been released at https://github.com/wmq12345/TGC-Net.

Learning Visible Surface Area Estimation for Irregular Objects

Xu Liu
Jianing Li
Xianqi Zhang
Jingyuan Sun
Xiaopeng Fan
Yonghong Tian

Visible surface area estimation for irregular objects, one of the most fundamental and challenging topics in mathematics, supports a wide range of applications. The existing techniques usually estimate the visible surface area via mathematical modeling from 3D point clouds. However, the 3D scanner is expensive, and the corresponding evaluation method is too complex. In this paper, we propose a novel problem setting, deep learning for visible surface area estimation, which is the first trial to estimate the visible surface area for irregular objects from monocular images. Technically, we first build a novel visible surface area estimation dataset including 9099 real annotations. Then, we design a learning-based architecture to predict the visible surface area, including two core modules (i.e., the classification module and the area-bins module). The classification module is presented to predict the visible surface area distribution interval and assist network training for more accurate visible surface area estimation. Meanwhile, the area-bins module using the transformer encoder is proposed to distinguish the difference in visible surface area between irregular objects of the same category. The experimental results demonstrate that our approach can effectively estimate the visible surface area for irregular objects with various categories and sizes. We hope that this work will attract further research into this newly identified, yet crucial research direction. Our source code and data are available at \textcolormagenta \urlhttps://github.com/liuxu0303/VSAnet .

Blind Robust Video Watermarking Based on Adaptive Region Selection and Channel Reference

Qinwei Chang
Leichao Huang
Shaoteng Liu
Hualuo Liu
Tianshu Yang
Yexin Wang

Digital watermarking technology has a wide range of applications in video distribution and copyright protection due to its excellent invisibility and convenient traceability. This paper proposes a robust blind watermarking algorithm using adaptive region selection and channel reference. By designing a combinatorial selection algorithm using texture information and feature points, the method realizes automatically selecting stable blocks which can avoid being destroyed during video encoding and complex attacks. In addition, considering human's insensitivity to some specific color components, a channel-referenced watermark embedding method is designed for less impact on video quality. Moreover, compared with other methods' embedding watermark only at low frequencies, our method tends to modify low-frequency coefficients close to mid frequencies, further ensuring stable retention of the watermark information in the video encoding process. Experimental results show that the proposed method achieves excellent video quality and high robustness against geometric attacks, compression, transcoding and camcorder recordings attacks.

Disparity-based Stereo Image Compression with Aligned Cross-View Priors

Yongqi Zhai
Luyang Tang
Yi Ma
Rui Peng
Ronggang Wang

With the wide application of stereo images in various fields, the research on stereo image compression (SIC) attracts extensive attention from academia and industry. The core of SIC is to fully explore the mutual information between the left and right images and reduce redundancy between views as much as possible. In this paper, we propose DispSIC, an end-to-end trainable deep neural network, in which we jointly train a stereo matching model to assist in the image compression task. Based on the stereo matching results (i.e. disparity), the right image can be easily warped to the left view, and only the residuals between the left and right views are encoded for the left image. A three-branch auto-encoder architecture is adopted in DispSIC, which encodes the right image, the disparity map and the residuals respectively. During training, the whole network can learn how to adaptively allocate bitrates to these three parts, achieving better rate-distortion performance at the cost of a lower disparity map bitrates. Moreover, we propose a conditional entropy model with aligned cross-view priors for SIC, which takes the warped latents of the right image as priors to improve the accuracy of the probability estimation for the left image. Experimental results demonstrate that our proposed method achieves superior performance compared to other existing SIC methods on the KITTI and InStereo2K datasets both quantitatively and qualitatively.

Label-Efficient Domain Generalization via Collaborative Exploration and Generalization

Junkun Yuan
Xu Ma
Defang Chen
Kun Kuang
Fei Wu
Lanfen Lin

Considerable progress has been made in domain generalization (DG) which aims to learn a generalizable model from multiple well-annotated source domains to unknown target domains. However, it can be prohibitively expensive to obtain sufficient annotation for source datasets in many real scenarios. To escape from the dilemma between domain generalization and annotation costs, in this paper, we introduce a novel task named label-efficient domain generalization (LEDG) to enable model generalization with label-limited source domains. To address this challenging task, we propose a novel framework called Collaborative Exploration and Generalization (CEG) which jointly optimizes active exploration and semi-supervised generalization. Specifically, in active exploration, to explore class and domain discriminability while avoiding information divergence and redundancy, we query the labels of the samples with the highest overall ranking of class uncertainty, domain representativeness, and information diversity. In semi-supervised generalization, we design MixUp-based intra- and inter-domain knowledge augmentation to expand domain knowledge and generalize domain invariance. We unify active exploration and semi-supervised generalization in a collaborative way and promote mutual enhancement between them, boosting model generalization with limited annotation. Extensive experiments show that CEG yields superior generalization performance. In particular, CEG can even use only 5% data annotation budget to achieve competitive results compared to the previous DG methods with fully labeled data on PACS dataset.

Progressive Unsupervised Learning of Local Descriptors

Wufan Wang
Lei Zhang
Hua Huang

Training tuple construction is a crucial step in unsupervised local descriptor learning. Existing approaches perform this step relying on heuristics, which suffer from inaccurate supervision signals and struggle to achieve the desired performance. To address the problem, this work presents DescPro, an unsupervised approach that progressively explores both accurate and informative training tuples for model optimization without using heuristics. Specifically, DescPro consists of a Robust Cluster Assignment (RCA) method to infer pairwise relationships by clustering reliable samples with the increasingly powerful CNN model, and a Similarity-weighted Positive Sampling (SPS) strategy to select informative positive pairs for training tuple construction. Extensive experimental results show that, with the collaboration of the above two modules, DescPro can outperform state-of-the-art unsupervised local descriptors and even rival competitive supervised ones on standard benchmarks.

Graph Reasoning Transformer for Image Parsing

Dong Zhang
Jinhui Tang
Kwang-Ting Cheng

Capturing the long-range dependencies has empirically proven to be effective on a wide range of computer vision tasks. The progressive advances on this topic have been made through the employment of the transformer framework with the help of the multi-head attention mechanism. However, the attention-based image patch interaction potentially suffers from problems of redundant interactions of intra-class patches and unoriented interactions of inter-class patches. In this paper, we propose a novel Graph Reasoning Transformer (GReaT) for image parsing to enable image patches to interact following a relation reasoning pattern. Specifically, the linearly embedded image patches are first projected into the graph space, where each node represents the implicit visual center for a cluster of image patches and each edge reflects the relation weight between two adjacent nodes. After that, global relation reasoning is performed on this graph accordingly. Finally, all nodes including the relation information are mapped back into the original space for subsequent processes. Compared to the conventional transformer, GReaT has higher interaction efficiency and a more purposeful interaction pattern. Experiments are carried out on the challenging Cityscapes and ADE20K datasets. Results show that GReaT achieves consistent performance gains with slight computational overheads on the state-of-the-art transformer baselines.

Opportunistic Backdoor Attacks: Exploring Human-imperceptible Vulnerabilities on Speech Recognition Systems

Qiang Liu
Tongqing Zhou
Zhiping Cai
Yonghao Tang

Speech recognition systems, trained and updated based on large-scale audio data, are vulnerable to backdoor attacks that inject dedicated triggers in system training. The used triggers are generally human-inaudible audio, such as ultrasonic waves. However, we note that such a design is not feasible, as it can be easily filtered out via pre-processing. In this work, we propose the first audible backdoor attack paradigm for speech recognition, characterized by passively triggering and opportunistically invoking. Traditional device-synthetic triggers are replaced with ambient noise in daily scenarios. For adapting triggers to the application dynamics of speech interaction, we exploit the observed knowledge inherited from the context to a trained model and accommodate the injection and poisoning with certainty-based trigger selection, performance-oblivious sample binding, and trigger late-augmentation. Experiments on two datasets under various environments evaluate the proposal's effectiveness in maintaining a high benign rate and facilitating outstanding attack success rate (99.27%, ~4% higher than BadNets), robustness (bounded infectious triggers), feasibility in real-world scenarios. It requires less than 1% data to be poisoned and is demonstrated to be able to resist typical speech enhancement techniques and general countermeasures (e.g., dedicated fine-tuning). The code and data will be made available at https://github.com/lqsunshine/DABA.

Certifying Better Robust Generalization for Unsupervised Domain Adaptation

Zhiqiang Gao
Shufei Zhang
Kaizhu Huang
Qiufeng Wang
Rui Zhang
Chaoliang Zhong

Recent studies explore how to obtain adversarial robustness for unsupervised domain adaptation (UDA). These efforts are however dedicated to achieving an optimal trade-off between accuracy and robustness on a given or seen target domain but ignore the robust generalization issue over unseen adversarial data. Consequently, degraded performance will be often observed when existing robust UDAs are applied to future adversarial data. In this work, we make a first attempt to address the robust generalization issue of UDA. We conjecture that the poor robust generalization of present robust UDAs may be caused by the large distribution gap among adversarial examples. We then provide an empirical and theoretical analysis showing that this large distribution gap is mainly owing to the discrepancy between feature-shift distributions. To reduce such discrepancy, a novel Anchored Feature-Shift Regularization (AFSR) method is designed with a certificated robust generalization bound. We conduct a series of experiments on benchmark UDA datasets. Experimental results validate the effectiveness of our proposed AFSR over many existing robust UDA methods.

Multimodal In-bed Pose and Shape Estimation under the Blankets

Yu Yin
Joseph P. Robinson
Yun Fu

Advancing technology to monitor our bodies and behavior while sleeping and resting are essential for healthcare. However, keen challenges arise from our tendency to rest under blankets. We present a multimodal approach to uncover the subjects and view bodies at rest without the blankets obscuring the view. For this, we introduce a channel-based fusion scheme to effectively fuse different modalities in a way that best leverages the knowledge captured by the multimodal sensors, including visual- and non-visual-based. The channel-based fusion scheme enhances the model's flexibility in the input at inference: one-to-many input modalities required at test time. Nonetheless, multimodal data or not, detecting humans at rest in bed is still a challenge due to the extreme occlusion when covered by a blanket. To mitigate the negative effects of blanket occlusion, we use an attention-based reconstruction module to explicitly reduce the uncertainty of occluded parts by generating uncovered modalities, which further update the current estimation via a cyclic fashion. Extensive experiments validate the proposed model's superiority over others.

Progressive Limb-Aware Virtual Try-On

Xiaoyu Han
Shengping Zhang
Qinglin Liu
Zonglin Li
Chenyang Wang

Existing image-based virtual try-on methods directly transfer specific clothing to a human image without utilizing clothing attributes to refine the transferred clothing geometry and textures, which causes incomplete and blurred clothing appearances. In addition, these methods usually mask the limb textures of the input for the clothing-agnostic person representation, which results in inaccurate predictions for human limb regions (i.e., the exposed arm skin), especially when transforming between long-sleeved and short-sleeved garments. To address these problems, we present a progressive virtual try-on framework, named PL-VTON, which performs pixel-level clothing warping based on multiple attributes of clothing and embeds explicit limb-aware features to generate photo-realistic try-on results. Specifically, we design a Multi-attribute Clothing Warping (MCW) module that adopts a two-stage alignment strategy based on multiple attributes to progressively estimate pixel-level clothing displacements. A Human Parsing Estimator (HPE) is then introduced to semantically divide the person into various regions, which provides structural constraints on the human body and therefore alleviates texture bleeding between clothing and limb regions. Finally, we propose a Limb-aware Texture Fusion (LTF) module to estimate high-quality details in limb regions by fusing textures of the clothing and the human body with the guidance of explicit limb-aware features. Extensive experiments demonstrate that our proposed method outperforms the state-of-the-art virtual try-on methods both qualitatively and quantitatively.

Text Style Transfer based on Multi-factor Disentanglement and Mixture

Anna Zhu
Zhanhui Yin
Brian Kenji Iwana
Xinyu Zhou
Shengwu Xiong

Text style transfer aims to transfer the reference style of one text image to another text image. Previous works have only been able to transfer the style to a binary text image. In this paper, we propose a framework to disentangle the text images into three factors: text content, font, and style features, and then remix the factors of different images to transfer a new style. Both the reference and input text images have no style restrictions. Adversarial training through multi-factor cross recognition is adopted in the network for better feature disentanglement and representation. To decompose the input text images into a disentangled representation with swappable factors, the network is trained using similarity mining within pairs of exemplars. To train our model, we synthesized a new dataset with various text styles in both English and Chinese. Several ablation studies and extensive experiments on our designed and public datasets demonstrate the effectiveness of our approach for text style transfer.

Cloud2Sketch: Augmenting Clouds with Imaginary Sketches

Zhaoyi Wan
Dejia Xu
Zhangyang Wang
Jian Wang
Jiebo Luo

Have you ever looked up at the sky and imagined what the clouds look like? In this work, we present an interesting task that augments clouds in the sky with imagined sketches. Different from generic image-to-sketch translation tasks, unique challenges are introduced: real-world clouds have different levels of similarity to something; sketch generation without sketch retrieval could lead to something unrecognizable; a retrieved sketch from some dataset cannot be directly used because of the mismatch of the shape; an optimal sketch imagination is subjective. We propose Cloud2Sketch, a novel self-supervised pipeline to tackle the aforementioned challenges. First, we pre-process cloud images with a cloud detector and a thresholding algorithm to obtain cloud contours. Then, cloud contours are passed through a retrieval module to retrieve sketches with similar geometrical shapes. Finally, we adopt a novel sketch translation model with built-in free-form deformation for aligning the sketches to cloud contours. To facilitate training, an icon-based sketch collection named Sketchy Zoo is proposed. Extensive experiments validate the effectiveness of our method both qualitatively and quantitatively.

CycleHand: Increasing 3D Pose Estimation Ability on In-the-wild Monocular Image through Cyclic Flow

Daiheng Gao
Xindi Zhang
Xingyu Chen
Andong Tan
Bang Zhang
Pan Pan
Ping Tan

Current methods for 3D hand pose estimation fail to generalize well to in-the-wild new scenarios due to varying camera viewpoints, self-occlusions, and complex environments. To address this problem, we propose CycleHand to improve the generalization ability of the model in a self-supervised manner. Our motivation is based on an observation: if one globally rotates the whole hand and reversely rotates it back, the estimated 3D poses of fingers should keep consistent before and after the rotation because the wrist-relative hand poses stay unchanged during global 3D rotation. Hence, we propose arbitrary-rotation self-supervised consistency learning to improve the model's robustness for varying viewpoints. Another innovation of CycleHand is that we propose a high-fidelity texture map to render the photorealistic rotated hand with different lighting conditions, backgrounds, and skin tones to further enhance the effectiveness of our self-supervised task. To reduce the potential negative effects brought by the domain shift of synthetic images, we use the idea of contrastive learning to learn a synthetic-real consistent feature extractor in extracting domain-irrelevant hand representations. Experiments show that CycleHand can largely improve the hand pose estimation performance in both canonical datasets and real-world applications.

Defeating DeepFakes via Adversarial Visual Reconstruction

Ziwen He
Wei Wang
Weinan Guan
Jing Dong
Tieniu Tan

Existing DeepFake detection methods focus on passive detection, i.e., they detect fake face images by exploiting the artifacts produced during DeepFake manipulation. These detection-based methods have their limitation that they only work for ex-post forensics but cannot erase the negative influences of DeepFakes. In this work, we propose a proactive framework for combating DeepFake before the data manipulations. The key idea is to find a well defined substitute latent representation to reconstruct target facial data, leading the reconstructed face to disable the DeepFake generation. To this end, we invert face images into latent codes with a well trained auto-encoder, and search the adversarial face embeddings in their neighbor with the gradient descent method. Extensive experiments on three typical DeepFake manipulation methods, facial attribute editing, face expression manipulation, and face swapping, have demonstrated the effectiveness of our method in different settings.

Content based User Preference Modeling in Music Generation

Xichu Ma
Yuchen Wang
Ye Wang

Automatic music generation (AMG) has been an emerging research topic in AI in recent years. However, generating user-preferred music remains an unsolved problem. To address this challenge, we propose a hierarchical convolutional recurrent neural network with self-attention (CRNN-SA) to extract user music preference (UMP) and map it into an embedding space where the common UMPs are in the center and uncommon UMPs are scattered towards the edge. We then propose an explainable music distance measure as a bridge between the UMP and AMG; this measure computes the distance between a seed song and the user's UMP. That distance is then employed to adjust the AMG's parameters which control the music generation process in an iterative manner, so that the generated song will be closer to the user's UMP in every iteration. Experiments demonstrate that the proposed UMP embedding model successfully captures individual UMPs and that our proposed system is capable of generating user-preferred songs.

CrossHuman: Learning Cross-guidance from Multi-frame Images for Human Reconstruction

Liliang Chen
Jiaqi Li
Han Huang
Yandong Guo

We propose CrossHuman, a novel method that learns cross-guidance from parametric human model and multi-frame RGB images to achieve high-quality 3D human reconstruction. To recover geometry details and texture even in invisible regions, we design a reconstruction pipeline combined with tracking-based methods and tracking-free methods. Given a monocular RGB sequence, we track the parametric human model in the whole sequence, the points (voxels) corresponding to the target frame are warped to reference frames by the parametric body motion. Guided by the geometry priors of the parametric body and spatially aligned features from RGB sequence, the robust implicit surface is fused. Moreover, a multi-frame transformer (MFT) and a self-supervised warp refinement module are integrated to the framework to relax the requirements of parametric body and help to deal with very loose cloth. Compared with previous works, our CrossHuman enables high-fidelity geometry details and texture in both visible and invisible regions and improves the accuracy of the human reconstruction even under estimated inaccurate parametric human models. The experiments demonstrate that our method achieves state-of-the-art (SOTA) performance.

High-Quality 3D Face Reconstruction with Affine Convolutional Networks

Zhiqian Lin
Jiangke Lin
Lincheng Li
Yi Yuan
Zhengxia Zou

Recent works based on convolutional encoder-decoder architecture and 3DMM parameterization have shown great potential for canonical view reconstruction from a single input image. Conventional CNN architectures benefit from exploiting the spatial correspondence between the input and output pixels. However, in 3D face reconstruction, the spatial misalignment between the input image (e.g. face) and the canonical/UV output makes the feature encoding-decoding process quite challenging. In this paper, to tackle this problem, we propose a new network architecture, namely the Affine Convolution Networks, which enables CNN based approaches to handle spatially non-corresponding input and output images and maintain high-fidelity quality output at the same time. In our method, an affine transformation matrix is learned from the affine convolution layer for each spatial location of the feature maps. In addition, we represent 3D human heads in UV space with multiple components, including diffuse maps for texture representation, position maps for geometry representation, and light maps for recovering more complex lighting conditions in the real world. All the components can be trained without any manual annotations. Our method is parametric-free and can generate high-quality UV maps at resolution of 512 x 512 pixels, while previous approaches normally generate 256 x 256 pixels or smaller. Our code will be released once the paper got accepted.

xCloth: Extracting Template-free Textured 3D Clothes from a Monocular Image

Astitva Srivastava
Chandradeep Pokhariya
Sai Sagar Jinka
Avinash Sharma

Existing approaches for 3D garment reconstruction either assume a predefined template for the garment geometry (restricting them to fixed clothing styles) or yield vertex-colored meshes (lacking high-frequency textural details). Our novel framework co-learns geometric and semantic information of garment surface from the input monocular image for template-free textured 3D garment digitization. More specifically, we propose to extend PeeledHuman representation to predict the pixel-aligned, layered depth and semantic maps to extract 3D garments. The layered representation is further exploited to UV parametrize the arbitrary surface of the extracted garment without any human intervention to form a UV atlas. The texture is then imparted on the UV atlas in a hybrid fashion by first projecting pixels from the input image to UV space for the visible region, followed by inpainting the occluded regions. Thus, we are able to digitize arbitrarily loose clothing styles while retaining high-frequency textural details from a monocular image. We achieve high-fidelity 3D garment reconstruction results on three publicly available datasets and generalization on internet images.

SD-GAN: Semantic Decomposition for Face Image Synthesis with Discrete Attribute

Kangneng Zhou
Xiaobin Zhu
Daiheng Gao
Kai Lee
Xinjie Li
Xu-cheng Yin

Manipulating latent code in generative adversarial networks (GANs) for facial image synthesis mainly focuses on continuous attribute synthesis (e.g., age, pose and emotion), while discrete attribute synthesis (like face mask and eyeglasses) receives less attention. Directly applying existing works to facial discrete attributes may cause inaccurate results. In this work, we propose an innovative framework to tackle challenging facial discrete attribute synthesis via semantic decomposing, dubbed SD-GAN. To be concrete, we explicitly decompose the discrete attribute representation into two components, i.e. the semantic prior basis and offset latent representation. The semantic prior basis shows an initializing direction for manipulating face representation in the latent space. The offset latent presentation obtained by 3D-aware semantic fusion network is proposed to adjust prior basis. In addition, the fusion network integrates 3D embedding for better identity preservation and discrete attribute synthesis. The combination of prior basis and offset latent representation enable our method to synthesize photo-realistic face images with discrete attributes. Notably, we construct a large and valuable dataset MEGN (Face Mask and Eyeglasses images crawled from Google and Naver) for completing the lack of discrete attributes in the existing dataset. Extensive qualitative and quantitative experiments demonstrate the state-of-the-art performance of our method. Our code is available at an anonymous website: https://github.com/MontaEllis/SD-GAN.

SingGAN: Generative Adversarial Network For High-Fidelity Singing Voice Generation

Rongjie Huang
Chenye Cui
FeiYang cHEN
Yi Ren
Jinglin Liu
Zhou Zhao
Baoxing Huai
Zhefeng Wang

Deep generative models have achieved significant progress in speech synthesis to date, while high-fidelity singing voice synthesis is still an open problem for its long continuous pronunciation, rich high-frequency parts, and strong expressiveness. Existing neural vocoders designed for text-to-speech cannot directly be applied to singing voice synthesis because they result in glitches and poor high-frequency reconstruction. In this work, we propose SingGAN, a generative adversarial network designed for high-fidelity singing voice synthesis. Specifically, 1) to alleviate the glitch problem in the generated samples, we propose source excitation with the adaptive feature learning filters to expand the receptive field patterns and stabilize long continuous signal generation; and 2) SingGAN introduces global and local discriminators at different scales to enrich low-frequency details and promote high-frequency reconstruction; and 3) To improve the training efficiency, SingGAN includes auxiliary spectrogram losses and sub-band feature matching penalty loss. To the best of our knowledge, SingGAN is the first work designed toward high-fidelity singing voice vocoding. Our evaluation of SingGAN demonstrates the state-of-the-art results with higher-quality (MOS 4.05) samples. Also, SingGAN enables a sample speed of 50x faster than real-time on a single NVIDIA 2080Ti GPU. We further show that SingGAN generalizes well to the mel-spectrogram inversion of unseen singers, and the end-to-end singing voice synthesis system SingGAN-SVS enjoys a two-stage pipeline to transform the music scores into expressive singing voices.

Design What You Desire: Icon Generation from Orthogonal Application and Theme Labels

Yinpeng Chen
Zhiyu Pan
Min Shi
Hao Lu
Zhiguo Cao
Weicai Zhong

Generative adversarial networks,(GANs) have been trained to be professional artists able to create stunning artworks such as face generation and image style transfer. In this paper, we focus on a realistic business scenario: automated generation of customizable icons given desired mobile applications and theme styles. We first introduce a theme-application icon dataset, termed AppIcon, where each icon has two orthogonal theme and app labels. By investigating a strong baseline StyleGAN2, we observe mode collapse caused by the entanglement of the orthogonal labels. To solve this challenge, we propose IconGAN composed of a conditional generator and dual discriminators with orthogonal augmentations, and a contrastive feature disentanglement strategy is further designed to regularize the feature space of the two discriminators. Compared with other approaches, IconGAN indicates a superior advantage on the AppIcon benchmark. Further analysis also justifies the effectiveness of disentangling app and theme representations. Our project will be released at: https://github.com/architect-road/IconGAN.

Semantically-Consistent Dynamic Blurry Image Generation for Image Deblurring

Zhaohui Jing
Youjian Zhang
Chaoyue Wang
Daqing Liu
Yong Xia

The training of deep learning-based image deblurring models heavily relies on the paired sharp/blurry image dataset. Although many works verified that synthesized blurry-sharp pairs contribute to improving the deblurring performance, it is still an open problem about how to synthesize realistic and diverse dynamic blurry images. Instead of directly synthesizing blurry images, in this paper, we propose a novel method to generate semantic-aware dense dynamic motion, and employ the generated motion to synthesize blurry images. Specifically, for each sharp image, both the global motion (camera shake) and local motion (object moving) are considered given the depth information as the condition. Then, a blur creation module takes the spatial-variant motion information and the sharp image as input to synthesize a motion-blurred image. A relativistic GAN loss is employed to assure the synthesized blurry image is as realistic as possible. Experiments show that our method can generate diverse dynamic motion and visually realistic blurry images. Also, the generated image pairs can further improve the quantitative performance and generalization ability of the existing deblurring method on several test sets.

RepSR: Training Efficient VGG-style Super-Resolution Networks with Structural Re-Parameterization and Batch Normalization

Xintao Wang
Chao Dong
Ying Shan

This paper explores training efficient VGG-style super-resolution (SR) networks with the structural re-parameterization technique. The general pipeline of re-parameterization is to train networks with multi-branch topology first, and then merge them into standard 3x3 convolutions for efficient inference. In this work, we revisit those primary designs and investigate essential components for re-parameterizing SR networks. First of all, we find that batch normalization (BN) is important to bring training non-linearity and improve the final performance. However, BN is typically ignored in SR, as it usually degrades the performance and introduces unpleasant artifacts. We carefully analyze the cause of BN issue and then propose a straightforward yet effective solution. In particular, we first train SR networks with mini-batch statistics as usual, and then switch to using population statistics at the later training period. While we have successfully re-introduced BN into SR, we further design a new re-parameterizable block tailored for SR, namely RepSR. It consists of a clean residual path and two expand-and-squeeze convolution paths with the modified BN. Extensive experiments demonstrate that our simple RepSR is capable of achieving superior performance to previous SR re-parameterization methods among different model sizes. In addition, our RepSR can achieve a better trade-off between performance and actual running time (throughput) than previous SR methods. Codes are available at https://github.com/TencentARC/RepSR.

Rotation Invariant Transformer for Recognizing Object in UAVs

Shuoyi Chen
Mang Ye
Bo Du

Recognizing a target of interest from the UAVs is much more challenging than the existing object re-identification tasks across multiple city cameras. The images taken by the UAVs usually suffer from significant size difference when generating the object bounding boxes and uncertain rotation variations. Existing methods are usually designed for city cameras, incapable of handing the rotation issue in UAV scenarios. A straightforward solution is to perform the image-level rotation augmentation, but it would cause loss of useful information when inputting the powerful vision transformer as patches. This motivates us to simulate the rotation operation at the patch feature level, proposing a novel rotation invariant vision transformer (RotTrans). This strategy builds on high-level features with the help of the specificity of the vision transformer structure, which enhances the robustness against large rotation differences. In addition, we design invariance constraint to establish the relationship between the original feature and the rotated features, achieving stronger rotation invariance. Our proposed transformer tested on the latest UAV datasets greatly outperforms the current state-of-the-arts, which is 5.9% and 4.8% higher than the highest mAP and Rank1. Notably, our model also performs competitively for the person re-identification task on traditional city cameras. In particular, our solution wins the first place in the UAV-based person re-recognition track in the Multi-Modal Video Reasoning and Analyzing Competition held in ICCV 2021. Code is available at https://github.com/whucsy/RotTrans.

Active Learning for Point Cloud Semantic Segmentation via Spatial-Structural Diversity Reasoning

Feifei Shao
Yawei Luo
Ping Liu
Jie Chen
Yi Yang
Yulei Lu
Jun Xiao

The expensive annotation cost is notoriously known as the main constraint for the development of the point cloud semantic segmentation technique. Active learning methods endeavor to reduce such cost by selecting and labeling only a subset of the point clouds, yet previous attempts ignore the spatial-structural diversity of the selected samples, inducing the model to select clustered candidates with similar shapes in a local area while missing other representative ones in the global environment. In this paper, we propose a new 3D region-based active learning method to tackle this problem. Dubbed SSDR-AL, our method groups the original point clouds into superpoints and incrementally selects the most informative and representative ones for label acquisition. We achieve the selection mechanism via a graph reasoning network that considers both the spatial and structural diversities of superpoints. To deploy SSDR-AL in a more practical scenario, we design a noise-aware iterative labeling strategy to confront the "noisy annotation'' problem introduced by the previous "dominant labeling'' strategy in superpoints. Extensive experiments on two point cloud benchmarks demonstrate the effectiveness of SSDR-AL in the semantic segmentation task. Particularly, SSDR-AL significantly outperforms the baseline method and reduces the annotation cost by up to $63.0%$ and $24.0%$ when achieving $90%$ performance of fully supervised learning, respectively. Code is available at https://github.com/shaofeifei11/SSDR-AL.

Free-Lunch for Cross-Domain Few-Shot Learning: Style-Aware Episodic Training with Robust Contrastive Learning

Ji Zhang
Jingkuan Song
Lianli Gao
Hengtao Shen

Cross-Domain Few-Shot Learning (CDFSL) aims for training an adaptable model that can learn out-of-domain classes with a handful of samples. Compared to the well-studied few-shot learning problem, the difficulty for CDFSL lies in that the available training data from test tasks is not only extremely limited but also presents severe class differences from training tasks. To tackle this challenge, we propose Style-aware Episodic Training with Robust Contrastive Learning (SET-RCL), which is motivated by the key observation that a remarkable style-shift between tasks from source and target domains plays a negative role in cross-domain generalization. SET-RCL addresses the style-shift from two perspectives: 1) simulating the style distributions of unknown target domains (data perspective); and 2) learning a style-invariant representation (model perspective). Specifically, Style-aware Episodic Training (SET) focuses on manipulating the styl distribution of training tasks in the source domain, such that the learned model can achieve better adaption on test tasks with domain-specific styles. To further improve cross-domain generalization under style-shift, we develop Robust Contrastive Learning (RCL) to capture style-invariant and discriminative representations from the manipulated tasks. Notably,our SET-RCL is orthogonal to existing FSL approaches, thus can be adopted as a "free-lunch" for boosting their CDFSL performance. Extensive experiments on nine benchmark datasets and six baseline methods demonstrate the effectiveness of our method.

ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech

Rongjie Huang
Zhou Zhao
Huadai Liu
Jinglin Liu
Chenye Cui
Yi Ren

Denoising diffusion probabilistic models (DDPMs) have recently achieved leading performances in many generative tasks. However, the inherited iterative sampling process costs hinder their applications to text-to-speech deployment. Through the preliminary study on diffusion model parameterization, we find that previous gradient-based TTS models require hundreds or thousands of iterations to guarantee high sample quality, which poses a challenge for accelerating sampling. In this work, we propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech. Unlike previous work estimating the gradient for data density, ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling. To tackle the model convergence challenge with decreased diffusion iterations, ProDiff reduces the data variance in the target site via knowledge distillation. Specifically, the denoising model uses the generated mel-spectrogram from an N-step DDIM teacher as the training target and distills the behavior into a new model with N/2 steps. As such, it allows the TTS model to make sharp predictions and further reduces the sampling time by orders of magnitude. Our evaluation demonstrates that ProDiff needs only 2 iterations to synthesize high-fidelity mel-spectrograms, while it maintains sample quality and diversity competitive with state-of-the-art models using hundreds of steps. ProDiff enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU, making diffusion models practically applicable to text-to-speech synthesis deployment for the first time. Our extensive ablation studies demonstrate that each design in ProDiff is effective, and we further show that ProDiff can be easily extended to the multi-speaker setting.

Joint Learning Content and Degradation Aware Feature for Blind Super-Resolution

Yifeng Zhou
Chuming Lin
Donghao Luo
Yong Liu
Ying Tai
Chengjie Wang
Mingang Chen

To achieve promising results on blind image super-resolution (SR),some attempts leveraged the low resolution (LR) images to predict the kernel and improve the SR performance. However, these Supervised Kernel Prediction (SKP) methods are impractical due to the unavailable real-world blur kernels. Although some Unsupervised Degradation Prediction (UDP) methods are proposed to bypass this problem, the inconsistency between degradation embedding and SR feature is still challenging. By exploring the correlations between degradation embedding and SR feature, we observe that jointly learning the content and degradation aware feature is optimal. Based on this observation, a Content and Degradation aware SR Network dubbed CDSR is proposed. Specifically, CDSR contains three newly-established modules: (1) a Lightweight Patch-based Encoder (LPE) is applied to jointly extract content and degradation features; (2) a Domain Query Attention based module (DQA) is employed to adaptively reduce the inconsistency; (3) a Codebook-based Space Compress module (CSC) that can suppress the redundant information. Extensive experiments on several benchmarks demonstrate that the proposed CDSR outperforms the existing UDP models and achieves competitive performance on PSNR and SSIM even compared with the state-of-the-art SKP methods.

Self-Aligned Concave Curve: Illumination Enhancement for Unsupervised Adaptation

Wenjing Wang
Zhengbo Xu
Haofeng Huang
Jiaying Liu

Low light conditions not only degrade human visual experience, but also reduce the performance of downstream machine analytics. Although many works have been designed for low-light enhancement or domain adaptive machine analytics, the former considers less on high-level vision, while the latter neglects the potential of image-level signal adjustment. How to restore underexposed images/videos from the perspective of machine vision has long been overlooked. In this paper, we are the first to propose a learnable illumination enhancement model for high-level vision. Inspired by real camera response functions, we assume that the illumination enhancement function should be a concave curve, and propose to satisfy this concavity through discrete integral. With the intention of adapting illumination from the perspective of machine vision without task-specific annotated data, we design an asymmetric cross-domain self-supervised training strategy. Our model architecture and training designs mutually benefit each other, forming a powerful unsupervised normal-to-low light adaptation framework. Comprehensive experiments demonstrate that our method surpasses existing low-light enhancement and adaptation methods and shows superior generalization on various low-light vision tasks, including classification, detection, action recognition, and optical flow estimation. All of our data, code, and results will be available online upon publication of the paper.

Photorealistic Style Transfer via Adaptive Filtering and Channel Seperation

Hong Ding
Fei Luo
Caoqing Jiang
Gang Fu
Zipei Chen
Shenghong Hu
Chunxia Xiao

The problem of color and texture distortion remains unsolved in the photorealistic style transfer task. It is mainly caused by the interference between color and texture during transferring. To address this problem, we propose a end-to-end network via adaptive filtering and channel separation. Given a pair of content image and reference image, we firstly decompose them into two structure layers through adaptive weighted least squares filter (AWLSF), which could better perceive the color structure and illumination. Then, we carry out RGB transfer in a channel separation way on the two generated structure layers. To deal with texture in a relatively independent manner, we use a module and a subtraction operation to get more complete and clear content features. Finally, we merge the color structure and texture detail into the ultimate result. We conduct solid quantitative experiments on four metrics NIQE, AG, SSIM, and PSNR, and make a user study. The experimental results demonstrate that our method is able to produce better results than previous state-of-the-art methods, and validate the effectiveness and superiority of our method.

Recurrent Meta-Learning against Generalized Cold-start Problem in CTR Prediction

Junyu Chen
Qianqian Xu
Zhiyong Yang
Ke Ma
Xiaochun Cao
Qingming Huang

During the last decades, great success has been witnessed along the course of accurate Click-Through-Rate (CTR) prediction models for online advertising. However, the cold-start problem, which refers to the issue that the standard models can hardly draw accurate inferences for unseen users/ads, is still yet to be fully understood. Most recently, some related studies have been proposed to tackle this problem with only the new users/ads being considered. We argue that such new users/ads are not the only sources for cold-start. From another perspective, since users might shift their interests over time, one's recent behaviors might vary greatly from the records long ago. In this sense, we believe that the cold-start problem should also exist along the temporal dimension. Motivated by this, a generalized definition of the cold-start problem is provided where both new users/ads and recent behavioral data from known users are considered. To attack this problem, we propose a recursive meta-learning model with the user's behavior sequence prediction as a separate training task. Specifically, a time-series CTR model with the MAML (Model-Agnostic Meta-Learning)-like meta-learning method is proposed to make our model adapt to new tasks rapidly. Besides, we propose a parallel structure for extracting the feature interactions to efficiently fuse attention mechanisms and the RNN layer. Finally, experiments on three public datasets demonstrate the effectiveness of the proposed approaches.

Learning Projection Views for Sparse-View CT Reconstruction

Liutao Yang
Rongjun Ge
Shichang Feng
Daoqiang Zhang

Sparse-View CT (SVCT), which provides low-dose and high-speed CT imaging, plays an important role in the medical imaging area. As the decrease of projection views, the reconstructed image suffers from severe artifacts. To this end, recent works utilize deep learning methods to improve the imaging quality of SVCT and achieve promising performances. However, these methods mainly focus on the network design and modeling but overlook the importance of choosing projection views. To address this issue, this paper proposes a Projection-view LeArning Network (PLANet), which can estimate the importance of different view angles through reconstruction network training and select the projection views for high-quality image restoration. Specifically, we generate synthesized sparse-view sinograms by subsampling projections from full-view sinograms based on a learnable distribution, which can be learned through reconstruction network training. Thus, important image views can be selected to acquire sparse-view projection in imaging equipment. Furthermore, effective data augmentations are provided by the online generation of sparse-view sinogram to improve the stability and performance of reconstruction networks. In short, our method can select the important projection views and learn high-performance reconstruction networks in one unified deep-learning framework. Comprehensive experiments show that the proposed method achieves promising results compared to state-of-the-art methods, and the ablation studies also show the superiority of our proposed PLANet in terms of effectiveness and robustness.

Unsupervised Textured Terrain Generation via Differentiable Rendering

Peichi Zhou
Dingbo Lu
Chen Li
Jian Zhang
Long Liu
Changbo Wang

Constructing large-scale realistic terrains using modern modeling tools is an extremely challenging task even for professional users, undermining the effectiveness of video games, virtual reality, and other applications. In this paper, we present a step towards unsupervised and realistic modeling of textured terrains from DEM and satellite imagery, built upon two-stage illumination and texture optimization via differentiable rendering. First, a differentiable renderer for satellite imagery is established based on the Lambert diffuse model that allows inverse optimization of material and lighting parameters towards specific objective. Second, the original illumination direction of satellite imagery is recovered by reducing the difference between the shadow distribution generated by the renderer and that of the satellite image in YCrCb colour space, leveraging the abundant geometric information of DEM. Third, we propose to generate the original texture of the shadowed region by introducing visual consistency and smoothness constraints via differentiable rendering to arrive at an end-to-end unsupervised architecture. Comprehensive experiments demonstrate the effectiveness and efficiency of our proposed method as a potential tool to achieve virtual terrain modeling for widespread graphics applications.

MegaPortraits: One-shot Megapixel Neural Head Avatars

Nikita Drobyshev
Jenya Chelishev
Taras Khakhulin
Aleksei Ivakhnenko
Victor Lempitsky
Egor Zakharov

In this work, we advance the neural head avatar technology to the megapixel resolution while focusing on the particularly challenging task of cross-driving synthesis, i.e., when the appearance of the driving image is substantially different from the animated source image. We propose a set of new neural architectures and training methods that can leverage both medium-resolution video data and high-resolution image data to achieve the desired levels of rendered image quality and generalization to novel views and motion. We demonstrate that suggested architectures and methods produce convincing high-resolution neural avatars, outperforming the competitors in the cross-driving scenario. Lastly, we show how a trained high-resolution neural avatar model can be distilled into a lightweight student model which runs in real-time and locks the identities of neural avatars to several dozens of pre-defined source images. Real-time operation and identity lock are essential for many practical applications head avatar systems.

Event-guided Video Clip Generation from Blurry Images

Xin Ding
Tsuyoshi Takatani
Zhongyuan Wang
Ying Fu
Yinqiang Zheng

Dynamic and active pixel vision sensors (DAVIS) can simultaneously produce streams of asynchronous events captured by the dynamic vision sensor (DVS) and intensity frames from the active pixel sensor (APS). Event sequences show high temporal resolution and high dynamic range, while intensity images easily suffer from motion blur due to the low frame rate of APS. In this paper, we present an end-to-end convolutional neural network based method under the local and global constraints of events to restore clear, sharp intensity frames through collaborative learning from a blurry image and its associated event streams. Specifically, we first learn a function of the relationship between the sharp intensity frame and the corresponding blurry image with its event data. Then we propose a generation module to realize it with a supervision module to constrain the restoration in the motion process. We also capture the first realistic dataset with paired blurry frame/events and sharp frames by synchronizing a DAVIS camera and a high-speed camera. Experimental results show that our method can reconstruct high-quality sharp video clips, and outperform the state-of-the-art on both simulated and real-world data.

Consistency-Contrast Learning for Conceptual Coding

Jianhui Chang
Jian Zhang
Youmin Xu
Jiguo Li
Siwei Ma
Wen Gao

As an emerging compression scheme, conceptual coding usually encodes images into structural and textural representations and decodes them in a deep synthesis fashion. However, existing conceptual coding schemes ignore the structure of deep texture representation space, leading to a challenge of establishing efficient and faithful conceptual representations. In this paper, we firstly introduce contrastive learning into conceptual coding and propose Consistency-Contrast Learning (CCL) which optimizes the representation space by a consistency-contrast regularization. By modeling the original images and reconstructed images as "positive'' pairs and random images in a batch as "negative'' samples, CCL aims to align texture representation space with source images space relatively. Extensive experiments on diverse datasets demonstrate that: (1) the proposed CCL can achieve the best compression performance on the conceptual coding task; (2) CCL is superior to other popular regularization methods towards improving reconstruction quality; (3) CCL is general and can be applied to other tasks related to representation optimization and image reconstruction, such as GAN inversion.

Order-aware Human Interaction Manipulation

Mandi Luo
Jie Cao
Ran He

The majority of current techniques for pose transfer disregard the interactions between the transferred person and the surrounding instances, resulting in context inconsistency when applied to complicated situations. To tackle this issue, we propose InterOrderNet, a novel framework to perform order-aware interaction learning. The proposed InterOrderNet learns the relative order on the direction of the z-axis among instances to describe instance-level occlusions. Not only does learning this order guarantee the context consistency of human pose transfer, but it also enhances its generalization to natural scenes. Additionally, we present a novel unsupervised method, named Imitative Contrastive Learning, which sidesteps the requirements of order annotations. Existing pose transfer methods are easy to be integrated into the proposed InterOrderNet. Extensive experiments demonstrate that InterOrderNet enables these methods to perform interaction manipulation.

Semi-supervised Video Shadow Detection via Image-assisted Pseudo-label Generation

Zipei Chen
Xiao Lu
Ling Zhang
Chunxia Xiao

Although learning-based methods have shown their potential for image shadow detection, video shadow detection is still a challenging problem. It is due to the absence of large-scale, temporally consistent annotated video shadow detection dataset. To this end, we propose a semi-supervised video shadow detection method by seeking the assistance of the existing labeled image dataset to generate pseudo-labels as the additional supervision signals. Specifically, we first introduce a novel image-assisted video pseudo-label generator with a spatio-temporally aligned network (STANet). It generates high-quality and temporally consistent pseudo-labels. Then, with these pseudo-labels, we propose an uncertainty-guided semi-supervised learning strategy to reduce the impact of noise from them. Moreover, we also design a memory propagated long-term network (MPLNet), which produces video shadow detection results with long-term consistency in a light-weight way by using the memory mechanism. Extensive experiments on ViSha and our collected real-world video shadow detection dataset RVSD show that our approach not only achieves superior performance in the benchmark dataset but also generalizes well in more practical applications, which demonstrates the effectiveness of our method.

Towards Robust Video Object Segmentation with Adaptive Object Calibration

Xiaohao Xu
Jinglu Wang
Xiang Ming
Yan Lu

In the booming video era, video segmentation attracts increasing research attention in the multimedia community. Semi-supervised video object segmentation (VOS) aims at segmenting objects in all target frames of a video, given annotated object masks of reference frames. Most existing methods build pixel-wise reference-target correlations and then perform pixel-wise tracking to obtain target masks. Due to neglecting object-level cues, pixel-level approaches make the tracking vulnerable to perturbations, and even indiscriminate among similar objects. Towards robust VOS, the key insight is to calibrate the representation and mask of each specific object to be expressive and discriminative. Accordingly, we propose a new deep network, which can adaptively construct object representations and calibrate object masks to achieve stronger robustness. First, we construct the object representations by applying an adaptive object proxy (AOP) aggregation method, where the proxies represent arbitrary-shaped segments at multi-levels for reference. Then, prototype masks are initially generated from the reference-target correlations based on AOP. Afterwards, such proto-masks are further calibrated through network modulation, conditioning on the object proxy representations. We consolidate this conditional mask calibration process in a progressive manner, where the object representations and proto-masks evolve to be discriminative iteratively. Extensive experiments are conducted on the standard VOS benchmarks, YouTube-VOS-18/19 and DAVIS-17. Our model achieves the state-of-the-art performance among existing published works, and also exhibits superior robustness against perturbations.

Split-PU: Hardness-aware Training Strategy for Positive-Unlabeled Learning

Chengming Xu
Chen Liu
Siqian Yang
Yabiao Wang
Shijie Zhang
Lijie Jia
Yanwei Fu

Positive-Unlabeled (PU) learning aims to learn a model with rare positive samples and abundant unlabeled samples. Compared with classical binary classification, the task of PU learning is much more challenging due to the existence of many incompletely-annotated data instances. Since only part of the most confident positive samples are available and evidence is not enough to categorize the rest samples, many of these unlabeled data may also be the positive samples. Research on this topic is particularly useful and essential to many real-world tasks which demand very expensive labelling cost. For example, the recognition tasks in disease diagnosis, recommendation system and satellite image recognition may only have few positive samples that can be annotated by the experts. While this problem is receiving increasing attention, most of the efforts have been dedicated to the design of trustworthy risk estimators such as uPU and nnPU and direct knowledge distillation, e.g., Self-PU. These methods mainly omit the intrinsic hardness of some unlabeled data, which can result in sub-optimal performance as a consequence of fitting the easy noisy data and not sufficiently utilizing the hard data. In this paper, we focus on improving the commonly-used nnPU with a novel training pipeline. We highlight the intrinsic difference of hardness of samples in the dataset and the proper learning strategies for easy and hard data. By considering this fact, we propose first splitting the unlabeled dataset with an early-stop strategy. The samples that have inconsistent predictions between the temporary and base model are considered as hard samples. Then the model utilizes a noise-tolerant Jensen-Shannon divergence loss for easy data; and a dual-source consistency regularization for hard data which includes a cross-consistency between student and base model for low-level features and self-consistency for high-level features and predictions, respectively. Our method achieves much better results compared with existing methods on CIFAR10 and two medical datasets of liver cancer survival time prediction, and low blood pressure diagnosis of pregnant, individually. The experimental results validates the efficacy of our proposed method.

Multi-Camera Collaborative Depth Prediction via Consistent Structure Estimation

Jialei Xu
Xianming Liu
Yuanchao Bai
Junjun Jiang
Kaixuan Wang
Xiaozhi Chen
Xiangyang Ji

Depth map estimation from images is an important task in robotic systems. Existing methods can be categorized into two groups including multi-view stereo and monocular depth estimation. The former requires cameras to have large overlapping areas and sufficient baseline between cameras, while the latter that processes each image independently can hardly guarantee the structure consistency between cameras. In this paper, we propose a novel multi-camera collaborative depth prediction method that does not require large overlapping areas while maintaining structure consistency between cameras. Specifically, we formulate the depth estimation as a weighted combination of depth basis, in which the weights are updated iteratively by a refinement network driven by the proposed consistency loss. During the iterative update, the results of depth estimation are compared across cameras and the information of overlapping areas is propagated to the whole depth maps with the help of basis formulation. Experimental results on DDAD and NuScenes datasets demonstrate the superior performance of our method.

Fast Hierarchical Deep Unfolding Network for Image Compressed Sensing

Wenxue Cui
Shaohui Liu
Debin Zhao

By integrating certain optimization solvers with deep neural network, deep unfolding network (DUN) has attracted much attention in recent years for image compressed sensing (CS). However, there still exist several issues in existing DUNs: 1) For each iteration, a simple stacked convolutional network is usually adopted, which apparently limits the expressiveness of these models. 2) Once the training is completed, most hyperparameters of existing DUNs are fixed for any input content, which significantly weakens their adaptability. In this paper, by unfolding the Fast Iterative Shrinkage-Thresholding Algorithm (FISTA), a novel fast hierarchical DUN, dubbed FHDUN, is proposed for image compressed sensing, in which a well-designed hierarchical unfolding architecture is developed to cooperatively explore richer contextual prior information in multi-scale spaces. To further enhance the adaptability, series of hyperparametric generation networks are developed in our framework to dynamically produce the corresponding optimal hyperparameters according to the input content. Furthermore, due to the accelerated policy in FISTA, the newly embedded acceleration module makes the proposed FHDUN save more than 50% of the iterative loops against recent DUNs. Extensive CS experiments manifest that the proposed FHDUN outperforms existing state-of-the-art CS methods, while maintaining fewer iterations.

Restoration of User Videos Shared on Social Media

Hongming Luo
Fei Zhou
Kin-man Lam
Guoping Qiu

User videos shared on social media platforms usually suffer from degradations caused by unknown proprietary processing procedures, which means that their visual quality is poorer than that of the originals. This paper presents a new general video restoration framework for the restoration of user videos shared on social media platforms. In contrast to most deep learning-based video restoration methods that perform end-to-end mapping, where feature extraction is mostly treated as a black box, in the sense that what role a feature plays is often unknown, our new method, termed Video restOration through adapTive dEgradation Sensing (VOTES), introduces the concept of a degradation feature map (DFM) to explicitly guide the video restoration process. Specifically, for each video frame, we first adaptively estimate its DFM to extract features representing the difficulty of restoring its different regions. We then feed the DFM to a convolutional neural network (CNN) to compute hierarchical degradation features to modulate an end-to-end video restoration backbone network, such that more attention is paid explicitly to potentially more difficult to restore areas, which in turn leads to enhanced restoration performance. We will explain the design rationale of the VOTES framework and present extensive experimental results to show that the new VOTES method outperforms various state-of-the-art techniques both quantitatively and qualitatively. In addition, we contribute a large scale real-world database of user videos shared on different social media platforms. Codes and datasets are available at https://github.com/luohongming/VOTES.git

Real-time Streaming Video Denoising with Bidirectional Buffers

Chenyang Qi
Junming Chen
Xin Yang
Qifeng Chen

Video streams are delivered continuously to save the cost of storage and device memory. Real-time denoising algorithms are typically adopted on the user device to remove the noise involved during the shooting and transmission of video streams. However, sliding-window-based methods feed multiple input frames for a single output and lack computation efficiency. Recent multi-output inference works propagate the bidirectional temporal feature with a parallel or recurrent framework, which either suffers from performance drops on the temporal edges of clips or can not achieve online inference. In this paper, we propose a Bidirectional Streaming Video Denoising (BSVD) framework, to achieve high-fidelity real-time denoising for streaming videos with both past and future temporal receptive fields. The bidirectional temporal fusion for online inference is considered not applicable in the MoViNet. However, we introduce a novel Bidirectional Buffer Block as the core module of our BSVD, which makes it possible during our pipeline-style inference. In addition, our method is concise and flexible to be utilized in both non-blind and blind video denoising. We compare our model with various state-of-the-art video denoising models qualitatively and quantitatively on synthetic and real noise. Our method outperforms previous methods in terms of restoration fidelity and runtime.

Learning Hierarchical Dynamics with Spatial Adjacency for Image Enhancement

Yudong Liang
Bin Wang
Wenqi Ren
Jiaying Liu
Wenjian Wang
Wangmeng Zuo

In various real-world image enhancement applications, the degradations are always non-uniform or non-homogeneous and diverse, which challenges most deep networks with fixed parameters during the inference phase. Inspired by the dynamic deep networks that adapt the model structures or parameters conditioned on the inputs, we propose a DCP-guided hierarchical dynamic mechanism for image enhancement to adapt the model parameters and features from local to global as well as to keep spatial adjacency within the region. Specifically, channel-spatial-level, structure-level, and region-level dynamic components are sequentially applied. Channel-spatial-level dynamics obtain channel- and spatial-wise representation variations, and structure-level dynamics enable modeling geometric transformations and augment sampling locations for the varying local features to better describe the structures. In addition, a novel region-level dynamic is proposed to generate spatially continuous masks for dynamic features which capitalizes on the Dark Channel Priors (DCP). The proposed region-level dynamics benefit from exploiting the statistical differences between distorted and undistorted images. Moreover, the DCP-guided region generations are inherently spatial coherent which facilitates capturing local coherence of the images. The proposed method achieves state-of-the-art performance and generates visually pleasing images for multiple enhancement tasks,i.e. , image dehazing, image deraining and low-light image enhancement. The codes are available at https://github.com/DongLiangSXU/HDM.

Text's Armor: Optimized Local Adversarial Perturbation Against Scene Text Editing Attacks

Tao Xiang
Hangcheng Liu
Shangwei Guo
Hantao Liu
Tianwei Zhang

Deep neural networks (DNNs) have shown their powerful capability in scene text editing (STE). With carefully designed DNNs, one can alter texts in a source image with other ones while maintaining their realistic look. However, such editing tools provide a great convenience for criminals to falsify documents or modify texts without authorization. In this paper, we propose to actively defeat text editing attacks by designing invisible "armors" for texts in the scene. We turn the adversarial vulnerability of DNN-based STE into strength and design local perturbations (i.e., "armors") specifically for texts using an optimized normalization strategy. Such local perturbations can effectively mislead STE attacks without affecting the perceptibility of scene background. To strengthen our defense capabilities, we systemically analyze and model STE attacks and provide a precise defense method to defeat attacks on different editing stages. We conduct both subjective and objective experiments to show the superior of our optimized local adversarial perturbation against state-of-the-art STE attacks. We also evaluate the portrait and landscape transferability of our perturbations.

ChartStamp: Robust Chart Embedding for Real-World Applications

Jiayun Fu
Bin B. Zhu
Haidong Zhang
Yayi Zou
Song Ge
Weiwei Cui
Yun Wang
Dongmei Zhang
Xiaojing Ma
Hai Jin

Deep learning-based image embedding methods are typically designed for natural images and may not work for chart images due to their homogeneous regions, which lack variations to hide data both robustly and imperceptibly. In this paper, we propose ChartStamp, the first chart embedding method that is robust to real-world printing and displaying (printed on paper and displayed on screen, respectively, and then captured with a camera) while maintaining a good perceptual quality. ChartStamp hides 100, 1,000, or 10,000 raw bits into a chart image, depending on the designated robustness to printing, displaying, or JPEG. To ensure perceptual quality, it introduces a new perceptual model to guide embedding to insensitive regions of a chart image and a smoothness loss to ensure smoothness of the embedding residual in homogeneous regions. ChartStamp applies a distortion layer approximating designated real-world manipulations to train a model robust to these manipulations. Our experimental evaluation indicates that ChartStamp achieves the robustness and embedding capacity on chart images similar to their state-of-the-art counterparts on natural images. Our user studies indicate that ChartStamp achieves better perceptual quality than existing robust chart embedding methods and that our perceptual model outperforms the existing perceptual model.

Few-shot Image Generation Using Discrete Content Representation

Yan Hong
Li Niu
Jianfu Zhang
Liqing Zhang

Few-shot image generation and few-shot image translation are two related tasks, both of which aim to generate new images for an unseen category with only a few images. In this work, we make the first attempt to adapt few-shot image translation method to few-shot image generation task. Few-shot image translation disentangles an image into style vector and content map. An unseen style vector can be combined with different seen content maps to produce different images. However, it needs to store seen images to provide content maps and the unseen style vector may be incompatible with seen content maps. To adapt it to few-shot image generation task, we learn a compact dictionary of local content vectors via quantizing continuous content maps into discrete content maps instead of storing seen images. Furthermore, we model the autoregressive distribution of discrete content map conditioned on style vector, which can alleviate the incompatibility between content map and style vector. Qualitative and quantitative results on three real datasets demonstrate that our model can produce images of higher diversity and fidelity for unseen categories than previous methods.

Marior: Margin Removal and Iterative Content Rectification for Document Dewarping in the Wild

Jiaxin Zhang
Canjie Luo
Lianwen Jin
Fengjun Guo
Kai Ding

Camera-captured document images usually suffer from perspective and geometric deformations. It is of great value to rectify them when considering poor visual aesthetics and the deteriorated performance of OCR systems. Recent learning-based methods intensively focus on the accurately cropped document image. However, this might not be sufficient for overcoming practical challenges, including document images either with large marginal regions or without margins. Due to this impracticality, users struggle to crop documents precisely when they encounter large marginal regions. Simultaneously, dewarping images without margins is still an insurmountable problem. To the best of our knowledge, there is still no complete and effective pipeline for rectifying document images in the wild. To address this issue, we propose a novel approach called Marior (Margin Removal and Iterative Content Rectification). Marior follows a progressive strategy to iteratively improve the dewarping quality and readability in a coarse-to-fine manner. Specifically, we divide the pipeline into two modules: margin removal module (MRM) and iterative content rectification module (ICRM). First, we predict the segmentation mask of the input image to remove the margin, thereby obtaining a preliminary result. Then we refine the image further by producing dense displacement flows to achieve content-aware rectification. We determine the number of refinement iterations adaptively. Experiments demonstrate the state-of-the-art performance of our method on public benchmarks. The resources are available at https://github.com/ZZZHANG-jx/Marior for further comparison.

Image Inpainting Detection via Enriched Attentive Pattern with Near Original Image Augmentation

Wenhan Yang
Rizhao Cai
Alex Kot

As deep learning-based inpainting methods have achieved increasingly better results, its malicious use, e.g. removing objects to report fake news or to provide fake evidence, is becoming threatening. Previous works have provided rich discussions on network architectures, e.g. even performing Neural Architecture Search to obtain the optimal model architecture. However, there are rooms in other aspects. In our work, we provide comprehensive efforts from data and feature aspects. From the data aspect, as harder samples in the training data usually lead to stronger detection models, we propose near original image augmentation that pushes the inpainted images closer to the original ones (without distortion and inpainting) as the input images, which is proved to improve the detection accuracy. From the feature aspect, we propose to extract the attentive pattern. With the designed attentive pattern, the knowledge of different inpainting methods can be better exploited during the training phase. Finally, extensive experiments are conducted. In our evaluation, we consider the scenarios where the inpainting masks, which are used to generate the testing set, have a distribution gap from those masks used to produce the training set. Thus, the comparisons are conducted on a newly proposed dataset, where testing masks are inconsistent with the training ones. The experimental results show the superiority of the proposed method and the effectiveness of each component. All our codes and data will be online available.

Searching Lightweight Neural Network for Image Signal Processing

Haojia Lin
Lijiang Li
Xiawu Zheng
Fei Chao
Rongrong Ji

Recently, it has been shown that the traditional Image Signal Processing (ISP) can be replaced by deep neural networks due to their superior performance. However, most of these networks require heavy computation burden and thus are far from sufficient to be deployed on resource-limited platforms, including but not limited to mobile devices and FPGA. To tackle this challenge, we propose an automated search framework that derives ISP models with high image quality while satisfying the low-computation requirement. To reduce the search cost, we adopt the weight-sharing strategy by introducing a supernet and decouple the architecture search into two stages, supernet training and hard-aware evolutionary search. With the proposed framework, we can train the ISP model once and quickly find high-performance but low-computation models on multiple devices. Experiments demonstrate that the searched ISP models have an excellent trade-off between image quality and model complexity, i.e., achieve compelling reconstruction quality with more than 90% reduction in FLOPs as compared to the state-of-the-art networks.

Image Generation Network for Covert Transmission in Online Social Network

Zhengxin You
Qichao Ying
Sheng Li
Zhenxing Qian
Xinpeng Zhang

Online social networks have stimulated communications over the Internet more than ever, making it possible for secret message transmission over such noisy channels. In this paper, we propose a Coverless Image Steganography Network, called CIS-Net, that synthesizes a high-quality image directly conditioned on the secret message to transfer. CIS-Net is composed of four modules, namely, the Generation, Adversarial, Extraction, and Noise Module. The receiver can extract the hidden message without any loss even the images have been distorted by JPEG compression attacks. To disguise the behaviour of steganography, we collected images in the context of profile photos and stickers and train our network accordingly. As such, the generated images are more inclined to escape from malicious detection and attack. The distinctions from previous image steganography methods are majorly the robustness and losslessness against diverse attacks. Experiments over diverse public datasets have manifested the superior ability of anti-steganalysis.

Augmented Dual-Contrastive Aggregation Learning for Unsupervised Visible-Infrared Person Re-Identification

Bin Yang
Mang Ye
Jun Chen
Zesen Wu

Visible infrared person re-identification (VI-ReID) aims at searching out the corresponding infrared (visible) images from a gallery set captured by other spectrum cameras. Recent works mainly focus on supervised VI-ReID methods that require plenty of cross-modality (visible-infrared) identity labels which are more expensive than the annotations in single-modality person ReID. For the unsupervised learning visible infrared re-identification (USL-VI-ReID), the large cross-modality discrepancies lead to difficulties in generating reliable cross-modality labels and learning modality-invariant features without any annotations. To address this problem, we propose a novel Augmented Dual-Contrastive Aggregation (ADCA) learning framework. Specifically, a dual-path contrastive learning framework with two modality-specific memories is proposed to learn the intra-modality person representation. To associate positive cross-modality identities, we design a cross-modality memory aggregation module with count priority to select highly associated positive samples, and aggregate their corresponding memory features at the cluster level, ensuring that the optimization is explicitly concentrated on the modality-irrelevant perspective. Extensive experiments demonstrate that our proposed ADCA significantly outperforms existing unsupervised methods under various settings, and even surpasses some supervised counterparts, facilitating VI-ReID to real-world deployment. Code is available at https://github.com/yangbincv/ADCA.

DrawMon: A Distributed System for Detection of Atypical Sketch Content in Concurrent Pictionary Games

Nikhil Bansal
Kartik Gupta
Kiruthika Kannan
Sivani Pentapati
Ravi Kiran Sarvadevabhatla

Pictionary, the popular sketch-based guessing game, provides an opportunity to analyze shared goal cooperative game play in restricted communication settings. However, some players occasionally draw atypical sketch content. While such content is occasionally relevant in the game context, it sometimes represents a rule violation and impairs the game experience. To address such situations in a timely and scalable manner, we introduce DrawMon, a novel distributed framework for automatic detection of atypical sketch content in concurrently occurring Pictionary game sessions. We build specialized online interfaces to collect game session data and annotate atypical sketch content, resulting in AtyPict, the first ever atypical sketch content dataset. We use AtyPict to train CanvasNet, a deep neural atypical content detection network. We utilize CanvasNet as a core component of DrawMon. Our analysis of post deployment game session data indicates DrawMon's effectiveness for scalable monitoring and atypical sketch content detection. Beyond Pictionary, our contributions also serve as a design guide for customized atypical content response systems involving shared and interactive whiteboards. Code and datasets are available at https://drawm0n.github.io.

Approximate Shifted Laplacian Reconstruction for Multiple Kernel Clustering

Jiali You
Zhenwen Ren
Quansen Sun
Yuan Sun
Xingfeng Li

Multiple kernel clustering (MKC) has demonstrated promising performance for handing non-linear data clustering. Positively, it can integrate complementary information of multiple base kernels and avoid kernel function selection. However, negatively, the main challenging is that the kernel matrix with the size n x n leads to O(n2) memory complexity and O(n3) computational complexity. To mitigate such a challenging, taking graph Laplacian as breakthrough, this paper proposes a novel and simple MKC method, dubbed as approximate shifted Laplacian reconstruction (ASLR). For each base kernel, we propose the r-rank shifted Laplacian reconstruction scheme by considering the energy losing of Laplacian reconstruction and the clustering information preserving of Laplacian decompose simultaneously. Then, by analyzing the eigenvectors of the reconstructed Laplacian, we impose some constrains to tame its solution within a Fantope. Accordingly, the byproduct (i.e. the most informative eigenvectors) contains the main clustering information, such that the clustering assignments can be obtained relying on simple k-means algorithm. Owe to the Laplacian reconstruction scheme, the memory and computational complexity can be reduced to O(n) and O<(n^2)$, respectively. As experimentally demonstrated on eight challenging MKC benchmark datasets, the results verify the effectiveness and efficiency of ASLR.

Towards Continual Adaptation in Industrial Anomaly Detection

Wujin Li
Jiawei Zhan
Jinbao Wang
Bizhong Xia
Bin-Bin Gao
Jun Liu
Chengjie Wang
Feng Zheng

Anomaly detection (AD) has gained widespread attention due to its ability to identify defects in industrial scenarios using only normal samples. Although traditional AD methods achieved acceptable performance, they mainly focus on the current set of examples solely, leading to catastrophic forgetting of previously learned tasks when trained on a new one. Due to the limitation of flexibility and the requirements of realistic industrial scenarios, it is urgent to enhance the ability of continual adaptation of AD models. Therefore, this paper proposes a unified framework by incorporating continual learning (CL) to achieve our newly designed task of continual anomaly detection (CAD). Note that, we observe that data augmentation strategy can make AD methods well adapted to supervised CL (SCL) via constructing anomaly samples. Based on this, we hence propose a novel method named Distribution of Normal Embeddings (DNE), which utilizes the feature distribution of normal training samples from past tasks. It not only effectively alleviates catastrophic forgetting in CAD but also can be integrated with SCL methods to further improve their performance. Extensive experiments and visualization results on the popular benchmark dataset MVTec AD, have demonstrated advanced performance and the excellent continual adaption ability of our proposed method compared to other AD methods. To the best of our knowledge, we are the first to introduce and tackle the task of CAD. We believe that the proposed task and benchmark will be beneficial to the field of AD. Our code is available in thesupplementary material.

Neural Network Model Protection with Piracy Identification and Tampering Localization Capability

Cheng Xiong
Guorui Feng
Xinran Li
Xinpeng Zhang
Chuan Qin

With the rapid development of neural network, a vast number of neural network models have been developed in recent years, which condense numerous manpower and hardware resource. However, the original models are at risk of being pirated by the adversary to obtain illegal profits. On the other hand, malicious tampering on models, such as implanting the vulnerability and backdoor, may cause catastrophic consequences. We propose a model hash generator method to protect neural network models. Detailedly, our model hash sequence is composed of two parts: one is the model piracy identification hash, which is based on the dynamic convolution and a dual-branch network; the other is the model tampering localization hash, which can help the model owner to accurately detect the tampered locations for further recovery. Experimental results demonstrate the effectiveness of the proposed method for neural network model protection.

SDRTV-to-HDRTV via Hierarchical Dynamic Context Feature Mapping

Gang He
Kepeng Xu
Li Xu
Chang Wu
Ming Sun
Xing Wen
Yu-Wing Tai

In this work, we address the task of SDR videos to HDR videos(SDRTV-to-HDRTV conversion). Previous approaches use global feature modulation for SDRTV-to-HDRTV conversion. Feature modulation scales and shifts the features in the original feature space, which has limited mapping capability. In addition, the global image mapping cannot restore detail in HDR frames due to the luminance differences in different regions of SDR frames. To resolve the appeal, we propose a two-stage solution. The first stage is a hierarchical Dynamic Context feature mapping (HDCFM) model. HDCFM learns the SDR frame to HDR frame mapping function via hierarchical feature modulation (HME and HM ) module and a dynamic context feature transformation (DYCT) module. The HME estimates the feature modulation vector, HM is capable of hierarchical feature modulation, consisting of global feature modulation in series with local feature modulation, and is capable of adaptive mapping of local image features. The DYCT module constructs a feature transformation module in conjunction with the context, which is capable of adaptively generating a feature transformation matrix for feature mapping. Compared with simple feature scaling and shifting, the DYCT module can map features into a new feature space and thus has a more excellent feature mapping capability. In the second stage, we introduce a patch discriminator-based context generation model PDCG to obtain subjective quality enhancement of over-exposed regions. The proposed method can achieve state-of-the-art objective and subjective quality results. Specifically, HDCFM achieves a PSNR gain of 0.81 dB at about 100K parameters. The number of parameters is 1/14th of the previous state-of-the-art methods. The test code will be released on https://github.com/cooperlike/HDCFM.

Arbitrary Bit-width Network: A Joint Layer-Wise Quantization and Adaptive Inference Approach

Chen Tang
Haoyu Zhai
Kai Ouyang
Zhi Wang
Yifei Zhu
Wenwu Zhu

Conventional model quantization methods use a fixed quantization scheme to different data samples, which ignores the inherent"recognition difficulty" differences between various samples. We propose to feed different data samples with varying quantization schemes to achieve a data-dependent dynamic inference, at a fine-grained layer level. However, enabling this adaptive inference with changeable layer-wise quantization schemes is challenging because the combination of bit-widths and layers is growing exponentially, making it extremely difficult to train a single model in such a vast searching space and use it in practice. To solve this problem, we present the Arbitrary Bit-width Network (ABN), where the bit-widths of a single deep network can change at runtime for different data samples, with a layer-wise granularity. Specifically, first we build a weight-shared layer-wise quantizable "super-network" in which each layer can be allocated with multiple bit-widths and thus quantized differently on demand. The super-network provides a considerably large number of combinations of bit-widths and layers, each of which can be used during inference without retraining or storing myriad models. Second, based on the well-trained super-network, each layer's runtime bit-width selection decision is modeled as a Markov Decision Process (MDP) and solved by an adaptive inference strategy accordingly. Experiments show that the super-network can be built without accuracy degradation, and the bit-widths allocation of each layer can be adjusted to deal with various inputs on the fly. On ImageNet classification, we achieve 1.1% top1 accuracy improvement while saving 36.2% BitOps.

Privacy-preserving Reflection Rendering for Augmented Reality

Yiqin Zhao
Sheng Wei
Tian Guo

When the virtual objects consist of reflective materials, the required lighting information to render such objects can consist of privacy-sensitive information outside the current camera view. In this paper, we show, for the first time, that accuracy-driven multi-view environment lighting can reveal out-of-camera scene information and compromise privacy. We present a simple yet effective privacy attack that extracts sensitive scene information such as human faces and text from rendered objects under several application scenarios.

To defend against such attacks, we develop a novel IPC2S defense and a conditional R2 defense. Our IPC2S defense, combined with a generic lighting reconstruction method, preserves the scene geometry while obfuscating the privacy-sensitive information. As a proof-of-concept, we leverage existing OCR and face detection models to identify text and human faces from past camera observations and blur the color pixels associated with detected regions. We evaluate the visual quality impact of our defense by comparing rendered virtual objects to ones rendered with a generic multi-lighting reconstruction technique, ARKit, and R2 defense. Our visual and quantitative results demonstrate that our defense leads to structurally similar reflections with up to 0.98 SSIM score across various rendering scenarios while preserving sensitive information by reducing the automatic extraction success rate to at most 8.8%.

SESSION: Oral Session VII: Multimedia Systems - Systems and Middleware

Confederated Learning: Going Beyond Centralization

Zitai Wang
Qianqian Xu
Ke Ma
Xiaochun Cao
Qingming Huang

Traditional machine learning implicitly assumes that a single entity (e.g., a person or an organization) could complete all the jobs of the whole learning process: data collection, algorithm design, parameter selection, and model evaluation. However, many practical scenarios require cooperation among entities, and existing paradigms fail to meet cost, privacy, or security requirements and so on. In this paper, we consider a generalized paradigm: different roles are granted multiple permissions to complete their corresponding jobs, called Confederated Learning. Systematic analysis shows that confederated learning generalizes traditional machine learning and the existing distributed paradigms like federation learning. Then, we study an application scenario of confederated learning which could inspire future research in the context of cooperation between different entities. Three methods are proposed as the first trial for the cooperated learning under restricted conditions. Empirical results on three datasets validate the effectiveness of the proposed methods.

R-FEC: RL-based FEC Adjustment for Better QoE in WebRTC

Insoo Lee
Seyeon Kim
Sandesh Sathyanarayana
Kyungmin Bin
Song Chong
Kyunghan Lee
Dirk Grunwald
Sangtae Ha

The demand for video conferencing applications has seen explosive growth while users still often face unsatisfactory quality of experience (QoE). Video conferencing applications adopt Forward Error Correction (FEC) as a recovery mechanism to meet tight latency requirements and overcome packet losses prevalent in the network. However, many studies mainly focused on video rate control by neglecting the complex interactions of this video recovery mechanism on the rate control and its impact on the user QoE. Deciding the right amount of FEC for the current video rate under a dynamically changing network environment is not straightforward. For instance, the higher FEC may enhance the tolerance to packet losses, but it may increase latency due to FEC processing overhead and hurt the video quality due to the additional bandwidth used for FEC. To address this issue, we propose R-FEC which is a reinforcement learning (RL) based framework for video and FEC bitrate decisions in video conferencing. R-FEC aims to improve overall QoE by automatically learning through the results of past decisions and adjusting video and FEC bitrates to maximize the user QoE while minimizing the congestion in the network. Our experiments show that R-FEC outperforms the state-of-the-art solutions in video conferencing, with up to 27% improvement in its video rate and 6dB PSNR improvement in video quality over the default WebRTC.

SESSION: Poster Session VII: Multimedia Systems -- Systems and Middleware

Physical Backdoor Attacks to Lane Detection Systems in Autonomous Driving

Xingshuo Han
Guowen Xu
Yuan Zhou
Xuehuan Yang
Jiwei Li
Tianwei Zhang

Modern autonomous vehicles adopt state-of-the-art DNN models to interpret the sensor data and perceive the environment. However, DNN models are vulnerable to different types of adversarial attacks, which pose significant risks to the security and safety of the vehicles and passengers. One prominent threat is the backdoor attack, where the adversary can compromise the DNN model by poisoning the training samples. Although lots of effort has been devoted to the investigation of the backdoor attack to conventional computer vision tasks, its practicality and applicability to the autonomous driving scenario is rarely explored, especially in the physical world.

In this paper, we target the lane detection system, which is an indispensable module for many autonomous driving tasks, e.g., navigation, lane switching. We design and realize thefirst physical backdoor attacks to such system. Our attacks are comprehensively effective against different types of lane detection algorithms. Specifically, we introduce two attack methodologies (poison-annotation and clean-annotation) to generate poisoned samples. With those samples, the trained lane detection model will be infected with the backdoor, and can be activated by common objects (e.g., traffic cones) to make wrong detections, leading the vehicle to drive off the road or onto the opposite lane. Extensive evaluations on public datasets and physical autonomous vehicles demonstrate that our backdoor attacks are effective, stealthy and robust against various defense solutions. Our codes and experimental videos can be found in \textcolorblue \urlhttps://sites.google.com/view/lane-detection-attack/lda .

Dynamic Transformer for Few-shot Instance Segmentation

Haochen Wang
Jie Liu
Yongtuo Liu
Subhransu Maji
Jan-Jakob Sonke
Efstratios Gavves

Few-shot instance segmentation aims to train an instance segmentation model that can fast adapt to novel classes with only a few reference images. Existing methods are usually derived from standard detection models and tackle few-shot instance segmentation indirectly by conducting classification, box regression, and mask prediction on a large set of redundant proposals followed by indispensable post-processing, e.g., Non-Maximum Suppression. Such complicated hand-crafted procedures and hyperparameters lead to degraded optimization and insufficient generalization ability. In this work, we propose an end-to-end Dynamic Transformer Network, DTN for short, to directly segment all target object instances from arbitrary categories given by reference images, relieving the requirements of dense proposal generation and post-processing. Specifically, a small set of Dynamic Queries, conditioned on reference images, are exclusively assigned to target object instances and generate all the instance segmentation masks of reference categories simultaneously. Moreover, a Semantic-induced Transformer Decoder is introduced to constrain the cross-attention between dynamic queries and target images within the pixels of the reference category, which suppresses the noisy interaction with the background and irrelevant categories. Extensive experiments are conducted on the COCO-20 dataset. The experiment results demonstrate that our proposed Dynamic Transformer Network significantly outperforms the state-of-the-arts.

OISSR: Optical Image Stabilization Based Super Resolution on Smartphone Cameras

Hao Pan
Feitong Tan
Wenhao Li
Yi-Chao Chen
Guangtao Xue

Multi-frame super-resolution methods can generate high resolution images by combining multiple captures of the same scene; however, the performance of merged results are susceptible to degradation due to a lack of precision in image registration. In this study, we sought to develop a robust multi-frame super resolution method (called OISSR) for use on smartphone cameras with a optical image stabilizer (OIS). Acoustic injection is used to alter the readings from the built-in MEMS gyroscope to control the lens motion in the OIS module (note that the image sensor is fixed). We employ a priori knowledge of the induced lens motion to facilitate optical flow estimation with sub-pixel accuracy, and the output high-precision pixel alignment vectors are utilized to merge the multiple frames to reconstruct the final super resolution image. Extensive experiments on a OISSR prototype implemented on a Xiaomi 10Ultra demonstrate the high performance and effectiveness of the proposed system in obtaining the quadruple enhanced resolution imaging.

Improving Scalability, Sustainability and Availability via Workload Distribution in Edge-Cloud Gaming

Iryanto Jaya
Yusen Li
Wentong Cai

Recent uses of heterogeneous mobile and lightweight devices encourage computations to be abstracted remotely as black box systems. This same concept applies for cloud gaming in which computer games are located and run inside remote rendering servers (RSes). While cloud gaming enables lightweight devices with sufficient input capabilities and network connection to be able to play desktop games, latency and cost issue become significant hindrances in recent applications. In this paper, we came up with our edge-cloud gaming architecture which reduces the overall workload in RSes while increasing playerbase coverage by using edge RSes. Furthermore, we also proposed our allocation algorithm in order to assign incoming players to RSes. From our experiments, our proposed architecture has higher playerbase coverage while our allocation algorithm significantly reduces the cost in both single and batch player arrival pattern.

Display of 3D Illuminations using Flying Light Specks

Shahram Ghandeharizadeh

This paper presents techniques to display 3D illuminations using Flying Light Specks, FLSs. Each FLS is a miniature (hundreds of micrometers) sized drone with one or more light sources to generate different colors and textures with adjustable brightness. It is network enabled with a processor and local storage. Synchronized swarms of cooperating FLSs render illumination of virtual objects in a pre-specified 3D volume, an FLS display. We present techniques to display both static and motion illuminations. Our display techniques consider the limited flight time of an FLS on a fully charged battery and the duration of time to charge the FLS battery. Moreover, our techniques assume failure of FLSs is the norm rather than an exception. We present a hardware and a software architecture for an FLS-display along with a family of techniques to compute flight paths of FLSs for illuminations. With motion illuminations, one technique (ICF) minimizes the overall distance traveled by the FLSs significantly when compared with the other techniques.

SESSION: Oral Session VIII: Multimedia Systems -- Transport and Delivery

Improving Generalization for Neural Adaptive Video Streaming via Meta Reinforcement Learning

Nuowen Kan
Yuankun Jiang
Chenglin Li
Wenrui Dai
Junni Zou
Hongkai Xiong

In this paper, we present a meta reinforcement learning (Meta-RL)-based neural adaptive bitrate streaming (ABR) algorithm that is able to rapidly adapt its control policy to the changing network throughput dynamics. Specifically, to allow rapid adaptation, we discuss the necessity of detaching the inference of throughput dynamics with the universal control mechanism that is in essence shared by all potential throughput dynamics for neural ABR algorithms. To meta-learn the ABR policy, we then build up a model-free system framework, composed of a probabilistic latent encoder that infers the underlying dynamics from the recent throughput context, and a policy network that is conditioned on latent variable and learns to quickly adapt to new environments. Additionally, to address the difficulties caused by training the policy on mixed dynamics, on-policy RL (or imitation learning) algorithms are suggested for policy training, with a mutual information-based regularization to make the latent variable more informative about the policy. Finally, we implement our algorithm's meta-training and meta-adaptation procedures under a variety of throughput dynamics. Empirical evaluations on different QoE metrics and multiple datasets containing real-world network traces demonstrate that our algorithm outperforms state-of-the-art ABR algorithms, in terms of the performance on the average chunk QoE, consistency and fast adaptation across a wide range of throughput patterns.

DAO: Dynamic Adaptive Offloading for Video Analytics

taslim murad
Anh Nguyen
Zhisheng Yan

Offloading videos from end devices to edge or cloud servers is the key to enabling computation-intensive video analytics. To ensure the analytics accuracy at the server, the video quality for offloading must be configured based on the specific content and the available network bandwidth. While adaptive video streaming for user viewing has been widely studied, none of the existing works can guarantee the analytics accuracy at the server in bandwidth- and content-adaptive way. To fill in this gap, this paper presents DAO, a dynamic adaptive offloading framework for video analytics that jointly considers the dynamics of network bandwidth and video content. DAO is able to maximize the analytics accuracy at the server by adapting the video bitrate and resolution dynamically. In essence, we shift the context of adaptive video transport from traditional DASH systems to a new dynamic adaptive offloading framework tailored for video analytics. DAO is empowered by some new discoveries about the inherent relationship between analytics accuracy, video content, bitrate, and resolution, as well as by an optimization formulation to adapt the bitrate and resolution dynamically. Results from the real-world implementation of object detection tasks show that DAO's performance is close to the theoretical bound, achieving 20% bandwidth saving and 59% category-wise mAP improvement compared to conventional DASH schemes.

AggCast: Practical Cost-effective Scheduling for Large-scale Cloud-edge Crowdsourced Live Streaming

Rui-Xiao Zhang
Changpeng Yang
Xiaochan Wang
Tianchi Huang
Chenglei Wu
Jiangchuan Liu
Lifeng Sun

Conventional wisdom claims that in order to improve viewer engagement, the cloud-edge providers should serve the viewers with the nearest edge nodes, however, we show that doing this for crowdsourced live streaming (CLS) services can introduce significant costs inefficiency. We observe that the massive number of channels has greatly burdened the operating expenditure of the cloud-edge providers, and most importantly, unbalanced viewer distribution makes the edge nodes suffer significant costs inefficiency. To tackle the above concerns, we propose AggCast, a novel CLS scheduling framework to optimize the edge node utilization for the cloud-edge provider. The core idea of AggCast is to aggregate some viewers who are initially scattered on different regions, and assign them to fewer pre-selected nodes, thereby reducing bandwidth costs. In particular, by leveraging the insights obtained from our large-scale measurement, AggCast can not only ensure quality of experience (QoS), but also satisfy the systematic requirements of CLS services. AggCast has been A/B tested and fully deployed in a top cloud-edge provider in China for over eight months. The online and trace-driven experiments show that, compared to the common practice, AggCast can save over 15% back-to-source (BTS) bandwidth costs while having no negative impacts on QoS.

AdaMask: Enabling Machine-Centric Video Streaming with Adaptive Frame Masking for DNN Inference Offloading

Shengzhong Liu
Tianshi Wang
Jinyang Li
Dachun Sun
Mani Srivastava
Tarek Abdelzaher

This paper presents AdaMask, a machine-centric video streaming framework for remote deep neural network (DNN) inference. The objective is to optimize the accuracy of downstream DNNs, offloaded to a remote machine, by adaptively changing video compression control knobs at runtime. Our main contributions are twofold. First, we propose frame masking as an effective mechanism to reduce the bandwidth consumption of video stream, which only preserves regions that potentially contain objects of interest. Second, we design a new adaptation algorithm that achieves the Pareto-optimal tradeoff between accuracy and bandwidth by controlling the masked portions of frames together with conventional H.264 control knobs (eg. resolution). Through extensive evaluations on three sensing scenarios (dash camera, traffic surveillance, and drone), frame masking saves the bandwidth by up to 65% with < 1% accuracy degradation, and AdaMask improves the accuracy by up to 14% over the baselines against the network dynamics.

SESSION: Poster Session VIII: Multimedia Systems -- Transport and Delivery

Learning-Based Video Coding with Joint Deep Compression and Enhancement

Tiesong Zhao
Weize Feng
HongJi Zeng
Yiwen Xu
Yuzhen Niu
Jiaying Liu

End-to-end learning-based video coding has attracted substantial attentions by compressing video signals as stacked visual features. This paper proposes an end-to-end deep video codec with jointly optimized compression and enhancement modules (JCEVC). First, we propose a dual-path generative adversarial network (DPEG) to reconstruct video details after compression. An α-path and a β-path concurrently reconstruct the structure information and local textures. Second, we reuse the DPEG network in both motion compensation and quality enhancement modules, which are further combined with other necessary modules to formulate our JCEVC framework. Third, we employ a joint training of deep video compression and enhancement that further improves the rate-distortion (RD) performance of compression. Compared with x265 LDP very fast mode, our JCEVC reduces the average bit-per-pixel (bpp) by 39.39%/54.92% at the same PSNR/MS-SSIM, which outperforms the state-of-the-art deep video codecs by a considerable margin. Sourcecode is available at: https://github.com/fwz1021/JCEVC.

Structure-Preserving Motion Estimation for Learned Video Compression

Han Gao
Jinzhong Cui
Mao Ye
Shuai Li
Yu Zhao
Xiatian Zhu

Following the conventional hybrid video coding framework, existing learned video compression methods rely on the decoded previous frame as the reference for motion estimation considering that it is available to the decoder. Diving into its essential advantage of strong representation capability with CNNs, however, we find this strategy is suboptimal due to two reasons: (1) Motion estimation based on the decoded (often distorted) frame would damage both the spatial structure of motion information inferred and the corresponding residual for each frame, making it difficult to be spatially encoded on the whole image basis using CNNs; (2) Typically, it would break the consistent nature across frames since the estimated motion information is no longer consistent with the movement in the original video due to the distortion in the decoded video, lowering the overall temporal coding efficiency. To overcome these problems, a novel asymmetric Structure-Preserving Motion Estimation (SPME) method is proposed, with the aim to fully explore the ignored original previous frame at the encoder side while complying with the decoded previous frame at the decoder side. Concretely, SPME estimates superior spatially structure-preserving and temporally consistent motion field by aggregating the motion prediction of both the original and the decoded reference frames w.r.t the current frame. Critically, our method can be universally applied to the existing feature prediction based video compression methods. Extensive experiments on several standard test datasets show that our SPME can significantly enhance the state-of-the-art methods.

Learned Internet Congestion Control for Short Video Uploading

Tianchi Huang
Chao Zhou
Lianchen Jia
Rui-Xiao Zhang
Lifeng Sun

Short video uploading service has become increasingly important, as at least 30 million videos are uploaded per day. However, we find that existing congestion control (CC) algorithms, either heuristics or learning-based, are not applicable for video uploading -- i.e., lacking in the design of the fundamental mechanism and being short of leveraging network modeling. We present DuGu, a novel learning-based CC algorithm designed by considering the unique proprieties of video uploading via the probing phase and internet networking via the control phase. During the probing phase, DuGu leverages the transmission gap of uploading short videos to actively detect the network metrics to better understand network dynamics. DuGu uses a neural network~(NN) to avoid congestion during the control phase. Here, instead of using handcrafted reward functions, the NN is learned by imitating the expert policy given by the optimal solver, improving both performance and learning efficiency. To build this system, we construct an omniscient-like network emulator, implement an optimal solver and collect a large corpus of real-world network traces to learn expert strategies. Trace-driven and real-world A/B tests reveal that DuGu supports multi-objective and rivals or outperforms existing CC algorithms across all considered scenarios.

PicT: A Slim Weakly Supervised Vision Transformer for Pavement Distress Classification

Wenhao Tang
Sheng Huang
Xiaoxian Zhang
Luwen Huangfu

Automatic pavement distress classification facilitates improving the efficiency of pavement maintenance and reducing the cost of labor and resources. A recently influential branch of this task divides the pavement image into patches and infers the patch labels for addressing these issues from the perspective of multi-instance learning. However, these methods neglect the correlation between patches and suffer from a low efficiency in the model optimization and inference. As a representative approach of vision Transformer, Swin Transformer is able to address both of these issues. It first provides a succinct and efficient framework for encoding the divided patches as visual tokens, then employs self-attention to model their relations. Built upon Swin Transformer, we present a novel vision Transformer named Pavement Image Classification Transformer (PicT) for pavement distress classification. In order to better exploit the discriminative information of pavement images at the patch level, the Patch Labeling Teacher is proposed to leverage a teacher model to dynamically generate pseudo labels of patches from image labels during each iteration, and guides the model to learn the discriminative features of patches via patch label inference in a weakly supervised manner. The broad classification head of Swin Transformer may dilute the discriminative features of distressed patches in the feature aggregation step due to the small distressed area ratio of the pavement image. To overcome this drawback, we present a Patch Refiner to cluster patches into different groups and only select the highest distress-risk group to yield a slim head for the final image classification. We evaluate our method on a large-scale bituminous pavement distress dataset named CQU-BPDD. Extensive results demonstrate the superiority of our method over baselines and also show that PicT outperforms the second-best performed model by a large margin of +2.4% in P@R on detection task, +3.9% in F1 on recognition task, and 1.8x throughput, while enjoying 7x faster training speed using the same computing resources. Our codes and models have been released on https://github.com/DearCaat/PicT.

Rate-Distortion-Guided Learning Approach with Cross-Projection Information for V-PCC Fast CU Decision

Hang Yuan
Wei Gao
Ge Li
Zhu Li

In video-based point cloud compression (V-PCC), the 3D dynamic point cloud sequence is projected into 2D sequences for compression by utilizing the mature 2D video encoder. It is noted the encoding of attribute sequence is extremely time-consuming, and the applicable fast algorithms are still lacking because of the uniqueness of video content and coding structure in V-PCC. This paper proposes a novel rate-distortion-guided fast attribute coding unit (CU) partitioning approach with cross-projection information in V-PCC all-intra (AI) coding. By analyzing the guidance effectiveness of cross-projection information for attribute CU partition, we first propose to combine the occupancy, geometry and attribute features for CU division determination. Afterward, considering that different CUs have different rate-distortion costs and the influences on coding performances by inaccurate different CU predictions are also dissimilar, we devise a rate-distortion-guided learning approach to reduce the coding loss generated by the mispredictions of CU partition. Moreover, we carefully design an overall decision framework for CU partition in V-PCC AI coding structure. Experimental results prove the advantages of our approach, where the coding time is saved by 62.41%, and the End-to-End BD-TotalRate loss only is 0.27%. To the best of our knowledge, the proposed fast attribute CU decision approach achieves the state-of-the-art performance in V-PCC AI coding.

Evaluating the Impact of Tiled User-Adaptive Real-Time Point Cloud Streaming on VR Remote Communication

Shishir Subramanyam
Irene Viola
Jack Jansen
Evangelos Alexiou
Alan Hanjalic
Pablo Cesar

Remote communication has rapidly become a part of everyday life in both professional and personal contexts. However, popular video conferencing applications present limitations in terms of quality of communication, immersion and social meaning. VR remote communication applications offer a greater sense of co-presence and mutual sensing of emotions between remote users. Previous research on these applications has shown that realistic point cloud user reconstructions offer better immersion and communication as compared to synthetic user avatars. However, photorealistic point clouds require a large volume of data per frame and are challenging to transmit over bandwidth-limited networks. Recent research has demonstrated significant improvements to perceived quality by optimizing the usage of bandwidth based on the position and orientation of the user's viewport with user-adaptive streaming. In this work, we developed a real-time VR communication application with an adaptation engine that features tiled user-adaptive streaming based on user behaviour. The application also supports traditional network adaptive streaming. The contribution of this work is to evaluate the impact of tiled user-adaptive streaming on quality of communication, visual quality, system performance and task completion in a functional live VR remote communication system. We performed a subjective evaluation with 33 users to compare the different streaming conditions with a neck exercise training task. As a baseline, we use uncompressed streaming requiring approximately 300 megabits per second and our solution achieves similar visual quality with tiled adaptive streaming at 14 megabits per second. We also demonstrate statistically significant gains in the quality of interaction and improvements to system performance and CPU consumption with tiled adaptive streaming as compared to the more traditional network adaptive streaming.

Prism: Handling Packet Loss for Ultra-low Latency Video

Devdeep Ray
Vicente Bobadilla Riquelme
Srinivasan Seshan

Real-time interactive video streaming applications like cloud-based video games, AR, and VR require high quality video streams and extremely low end-to-end interaction delays. These requirements cause the QoE to be extremely sensitive to packet losses. Due to the inter-dependency between compressed frames, packet losses stall the video decode pipeline until the lost packets are retransmitted (resulting in stutters and higher delays), or the decoder state is reset using IDR-frames (lower video quality for given bandwidth). Prism is a hybrid predictive-reactive packet loss recovery scheme that uses a split-stream video coding technique to meet the needs of ultra-low latency video streaming applications. Prism's approach enables aggressive loss prediction, rapid loss recovery, and high video quality post-recovery, with zero overhead during normal operation - avoiding the pitfalls of existing approaches. Our evaluation on real video game footage shows that Prism reduces the penalty of using I-frames for recovery by 81%, while achieving 30% lower delay than pure retransmission-based recovery.

Exploring Spherical Autoencoder for Spherical Video Content Processing

Jin Zhou
Na Li
Yao Liu
Shuochao Yao
Songqing Chen

3D spherical content is increasingly presented in various applications (e.g., AR/MR/VR) for better users' immersiveness experience, yet today processing such spherical 3D content still mainly relies on the traditional 2D approaches after projection, leading to the distortion and/or loss of critical information. This study sets to explore methods to process spherical 3D content directly and more effectively. Using 360-degree videos as an example, we propose a novel approach called Spherical Autoencoder (SAE) for spherical video processing. Instead of projecting to a 2D space, SAE represents the 360-degree video content as a spherical object and employs encoding and decoding on the 360-degree video directly. Furthermore, to support the adoption of SAE on pervasive mobile devices that often have resource constraints, we further propose two optimizations on top of SAE.First, since the FoV (Field of View) prediction is widely studied and leveraged to transport only a portion of the content to the mobile device to save bandwidth and battery consumption, we design p-SAE, a SAE scheme with the partial view support that can utilize such FoV prediction. Second, since machine learning models are often compressed when running on mobile devices in order to reduce the processing load, which usually leads to degradation of output (e.g., video quality in SAE), we propose c-SAE by applying the compressive sensing theory into SAE to maintain the video quality when the model is compressed. Our extensive experiments show that directly incorporating and processing spherical signals is promising, and it outperforms the traditional approaches by a large margin. Both p-SAE and c-SAE show their effectiveness in delivering high quality videos (e.g., PSNR results) when used alone or combined together with model compression.

Sophon: Super-Resolution Enhanced 360° Video Streaming with Visual Saliency-aware Prefetch

Jianxin Shi
Lingjun Pu
Xinjing Yuan
Qianyun Gong
Jingdong Xu

360° video streaming requires ultra-high bandwidth to provide an excellent immersive experience. Traditional viewport-aware streaming methods are theoretically effective but unreliable in practice due to the adverse effects of time-varying available bandwidth on the small playback buffer. To this end, we ponder the complementarity between the large buffer-based approach and the viewport-aware strategy for 360°video streaming. In this work, we present Sophon, a buffer-based and neural-enhanced streaming framework, which exploits the double buffer design, super-resolution technique, and viewport-aware strategy to improve user experience. Furthermore, we propose two well-suited ideas: visual saliency-aware prefetch and super-resolution model selection scheme to address the challenges of insufficient computing resources and dynamic user preferences. Correspondingly, we respectively introduce the prefetch and model selection metric, and develop a lightweight buffer occupancy-based prefetch algorithm and a deep reinforcement learning method to trade off bandwidth consumption, computing resource utilization, and content quality enhancement. We implement a prototype of Sophon and extensive evaluations corroborate its superior performance over state-of-the-art works.

Error Concealment of Dynamic 3D Point Cloud Streaming

Tzu-Kuan Hung
I-Chun Huang
Samuel Rhys Cox
Wei Tsang Ooi
Cheng-Hsin Hsu

Recently standardized MPEG Video-based Point Cloud Compression (V-PCC) codec has shown promise in achieving a good rate-distortion ratio of dynamic 3D point cloud compression. Current error concealment methods of V-PCC, however, lead to significantly distorted 3D point cloud frames under imperfect network conditions. To address this problem, we propose a general framework for concealing distorted and lost 3D point cloud frames due to packet loss. We also design, implement, and evaluate a suite of tools for each stage of our framework, which can be combined into multiple variants of error concealment algorithms. We conduct extensive experiments using seven dynamic 3D point cloud sequences with diverse characteristics to understand the strengths and limitations of our proposed error concealment algorithms. Our experiment results show that our algorithms outperform: (i) the method employed by V-PCC by at least 3.58 dB in Geometry Peak Signal-to-Noise Ratio (GPSNR) and 10.68 in Video Multi-Method Assessment Fusion (VMAF) and (ii) point cloud frame copy method by at most 5.8 dB in (3D) GPSNR and 12.0 in (2D) VMAF. Further, the proposed error concealment framework and algorithms work in the 3D domain, and thus are agnostic to the codecs and are applicable to future point cloud compression standards

Personalized 360-Degree Video Streaming: A Meta-Learning Approach

Yiyun Lu
Yifei Zhu
Zhi Wang

Over the past decades, 360-degree videos have attracted wide interest for the immersive experience they bring to viewers. The rising of high-resolution 360-degree videos greatly challenges the traditional video streaming systems in limited network environments. Given the limited bandwidth, tile-based video streaming with adaptive bitrate selection has been widely studied to improve the Quality of Experience (QoE) of viewers by tiling the video frames and allocating different bitrates for tiles inside and outside viewers' viewports. Existing solutions for viewport prediction and bitrate selection train general models without catering to the intrinsic need for personalization. In this paper, we present the first meta-learning-based personalized 360-degree video streaming framework. The commonality among viewers of different viewing patterns and QoE preferences is captured by efficient meta-network designs. Specifically, we design a meta-based long-short term memory model for viewport prediction and a meta-based reinforcement learning model for bitrate selection. Extensive experiments on real-world datasets demonstrate that our framework not only outperforms the state-of-the-art data-driven approaches in prediction accuracy by 11% on average and improves QoE by 27% on average, but also quickly adapts to users with new preferences with on average 67%-88% less training epochs.

SESSION: Oral Session IX: Multimedia Systems -- Data Systems Management and Indexing

InDiD: Instant Disorder Detection via a Principled Neural Network

Evgenia Romanenkova
Alexander Stepikin
Matvey Morozov
Alexey Zaytsev

For sequential data, a change point is a moment of abrupt regime switch in data streams. Such changes appear in different scenarios, including simpler data from sensors and more challenging video surveillance data. We need to detect disorders as fast as possible. Classic approaches for change point detection (CPD) might underperform for semi-structured sequential data because they cannot process its structure without a proper representation. We propose a principled loss function that balances change detection delay and time to a false alarm. It approximates classic rigorous solutions but is differentiable and allows representation learning for deep models. We consider synthetic sequences, real-world data sensors and videos with change points. We carefully labelled available video data with change point moments and released it for the first time. Experiments suggest that complex data require meaningful representations tailored for the specificity of the CPD task --- and our approach provides them outperforming considered baselines. For example, for explosion detection in video, the F1 score for our method is 0.53 compared to baseline scores of 0.31 and 0.35.

Maze: A Cost-Efficient Video Deduplication System at Web-scale

An Qin
Mengbai Xiao
Ben Huang
Xiaodong Zhang

With the advancement and dominant service of Internet videos, the content-based video deduplication system becomes an essential and dependent infrastructure for Internet video service. However, the explosively growing video data on the Internet challenges the system design and implementation for its scalability in several ways. (1) Although the quantization-based indexing techniques are effective for searching visual features at a large scale, the costly re-training over the complete dataset must be done periodically. (2) The high-dimensional vectors for visual features demand increasingly large SSD space, degrading I/O performance. (3) Videos crawled from the Internet are diverse, and visually similar videos are not necessarily the duplicates, increasing deduplication complexity. (4) Most videos are edited ones. The duplicate contents are more likely discovered as clips inside the videos, demanding processing techniques with close attention to details.

To address above-mentioned issues, we propose Maze, a full-fledged video deduplication system. Maze has an ANNS layer that indexes and searches the high dimensional feature vectors. The architecture of the ANNS layer supports efficient reads and writes and eliminates the data migration caused by re-training. Maze adopts the CNN-based feature and the ORB feature as the visual features, which are optimized for the specific video deduplication task. The features are compact and fully reside in the memory. Acoustic features are also incorporated in Maze so that the visually similar videos but having different audio tracks are recognizable. A clip-based matching algorithm is developed to discover duplicate contents at a fine granularity. Maze has been deployed as a production system for two years. It has indexed 1.3 billion videos and is indexing ~800 thousand videos per day. For the ANNS layer, the average read latency is 4 seconds and the average write latency is at most 4.84 seconds. The re-training over the complete dataset is no longer required no matter how many new data sets are added, eliminating the costly data migration between nodes. Maze recognizes the duplicate live streaming videos with both the similar appearance and the similar audio at a recall of 98%. Most importantly, Maze is also cost-effective. For example, the compact feature design helps save 5800 SSDs and the computation resources devoted to running the whole system decrease to 250K standard cores per billion videos.

SESSION: Poster Session IX: Multimedia Systems -- Data Systems Management and Indexing

HyP2 Loss: Beyond Hypersphere Metric Space for Multi-label Image Retrieval

Chengyin Xu
Zenghao Chai
Zhengzhuo Xu
Chun Yuan
Yanbo Fan
Jue Wang

Image retrieval has become an increasingly appealing technique with broad multimedia application prospects, where deep hashing serves as the dominant branch towards low storage and efficient retrieval. In this paper, we carried out in-depth investigations on metric learning in deep hashing for establishing a powerful metric space in multi-label scenarios, where the pair loss suffers high computational overhead and converge difficulty, while the proxy loss is theoretically incapable of expressing the profound label dependencies and exhibits conflicts in the constructed hypersphere space. To address the problems, we propose a novel metric learning framework with Hybrid Proxy-Pair Loss (HyP$^2$ Loss) that constructs an expressive metric space with efficient training complexity w.r.t. the whole dataset. The proposed HyP$^2$ Loss focuses on optimizing the hypersphere space by learnable proxies and excavating data-to-data correlations of irrelevant pairs, which integrates sufficient data correspondence of pair-based methods and high-efficiency of proxy-based methods. Extensive experiments on four standard multi-label benchmarks justify the proposed method outperforms the state-of-the-art, is robust among different hash bits and achieves significant performance gains with a faster, more stable convergence speed. Our code is available at https://github.com/JerryXu0129/HyP2-Loss.

Online Deep Learning from Doubly-Streaming Data

Heng Lian
John Scovi Atwood
Bo-Jian Hou
Jian Wu
Yi He

This paper investigates a new online learning problem with doubly-streaming data, where the data streams are described by feature spaces that constantly evolve, with new features emerging and old features fading away. A plausible idea to deal with such data streams is to establish a relationship between the old and new feature spaces, so that an online learner can leverage the knowledge learned from the old features to better the learning performance on the new features. Unfortunately, this idea does not scale up to high-dimensional multimedia data with complex feature interplay, which suffers a tradeoff between onlineness, which biases shallow learners, and expressiveness, which requires deep models. Motivated by this, we propose a novel OLD3S paradigm, where a shared latent subspace is discovered to summarize information from the old and new feature spaces, building an intermediate feature mapping relationship. A key trait of OLD3S is to treat the model capacity as a learnable semantics, aiming to yield optimal model depth and parameters jointly in accordance with the complexity and non-linearity of the input data streams in an online fashion. Both theoretical analysis and empirical studies substantiate the viability and effectiveness of our proposed approach. The code is available online at https://github.com/X1aoLian/OLD3S.

Re-ordered Micro Image based High Efficient Residual Coding in Light Field Compression

Hyunmin Jung
Hyuk-Jae Lee
Chae Eun Rhee

Light field (LF), a new approach in three-dimensional image processing, has been actively used in various applications in recent years. LF is based on a large amount of data and this always leads to LF compression (LFC) issues. Pseudo-sequence (PS)-based LFC converts a LF into a video sequence and compresses it through a video codec, whereas synthesis-based LFC (SYN-LFC) synthesizes the rest from some of the LF to reduce the number of bits. SYN-LFC is superior to PS-based LFC at low bitrates. However, its competitiveness decreases at high bitrates due to the inefficient compression of residuals. This paper maximizes the advantages of SYN-LFC by increasing the compression efficiency of residuals. To exploit the characteristic of the residual in favor of compression, this paper compresses the residual in the form of a micro image (MI). The conversion of residuals to MI has the effect of gathering similar residuals of each viewpoint, which increases the spatial coherence. However, the conventional MI conversion does not reflect the geometric characteristics of LF at all. To tackle this problem, this paper proposes the re-ordered micro image (RoMI), which is a novel MI conversion that takes advantage of the geometric characteristics of LF, thereby maximizing the spatial coherence and compression efficiency. To compress MI-type residuals, JPEG2000, an image-level codec, is used. It is highly suitable for RoMI with spatial coherence beyond the block level. In the experimental results, the proposed RoMI shows average improvements of 30.29% and 14.05% in the compression efficiency compared to the existing PS-based LFC and SYN-LFC methods, respectively.

Accelerating General-purpose Lossless Compression via Simple and Scalable Parameterization

Yu Mao
Yufei Cui
Tei-Wei Kuo
Chun Jason Xue

The storage of multi-media data can benefit from the advancements in general-purpose lossless compression. The explosive growth of multi-media data volume in data centers demands a higher compression ratio and better compressors' run-time speed. However, recent deep-learning-based compressors with a high compression ratio usually build complicated dependencies on history symbols, leading to a long compression time. This paper investigates the behavior of historical symbols and finds an approximate order of importance. Namely, recent symbols have a substantially larger influence on the probability estimation of the next unknown symbol. This observation guides the designing of an interpretable structure for data compression, rather than learning implicitly from data like Recurrent Neural Network (RNN) and attention. Based on this observation, we disentangle the compression model into order learning and feature learning, which were fused in a large module in previous works. A parameterized ordered mask unit is established to learn the ordered importance of history symbols. A fast Multi-Layer Perceptron (MLP) network is designed for efficient feature learning. The proposed compressor can improve both compression performance and computational efficiency compared with transformer-based or RNN-based compressors. To further enhance computational efficiency, we propose a branch-MLP block to replace the original MLP layer. This block reduces the parameters and the FLOPs of the original MLP to a half, without sacrificing compression performance. Experiments on multi-media data demonstrate that our model improves the compression ratio by 10% on average across data domains while accelerating compression speed by 100% compared with the state-of-the-art. The source code and appendix are released at https://github.com/mynotwo/compressor_via_simple_and_scalable_parameteri....

SESSION: Oral Session X: Understanding Multimedia Content -- Multimodal Fusion and Embeddings

Semantic Data Augmentation based Distance Metric Learning for Domain Generalization

Mengzhu Wang
Jianlong Yuan
Qi Qian
Zhibin Wang
Hao Li

Domain generalization (DG) aims to learn a model on one or more different but related source domains that could be generalized into an unseen target domain. Existing DG methods try to prompt the diversity of source domains for the model's generalization ability, while they may have to introduce auxiliary networks or striking computational costs. On the contrary, this work applies the implicit semantic augmentation in feature space to capture the diversity of source domains. Concretely, an additional loss function of distance metric learning (DML) is included to optimize the local geometry of data distribution. Besides, the logits from cross entropy loss with infinite augmentations is adopted as input features for the DML loss in lieu of the deep features. We also provide a theoretical analysis to show that the logits can approximate the distances defined on original features well. Further, we provide an in-depth analysis of the mechanism and rational behind our approach, which gives us a better understanding of why leverage logits in lieu of features can help domain generalization. The proposed DML loss with the implicit augmentation is incorporated into a recent DG method, that is, Fourier Augmented Co-Teacher framework (FACT). Meanwhile, our method also can be easily plugged into various DG methods. Extensive experiments on three benchmarks (Digits-DG, PACS and Office-Home) have demonstrated that the proposed method is able to achieve the state-of-the-art performance.

Mix-DANN and Dynamic-Modal-Distillation for Video Domain Adaptation

Yuehao Yin
Bin Zhu
Jingjing Chen
Lechao Cheng
Yu-Gang Jiang

Video domain adaptation is non-trivial due to video is inherently involved with multi-dimensional and multi-modal information. Existing works mainly adopt adversarial learning and self-supervised tasks to align features. Nevertheless, the explicit interaction between source and target in the temporal dimension, as well as the adaptation between modalities, are unexploited. In this paper, we propose Mix-Domain-Adversarial Neural Network and Dynamic-Modal-Distillation (MD-DMD), a novel multi-modal adversarial learning framework for unsupervised video domain adaptation. Our approach incorporates the temporal information between source and target domains, as well as the diversity of adaptability between modalities. On the one hand, for every single modality, we mix the frames from source and target domains to form mix-samples, then let the adversarial-discriminator predict the mix ratio of a mix-sample to further enhance the ability of the model to capture domain-invariant feature representations. On the other hand, we dynamically estimate the adaptability for different modalities during training, then pick the most adaptable modality as a teacher to guide other modalities by knowledge distillation. As a result, modalities are capable of learning transferable knowledge from each other, which leads to more effective adaptation. Experiments on two video domain adaptation benchmarks demonstrate the superiority of our proposed MD-DMD over state-of-the-art methods.

Search-oriented Micro-video Captioning

Liqiang Nie
Leigang Qu
Dai Meng
Min Zhang
Qi Tian
Alberto Del Bimbo

Pioneer efforts have been dedicated to the content-oriented video captioning that generates relevant sentences to describe the visual contents of a given video from the producer perspective. By contrast, this work targets at the search-oriented one that summarizes the given video via generating query-like sentences from the consumer angle. Beyond relevance, diversity is vital in characterizing consumers' seeking intention from different aspects. Towards this end, we devise a large-scale multimodal pre-training network regularized by five tasks to strengthen the downstream video representation, which is well-trained over our collected 11M micro-videos. Thereafter, we present a flow-based diverse captioning model to generate different captions from consumers' search demand. This model is optimized via a reconstruction loss and a KL divergence between the prior and the posterior. We justify our model over our constructed golden dataset comprising 690k <query, micro-video> pairs and experimental results demonstrate its superiority.

Dual Part Discovery Network for Zero-Shot Learning

Jiannan Ge
Hongtao Xie
Shaobo Min
Pandeng Li
Yongdong Zhang

Zero-Shot Learning (ZSL) aims to recognize unseen classes by transferring knowledge from seen classes. Recent methods focus on learning a common semantic space to align visual and attribute information. However, they always over-relied on provided attributes and ignored the category discriminative information that contributes to accurate unseen class recognition, resulting in weak transferability. To this end, we propose a novel Dual Part Discovery Network (DPDN) that considers both attribute and category discriminative information by discovering attribute-guided parts and category-guided parts simultaneously to improve knowledge transfer. Specifically, for attribute-guided parts discovery, DPDN can localize the regions with specific attribute information and significantly bridge the gap between visual and semantic information guided by the given attributes. For category-guided parts discovery, the local parts are explored to discover other important regions that bring latent crucial details ignored by attributes, with the guidance of adaptive category prototypes. To better mine the transferable knowledge, we impose class correlations constraints to regularize the category prototypes. Finally, attribute- and category-guided parts complement each other and provide adequate discriminative subtle information for more accurate unseen class recognition. Extensive experimental results demonstrate that DPDN can discover discriminative parts and outperform state-of-the-art methods on three standard benchmarks.

Non-Autoregressive Cross-Modal Coherence Modelling

Yi Bin
Wenhao Shi
Jipeng Zhang
Yujuan Ding
Yang Yang
Heng Tao Shen

Modelling the coherence of information is important for human to perceive and prehend the physical world. Existing works on coherence modelling mainly focus on single modality, which overlook the effect of information integration and semantic consistency across modalities. To fill the research gap, this paper targets at the cross-modal coherence modelling, specifically, the cross-modal ordering task. The task requires to not only explore the coherence information in single modality, but also leverage cross-modal information to model the semantic consistency between modalities. To this end, we propose a Non-Autoregressive Cross-modal Ordering Net (NACON) adopting a basic encoder-decoder architecture. Specifically, NACON is equipped with an order-invariant context encoder to model the unordered input set and a non-autoregressive decoder to generate ordered sequences in parallel. We devise a cross-modal positional attention module in NACON to take advantage of the cross-modal order guidance. To alleviate the repetition problem of non-autoregressive models, we introduce an elegant exclusive loss to constrain the ordering exclusiveness between positions and elements. We conduct extensive experiments on two assembled datasets to support our task, SIND and TACoS-Ordering. Experimental results show that the proposed NACON can effectively leverage cross-modal guidance and recover the correct order of the elements.The code is available at https://github.com/YiBin-CHN/CMCM.

CoHOZ: Contrastive Multimodal Prompt Tuning for Hierarchical Open-set Zero-shot Recognition

Ning Liao
Yifeng Liu
Li Xiaobo
Chenyi Lei
Guoxin Wang
Xian-Sheng Hua
Junchi Yan

Practical image recognition often encounters samples whose labels either are totally unknown or belong to new classes outside the training set. The first problem refers to the open-set recognition (OSR), in which unknown classes are recognized as one with no more semantic information. While the latter is called zero-shot learning (ZSL), in which new classes are usually predefined. The existing literature mostly addresses these two problems separately. In this paper, we take the ambition for solving the combination of these two problems to fulfill semantically recognizing the unknown classes detected in OSR by zero-shot prediction. We propose the Contrastive multimodal prompt tuning for Hierarchical Open-set Zero-shot recognition (CoHOZ). Specifically, we firstly build a global and compatible hierarchical label tree with all downstream datasets aligned, which lays foundations for other modules. To detect unknown classes, we propose the contrastive continuous prompt tuning, which introduces additional negative classes from the fine level of the built hierarchy for prompt learning. To generate candidate classes for zero-shot prediction on the unknown data using prompt, we combine the built hierarchy to collect candidate classes from coarse to fine. In our experiments, when following the standard OSR protocol regarding all the unknown classes as a single class, CoHOZ achieves a new state-of-the-art performance both in unknown detection and open-set recognition. Few-shot tuning by the CoHOZ also shows competitive performance on them. In addition, the detailed semantic information of unknown classes are well explored, which has also been verified in experiments.

GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement

Zhi-Qi Cheng
Qi Dai
Siyao Li
Teruko Mitamura
Alexander Hauptmann

Grounded Situation Recognition (GSR) aims to generate structured semantic summaries of images for "human-like'' event understanding. Specifically, GSR task not only detects the salient activity verb (e.g. buying), but also predicts all corresponding semantic roles (e.g. agent and goods). Inspired by object detection and image captioning tasks, existing methods typically employ a two-stage framework: 1) detect the activity verb, and then 2) predict semantic roles based on the detected verb. Obviously, this illogical framework constitutes a huge obstacle to semantic understanding. First, pre-detecting verbs solely without semantic roles inevitably fails to distinguish many similar daily activities (e.g., offering and giving, buying and selling). Second, predicting semantic roles in a closed auto-regressive manner can hardly exploit the semantic relations among the verb and roles. To this end, in this paper we propose a novel two-stage framework that focuses on utilizing such bidirectional relations within verbs and roles. In the first stage, instead of pre-detecting the verb, we postpone the detection step and assume a pseudo label, where an intermediate representation for each corresponding semantic role is learned from images. In the second stage, we exploit transformer layers to unearth the potential semantic relations within both verbs and semantic roles. With the help of a set of support images, an alternate learning scheme is designed to simultaneously optimize the results: update the verb using nouns corresponding to the image, and update nouns using verbs from support images. Extensive experimental results on challenging SWiG benchmarks show that our renovated framework outperforms other state-of-the-art methods under various metrics.

CALM: Commen-Sense Knowledge Augmentation for Document Image Understanding

Qinyi Du
Qingqing Wang
Keqian Li
Jidong Tian
Liqiang Xiao
Yaohui Jin

Performance of document image understanding has been significantly fueled by encoding multi-modal information in recent years. However, existing works heavily rely on the superficial appearance of the observed data, resulting in counter-intuitive model behavior in many critical cases. To overcome this issue, this paper proposes a common-sense knowledge augmented model CALM for document image understanding tasks. It firstly produces purified representations of document contents to extract key information and learn common-sense augmented representation for inputs. Then, relevant common-sense knowledge is extracted from the external ConceptNet knowledge base, and a derived knowledge graph is built to enhance the common-sense reasoning capability of CALM jointly. In order to further highlight the importance of common-sense knowledge in document image understanding, we propose the first question-answering dataset, CS-DVQA, focused on common-sense reasoning for document images, in which questions are answered by taking both document contents and common-sense knowledge into consideration. Through extensive evaluation, the proposed CALM approach outperforms the state-of-the-art models in three document image understanding tasks, including key information extraction(from 85.37 to 86.52), document image classification(from 96.08 to 96.17), document visual question answering(from 86.72 to 88.03).

Cross-Modal Retrieval with Heterogeneous Graph Embedding

Dapeng Chen
Min Wang
Haobin Chen
Lin Wu
Jing Qin
Wei Peng

Conventional methods address the cross-modal retrieval problem by projecting the multi-modal data into a shared representation space. Such a strategy will inevitably lose the modality-specific information, leading to decreased retrieval accuracy. In this paper, we propose heterogeneous graph embeddings to preserve more abundant cross-modal information. The embedding from one modality will be compensated with the aggregated embeddings from the other modality. In particular, a self-denoising tree search is designed to reduce the "label noise" problem, making the heterogeneous neighborhood more semantically relevant. The dual-path aggregation tackles the "modality imbalance" problem, giving each sample comprehensive dual-modality information. The final heterogeneous graph embedding is obtained by feeding the aggregated dual-modality features to the cross-modal self-attention module. Experiments conducted on cross-modality person re-identification and image-text retrieval task validate the superiority and generality of the proposed method.

Simple Self-supervised Multiplex Graph Representation Learning

Yujie Mo
Yuhuan Chen
Liang Peng
Xiaoshuang Shi
Xiaofeng Zhu

Self-supervised multiplex graph representation learning (SMGRL) aims to capture the information from the multiplex graph, and generates discriminative embedding without labels. However, previous SMGRL methods still suffer from the issues of efficiency and effectiveness due to the processes, e.g., data augmentation, negative sample encoding, complex pretext tasks, etc. In this paper, we propose a simple method to achieve efficient and effective SMGRL. Specifically, the proposed method removes the processes (i.e., data augmentation and negative sample encoding) for the SMGRL and designs a simple pretext task, for achieving the efficiency. Moreover, the proposed method also designs an intra-graph decorrelation loss and an inter-graph decorrelation loss, respectively, to capture the common information within individual graphs and the common information across graphs, for achieving the effectiveness. Extensive experimental results verify the efficiency and effectiveness of our method, compared to 11 comparison methods on 4 public benchmark datasets, on the node classification task.

Ordered Attention for Coherent Visual Storytelling

Tom Braude
Idan Schwartz
Alex Schwing
Ariel Shamir

We address the problem of visual storytelling, i.e., generating a story for a given sequence of images. While each story sentence should describe a corresponding image, a coherent story also needs to be consistent and relate to both future and past images. Current approaches encode images independently, disregarding relations between images. Our approach learns to encode images with different interactions based on the story position (i.e., past image or future image). To this end, we develop a novel message-passing-like algorithm for ordered image attention (OIA) that collects interactions across all the images in the sequence. Finally, to generate the story's sentences, a second attention mechanism picks the important image attention vectors with an Image-Sentence Attention (ISA). The obtained results improve the METEOR score on the VIST dataset by 1%. Furthermore, a thorough human study confirms improvements and demonstrates that order-based interactions significantly improve coherency (64.20% \vs 28.70%). Source code available at \urlhttps://github.com/tomateb/OIAVist.git

LVI-ExC: A Target-free LiDAR-Visual-Inertial Extrinsic Calibration Framework

Zhong Wang
Lin Zhang
Ying Shen
Yicong Zhou

Recently, the multi-modal fusion with 3D LiDAR, camera, and IMU has shown great potential in applications of automation-related fields. Yet a prerequisite for a successful fusion is that the geometric relationships among the sensors are accurately determined, which is called an extrinsic calibration problem. To date, the existing target-based approaches to deal with this problem rely on sophisticated calibration objects (sites) and well-trained operators, which is time-consuming and inflexible in practical applications. Contrarily, a few target-free methods can overcome these shortcomings, while they only focus on the calibrations of two types of the sensors. Although it is possible to obtain LiDAR-visual-inertial extrinsics by chained calibrations, problems such as cumbersome operations, large cumulative errors, and weak geometric consistency still exist. To this end, we propose LVI-ExC, an integrated LiDAR-Visual-Inertial Extrinsic Calibration framework, which takes natural multi-modal data as input and yields sensor-to-sensor extrinsics end-to-end without any auxiliary object (site) or manual assistance. To fuse multi-modal data, we formulate the LiDAR-visual-inertial extrinsic calibration as a continuous-time simultaneous localization and mapping problem, in which the extrinsics, trajectories, time differences, and map points are jointly estimated by establishing sensor-to-sensor and sensor-to-trajectory constraints. Extensive experiments show that LVI-ExC can produce precise results. With LVI-ExC's outputs, the LiDAR-visual reprojection results and the reconstructed environment map are all highly consistent with the actual natural scenes, demonstrating LVI-ExC's outstanding performance. To ensure that our results are fully reproducible, all the relevant data and codes have been released publicly at https://cslinzhang.github.io/LVI-ExC/.

MM-ALT: A Multimodal Automatic Lyric Transcription System

Xiangming Gu
Longshen Ou
Danielle Ong
Ye Wang

Automatic lyric transcription (ALT) is a nascent field of study attracting increasing interest from both the speech and music information retrieval communities, given its significant application potential. However, ALT with audio data alone is a notoriously difficult task due to instrumental accompaniment and musical constraints resulting in degradation of both the phonetic cues and the intelligibility of sung lyrics. To tackle this challenge, we propose the MultiModal Automatic Lyric Transcription system (MM-ALT), together with a new dataset, N20EM, which consists of audio recordings, videos of lip movements, and inertial measurement unit (IMU) data of an earbud worn by the performing singer. We first adapt the wav2vec 2.0 framework from automatic speech recognition (ASR) to the ALT task. We then propose a video-based ALT method and an IMU-based voice activity detection (VAD) method. In addition, we put forward the Residual Cross Attention (RCA) mechanism to fuse data from the three modalities (i.e., audio, video, and IMU). Experiments show the effectiveness of our proposed MM-ALT system, especially in terms of noise robustness. Project page is at https://n20em.github.io.

Self-supervised Exclusive Learning for 3D Segmentation with Cross-Modal Unsupervised Domain Adaptation

Yachao Zhang
Miaoyu Li
Yuan Xie
Cuihua Li
Cong Wang
Zhizhong Zhang
Yanyun Qu

2D-3D unsupervised domain adaptation (UDA) tackles the lack of annotations in a new domain by capitalizing the relationship between 2D and 3D data. Existing methods achieve considerable improvements by performing cross-modality alignment in a modality-agnostic way, failing to exploit modality-specific characteristic for modeling complementarity. In this paper, we present self-supervised exclusive learning for cross-modal semantic segmentation under the UDA scenario, which avoids the prohibitive annotation. Specifically, two self-supervised tasks are designed, named "plane-to-spatial'' and "discrete-to-textured''. The former helps the 2D network branch improve the perception of spatial metrics, and the latter supplements structured texture information for the 3D network branch. In this way, modality-specific exclusive information can be effectively learned, and the complementarity of multi-modality is strengthened, resulting in a robust network to different domains. With the help of the self-supervised tasks supervision, we introduce a mixed domain to enhance the perception of the target domain by mixing the patches of the source and target domain samples. Besides, we propose a domain-category adversarial learning with category-wise discriminators by constructing the category prototypes for learning domain-invariant features. We evaluate our method on various multi-modality domain adaptation settings, where our results significantly outperform both uni-modality and multi-modality state-of-the-art competitors.

Cross-Compatible Embedding and Semantic Consistent Feature Construction for Sketch Re-identification

Yafei Zhang
Yongzeng Wang
Huafeng Li
Shuang Li

Sketch re-identification (Re-ID) refers to using sketches of pedestrians to retrieve their corresponding photos from surveillance videos. It can track pedestrians according to the sketches drawn based on eyewitnesses without querying pedestrian photos. Although the Sketch Re-ID concept has been proposed, the gap between the sketch and the photo still greatly hinders pedestrian identity matching. Based on the idea of transplantation without rejection, we propose a Cross-Compatible Embedding (CCE) approach to narrow the gap. A Semantic Consistent Feature Construction (SCFC) scheme is simultaneously presented to enhance feature discrimination. Under the guidance of identity consistency, the CCE performs cross modal interchange at the local token level in the Transformer framework, enabling the model to extract modal-compatible features. The SCFC improves the representation ability of features by handling the inconsistency of information in the same location of the sketch and the corresponding pedestrian photo. The SCFC scheme divides the local tokens of pedestrian images with different modes into different groups and assigns specific semantic information to each group for constructing a semantic consistent global feature representation. Experiments on the public Sketch Re-ID dataset confirm the effectiveness of the proposed method and its superiority over existing methods. Experiments on Sketch-based image retrieval datasets QMUL-Shoe-v2 and QMUL-Chair-v2 are conducted to assess the method's generalization. The results show that the proposed method outperforms the state-of-the-art works compared. The source code of our method is available at: https://github.com/lhf12278/CCSC.

Difference Residual Graph Neural Networks

Liang Yang
Weihang Peng
Wenmiao Zhou
Bingxin Niu
Junhua Gu
Chuan Wang
Yuanfang Guo
Dongxiao He
Xiaochun Cao

Graph Neural Networks have been widely employed for multimodal fusion and embedding. To overcome over-smoothing issue, residual connections, which are designed for alleviating vanishing gradient problem in NNs, are adopted in Graph Neural Networks (GNNs) to incorporate local node information. However, these simple residual connections are ineffective on networks with heterophily, since the roles of both convolutional operations and residual connections in GNNs are significantly different from those in classic NNs. By considering the specific smoothing characteristic of graph convolutional operation, deep layers in GNNs are expected to focus on the data which can't be properly handled in shallow layers. To this end, a novel and universal Difference Residual Connections (DRC), which feed the difference of the output and input of previous layer as the input of the next layer, is proposed. Essentially, Difference Residual Connections is equivalent to inserting layers with opposite effect (e.g., sharpening) into the network to prevent the excessive effect (e.g., over-smoothing issue) induced by too many layers with the similar role (e.g., smoothing) in GNNs. From the perspective of optimization, DRC is the gradient descent method to minimize an objective function with both smoothing and sharpening terms. The analytic solution to this objective function is determined by both graph topology and node attributes, which theoretically proves that DRC can prevent over-smoothing issue. Extensive experiments demonstrate the superiority of DRC on real networks with both homophily and heterophily, and show that DRC can automatically determine the model depth and be adaptive to both shallow and deep models with two complementary components.

SESSION: Poster Session X: Understanding Multimedia Content -- Multimodal Fusion and Embeddings

Normalization-based Feature Selection and Restitution for Pan-sharpening

Man Zhou
Jie Huang
Keyu Yan
Gang Yang
Aiping Liu
Chongyi Li
Feng Zhao

Pan-sharpening is essentially a panchromatic (PAN) image-guided low-spatial resolution MS image super-resolution problem. The commonly challenging issue of pan-sharpening is how to correctly select consistent features and propagate them, and properly handle inconsistent ones between PAN and MS modalities. To solve this issue, we propose a Normalization-based Feature Selection and Restitution mechanism, which is capable of filtering out the inconsistent features and promoting to learn the consistent ones. Specifically, we first modulate the PAN feature as the MS style in feature space by AdaIN operation \citeAdaIN. However, such operation inevitably removes the favorable features. We thus propose to distill the effective information from the removed part and restitute it back to the modulated part. To better distillation, we enforce a contrastive learning constraint to close the distance between the restituted feature and the ground truth, and push the removed part away from the ground truth. In this way, the consistent features of PAN images are correctly selected and the inconsistent ones are filtered out, thus relieving the over-transferred artifacts in the process of PAN-guided MS super-resolution. Extensive experiments validate the effectiveness of the proposed network and demonstrate its favorable performance against other state-of-the-art methods. The source code will be released at https://github.com/manman1995/pansharpening.

Adaptively Learning Low-high Frequency Information Integration for Pan-sharpening

Man Zhou
Jie Huang
Chongyi Li
Hu Yu
Keyu Yan
Naishan Zheng
Feng Zhao

Pan-sharpening aims to generate high-spatial resolution multi-spectral (MS) image by fusing high-spatial resolution panchromatic (PAN) image and its corresponding low-spatial resolution MS image. Despite the remarkable progress, most existing pan-sharpening methods only work in the spatial domain and rarely explore the potential solutions in the frequency domain. In this paper, we propose a novel pan-sharpening framework by adaptively learning low-high frequency information integration in the spatial and frequency dual domains. It consists of three key designs: mask prediction sub-network, low-frequency learning sub-network and high-frequency learning sub-network. Specifically, the first is responsible for measuring the modality-aware frequency information difference of PAN and MS images and further predicting the low-high frequency boundary in the form of a two-dimensional mask. In view of the mask, the second adaptively picks out the corresponding low-frequency components of different modalities and then restores the expected low-frequency one by spatial and frequency dual domains information integration while the third combines the above refined low-frequency and the original high-frequency for the latent high-frequency reconstruction. In this way, the low-high frequency information is adaptively learned, thus leading to the pleasing results. Extensive experiments validate the effectiveness of the proposed network and demonstrate the favorable performance against other state-of-the-art methods. The source code will be released at https://github.com/manman1995/pansharpening.

Complementary Graph Representation Learning for Functional Neuroimaging Identification

Rongyao Hu
Liang Peng
Jiangzhang Gan
Xiaoshuang Shi
Xiaofeng Zhu

The functional connectomics study on resting state functional magnetic resonance imaging (rs-fMRI) data has become a popular way for early disease diagnosis. However, previous methods did not jointly consider the global patterns, the local patterns, and the temporal information of the blood-oxygen-level-dependent (BOLD) signals, thereby restricting the model effectiveness for early disease diagnosis. In this paper, we propose a new graph convolutional network (GCN) method to capture local and global patterns for conducting dynamically functional connectivity analysis. Specifically, we first employ the sliding window method to partition the original BOLD signals into multiple segments, aiming at achieving the dynamically functional connectivity analysis, and then design a multi-view node classification and a temporal graph classification to output two kinds of representations, which capture the temporally global patterns and the temporally local patterns, respectively. We further fuse these two kinds of representation by the weighted concatenation method whose effectiveness is experimentally proved as well. Experimental results on real datasets demonstrate the effectiveness of our method, compared to comparison methods on different classification tasks.

Dynamically Adjust Word Representations Using Unaligned Multimodal Information

Jiwei Guo
Jiajia Tang
Weichen Dai
Yu Ding
Wanzeng Kong

Multimodal Sentiment Analysis is a promising research area for modeling multiple heterogeneous modalities. Two major challenges that exist in this area are a) multimodal data is unaligned in nature due to the different sampling rates of each modality, and b) long-range dependencies between elements across modalities. These challenges increase the difficulty of conducting efficient multimodal fusion. In this work, we propose a novel end-to-end network named Cross Hyper-modality Fusion Network (CHFN). The CHFN is an interpretable Transformer-based neural model that provides an efficient framework for fusing unaligned multimodal sequences. The heart of our model is to dynamically adjust word representations in different non-verbal contexts using unaligned multimodal sequences. It is concerned with the influence of non-verbal behavioral information at the scale of the entire utterances and then integrates this influence into verbal expression. We conducted experiments on both publicly available multimodal sentiment analysis datasets CMU-MOSI and CMU-MOSEI. The experiment results demonstrate that our model surpasses state-of-the-art models. In addition, we visualize the learned interactions between language modality and non-verbal behavior information and explore the underlying dynamics of multimodal language data.

Bipartite Graph-based Discriminative Feature Learning for Multi-View Clustering

Weiqing Yan
Jindong Xu
Jinglei Liu
Guanghui Yue
Chang Tang

Multi-view clustering is an important technique in machine learning research. Existing methods have improved in clustering performance, most of them learn graph structure depending on all samples, which are high complexity. Bipartite graph-based multi-view clustering can obtain clustering result by establishing the relationship between the sample points and small anchor points, which improve the efficiency of clustering. Most bipartite graph-based clustering methods only focus on topological graph structure learning depending on sample nodes, ignore the influence of node features. In this paper, we propose bipartite graph-based discriminative feature learning for multi-view clustering, which combines bipartite graph learning and discriminative feature learning to a unified framework. Specifically, the bipartite graph learning is proposed via multi-view subspace representation with manifold regularization terms. Meanwhile, our feature learning utilizes data pseudo-labels obtained by fused bipartite graph to seek projection direction, which make the same label be closer and make data points with different labels be far away from each other. At last, the proposed manifold regularization terms establish the relationship between constructed bipartite graph and new data representation. By leveraging the interactions between structure learning and discriminative feature learning, we are able to select more informative features and capture more accurate structure of data for clustering. Extensive experimental results on different scale datasets demonstrate our method achieves better or comparable clustering performance than the results of state-of-the-art methods.

Dynamic Incomplete Multi-view Imputing and Clustering

Xingfeng Li
Quansen Sun
Zhenwen Ren
Yinghui Sun

Incomplete multi-view clustering (IMVC) is deemed a significant research topic in multimedia to handle data loss situations. Current late fusion incomplete multi-view clustering methods have attracted intensive attention owing to their superiority in using consensus partition for effective and efficient imputation and clustering. However, 1) their imputation quality and clustering performance depend heavily on the static prior partition, such as predefined zeros filling, destroying the diversity of different views; 2) the size of base partitions is too small, which would lose advantageous details of base kernels to decrease clustering performance. To address these issues, we propose a novel IMVC method, named Dynamic Incomplete Multi-view Imputing and Clustering (DIMIC). Concretely, the observed views dynamically generate a consensus proxy with the guidance of a shared cluster matrix for more effective imputation and clustering, rather than a fixed predefined partition matrix. Furthermore, the proper size of base partitions is employed to protect sufficient kernel details for further enhancing the quality of the consensus proxy. By designing a solver with a linear computational and memory complexity on extensive experiments, our effectiveness, superiority, and efficiency are validated on multiple public datasets with recent advances.

Learning Smooth Representation for Multi-view Subspace Clustering

Shudong Huang
Yixi Liu
Yazhou Ren
Ivor W. Tsang
Zenglin Xu
Jiancheng Lv

Multi-view subspace clustering aims to exploit data correlation consensus among multiple views, which essentially can be treated as graph-based approach. However, existing methods usually suffer from suboptimal solution as the raw data might not be separable into subspaces. In this paper, we propose to achieve a smooth representation for each view and thus facilitate the downstream clustering task. It is based on a assumption that a graph signal is smooth if nearby nodes on the graph have similar features representations. Specifically, our mode is able to retain the graph geometric features by applying a low-pass filter to extract the smooth representations of multiple views. Besides, our method achieves the smooth representation learning as well as multi-view clustering interactively in a unified framework, hence it is an end-to-end single-stage learning problem. Substantial experiments on benchmark multi-view datasets are performed to validate the effectiveness of the proposed method, compared to the state-of-the-arts over the clustering performance.

LFBCNet: Light Field Boundary-aware and Cascaded Interaction Network for Salient Object Detection

Mianzhao Wang
Fan Shi
Xu Cheng
Meng Zhao
Yao Zhang
Chen Jia
Weiwei Tian
Shengyong Chen

In light field imaging techniques, the abundance of stereo spatial information aids in improving the performance of salient object detection. In some complex scenes, however, applying the 4D light field boundary structure to discriminate salient objects from background regions is still under-explored. In this paper, we propose a light field boundary-aware and cascaded interaction network based on light field macro-EPI, named LFBCNet. Firstly, we propose a well-designed light field multi-epipolar-aware learning (LFML) module to learn rich salient boundary cues by perceiving the continuous angle changes from light field macro-EPI. Secondly, to fully excavate the correlation between salient objects and boundaries at different scales, we design multiple light field boundary interactive (LFBI) modules and cascade them to form a light field multi-scale cascade interaction decoder network. Each LFBI is assigned to predict exquisite salient objects and boundaries by interactively transmitting the salient object and boundary features. Meanwhile, the salient boundary features are forced to gradually refine the salient object features during the multi-scale cascade encoding. Furthermore, a light field multi-scale-fusion prediction (LFMP) module is developed to automatically select and integrate multi-scale salient object features for final saliency prediction. The proposed LFBCNet can accurately distinguish tiny differences between salient objects and background regions. Comprehensive experiments on large benchmark datasets prove that the proposed method achieves competitive performance over 2-D, 3-D, and 4-D salient object detection methods.

Multiple Kernel Clustering with Dual Noise Minimization

Junpu Zhang
Liang Li
Siwei Wang
Jiyuan Liu
Yue Liu
Xinwang Liu
En Zhu

Clustering is a representative unsupervised method widely applied in multi-modal and multi-view scenarios. Multiple kernel clustering (MKC) aims to group data by integrating complementary information from base kernels. As a representative, late fusion MKC first decomposes the kernels into orthogonal partition matrices, then learns a consensus one from them, achieving promising performance recently. However, these methods fail to consider the noise inside the partition matrix, preventing further improvement of clustering performance. We discover that the noise can be disassembled into separable dual parts, i.e. N-noise and C-noise (Null space noise and Column space noise). In this paper, we rigorously define dual noise and propose a novel parameter-free MKC algorithm by minimizing them. To solve the resultant optimization problem, we design an efficient two-step iterative strategy. To our best knowledge, it is the first time to investigate dual noise within the partition in the kernel space. We observe that dual noise will pollute the block diagonal structures and incur the degeneration of clustering performance, and C-noise exhibits stronger destruction than N-noise. Owing to our efficient mechanism to minimize dual noise, the proposed algorithm surpasses the recent methods by large margins.

Webly Supervised Image Hashing with Lightweight Semantic Transfer Network

Hui Cui
Lei Zhu
Jingjing Li
Zheng Zhang
Weili Guan

Recent studies have verified the success of deep hashing for efficient image retrieval. However, most existing methods require abundant human labeling data to optimize the large number of involved network parameters, which consequently restricts the scalability of deep image hashing. Alternatively, learning from freely available web images that inherently include rich semantics is a promising strategy. Nevertheless, the domain distribution gap will prevent transferring the semantics involved in the source web images to the target images. Besides, most existing deep image hashing methods suffer from excessive training time to achieve satisfactory performance without explicit supervision. How to efficiently train the deep image hashing network is another important problem that needs to be seriously considered. In this paper, we propose a Webly Supervised Image Hashing (WSIH) with a well-designed lightweight network. Our model enhances the semantics of unsupervised image hashing with the weak supervision from freely available web images, and simultaneously avoids involving over-abundant parameters in the deep network architecture. Particularly, we train a concept prototype learning network on the web images, learning well-trained network parameters and the prototype codes that hold the discriminative semantics of the potential visual concepts in target images. Further, we meticulously design a lightweight siamese network architecture and a dual-level transfer mechanism to efficiently translate the semantics learned from source web images to the target images. Experiments on two widely-tested image datasets show the superiority of the proposed method in both retrieval accuracy and training efficiency compared to state-of-the-art image hashing methods.The source codes of our method are available at: https://github.com/christinecui/WSIH.

Rethinking Super-Resolution as Text-Guided Details Generation

Chenxi Ma
Bo Yan
Qing Lin
Weimin Tan
Siming Chen

Deep neural networks have greatly promoted the performance of single image super-resolution (SISR). Conventional methods still resort to restoring the single high-resolution (HR) solution only based on the input of image modality. However, the image-level information is insufficient to predict adequate details and photo-realistic visual quality facing large upscaling factors (×8, ×16). In this paper, we propose a new perspective that regards the SISR as a semantic image detail enhancement problem to generate semantically reasonable HR image that are faithful to the ground truth. To enhance the semantic accuracy and the visual quality of the reconstructed image, we explore the multi-modal fusion learning in SISR by proposing a Text-Guided Super-Resolution (TGSR) framework, which can effectively utilize the information from the text and image modalities. Different from existing methods, the proposed TGSR could generate HR image details that match the text descriptions through a coarse-to-fine process. Extensive experiments and ablation studies demonstrate the effect of the TGSR, which exploits the text reference to recover realistic images.

DEAL: An Unsupervised Domain Adaptive Framework for Graph-level Classification

Nan Yin
Li Shen
Baopu Li
Mengzhu Wang
Xiao Luo
Chong Chen
Zhigang Luo
Xian-Sheng Hua

Graph neural networks (GNNs) have achieved state-of-the-art results on graph classification tasks. They have been primarily studied in cases of supervised end-to-end training, which requires abundant task-specific labels. Unfortunately, annotating labels of graph data could be prohibitively expensive or even impossible in many applications. An effective solution is to incorporate labeled graphs from a different, but related source domain, to develop a graph classification model for the target domain. However, the problem of unsupervised domain adaptation for graph classification is challenging due to potential domain discrepancy in graph space as well as the label scarcity in the target domain. In this paper, we present a novel GNN framework named DEAL by incorporating both source graphs and target graphs, which is featured by two modules, i.e., adversarial perturbation and pseudo-label distilling. Specifically, to overcome domain discrepancy, we equip source graphs with target semantics by applying to them adaptive perturbations which are adversarially trained against a domain discriminator. Additionally, DEAL explores distinct feature spaces at different layers of the GNN encoder, which emphasize global and local semantics respectively. Then, we distill the consistent predictions from two spaces to generate reliable pseudo-labels for sufficiently utilizing unlabeled data, which further improves the performance of graph classification. Extensive experiments on a wide range of graph classification datasets reveal the effectiveness of our proposed DEAL.

AVQA: A Dataset for Audio-Visual Question Answering on Videos

Pinci Yang
Xin Wang
Xuguang Duan
Hong Chen
Runze Hou
Cong Jin
Wenwu Zhu

Audio-visual question answering aims to answer questions regarding both audio and visual modalities in a given video, and has drawn increasing research interest in recent years. However, there have been no appropriate datasets for this challenging task on videos in real-life scenarios so far. They are either designed with questions containing only visual clues without taking any audio information into account, or considering audio with restrictions to specific scenarios, such as panoramic videos and videos about music performances. In this paper, to overcome the limitations of existing datasets, we introduce AVQA, a new audio-visual question answering dataset on videos in real-life scenarios. We collect 57,015 videos from daily audio-visual activities and 57,335 specially-designed question-answer pairs relying on clues from both modalities, where information contained in a single modality is insufficient or ambiguous. Furthermore, we propose a Hierarchical Audio-Visual Fusing module to model multiple semantic correlations among audio, visual, and text modalities and conduct ablation studies to analyze the role of different modalities on our datasets. Experimental results show that our proposed method significantly improves the audio-visual question answering performance over various question types. Therefore, AVQA can provide an adequate testbed for the generation of models with a deeper understanding of multimodal information on audio-visual question answering in real-life scenarios. (The dataset is available at https://mn.cs.tsinghua.edu.cn/avqa)

Prompting for Multi-Modal Tracking

Jinyu Yang
Zhe Li
Feng Zheng
Ales Leonardis
Jingkuan Song

Multi-modal tracking gains attention due to its ability to be more accurate and robust in complex scenarios compared to traditional RGB-based tracking. Its key lies in how to fuse multi-modal data and reduce the gap between modalities. However, multi-modal tracking still severely suffers from data deficiency, thus resulting in the insufficient learning of fusion modules. Instead of building such a fusion module, in this paper, we provide a new perspective on multi-modal tracking by attaching importance to the multi-modal visual prompts. We design a novel multi-modal prompt tracker (ProTrack), which can transfer the multi-modal inputs to a single modality by the prompt paradigm. By best employing the tracking ability of pre-trained RGB trackers learning at scale, our ProTrack can achieve high-performance multi-modal tracking by only altering the inputs, even without any extra training on multi-modal data. Extensive experiments on 5 benchmark datasets demonstrate the effectiveness of the proposed ProTrack.

mmBody Benchmark: 3D Body Reconstruction Dataset and Analysis for Millimeter Wave Radar

Anjun Chen
Xiangyu Wang
Shaohao Zhu
Yanxu Li
Jiming Chen
Qi Ye

Millimeter Ware (mmWave) Radar is gaining popularity as it can work in adverse environments like smoke, rain, snow, poor lighting, etc. Prior work has explored the possibility of reconstructing 3D skeletons or meshes from the noisy and sparse mmWare Radar signals. However, it is unclear how accurately we can reconstruct the 3D body from the mmWave signals across scenes and how it performs compared with cameras, which are important aspects needed to be considered when either using mmWave radars alone or combining them with cameras. To answer these questions, an automatic 3D body annotation system is first designed and built up with multiple sensors to collect a large-scale dataset. The dataset consists of synchronized and calibrated mmWave radar point clouds and RGB(D) images in different scenes and skeleton/mesh annotations for humans in the scenes. With this dataset, we train state-of-the-art methods with inputs from different sensors and test them in various scenarios. The results demonstrate that 1) despite the noise and sparsity of the generated point clouds, the mmWave radar can achieve better reconstruction accuracy than the RGB camera but worse than the depth camera; 2) the reconstruction from the mmWave radar is affected by adverse weather conditions moderately while the RGB(D) camera is severely affected. Further, analysis of the dataset and the results shadow insights on improving the reconstruction from the mmWave radar and the combination of signals from different sensors.

Eliminating Spatial Ambiguity for Weakly Supervised 3D Object Detection without Spatial Labels

Haizhuang Liu
Huimin Ma
Yilin Wang
Bochao Zou
Tianyu Hu
Rongquan Wang
Jiansheng Chen

Previous weakly-supervised methods of 3D object detection in driving scenes mainly rely on spatial labels, which provide the location, dimension, or orientation information. The annotation of 3D spatial labels is time-consuming. There also exist methods that do not require spatial labels, but their detections may fall on object parts rather than entire objects or backgrounds. In this paper, a novel cross-modal weakly-supervised 3D progressive refinement framework (WS3DPR) for 3D object detection that only needs image-level class annotations is introduced. The proposed framework consists of two stages: 1) classification refinement for potential objects localization and 2) regression refinement for spatial pseudo labels reasoning. In the first stage, a region proposal network is trained by cross-modal class knowledge transferred from 2D image to 3D point cloud and class information propagation. In the second stage, the locations, dimensions, and orientations of 3D bounding boxes are further refined with geometric reasoning based on 2D frustum and 3D region. When only image-level class labels are available, proposals with different 3D locations become overlapped in 2D, leading to the misclassification of foreground objects. Therefore, a 2D-3D semantic consistency block is proposed to disentangle different 3D proposals after projection. The overall framework progressively learns features in a coarse to fine manner. Comprehensive experiments on the KITTI3D dataset demonstrate that our method achieves competitive performance compared with previous methods with a lightweight labeling process.

Dynamic Graph Reasoning for Multi-person 3D Pose Estimation

Zhongwei Qiu
Qiansheng Yang
Jian Wang
Dongmei Fu

Multi-person 3D pose estimation is a challenging task because of occlusion and depth ambiguity, especially in the cases of crowd scenes. To solve these problems, most existing methods explore modeling body context cues by enhancing feature representation with graph neural networks or adding structural constraints. However, these methods are not robust for their single-root formulation that decoding 3D poses from a root node with a pre-defined graph. In this paper, we propose GR-M3D, which models the Multi-person 3D pose estimation with dynamic Graph Reasoning. The decoding graph in GR-M3D is predicted instead of pre-defined. In particular, It firstly generates several data maps and enhances them with a scale and depth aware refinement module (SDAR). Then multiple root keypoints and dense decoding paths for each person are estimated from these data maps. Based on them, dynamic decoding graphs are built by assigning path weights to the decoding paths, while the path weights are inferred from those enhanced data maps. And this process is named dynamic graph reasoning (DGR). Finally, the 3D poses are decoded according to dynamic decoding graphs for each detected person. GR-M3D can adjust the structure of the decoding graph implicitly by adopting soft path weights according to input data, which makes the decoding graphs be adaptive to different input persons to the best extent and more capable of handling occlusion and depth ambiguity than previous methods. We empirically show that the proposed bottom-up approach even outperforms top-down methods and achieves state-of-the-art results on three 3D pose datasets.

DiT: Self-supervised Pre-training for Document Image Transformer

Junlong Li
Yiheng Xu
Tengchao Lv
Lei Cui
Cha Zhang
Furu Wei

Image Transformer has recently achieved significant progress for natural image understanding, either using supervised (ViT, DeiT, etc.) or self-supervised (BEiT, MAE, etc.) pre-training techniques. In this paper, we propose DiT, a self-supervised pre-trained Document Image Transformer model using large-scale unlabeled text images for Document AI tasks, which is essential since no supervised counterparts ever exist due to the lack of human-labeled document images. We leverage DiT as the backbone network in a variety of vision-based Document AI tasks, including document image classification, document layout analysis, table detection as well as text detection for OCR. Experiment results have illustrated that the self-supervised pre-trained DiT model achieves new state-of-the-art results on these downstream tasks, e.g. document image classification (91.11 - 92.69), document layout analysis (91.0 - 94.9), table detection (94.23 - 96.55) and text detection for OCR (93.07 - 94.29). The code and pre-trained models are publicly available at https://aka.ms/msdit.

Learning to Estimate External Forces of Human Motion in Video

Nathan Louis
Jason J. Corso
Tylan N. Templin
Travis D. Eliason
Daniel P. Nicolella

Analyzing sports performance or preventing injuries requires capturing ground reaction forces (GRFs) exerted by the human body during certain movements. Standard practice uses physical markers paired with force plates in a controlled environment, but this is marred by high costs, lengthy implementation time, and variance in repeat experiments; hence, we propose GRF inference from video. While recent work has used LSTMs to estimate GRFs from 2D viewpoints, these can be limited in their modeling and representation capacity. First, we propose using a transformer architecture to tackle the GRF from video task, being the first to do so. Then we introduce a new loss to minimize high impact peaks in regressed curves. We also show that pre-training and multi-task learning on 2D-to-3D human pose estimation improves generalization to unseen motions. And pre-training on this different task provides good initial weights when finetuning on smaller (rarer) GRF datasets. We evaluate on LAAS Parkour and a newly collected ForcePose dataset; we show up to 19% decrease in error compared to prior approaches.

Query Prior Matters: A MRC Framework for Multimodal Named Entity Recognition

Meihuizi Jia
Xin Shen
Lei Shen
Jinhui Pang
Lejian Liao
Yang Song
Meng Chen
Xiaodong He

Multimodal named entity recognition (MNER) is a vision-language task where the system is required to detect entity spans and corresponding entity types given a sentence-image pair. Existing methods capture text-image relations with various attention mechanisms that only obtain implicit alignments between entity types and image regions. To locate regions more accurately and better model cross-/within-modal relations, we propose a machine reading comprehension based framework for MNER, namely MRC-MNER. By utilizing queries in MRC, our framework can provide prior information about entity types and image regions. Specifically, we design two stages, Query-Guided Visual Grounding and Multi-Level Modal Interaction, to align fine-grained type-region information and simulate text-image/inner-text interactions respectively. For the former, we train a visual grounding model via transfer learning to extract region candidates that can be further integrated into the second stage to enhance token representations. For the latter, we design text-image and inner-text interaction modules along with three sub-tasks for MRC-MNER. To verify the effectiveness of our model, we conduct extensive experiments on two public MNER datasets, Twitter2015 and Twitter2017. Experimental results show that MRC-MNER outperforms the current state-of-the-art models on Twitter2017, and yields competitive results on Twitter2015.

Robust Multimodal Depth Estimation using Transformer based Generative Adversarial Networks

Md Fahim Faysal Khan
Anusha Devulapally
Siddharth Advani
Vijaykrishnan Narayanan

Accurately measuring the absolute depth of every pixel captured by an imaging sensor is of critical importance in real-time applications such as autonomous navigation, augmented reality and robotics. In order to predict dense depth, a general approach is to fuse sensor inputs from different modalities such as LiDAR, camera and other time-of-flight sensors. LiDAR and other time-of-flight sensors provide accurate depth data but are quite sparse, both spatially and temporally. To augment missing depth information, generally RGB guidance is leveraged due to its high resolution information. Due to the reliance on multiple sensor modalities, design for robustness and adaptation is essential. In this work, we propose a transformer-like self-attention based generative adversarial network to estimate dense depth using RGB and sparse depth data. We introduce a novel training recipe for making the model robust so that it works even when one of the input modalities is not available. The multi-head self-attention mechanism can dynamically attend to most salient parts of the RGB image or corresponding sparse depth data producing the most competitive results. Our proposed network also requires less memory for training and inference compared to other existing heavily residual connection based convolutional neural networks, making it more suitable for resource-constrained edge applications. The source code is available at: https://github.com/kocchop/robust-multimodal-fusion-gan

Caption-Aware Medical VQA via Semantic Focusing and Progressive Cross-Modality Comprehension

Fuze Cong
Shibiao Xu
Li Guo
Yinbing Tian

Medical Visual Question Answering as a specific-domain task requires substantive prior knowledge of medicine. However, deep learning techniques encounter severe problems of limited supervision due to the scarcity of well-annotated large-scale medical VQA datasets. As an alternative to facing the data limitation problem, image captioning can be introduced to learn summary information about the picture, which is beneficial to question answering. To this end, we propose a caption-aware VQA method that can read the summary information of image content and clinic diagnoses from plenty of medical images and answer the medical question with richer multimodality features. The proposed method consists of two novel components emphasizing semantic locations and semantic content respectively. Firstly, to extract and leverage the semantic locations implied in image captioning, similarity analysis is designed to summarize the attention maps generated from image captioning by their relevance and guide the visual model to focus on the semantic-rich regions. Besides, to combine the semantic content in the generated captions, we propose a Progressive Compact Bilinear Interactions structure to achieve cross-modality comprehension over the image, question and caption features by performing bilinear attention in a gradual manner. Qualitative and quantitative experiments on various medical datasets exhibit the superiority of the proposed approach compared to the state-of-the-art methods.

Complementarity-Enhanced and Redundancy-Minimized Collaboration Network for Multi-agent Perception

Guiyang Luo
Hui Zhang
Quan Yuan
Jinglin Li

Multi-agent collaborative perception depends on sharing sensory information to improve perception accuracy and robustness, as well as to extend coverage. The cooperative shared information between agents should achieve an equilibrium between redundancy and complementarity, thus creating a concise and composite representation. To this end, this paper presents a complementarity-enhanced and redundancy-minimized collaboration network (CRCNet), for efficiently guiding and supervising the fusion among shared features. Our key novelties lie in two aspects. First, each fused feature is forced to bring about a marginal gain by exploiting a contrastive loss, which can supervise our model to select complementary features. Second, mutual information is applied to measure the dependence between fused feature pairs and the upper bound of mutual information is minimized to encourage independence, thus guiding our model to select irredundant features. Furthermore, the above modules are incorporated into a feature fusion network CRCNet. Our quantitative and qualitative experiments in collaborative object detection show that CRCNet performs better than the state-of-the-art methods.

Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language Explanations

Qian Yang
Yunxin Li
Baotian Hu
Lin Ma
Yuxin Ding
Min Zhang

Visual Entailment with natural language explanations aims to infer the relationship between a text-image pair and generate a sentence to explain the decision-making process. Previous methods rely mainly on a pre-trained vision-language model to perform the relation inference and a language model to generate the corresponding explanation. However, the pre-trained vision-language models mainly build token-level alignment between text and image yet ignore the high-level semantic alignment between the phrases (chunks) and visual contents, which is critical for vision-language reasoning. Moreover, the explanation generator based only on the encoded joint representation does not explicitly consider the critical decision-making points of relation inference. Thus the generated explanations are less faithful to visual-language reasoning. To mitigate these problems, we propose a unified Chunk-aware Alignment and Lexical Constraint based method, dubbed as CALeC. It contains a Chunk-aware Semantic Interactor (arr. CSI), a relation inferrer, and a Lexical Constraint-aware Generator (arr. LeCG). Specifically, CSI exploits the sentence structure inherent in language and various image regions to build chunk-aware semantic alignment. Relation inferrer uses an attention-based reasoning network to incorporate the token-level and chunk-level vision-language representations. LeCG utilizes lexical constraints to expressly incorporate the words or chunks focused by the relation inferrer into explanation generation, improving the faithfulness and informativeness of the explanations. We conduct extensive experiments on three datasets, and experimental results indicate that CALeC significantly outperforms other competitor models on inference accuracy and quality of generated explanations.

Two-Stream Transformer for Multi-Label Image Classification

Xuelin Zhu
Jiuxin Cao
Jiawei Ge
Weijia Liu
Bo Liu

Multi-label image classification is a fundamental yet challenging task in computer vision that aims to identify multiple objects from a given image. Recent studies on this task mainly focus on learning cross-modal interactions between label semantics and high-level visual representations via an attention operation. However, these one-shot attention based approaches generally perform poorly in establishing accurate and robust alignments between vision and text due to the acknowledged semantic gap. In this paper, we propose a two-stream transformer (TSFormer) learning framework, in which the spatial stream focuses on extracting patch features with a global perception, while the semantic stream aims to learn vision-aware label semantics as well as their correlations via a multi-shot attention mechanism. Specifically, in each layer of TSFormer, a cross-modal attention module is developed to aggregate visual features from spatial stream into semantic stream and update label semantics via a residual connection. In this way, the semantic gap between two streams gradually narrows as the procedure progresses layer by layer, allowing the semantic stream to produce sophisticated visual representations for each label towards accurate label recognition. Extensive experiments on three visual benchmarks, including Pascal VOC 2007, Microsoft COCO and NUS-WIDE, consistently demonstrate that our proposed TSFormer achieves state-of-the-art performance on the multi-label image classification task.

SoftSkip: Empowering Multi-Modal Dynamic Pruning for Single-Stage Referring Comprehension

Dulanga Weerakoon
Vigneshwaran Subbaraju
Tuan Tran
Archan Misra

Supporting real-time referring expression comprehension (REC) on pervasive devices is an important capability for human-AI collaborative tasks. Model pruning techniques, applied to DNN models, can enable real-time execution even on resource-constrained devices. However, existing pruning strategies are designed principally for uni-modal applications, and suffer a significant loss of accuracy when applied to REC tasks that require fusion of textual and visual inputs. We thus present a multi-modal pruning model, LGMDP, which uses language as a pivot to dynamically and judiciously select the relevant computational blocks that need to be executed. LGMDP also introduces a new SoftSkip mechanism, whereby 'skipped' visual scales are not completely eliminated but approximated with minimal additional computation. Experimental evaluation, using 3 benchmark REC datasets and an embedded device implementation, shows that LGMDP can achieve 33% latency savings, with an accuracy loss 0.5% - 2%.

Unbiased Directed Object Attention Graph for Object Navigation

Ronghao Dang
Zhuofan Shi
Liuyi Wang
Zongtao He
Chengju Liu
Qijun Chen

Object navigation tasks require agents to locate specific objects in unknown environments based on visual information. Previously, graph convolutions were used to implicitly explore the relationships between objects. However, due to differences in visibility among objects, it is easy to generate biases in object attention. Thus, in this paper, we propose a directed object attention (DOA) graph to guide the agent in explicitly learning the attention relationships between objects, thereby reducing the object attention bias. In particular, we use the DOA graph to perform unbiased adaptive object attention (UAOA) on the object features and unbiased adaptive image attention (UAIA) on the raw images, respectively. To distinguish features in different branches, a concise adaptive branch energy distribution (ABED) method is proposed. We assess our methods on the AI2-Thor dataset. Compared with the state-of-the-art (SOTA) method, our method reports 7.4%, 8.1% and 17.6% increase in success rate (SR), success weighted by path length (SPL) and success weighted by action efficiency (SAE), respectively.

FastPR: One-stage Semantic Person Retrieval via Self-supervised Learning

Meng Sun
Ju Ren
Xin Wang
Wenwu Zhu
Yaoxue Zhang

Semantic person retrieval aims to locate a specific person in an image with the query of semantic descriptions, which has shown great significance in surveillance and security applications. Prior arts commonly adopt a two-stage method that first extracts the persons with a pretrained detector and then finds the target matching the descriptions optimally.However, existing works suffer from high computational complexity and low recall rate caused by error accumulation in the two-stage inference. To solve the problems, we propose FastPR, a one-stage semantic person retrieval method via self-supervised learning, to optimize the person localization and semantic retrieval simultaneously. Specifically, we propose a dynamic visual-semantic alignment mechanism which utilizes grid-based attention to fuse the cross-modal features, and employs a label prediction proxy task to constrain the attention process. To tackle the challenges that real-world surveillance images may suffer from low-resolution and occlusion, and the target persons may be within a crowd,we further propose a dual-granularity person localization module through designing an upsampling reconstruction proxy task to enhance the local feature of the target person in the fused features, followed by a tailored offset prediction proxy task to make the localization network capable of accurately identifying and distinguishing the target person in a crowd. Experimental results demonstrate that FastPR achieves the best retrieval accuracy compared to the state-of-the-art baseline methods, with over 15 times inference time reduction.

Towards Counterfactual Image Manipulation via CLIP

Yingchen Yu
Fangneng Zhan
Rongliang Wu
Jiahui Zhang
Shijian Lu
Miaomiao Cui
Xuansong Xie
Xian-Sheng Hua
Chunyan Miao

Leveraging StyleGAN's expressivity and its disentangled latent codes, existing methods can achieve realistic editing of different visual attributes such as age and gender of facial images. An intriguing yet challenging problem arises: Can generative models achieve counterfactual editing against their learnt priors? Due to the lack of counterfactual samples in natural datasets, we investigate this problem in a text-driven manner with Contrastive-Language-Image-Pretraining (CLIP), which can offer rich semantic knowledge even for various counterfactual concepts. Different from in-domain manipulation, counterfactual manipulation requires more comprehensive exploitation of semantic knowledge encapsulated in CLIP as well as more delicate handling of editing directions for avoiding being stuck in local minimum or undesired editing. To this end, we design a novel contrastive loss that exploits predefined CLIP-space directions to guide the editing toward desired directions from different perspectives. In addition, we design a simple yet effective scheme that explicitly maps CLIP embeddings (of target text) to the latent space and fuses them with latent codes for effective latent code optimization and accurate editing. Extensive experiments show that our design achieves accurate and realistic editing while driving by target texts with various counterfactual concepts.

Bidirectionally Learning Dense Spatio-temporal Feature Propagation Network for Unsupervised Video Object Segmentation

Jiaqing Fan
Tiankang Su
Kaihua Zhang
Qingshan Liu

Spatio-temporal feature representation is essential for accurate unsupervised video object segmentation, which needs an effective feature propagation paradigm for both appearance and motion features that can fully interchange information across frames. However, existing solutions mainly focus on the forward feature propagation from the preceding frame to the current one, either using the former segmentation mask or motion propagation in a frame-by-frame manner. This ignores the bi-directional temporal feature interactions (including the backward propagation from the future to the current frame) across all frames that can help to enhance the spatiotemporal feature representation for segmentation prediction. To this end, this paper presents a novel Dense Bidirectional Spatio-temporal feature propagation Network (DBSNet) to fully integrate the forward and the backward propagations across all frames. Specifically, a dense bi-ConvLSTM module is first developed to propagate the features across all frames in a forward and backward manner. This can fully capture the multi-level spatio-temporal contextual information across all frames, producing an effective feature representation that has a strong discriminative capability to tell from noisy backgrounds. Following it, a spatio-temporal Transformer refinement module is designed to further enhance the propagated features, which can effectively capture the spatio-temporal long-range dependencies among all frames. Afterwards, a Co-operative Direction-aware Graph Attention (Co-DGA) module is designed to integrate the propagated appearancemotion cues, yielding a strong spatio-temporal feature representation for segmentation mask prediction. The Co-DGA assigns proper attentional weights to neighboring points along the coordinate axis, making the segmentation model to selectively focus on the most relevant neighbors. Extensive evaluations on four mainstream challenging benchmarks including DAVIS16, FBMS, DAVSOD, and MCL demonstrate that the proposed DBSNet achieves favorable performance against state-of-the-art methods in terms of all evaluation metrics.

Weakly Supervised Video Salient Object Detection via Point Supervision

Shuyong Gao
Haozhe Xing
Wei Zhang
Yan Wang
Qianyu Guo
Wenqiang Zhang

Fully supervised video salient object detection models have achieved excellent performance, yet obtaining pixel-by-pixel annotated datasets is laborious. Several works attempt to use scribble annotations to mitigate this problem, but point supervision as a more labor-saving annotation method (even the most labor-saving method among manual annotation methods for dense prediction), has not been explored. In this paper, we propose a strong baseline model based on point supervision. To infer saliency maps with temporal information, we mine inter-frame complementary information from short-term and long-term perspectives, respectively. Specifically, we propose a hybrid token attention module, which mixes optical flow and image information from orthogonal directions, adaptively highlighting critical optical flow information (channel dimension) and critical token information (spatial dimension). To exploit long-term cues, we develop the Long-term Cross-Frame Attention module (LCFA), which assists the current frame in inferring salient objects based on multi-frame tokens. Furthermore, we label two point-supervised datasets, P-DAVIS and P-DAVSOD, by relabeling the DAVIS and the DAVSOD dataset. Experiments on the six benchmark datasets illustrate our method outperforms the previous state-of-the-art weakly supervised methods and even is comparable with some fully supervised approaches. Our source code and datasets are available at: https://github.com/shuyonggao/PVSOD.

Look Less Think More: Rethinking Compositional Action Recognition

Rui Yan
Peng Huang
Xiangbo Shu
Junhao Zhang
Yonghua Pan
Jinhui Tang

Compositional action recognition which aims to identify the unseen combinations of actions and objects has recently attracted wide attention. Conventional methods bring in additional cues (e.g., dynamic motions of objects) to alleviate the inductive bias between the visual appearance of objects and the human action-level labels. Besides, compared with non-compositional settings, previous methods only pursue higher performance in compositional settings, which can not prove their generalization ability. To this end, we firstly rethink the problem and design a more generalized metric (namely Drop Ratio) and a more practical setting to evaluate the compositional generalization of existing action recognition algorithms. Beyond that, we propose a simple yet effective framework, Look Less Think More (LLTM), to reduce the strong association between visual objects and action-level labels (Look Less), and then discover the commonsense relationships between object categories and human actions (Think More). We test the rationality of the proposed Drop Ratio and Practical setting by comparing several popular action recognition methods on SSV2. Besides, the proposed LLTM achieves state-of-the-art performance on SSV2 with different settings.

Continual Multi-view Clustering

Xinhang Wan
Jiyuan Liu
Weixuan Liang
Xinwang Liu
Yi Wen
En Zhu

With the increase of multimedia applications, data are often collected from multiple sensors or modalities, encouraging the rapid development of multi-view (also called multi modal) clustering technique. As a representative, late fusion multi-view clustering algorithm has attracted extensive attention due to its low computation complexity yet promising performance. However, most of them deal with the clustering problem in which all data views are available in advance, and overlook the scenarios where data observations of new views are accumulated over time. To solve this issue, we propose a continual approach on the basis of late fusion multi-view clustering framework. In specific, it only needs to maintain a consensus partition matrix and update knowledge with the incoming one of a new data view rather than keep all of them. This benefits a lot by preventing the previously learned knowledge from recomputing over and over again, saving a large amount of computation resource/time and labor force. Nevertheless, we design an alternate and convergent strategy to solve the resultant optimization problem. Also, the proposed algorithm shows excellent clustering performance and time/space efficiency in the experiment.

Efficient Anchor Learning-based Multi-view Clustering -- A Late Fusion Method

Tiejian Zhang
Xinwang Liu
En Zhu
Sihang Zhou
Zhibin Dong

Anchor enhanced multi-view late fusion clustering has attracted numerous researchers' attention for its high clustering accuracy and promising efficiency. However, in the existing methods, the anchor points are usually generated through sampling or linearly combining the samples within the datasets, which could result in enormous time consumption and limited representation capability. To solve the problem, in our method, we learn the view-specific anchor points by learning them directly. Specifically, in our method, we first reconstruct the partition matrix of each view through multiplying a view-specific anchor matrix by a consensus reconstruction matrix. Then, by maximizing the weighted alignment between the base partition matrix and its estimated version in each view, we learn the optimal anchor points for each view. In particular, unlike previous late fusion algorithms, which define anchor points as linear combinations of existing samples, we define anchor points as a series of orthogonal vectors that are directly learned through optimization, which expands the learning space of the anchor points. Moreover, based on the above design, the resultant algorithm has only linear complexity and no hyper-parameter. Experiments on $12$ benchmark kernel datasets and 5 large-scale datasets illustrate that the proposed Efficient Anchor Learning-based Multi-view Clustering (AL-MVC) algorithm achieves the state-of-the-art performance in both clustering performance and efficiency.

Cross-modal Knowledge Graph Contrastive Learning for Machine Learning Method Recommendation

Xianshuai Cao
Yuliang Shi
Jihu Wang
Han Yu
Xinjun Wang
Zhongmin Yan

The explosive growth of machine learning (ML) methods is overloading users with choices for learning tasks. Method recommendation aims to alleviate this problem by selecting the most appropriate ML methods for given learning tasks. Recent research shows that the descriptive and structural information of the knowledge graphs (KGs) can significantly enhance the performance of ML method recommendation. However, existing studies have not fully explored the descriptive information in KGs, nor have they effectively exploited the descriptive and structural information to provide the necessary supervision. To address these limitations, we distinguish descriptive attributes from the traditional relationships in KGs with the rest as structural connections to expand the scope of KG descriptive information. Based on this insight, we propose the Cross-modal Knowledge Graph Contrastive learning (CKGC) approach, which regards information from descriptive attributes and structural connections as two modalities, learning informative node representations by maximizing the agreement between the descriptive view and the structural view. Through extensive experiments, we demonstrate that CKGC significantly outperforms the state-of-the-art baselines, achieving around 2% higher accurate click-through-rate (CTR) prediction, over 30% more accurate top-10 recommendation, and over 50% more accurate top-20 recommendation compared to the best performing existing approach.

Multigranular Visual-Semantic Embedding for Cloth-Changing Person Re-identification

Zan Gao
Hongwei Wei
Weili Guan
Weizhi Nie
Meng Liu
Meng Wang

To date, only a few works have focused on the cloth-changing person Re-identification (ReID) task, but since it is very difficult to extract generalized and robust features for representing people with different clothes, thus, their performances need to be improved. Moreover, visual-semantic information is also often ignored. To solve these issues, in this work, a novel multigranular visual-semantic embedding algorithm (MVSE) is proposed for cloth-changing person ReID, where visual semantic information and human attributes are embedded into the network, and the generalized features of human appearance can be well learned to effectively solve the problem of cloth-changing. Specifically, to fully represent a person with clothing changes, a multigranular feature representation scheme (MGR) is employed to adaptively extract multilevel and multigranular feature information, and then a cloth desensitization network (CDN) is designed to improve the feature robustness for the person with different clothes, where different high-level human attributes are fully utilized. Moreover, to further solve the issue of pose changes and occlusion under different camera perspectives, a partially semantically aligned network (PSA) is proposed to obtain the visual-semantic information that is used to align the human attributes. Most importantly, these three modules are jointly explored in a unified framework. Extensive experimental results on four cloth-changing person ReID datasets demonstrate that the MVSE algorithm can extract highly robust feature representations of cloth-changing persons, and it can outperform state-of-the-art cloth-changing person ReID approaches.

Adaptive Structural Similarity Preserving for Unsupervised Cross Modal Hashing

Liang Li
Baihua Zheng
Weiwei Sun

Cross-modal hashing is an important approach for multimodal data management and application. Existing unsupervised cross-modal hashing algorithms mainly rely on data features in pre-trained models to mine their similarity relationships. However, their optimization objectives are based on the static metric between the original uni-modal features, without further exploring data correlations during the training. In addition, most of them mainly focus on association mining and alignment among pairwise instances in continuous space but ignore the latent structural correlations contained in the semantic hashing space. In this paper, we propose an unsupervised hash learning framework ASSPH to solve the above problems. Firstly, we propose an adaptive learning scheme, with limited data and training batches, to enrich semantic correlations of unlabeled instances during the training process and meanwhile to ensure a smooth convergence of the training process. Secondly, we present an asymmetric structural semantic representation learning scheme. We introduce structural semantic metrics based on graph adjacency relations and meanwhile align the inter- and intra-modal semantics in the hash space with an asymmetric binary optimization process. Finally, we conduct extensive experiments to validate the enhancements of our work in comparison with existing works.

CubeMLP: An MLP-based Model for Multimodal Sentiment Analysis and Depression Estimation

Hao Sun
Hongyi Wang
Jiaqing Liu
Yen-Wei Chen
Lanfen Lin

Multimodal sentiment analysis and depression estimation are two important research topics that aim to predict human mental states using multimodal data. Previous research has focused on developing effective fusion strategies for exchanging and integrating mind-related information from different modalities. Some MLP-based techniques have recently achieved considerable success in a variety of computer vision tasks. Inspired by this, we explore multimodal approaches with a feature-mixing perspective in this study. To this end, we introduce CubeMLP, a multimodal feature processing framework based entirely on MLP. CubeMLP consists of three independent MLP units, each of which has two affine transformations. CubeMLP accepts all relevant modality features as input and mixes them across three axes. After extracting the characteristics using CubeMLP, the mixed multimodal features are flattened for task predictions. Our experiments are conducted on sentiment analysis datasets: CMU-MOSI and CMU-MOSEI, and depression estimation dataset: AVEC2019. The results show that CubeMLP can achieve state-of-the-art performance with a much lower computing cost.

Generalized Global Ranking-Aware Neural Architecture Ranker for Efficient Image Classifier Search

Bicheng Guo
Tao Chen
Shibo He
Haoyu Liu
Lilin Xu
Peng Ye
Jiming Chen

Neural Architecture Search (NAS) is a powerful tool for automating effective image processing DNN designing. The ranking has been advocated to design an efficient performance predictor for NAS. The previous contrastive method solves the ranking problem by comparing pairs of architectures and predicting their relative performance. However, it only focuses on the rankings between two involved architectures and neglects the overall quality distributions of the search space, which may suffer generalization issues. A predictor, namely Neural Architecture Ranker (NAR) which concentrates on the global quality tier of specific architecture, is proposed to tackle such problems caused by the local perspective. The NAR explores the quality tiers of the search space globally and classifies each individual to the tier they belong to according to its global ranking. Thus, the predictor gains the knowledge of the performance distributions of the search space which helps to generalize its ranking ability to the datasets more easily. Meanwhile, the global quality distribution facilitates the search phase by directly sampling candidates according to the statistics of quality tiers, which is free of training a search algorithm, e.g., Reinforcement Learning (RL) or Evolutionary Algorithm (EA), thus it simplifies the NAS pipeline and saves the computational overheads. The proposed NAR achieves better performance than the state-of-the-art methods on two widely used datasets for NAS research. On the vast search space of NAS-Bench-101, the NAR easily finds the architecture with top 0.01 performance only by sampling. It also generalizes well to different image datasets of NAS-Bench-201, i.e., CIFAR-10, CIFAR-100, and ImageNet-16-120 by identifying the optimal architectures for each of them.

Exploiting Transformation Invariance and Equivariance for Self-supervised Sound Localisation

Jinxiang Liu
Chen Ju
Weidi Xie
Ya Zhang

We present a simple yet effective self-supervised framework for audio-visual representation learning, to localize the sound source in videos. To understand what enables to learn useful representations, we systematically investigate the effects of data augmentations, and reveal that (1) composition of data augmentations plays a critical role, i.e. explicitly encouraging the audio-visual representations to be invariant to various transformations (transformation invariance); (2) enforcing geometric consistency substantially improves the quality of learned representations, i.e. the detected sound source should follow the same transformation applied on input video frames (transformation equivariance). Extensive experiments demonstrate that our model significantly outperforms previous methods on two sound localization benchmarks, namely, Flickr-SoundNet and VGG-Sound. Additionally, we also evaluate audio retrieval and cross-modal retrieval tasks. In both cases, our self-supervised models demonstrate superior retrieval performances, even competitive with the supervised approach in audio retrieval. This reveals the proposed framework learns strong multi-modal representations that are beneficial to sound localisation and generalization to further applications. The project page is https://jinxiang-liu.github.io/SSL-TIE.

Unsupervised Video Hashing with Multi-granularity Contextualization and Multi-structure Preservation

Yanbin Hao
Jingru Duan
Hao Zhang
Bin Zhu
Pengyuan Zhou
Xiangnan He

Unsupervised video hashing typically aims to learn a compact binary vector to represent complex video content without using manual annotations. Existing unsupervised hashing methods generally suffer from incomplete exploration of various perspective dependencies (e.g., long-range and short-range) and data structures that exist in visual contents, resulting in less discriminative hash codes. In this paper, we propose aMulti-granularity Contextualized and Multi-Structure preserved Hashing (MCMSH) method, exploring multiple axial contexts for discriminative video representation generation and various structural information for unsupervised learning simultaneously. Specifically, we delicately design three self-gating modules to separately model three granularities of dependencies (i.e., long/middle/short-range dependencies) and densely integrate them into MLP-Mixer for feature contextualization, leading to a novel model MC-MLP. To facilitate unsupervised learning, we investigate three kinds of data structures, including clusters, local neighborhood similarity structure, and inter/intra-class variations, and design a multi-objective task to train MC-MLP. These data structures show high complementarities in hash code learning. We conduct extensive experiments using three video retrieval benchmark datasets, demonstrating that our MCMSH not only boosts the performance of the backbone MLP-Mixer significantly but also outperforms the competing methods notably. Code is available at: https://github.com/haoyanbin918/MCMSH.

DisCo: Disentangled Implicit Content and Rhythm Learning for Diverse Co-Speech Gestures Synthesis

Haiyang Liu
Naoya Iwamoto
Zihao Zhu
Zhengqing Li
You Zhou
Elif Bozkurt
Bo Zheng

Current co-speech gestures synthesis methods struggle with generating diverse motions and typically collapse to single or few frequent motion sequences, which are trained on original data distribution with customized models and strategies. We tackle this problem by temporally clustering motion sequences into content and rhythm segments and then training on content-balanced data distribution. In particular, by clustering motion sequences, we have observed for each rhythm pattern, some motions appear frequently, while others appear less. This imbalance results in the difficulty of generating low frequent occurrence motions and it cannot be easily solved by resampling, due to the inherent many-to-many mapping between content and rhythm. Therefore, we present DisCo, which disentangles motion into implicit content and rhythm features by contrastive loss for adopting different data balance strategies. Besides, to model the inherent mapping between content and rhythm features, we design a diversity-and-inclusion network (DIN), which firstly generates content features candidates and then selects one candidate by learned voting. Experiments on two public datasets, Trinity and S2G-Ellen, justify that DisCo generates more realistic and diverse motions than state-of-the-art methods. Code and data are available at https://pantomatrix.github.io/DisCo/

Adaptively-weighted Integral Space for Fast Multiview Clustering

Man-Sheng Chen
Tuo Liu
Chang-Dong Wang
Dong Huang
Jian-Huang Lai

Multiview clustering has been extensively studied to take advantage of multi-source information to improve the clustering performance. In general, most of the existing works typically compute an $n\times n$ affinity graph by some similarity/distance metrics (e.g. the Euclidean distance) or learned representations, and explore the pairwise correlations across views. But unfortunately, a quadratic or even cubic complexity is often needed, bringing about difficulty in clustering large-scale datasets. Some efforts have been made recently to capture data distribution in multiple views by selecting view-wise anchor representations with k-means, or by direct matrix factorization on the original observations. Despite the significant success, few of them have considered the view-insufficiency issue, implicitly holding the assumption that each individual view is sufficient to recover the cluster structure. Moreover, the latent integral space as well as the shared cluster structure from multiple insufficient views is not able to be simultaneously discovered. In view of this, we propose an \underlineA daptively-weighted \underlineI ntegral Space for Fast \underlineM ultiview \underlineC lustering (AIMC) with nearly linear complexity. Specifically, view generation models are designed to reconstruct the view observations from the latent integral space with diverse adaptive contributions. Meanwhile, a centroid representation with orthogonality constraint and cluster partition are seamlessly constructed to approximate the latent integral space. An alternate minimizing algorithm is developed to solve the optimization problem, which is proved to have linear time complexityw.r.t. the sample size. Extensive experiments conducted on several real-world datasets confirm the superiority of the proposed AIMC method compared with the state-of-the-art methods.

Towards All Weather and Unobstructed Multi-Spectral Image Stitching: Algorithm and Benchmark

Zhiying Jiang
Zengxi Zhang
Xin Fan
Risheng Liu

Image stitching is a fundamental task that requires multiple images from different viewpoints to generate a wide field-of-viewing~(FOV) scene. Previous methods are developed on RGB images. However, the severe weather and harsh conditions, such as rain, fog, low light, strong light, etc., on visible images may introduce evident interference, leading to the distortion and misalignment of the stitched results. To remedy the deficient imaging of optical sensors, we investigate the complementarity across infrared and visible images to improve the perception of scenes in terms of visual information and viewing ranges. Instead of the cascaded fusion-stitching process, where the inaccuracy accumulation caused by image fusion hinders the stitch performance, especially content loss and ghosting effect, we develop a learnable feature adaptive network to investigate a stitch-oriented feature representation and perform the information complementary at the feature-level. By introducing a pyramidal structure along with the global fast correlation regression, the quadrature attention based correspondence is more responsible for feature alignment, and the estimation of sparse offsets can be realized in a coarse-to-fine manner. Furthermore, we propose the first infrared and visible image based multi-spectral image stitching dataset, covering a more comprehensive range of scenarios and diverse viewing baselines. Extensive experiments on real-world data demonstrate that our method reconstructs the wide FOV images with more credible structure and complementary information against state-of-the-arts.

A Parameter-free Multi-view Information Bottleneck Clustering Method by Cross-view Weighting

Shizhe Hu
Ruilin Geng
Zhaoxu Cheng
Chaoyang Zhang
Guoliang Zou
Zhengzheng Lou
Yangdong Ye

With the fast-growing multi-modal/media data in the Big Data era, multi-view clustering (MVC) has attracted lots of attentions lately. Most MVCs focus on integrating and utilizing the complementary information among views by linear sum of the learned view weights and have shown great success in some fields. However, they fail to quantify how complementary the information across views actually utilized for benefiting final clustering. Additionally, most of them contain at least one parameter for regularization without prior knowledge, which puts pressure on the parameter-tuning and thus makes them impractical. In this paper, we propose a novel parameter-free multi-view information bottleneck (PMIB) clustering method to automatically identify and exploit useful complementary information among views, thus reducing the negative impact from the harmful views. Specifically, we first discover the informative view by measuring the relevant information preserved by the original data and the compact clusters with mutual information. Then, a new cross-view weight learning scheme is designed to learn how complementary between the informative view and remaining views. Finally, the quantitative correlations among views are fully exploited to improve the clustering performance without needing any additional parameters or prior knowledge. Experimental results on different kinds of multi-view datasets show the effectiveness of the proposed method.

HERO: HiErarchical spatio-tempoRal reasOning with Contrastive Action Correspondence for End-to-End Video Object Grounding

Mengze Li
Tianbao Wang
Haoyu Zhang
Shengyu Zhang
Zhou Zhao
Wenqiao Zhang
Jiaxu Miao
Shiliang Pu
Fei Wu

Video Object Grounding (VOG) is the problem of associating spatial object regions in the video to a descriptive natural language query. This is a challenging vision-language task that necessitates constructing the correct cross-modal correspondence and modeling the appropriate spatio-temporal context of the query video and caption, thereby localizing the specific objects accurately. In this paper, we tackle this task by a novel framework called HiErarchical spatio-tempoRal reasOning (HERO) with contrastive action correspondence. We study the VOG task at two aspects that prior works overlooked: (1) Contrastive Action Correspondence-aware Retrieval. Notice that the fine-grained video semantics (e.g., multiple actions) is not totally aligned with the annotated language query (e.g., single action), we first introduce the weakly-supervised contrastive learning that classifies the video as action-consistent and action-independent frames relying on the video-caption action semantic correspondence. Such a design can build the fine-grained cross-modal correspondence for more accurate subsequent VOG. (2) Hierarchical Spatio-temporal Modeling Improvement. While transformer-based VOG models present their potential in sequential modality (i.e., video and caption) modeling, existing evidence also indicates that the transformer suffers from the issue of the insensitive spatio-temporal locality. Motivated by that, we carefully design the hierarchical reasoning layers to decouple fully connected multi-head attention and remove the redundant interfering correlations. Furthermore, our proposed pyramid and shifted alignment mechanisms are effective to improve the cross-modal information utilization of neighborhood spatial regions and temporal frames. We conducted extensive experiments to show our HERO outperforms existing techniques by achieving significant improvement on two benchmark datasets.

MAVT-FG: Multimodal Audio-Visual Transformer for Weakly-supervised Fine-Grained Recognition

Xiaoyu Zhou
Xiaotong Song
Hao Wu
Jingran Zhang
Xing Xu

Weakly-supervised fine-grained recognition aims to detect potential differences between subcategories at a more detailed scale without using any manual annotations. While most recent works focus on classical image-based fine-grained recognition that recognizes subcategories at image-level, video-based fine-grained recognition is much more challenging and specifically needed. In this paper, we propose a Multimodal Audio-Visual Transformer for Weakly-supervised Fine-Grained Recognition (MAVT-FG) model which incorporates audio-visual modalities. Specifically, MAVT-FG consists of Audio-Visual Dual-Encoder for feature extraction, Cross-Decoder for Audio-Visual Fusion (DAVF) to exploit inherent cues and correspondences between two modalities, and Search-and-Select Fine-grained Branch (SSFG) to capture the most discriminative regions. Furthermore, we construct a new benchmark: Fine-grained Birds of Audio-Visual (FGB-AV) for audio-visual weakly-supervised fine-grained recognition at video-level. Experimental results show that our method achieves superior performance and outperforms other state-of-the-art methods.

Dynamic Graph Modeling for Weakly-Supervised Temporal Action Localization

Haichao Shi
Xiao-Yu Zhang
Changsheng Li
Lixing Gong
Yong Li
Yongjun Bao

Weakly supervised action localization is a challenging task that aims to localize action instances in untrimmed videos given only video-level supervision. Existing methods mostly distinguish action from background via attentive feature fusion with RGB and optical flow modalities. Unfortunately, this strategy fails to retain the distinct characteristics of each modality, leading to inaccurate localization under hard-to-discriminate cases such as action-context interference and in-action stationary period. As an action is typically comprised of multiple stages, an intuitive solution is to model the relation between the finer-grained action segments to obtain a more detailed analysis. In this paper, we propose a dynamic graph-based method, namely DGCNN, to explore the two-stream relation between action segments. To be specific, segments within a video which are likely to be actions are dynamically selected to construct an action graph. For each graph, a triplet adjacency matrix is devised to explore the temporal and contextual correlations between the pseudo action segments, which consists of three components, i.e., mutual importance, feature similarity, and high-level contextual similarity. The two-stream dynamic pseudo graphs, along with the pseudo background segments, are used to derive more detailed video representation. For action localization, a non-local based temporal refinement module is proposed to fully leverage the temporal consistency between consecutive segments. Experimental results on three datasets, i.e., THUMOS14, ActivityNet v1.2 and v1.3, demonstrate that our method is superior to the state-of-the-arts.

Cross-Domain and Cross-Modal Knowledge Distillation in Domain Adaptation for 3D Semantic Segmentation

Miaoyu Li
Yachao Zhang
Yuan Xie
Zuodong Gao
Cuihua Li
Zhizhong Zhang
Yanyun Qu

With the emergence of multi-modal datasets where LiDAR and camera are synchronized and calibrated, cross-modal Unsupervised Domain Adaptation (UDA) has attracted increasing attention because it reduces the laborious annotation of target domain samples. To alleviate the distribution gap between source and target domains, existing methods conduct feature alignment by using adversarial learning. However, it is well-known to be highly sensitive to hyperparameters and difficult to train. In this paper, we propose a novel model (Dual-Cross) that integrates Cross-Domain Knowledge Distillation (CDKD) and Cross-Modal Knowledge Distillation (CMKD) to mitigate domain shift. Specifically, we design the multi-modal style transfer to convert source image and point cloud to target style. With these synthetic samples as input, we introduce a target-aware teacher network to learn knowledge of the target domain. Then we present dual-cross knowledge distillation when the student is learning on source domain. CDKD constrains teacher and student predictions under same modality to be consistent. It can transfer target-aware knowledge from the teacher to the student, making the student more adaptive to the target domain. CMKD generates hybrid-modal prediction from the teacher predictions and constrains it to be consistent with both 2D and 3D student predictions. It promotes the information interaction between two modalities to make them complement each other. From the evaluation results on various domain adaptation settings, Dual-Cross significantly outperforms both uni-modal and cross-modal state-of-the-art methods.

AVA-AVD: Audio-visual Speaker Diarization in the Wild

Eric Zhongcong Xu
Zeyang Song
Satoshi Tsutsui
Chao Feng
Mang Ye
Mike Zheng Shou

Audio-visual speaker diarization aims at detecting "who spoke when'' using both auditory and visual signals. Existing audio-visual diarization datasets are mainly focused on indoor environments like meeting rooms or news studios, which are quite different from in-the-wild videos in many scenarios such as movies, documentaries, and audience sitcoms. To develop diarization methods for these challenging videos, we create the AVA Audio-Visual Diarization (AVA-AVD) dataset. Our experiments demonstrate that adding AVA-AVD into training set can produce significantly better diarization models for in-the-wild videos despite that the data is relatively small. Moreover, this benchmark is challenging due to the diverse scenes, complicated acoustic conditions, and completely off-screen speakers. As a first step towards addressing the challenges, we design the Audio-Visual Relation Network (AVR-Net) which introduces a simple yet effective modality mask to capture discriminative information based on face visibility. Experiments show that our method not only can outperform state-of-the-art methods but is more robust as varying the ratio of off-screen speakers. Our data and code has been made publicly available at \textcolormagenta \urlhttps://github.com/showlab/AVA-AVD .

Image-Signal Correlation Network for Textile Fiber Identification

Bo Peng
Liren He
Yining Qiu
Wu Dong
Mingmin Chi

Identifying fiber compositions is an important aspect of the textile industry. In recent decades, near-infrared spectroscopy has shown its potential in the automatic detection of fiber components. However, for plant fibers such as cotton and linen, the chemical compositions are the same and thus the absorption spectra are very similar, leading to the problem of "different materials with the same spectrum, whereas the same material with different spectrums" and it is difficult using a single mode of NIR signals to capture the effective features to distinguish these fibers. To solve this problem, textile experts under a microscope measure the cross-sectional or longitudinal characteristics of fibers to determine fiber contents with a destructive way. In this paper, we construct the first NIR signal-microscope image textile fiber composition dataset (NIRITFC). Based on the NIRITFC dataset, we propose an image-signal correlation network (ISiC-Net) and design image-signal correlation perception and image-signal correlation attention modules, respectively, to effectively integrate the visual features (esp. local texture details of fibers) with the finer absorption spectrum information of the NIR signal to capture the deep abstract features of bimodal data for nondestructive textile fiber identification. To better learn the spectral characteristics of the fiber components, the endmember vectors of the corresponding fibers are generated by embedding encoding, and the reconstruction loss is designed to guide the model to reconstruct the NIR signals of the corresponding fiber components by a nonlinear mapping. The quantitative and qualitative results are significantly improved compared to both single and bimodal approaches, indicating the great potential of combining microscopic images and NIR signals for textile fiber composition identification.

Relation-enhanced Negative Sampling for Multimodal Knowledge Graph Completion

Derong Xu
Tong Xu
Shiwei Wu
Jingbo Zhou
Enhong Chen

Knowledge Graph Completion (KGC), aiming to infer the missing part of Knowledge Graphs (KGs), has long been treated as a crucial task to support downstream applications of KGs, especially for the multimodal KGs (MKGs) which suffer the incomplete relations due to the insufficient accumulation of multimodal corpus. Though a few research attentions have been paid to the completion task of MKGs, there is still a lack of specially designed negative sampling strategies tailored to MKGs. Meanwhile, though effective negative sampling strategies have been widely regarded as a crucial solution for KGC to alleviate the vanishing gradient problem, we realize that, there is a unique challenge for negative sampling in MKGs about how to model the effect of KG relations during learning the complementary semantics among multiple modalities as an extra context. In this case, traditional negative sampling techniques which only consider the structural knowledge may fail to deal with the multimodal KGC task. To that end, in this paper, we propose a MultiModal Relation-enhanced Negative Sampling (MMRNS) framework for multimodal KGC task. Especially, we design a novel knowledge-guided cross-modal attention (KCA) mechanism, which provides bi-directional attention for visual & textual features via integrating relation embedding. Then, an effective contrastive semantic sampler is devised after consolidating the KCA mechanism with contrastive learning. In this way, a more similar representation of semantic features between positive samples, as well as a more diverse representation between negative samples under different relations could be learned. Afterwards, a masked gumbel-softmax optimization mechanism is utilized for solving the non-differentiability of sampling process, which provides effective parameter optimization compared with traditional sample strategies. Extensive experiments on three multimodal KGs demonstrate that our MMRNS framework could significantly outperform the state-of-the-art baseline methods, which validates the effectiveness of relation guides in multimodal KGC task.

Symmetric Uncertainty-Aware Feature Transmission for Depth Super-Resolution

Wuxuan Shi
Mang Ye
Bo Du

Color-guided depth super-resolution (DSR) is an encouraging paradigm that enhances a low-resolution (LR) depth map guided by an extra high-resolution (HR) RGB image from the same scene. Existing methods usually use interpolation to upscale the depth maps before feeding them into the network and transfer the high-frequency information extracted from HR RGB images to guide the reconstruction of depth maps. However, the extracted high-frequency information usually contains textures that are not present in depth maps in the existence of the cross-modality gap, and the noises would be fur- ther aggravated by interpolation due to the resolution gap between the RGB and depth images. To tackle these challenges, we propose a novel Symmetric Uncertainty-aware Feature Transmission (SUFT) for color-guided DSR. (1) For the resolution gap, SUFT builds an iterative up-and-down sampling pipeline, which makes depth features and RGB features spatially consistent while suppressing noise amplification and blurring by replacing common interpolated pre-upsampling. (2) For the cross-modality gap, we propose a novel Symmetric Uncertainty scheme to remove parts of RGB information harmful to the recovery of HR depth maps. Extensive experiments on benchmark datasets and challenging real-world settings suggest that our method achieves superior performance compared to state-of-the-art methods. Our code and models are available at https://github.com/ShiWuxuan/SUFT.

DTR: An Information Bottleneck Based Regularization Framework for Video Action Recognition

Jiawei Fan
Yu Zhao
Xie Yu
Lihua Ma
Junqi Liu
Fangqiu Yi
Boxun Li

An optimal representation should contain the maximum task-relevant information and minimum task-irrelevant information, as revealed from Information Bottleneck Principle. In video action recognition, CNN based approaches have obtained better spatio-temporal representation by modeling temporal context. However, these approaches still suffer low generalization. In this paper, we propose a moderate optimization based approach called Dual-view Temporal Regularization (DTR) based on Information Bottleneck Principle for an effective and generalized video representation without sacrificing any efficiency of the model. On the one hand, we design Dual-view Regularization (DR) to constrain task-irrelevant information, which can effectively compress background and irrelevant motion information. On the other hand, we design Temporal Regularization (TR) to maintain task-relevant information by finding an optimal difference between frames, which benefits extracting sufficient motion information. The experimental results demonstrate: (1) DTR is orthogonal to temporal modeling as well as data augmentation, and it achieves general improvement on both model-based and data-based approaches; (2) DTR is effective among 7 different datasets, especially on motion-centric datasets i.e. SSv1/ SSv2, in which DTR gets 6%/3.8% absolute gains in top-1 accuracy.

Self-Supervised Graph Neural Network for Multi-Source Domain Adaptation

Jin Yuan
Feng Hou
Yangzhou Du
Zhongchao Shi
Xin Geng
Jianping Fan
Yong Rui

Domain adaptation (DA) tries to tackle the scenarios when the test data does not fully follow the same distribution of the training data, and multi-source domain adaptation (MSDA) is very attractive for real world applications. By learning from large-scale unlabeled samples, self-supervised learning has now become a new trend in deep learning. It is worth noting that both self-supervised learning and multi-source domain adaptation share a similar goal: they both aim to leverage unlabeled data to learn more expressive representations. Unfortunately, traditional multi-task self-supervised learning faces two challenges: (1) the pretext task may not strongly relate to the downstream task, thus it could be difficult to learn useful knowledge being shared from the pretext task to the target task; (2) when the same feature extractor is shared between the pretext task and the downstream one and only different prediction heads are used, it is ineffective to enable inter-task information exchange and knowledge sharing. To address these issues, we propose a novel Self-Supervised Graph Neural Network (SSG), where a graph neural network is used as the bridge to enable more effective inter-task information exchange and knowledge sharing. More expressive representation is learned by adopting a mask token strategy to mask some domain information. Our extensive experiments have demonstrated that our proposed SSG method has achieved state-of-the-art results over four multi-source domain adaptation datasets, which have shown the effectiveness of our proposed SSG method from different aspects.

ChoreoGraph: Music-conditioned Automatic Dance Choreography over a Style and Tempo Consistent Dynamic Graph

Ho Yin Au
Jie Chen
Junkun Jiang
Yike Guo

To generate dance that temporally and aesthetically matches the music is a challenging problem, as the following factors need to be considered. First, the aesthetic styles and messages conveyed by the motion and music should be consistent. Second, the beats of the generated motion should be locally aligned to the musical features. And finally, basic choreomusical rules should be observed, and the motion generated should be diverse. To address these challenges, we propose ChoreoGraph, which choreographs high-quality dance motion for a given piece of music over a Dynamic Graph. A data-driven learning strategy is proposed to evaluate the aesthetic style and rhythmic connections between music and motion in a progressively learned cross-modality embedding space. The motion sequences will be beats-aligned based on the music segments and then incorporated as nodes of a Dynamic Motion Graph. Compatibility factors such as the style and tempo consistency, motion context connection, action completeness, and transition smoothness are comprehensively evaluated to determine the node transition in the graph. We demonstrate that our repertoire-based framework can generate motions with aesthetic consistency and robustly extensible in diversity. Both quantitative and qualitative experiment results show that our proposed model outperforms other baseline models.

Pixelwise Adaptive Discretization with Uncertainty Sampling for Depth Completion

Rui Peng
Tao Zhang
Bing Li
Yitong Wang

Image guided depth completion is an extensively studied multi-modal task that takes sparse measurements and RGB images as input to recover dense depth maps. While the common practice is to regress the depth value from the unbounded range, some recent methods achieve breakthrough performance by discretizing the regression range into a number of discrete depth values, namely, Depth Hypotheses, and casting the scalar regression to the distribution estimation. However, existing methods employ the handcraft or image-level adaptive discretization strategies, where their generated depth hypotheses are pixel-shared, which can not adapt to all pixels and is inefficient. In this paper, we are the first to consider the difference between pixels and propose Pixelwise Adaptive Discretization to generate the tailored depth hypotheses for each pixel. Meanwhile, we introduce Uncertainty Sampling to generate the compact depth hypotheses for easy pixels and loose for hard pixels. This divide-and-conquer for each pixel allows the discrete depth hypotheses to be concentrated around the ground-truth of each pixel as much as possible, which is the core of discretization methods. Extensive experiments on the outdoor KITTI and indoor NYU Depth V2 datasets show that our model, called PADNet, surpasses the previous state-of-the-art methods even with limited parameters and computational cost.

Robust Diversified Graph Contrastive Network for Incomplete Multi-view Clustering

Zhe Xue
Junping Du
Hai Zhu
Zhongchao Guan
Yunfei Long
Yu Zang
Meiyu Liang

Incomplete multi-view clustering is a challenging task which aims to partition the unlabeled incomplete multi-view data into several clusters. The existing incomplete multi-view clustering methods neglect to utilize the diversified correlations inherent in data and handle the noise contained in different views. To address these issues, we propose a Robust Diversified Graph Contrastive Network (RDGC) for incomplete multi-view clustering, which integrates multi-view representation learning and diversified graph contrastive regularization into a unified framework. Multi-view unified and specific encoding network is developed to fuse different views into a unified representation, which can flexibly estimate the importance of views for incomplete multi-view data. Robust diversified graph contrastive regularization is proposed which captures the diversified data correlations to improve the discriminating power of the learned representation and reduce the information loss caused by the view missing problem. Moreover, our method can effectively resist the influence of noise and unreliable views by leveraging the robust contrastive learning loss. Extensive experiments conducted on four multi-view clustering datasets demonstrate the superiority of our method over the state-of-the-art methods.

Calibrating Class Weights with Multi-Modal Information for Partial Video Domain Adaptation

Xiyu Wang
Yuecong Xu
Jianfei Yang
Kezhi Mao

Assuming the source label space subsumes the target one, Partial Video Domain Adaptation (PVDA) is a more general and practical scenario for cross-domain video classification problems. The key challenge of PVDA is to mitigate the negative transfer caused by the source-only outlier classes. To tackle this challenge, a crucial step is to aggregate target predictions to assign class weights by up-weighing target classes and down-weighing outlier classes. However, the incorrect predictions of class weights can mislead the network and lead to negative transfer. Previous works improve the class weight accuracy by utilizing temporal features and attention mechanisms, but these methods may fall short when trying to generate accurate class weight when domain shifts are significant, as in most real-world scenarios. To deal with these challenges, we first propose the Multi-modality partial Adversarial Network (MAN), which utilizes multi-scale and multi-modal information to enhance PVDA performance. Based on MAN, we then propose Multi-modality Cluster-calibrated partial Adversarial Network (MCAN). It utilizes a novel class weight calibration method to alleviate the negative transfer caused by incorrect class weights. Specifically, the calibration method tries to identify and weigh correct and incorrect predictions using distributional information implied by unsupervised clustering. Extensive experiments are conducted on prevailing PVDA benchmarks, and the proposed MCAN achieves significant improvements when compared to state-of-the-art PVDA methods.

Cyclical Fusion: Accurate 3D Reconstruction via Cyclical Monotonicity

Duo Chen
Zixin Tang
Yiguang Liu

The dense correspondence estimation is crucial to RGB-D reconstruction systems. However, the projective correspondences are highly unreliable due to sensor depth and pose uncertainties. To tackle this challenge, we introduce a geometry-driven fusion framework, Cyclical Fusion. It pushes the correspondence finding forward to the 3D space instead of searching for candidates on the 2.5D projective map. Moreover, it establishes precise correspondence in two phases, coarse to fine. 1) First, the local surface (represented by a voxel) is characterized by Gaussian distribution. The Karcher-Frechet barycenter is adapted to conduct the robust approximation of covariance. Then, the metric between distributions is calculated via the L2-Wasserstein distance, and the correspondence voxel can be discovered through the nearest distribution-to-distribution model. 2) Our method utilizes an effective correspondence verification scheme derived from cyclical monotonicity related to Rockafellar's theorem. The concept of cyclical monotonicity reveals the geometrical nature of correspondences. A substantial constraint prevents the correspondences from twisting during the fusion process. Accordingly, precise point-to-point correspondence can be discovered. 3) The advection between correspondences is used to form a smooth manifold under regularization terms. Finally, Cyclical Fusion is integrated into a prototype reconstruction system (utilize multiple streams: depth, pose, RGB, and infrared). Experimental results on different benchmarks and real-world scanning verify the superior performance of the proposed method. Cyclical Fusion accomplishes the most authentic reconstruction for which the original projective correspondence-based scheme failed (See Fig.1). Our new techniques make the reconstruction applicable for multimedia content creation and many others.

Keypoint-Guided Modality-Invariant Discriminative Learning for Visible-Infrared Person Re-identification

Tengfei Liang
Yi Jin
Wu Liu
Songhe Feng
Tao Wang
Yidong Li

The visible-infrared person re-identification (VI-ReID) task aims to retrieve images of pedestrians across cameras with different modalities. In this task, the major challenges arise from two aspects: intra-class variations among images of the same identity, and cross-modality discrepancies between visible and infrared images. Existing methods mainly focus on the latter, attempting to alleviate the impact of modality discrepancy, which ignore the former issue of identity variations and achieve limited discrimination. To address both aspects, we propose a Keypoint-guided Modality-invariant Discriminative Learning (KMDL) method, which can simultaneously adapt to intra-ID variations and bridge the cross-modality gap. By introducing human keypoints, our method makes further exploration in the image space, feature space and loss constraints to solve the above issues. Specifically, considering the modality discrepancy in original images, we first design a Hue Jitter Augmentation (HJA) strategy, introducing the hue disturbance to alleviate color dependence in the input stage. To obtain discriminative fine-grained representation for retrieval, we design the Global-Keypoint Graph Module (GKGM) in feature space, which can directly extract keypoint-aligned features and mine relationships within global and keypoint embeddings. Based on these semantic local embeddings, we further propose the Keypoint-Aware Center (KAC) loss that can effectively adjust the feature distribution under the supervision of ID and keypoint to learn discriminative representation for the matching. Extensive experiments on SYSU-MM01 and RegDB datasets demonstrate the effectiveness of our KMDL method.

Model-Guided Multi-Contrast Deep Unfolding Network for MRI Super-resolution Reconstruction

Gang Yang
Li Zhang
Man Zhou
Aiping Liu
Xun Chen
Zhiwei Xiong
Feng Wu

Magnetic resonance imaging (MRI) with high resolution (HR) provides more detailed information for accurate diagnosis and quantitative image analysis. Despite the significant advances, most existing super-resolution (SR) reconstruction network for medical images has two flaws: 1) All of them are designed in a black-box principle, thus lacking sufficient interpretability and further limiting their practical applications. Interpretable neural network models are of significant interest since they enhance the trustworthiness required in clinical practice when dealing with medical images. 2) most existing SR reconstruction approaches only use a single contrast or use a simple multi-contrast fusion mechanism, neglecting the complex relationships between different contrasts that are critical for SR improvement. To deal with these issues, in this paper, a novel Model-Guided interpretable Deep Unfolding Network (MGDUN) for medical image SR reconstruction is proposed. The Model-Guided image SR reconstruction approach solves manually designed objective functions to reconstruct HR MRI. We show how to unfold an iterative MGDUN algorithm into a novel model-guided deep unfolding network by taking the MRI observation matrix and explicit multi-contrast relationship matrix into account during the end-to-end optimization. Extensive experiments on the multi-contrast IXI dataset and BraTs 2019 dataset demonstrate the superiority of our proposed model.

Learning from Different text-image Pairs: A Relation-enhanced Graph Convolutional Network for Multimodal NER

Fei Zhao
Chunhui Li
Zhen Wu
Shangyu Xing
Xinyu Dai

Multimodal Named Entity Recognition (MNER) aims to locate and classify named entities mentioned in a (text, image) pair. However, dominant work independently models the internal matching relations in a pair of image and text, ignoring the external matching relations between different (text, image) pairs inside the dataset, though such relations are crucial for alleviating image noise in MNER task. In this paper, we primarily explore two kinds of external matching relations between different (text, image) pairs, i.e., inter-modal relations and intra-modal relations. On the basis, we propose a Relation-enhanced Graph Convolutional Network (R-GCN) for the MNER task. Specifically, we first construct an inter-modal relation graph and an intra-modal relation graph to gather the image information most relevant to the current text and image from the dataset, respectively. And then, multimodal interaction and fusion are leveraged to predict the NER label sequences. Extensive experimental results show that our model consistently outperforms state-of-the-art works on two public datasets. Our code and datasets are available at https://github.com/1429904852/R-GCN.

Multi-directional Knowledge Transfer for Few-Shot Learning

Shuo Wang
Xinyu Zhang
Yanbin Hao
Chengbing Wang
Xiangnan He

Knowledge transfer-based few-shot learning (FSL) aims at improving the recognition ability of a novel object under limited training samples by transferring relevant potential knowledge from other data. Most related methods calculate such knowledge to refine the representation of a novel sample or enrich the supervision to a classifier during a transfer procedure. However, it is easy to introduce new noise during the transfer calculations since: (1) the unbalanced quantity of samples between the known (base) and the novel categories biases the contents capturing of the novel objects, and (2) the semantic gaps existing in different modalities weakens the knowledge interaction during the training.

To reduce the influences of these issues in knowledge transfer-based FSL, this paper proposes a multi-directional knowledge transfer (MDKT). Specifically, (1) we use two independent unidirectional knowledge self-transfer strategies to calibrate the distributions of the novel categories from base categories in the visual and the textual space. It aims to yield transferable knowledge of the base categories to describe a novel category. (2) To reduce the inferences of semantic gaps, we first use a bidirectional knowledge connection to exchange the knowledge between the visual and the textual space. Then we adopt an online fusion strategy to enhance the expressions of the textual knowledge and improve the prediction accuracy of the novel categories by combining the knowledge from different modalities. Empirical studies on three FSL benchmark datasets demonstrate the effectiveness of MDKT, which improves the recognition accuracy on novel categories under limited samples, especially on $1$-shot and $2$-shot training tasks.

DetFusion: A Detection-driven Infrared and Visible Image Fusion Network

Yiming Sun
Bing Cao
Pengfei Zhu
Qinghua Hu

Infrared and visible image fusion aims to utilize the complementary information between the two modalities to synthesize a new image containing richer information. Most existing works have focused on how to better fuse the pixel-level details from both modalities in terms of contrast and texture, yet ignoring the fact that the significance of image fusion is to better serve downstream tasks. For object detection tasks, object-related information in images is often more valuable than focusing on the pixel-level details of images alone. To fill this gap, we propose a detection-driven infrared and visible image fusion network, termed DetFusion, which utilizes object-related information learned in the object detection networks to guide multimodal image fusion. We cascade the image fusion network with the detection networks of both modalities and use the detection loss of the fused images to provide guidance on task-related information for the optimization of the image fusion network. Considering that the object locations provide a priori information for image fusion, we propose an object-aware content loss that motivates the fusion model to better learn the pixel-level information in infrared and visible images. Moreover, we design a shared attention module to motivate the fusion network to learn object-specific information from the object detection networks. Extensive experiments show that our DetFusion outperforms state-of-the-art methods in maintaining pixel intensity distribution and preserving texture details. More notably, the performance comparison with state-of-the-art image fusion methods in task-driven evaluation also demonstrates the superiority of the proposed method. Our code will be available: https://github.com/SunYM2020/DetFusion.

Sketch Transformer: Asymmetrical Disentanglement Learning from Dynamic Synthesis

Cuiqun Chen
Mang Ye
Meibin Qi
Bo Du

Sketch-photo recognition is a cross-modal matching problem whose query sets are sketch images drawn by artists or amateurs. Due to the significant modality difference between the two modalities, it is challenging to extract discriminative modality-shared feature representations. Existing works focus on exploring modality-invariant features to discover shared embedding space. However, they discard modality-specific cues, resulting in information loss and diminished discriminatory power of features. This paper proposes a novel asymmetrical disentanglement and dynamic synthesis learning method in the transformer framework (SketchTrans) to handle modality discrepancy by combining modality-shared information with modality-specific information. Specifically, an asymmetrical disentanglement scheme is introduced to decompose the photo features into sketch-relevant and sketch-irrelevant cues while preserving the original sketch structure. Using the sketch-irrelevant cues, we further translate the sketch modality component to photo representation through knowledge transfer, obtaining cross-modality representations with information symmetry. Moreover, we propose a dynamic updatable auxiliary sketch (A-sketch) modality generated from the photo modality to guide the asymmetrical disentanglement in a single framework. Under a multi-modality joint learning framework, this auxiliary modality increases the diversity of training samples and narrows the cross-modality gap. We conduct extensive experiments on three fine-grained sketch-based retrieval datasets, i.e., PKU-Sketch, QMUL-ChairV2, and QMUL-ShoeV2, outperforming the state-of-the-arts under various metrics.

Rethinking the Metric in Few-shot Learning: From an Adaptive Multi-Distance Perspective

Jinxiang Lai
Siqian Yang
Guannan Jiang
Xi Wang
Yuxi Li
Zihui Jia
Xiaochen Chen
Jun Liu
Bin-Bin Gao
Wei Zhang
Yuan Xie
Chengjie Wang

Few-shot learning problem focuses on recognizing unseen classes given a few labeled images. In recent effort, more attention is paid to fine-grained feature embedding, ignoring the relationship among different distance metrics. In this paper, for the first time, we investigate the contributions of different distance metrics, and propose an adaptive fusion scheme, bringing significant improvements in few-shot classification. We start from a naive baseline of confidence summation and demonstrate the necessity of exploiting the complementary property of different distance metrics. By finding the competition problem among them, built upon the baseline, we propose an Adaptive Metrics Module (AMM) to decouple metrics fusion into metric-prediction fusion and metric-losses fusion. The former encourages mutual complementary, while the latter alleviates metric competition via multi-task collaborative learning. Based on AMM, we design a few-shot classification framework AMTNet, including the AMM and the Global Adaptive Loss (GAL), to jointly optimize the few-shot task and auxiliary self-supervised task, making the embedding features more robust. In the experiment, the proposed AMM achieves 2% higher performance than the naive metrics fusion module, and our AMTNet outperforms the state-of-the-arts on multiple benchmark datasets.

Cross-Modality Domain Adaptation for Freespace Detection: A Simple yet Effective Baseline

Yuanbin Wang
Leyan Zhu
Shaofei Huang
Tianrui Hui
Xiaojie Li
Fei Wang
Si Liu

As one of the fundamental functions of autonomous driving system, freespace detection aims at classifying each pixel of the image captured by the camera as drivable or non-drivable. Current works of freespace detection heavily rely on large amount of densely labeled training data for accuracy and robustness, which is time-consuming and laborious to collect and annotate. To the best of our knowledge, we are the first work to explore unsupervised domain adaptation for freespace detection to alleviate the data limitation problem with synthetic data. We develop a cross-modality domain adaptation framework which exploits both RGB images and surface normal maps generated from depth images. A Collaborative Cross Guidance (CCG) module is proposed to leverage the context information of one modality to guide the other modality in a cross manner, thus realizing inter-modality intra-domain complement. To better bridge the domain gap between source domain (synthetic data) and target domain (real-world data), we also propose a Selective Feature Alignment (SFA) module which only aligns the features of consistent foreground area between the two domains, thus realizing inter-domain intra-modality adaptation. Extensive experiments are conducted by adapting three different synthetic datasets to one real-world dataset for freespace detection respectively. Our method performs closely to fully supervised freespace detection methods (93.08% v.s. 97.50% F1 score) and outperforms other general unsupervised domain adaptation methods for semantic segmentation with large margins, which shows the promising potential of domain adaptation for freespace detection.

Learning a Dynamic Cross-Modal Network for Multispectral Pedestrian Detection

Jin Xie
Rao Muhammad Anwer
Hisham Cholakkal
Jing Nie
Jiale Cao
Jorma Laaksonen
Fahad Shahbaz Khan

Multispectral pedestrian detection that enables continuous (day and night) localization of pedestrians has numerous applications. Existing approaches typically aggregate multispectral features by a simple element-wise operation. However, such a local feature aggregation scheme ignores the rich non-local contextual information. Further, we argue that a local tight correspondence across modalities is desired for multi-modal feature aggregation. To address these issues, we introduce a multispectral pedestrian detection framework that comprises a novel dynamic cross-modal network (DCMNet), which strives to adaptively utilize the local and non-local complementary information between multi-modal features. The proposed DCMNet consists of a local and a non-local feature aggregation module. The local module employs dynamically learned convolutions to capture local relevant information across modalities. On the other hand, the non-local module captures non-local cross-modal information by first projecting features from both modalities into the latent space and then obtaining dynamic latent feature nodes for feature aggregation. Comprehensive experiments are performed on two challenging benchmarks: KAIST and LLVIP. Experiments reveal the benefits of the proposed DCMNet, leading to consistently improved detection performance on diverse detection paradigms and backbones. When using the same backbone, our proposed detector achieves absolute gains of 1.74% and 1.90% over the baseline Cascade RCNN on the KAIST and LLVIP datasets.

Two-Stage Multi-Scale Resolution-Adaptive Network for Low-Resolution Face Recognition

Haihan Wang
Shangfei Wang
Lin Fang

Low-resolution face recognition is challenging due to uncertain input resolutions and the lack of distinguishing details in low-resolution (LR) facial images. Resolution-invariant representations must be learned for optimal performance. Existing methods for this task mainly minimize the distance between the representations of the low-resolution (LR) and corresponding high-resolution (HR) image pairs in a common subspace. However, these works only focus on introducing various distance metrics at the final layer and between HR-LR image pairs. They do not fully utilize the intermediate layers or multi-resolution supervision, yielding only modest performance. In this paper, we propose a novel two-stage multi-scale resolution-adaptive network to learn more robust resolution-invariant representations. In the first stage, the structural patterns and the semantic patterns are distilled from HR images to provide sufficient supervision for LR images. A curriculum learning strategy facilitates the training of HR and LR image matching, smoothly decreasing the resolution of LR images. In the second stage, a multi-resolution contrastive loss is introduced on LR images to enforce intra-class clustering and inter-class separation of the LR representations. By introducing multi-scale supervision and multi-resolution LR representation clustering, our network can produce robust representations despite uncertain input sizes. Experimental results on eight benchmark datasets demonstrate the effectiveness of the proposed method. Code will be released at https://github.com/hhwang98/TMR.

When True Becomes False: Few-Shot Link Prediction beyond Binary Relations through Mining False Positive Entities

Xuan Zhang
Xun Liang
Xiangping Zheng
Bo Wu
Yuhui Guo

Recently, the link prediction task on Hyper-relational Knowledge Graphs (HKGs) has been a hot spot, which aims to predict new facts beyond binary relations. Although previous models have accomplished considerable achievements, there remain three challenges: i) the previous models neglect the existence of False Positive Entities (FPEs), which are true entities in the binary triples, yet becomes false when encountering the query statements of HKGs; ii) Due to the sparse interactions, the models are not capable of coping with long-tail hyper-relations, which are ubiquitous in the real-world; iii) The models are generally transductive learning processes, and have difficulty in adapting new hyper-relations. To tackle the above issues, we firstly propose the task of few-shot link prediction on HKGs and devise hyper-relation-aware attention networks with a contrastive loss, which are empowered to encode all entities including FPEs effectively and increase the distance between the true entities and FPEs through contrastive learning. With few-shot references available, the proposed model then learns the representations of their long-tail hyper-relations and predicts new links by calculating the likelihood between queries and references. Furthermore, our model is inductive and can be scalable to any new hyper-relation effortlessly. Since it is the first trial on few-shot link prediction for HKGs, we also modify the existing few-shot learning approaches on binary relational data to work with HKGs as baselines. Experimental results on three real-world datasets show the superiority of our model over various state-of-the-art baselines.

Understanding Political Polarization via Jointly Modeling Users, Connections and Multimodal Contents on Heterogeneous Graphs

Hanjia Lyu
Jiebo Luo

Understanding political polarization on social platforms is important as public opinions may become increasingly extreme when they are circulated in homogeneous communities, thus potentially causing damage in the real world. Automatically detecting the political ideology of social media users can help better understand political polarization. However, it is challenging due to the scarcity of ideology labels, complexity of multimodal contents, and cost of time-consuming data collection process. Most previous frameworks either focus on unimodal content or do not scale up well. In this study, we adopt a heterogeneous graph neural network to jointly model user characteristics, multimodal post contents as well as user-item relations in a bipartite graph to learn a comprehensive and effective user embedding without requiring ideology labels. We apply our framework to online discussions about economy and public health topics. The learned embeddings are then used to detect political ideology and understand political polarization. Our framework outperforms the unimodal, early/late fusion baselines, and homogeneous GNN frameworks by a margin of at least 9% absolute gain in the area under the receiver operating characteristic on two social media datasets. More importantly, our work does not require a time-consuming data collection process, which allows faster detection and in turn allows the policy makers to conduct analysis and design policies in time to respond to crises. We also show that our framework learns meaningful user embeddings and can help better understand political polarization. Notable differences in user descriptions, topics, images, and levels of retweet/quote activities are observed. Our framework for decoding user-content interaction shows wide applicability in understanding political polarization. Furthermore, it can be extended to user-item bipartite information networks for other applications such as content and product recommendation.

SESSION: Oral Session XI: Understanding Multimedia Content -- Vision and Language

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

Yupan Huang
Tengchao Lv
Lei Cui
Yutong Lu
Furu Wei

Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis. The code and models are publicly available at https://aka.ms/layoutlmv3.

Reducing the Vision and Language Bias for Temporal Sentence Grounding

Daizong Liu
Xiaoye Qu
Wei Hu

Temporal sentence grounding (TSG) is an important yet challenging task in multimedia information retrieval. Although previous TSG methods have achieved decent performance, they tend to capture the selection biases of frequently appeared video-query pairs in the dataset rather than present robust multimodal reasoning abilities, especially for the rarely appeared pairs. In this paper, we study the above issue of selection biases and accordingly propose a Debiasing-TSG (D-TSG) model to filter and remove the negative biases in both vision and language modalities for enhancing the model generalization ability. Specifically, we propose to alleviate the issue from two perspectives: 1) Feature distillation. We built a multi-modal debiasing branch to firstly capture the vision and language biases, and then apply a bias identification module to explicitly recognize the true negative biases and remove them from the benign multi-modal representations. 2) Contrastive sample generation. We construct two types of negative samples to enforce the model to accurately learn the aligned multi-modal semantics and make complete semantic reasoning. We apply the proposed model to both commonly and rarely appeared TSG cases, and demonstrate its effectiveness by achieving the state-of-the-art performance on three benchmark datasets (ActivityNet Caption, TACoS, and Charades-STA).

Face Forgery Detection via Symmetric Transformer

Luchuan Song
Xiaodan Li
Zheng Fang
Zhenchao Jin
YueFeng Chen
Chenliang Xu

The deep learning-based face forgery detection is a novel yet challenging task. Despite impressive results have been achieved, there are still some limitations in the existing methods. For example, the previous methods are hard to maintain consistent predictions for consecutive frames, even if all of those frames are actually forged. We propose a symmetric transformer for channel and spatial feature extraction, which is because the channel and spatial features of a robust forgery detector should be consistent in the temporal domain. The symmetric transformer adopt the newly-designed attention-based strategies for channel variance and spatial gradients as the vital features, which greatly improves the robustness of deepfake video detection. Moreover, this symmetric structure acts on temporal and spatial features respectively, which ensures the robustness of detection from two different aspects. Our symmetric transformer is an end-to-end optimized network. Experiments are conducted on various settings, the proposed methods achieve significantly improvement on prediction robustness and perform better than state-of-the-art methods on different datasets.

End-to-End Compound Table Understanding with Multi-Modal Modeling

Zaisheng Li
Yi Li
Qiao Liang
Pengfei Li
Zhanzhan Cheng
Yi Niu
Shiliang Pu
Xi Li

Table is a widely used data form in webpages, spreadsheets, or PDFs to organize and present structural data. Although studies on table structure recognition have been successfully used to convert image-based tables into digital structural formats, solving many real problems still relies on further understanding of the table, such as cell relationship extraction. The current datasets related to table understanding are all based on the digit format. To boost research development, we release a new benchmark named ComFinTab with rich annotations that support both table recognition and understanding tasks. Unlike previous datasets containing the basic tables, ComFinTab contains a large ratio of compound tables, which is much more challenging and requires methods using multiple information sources. Based on the dataset, we also propose a uniform, concise task form with the evaluation metric to better evaluate the model's performance on the table understanding task in compound tables. Finally, a framework named CTUNet is proposed to integrate the compromised visual, semantic, and position features with a graph attention network, which can solve the table recognition task and the challenging table understanding task as a whole. Experimental results compared with some previous advanced table understanding methods demonstrate the effectiveness of our proposed model. Code and dataset are available at \urlhttps://github.com/hikopensource/DAVAR-Lab-OCR.

Modality Eigen-Encodings Are Keys to Open Modality Informative Containers

Yiyuan Zhang
Yuqi Ji

Vision-Language fusion relies heavily on precise cross-modal information synergy. Nevertheless, modality divergence makes mutual description with the other modality extremely difficult. Despite various attempts to tap into semantic unity in vision and language, most existing approaches utilize modality-specific features via the high-dimensionality tensors as the smallest unit of information, limiting the interactivity of multi-modal fine-grained fusion. Furthermore, in previous works, cross-modal interaction is commonly depicted by the similarity between semantically insufficient global features. Differently, we propose a novel scheme for multi-modal fusion named Vision Language Interaction (VLI). To represent more fine-grained and flexible information of the modality, we consider high-dimensional features as containers of modality-specific information, while homogeneous semantic information between heterogeneous modalities is the key stored in the containers. We first construct information containers via multi-scale alignment and then utilize modality eigen-encodings to take out the homogeneous semantics on the vector level. Finally, we iteratively embed the eigen-encodings of one modality into the eigen-encodings of the other modality to perform cross-modal semantic interaction. After embeddings interaction, vision and language information can break the existing representation bottleneck through the representation level of granularity never achieved in previous work. Extensive experimental results on vision-language tasks validate the effectiveness of VLI. On the three benchmarks of Referring Expression Comprehension (REC), Referring Expression Segmentation (RES), and Visual Question Answering (VQA), VLI significantly outperforms the existing state-of-the-art methods.

Visual Knowledge Graph for Human Action Reasoning in Videos

Yue Ma
Yali Wang
Yue Wu
Ziyu Lyu
Siran Chen
Xiu Li
Yu Qiao

Action recognition has been traditionally treated as a high-level video classification problem. However, such a manner lacks the detailed and semantic understanding of body movement, which is the critical knowledge to explain and infer complex human actions. To fill this gap, we propose to summarize a novel visual knowledge graph from over 15M detailed human annotations, for describing action as the distinct composition of body parts, part movements and interactive objects in videos. Based on it, we design a generic multi-modal Action Knowledge Understanding (AKU) framework, which can progressively infer human actions from body part movements in the videos, with assistance of visual-driven semantic knowledge mining. Finally, we validate AKU on the recent Kinetics-TPS benchmark, which contains body part parsing annotations for detailed understanding of human action in videos. The results show that, our AKU significantly boosts various video backbones with explainable action knowledge in both supervised and few shot settings, and outperforms the recent knowledge-based action recognition framework, e.g., our AKU achieves 83.9% accuracy on Kinetics-TPS while PaStaNet achieves 63.8% accuracy under the same backbone. The codes and models will be released at https://github.com/mayuelala/AKU.

Unsupervised and Pseudo-Supervised Vision-Language Alignment in Visual Dialog

Feilong Chen
Duzhen Zhang
Xiuyi Chen
Jing Shi
Shuang Xu
Bo XU

Visual dialog requires models to give reasonable answers according to a series of coherent questions and related visual concepts in images. However, most current work either focuses on attention-based fusion or pre-training on large-scale image-text pairs, ignoring the critical role of explicit vision-language alignment in visual dialog. To remedy this defect, we propose a novel unsupervised and pseudo-supervised vision-language alignment approach for visual dialog (AlignVD). Firstly, AlginVD utilizes the visual and dialog encoder to represent images and dialogs. Then, it explicitly aligns visual concepts with textual semantics via unsupervised and pseudo-supervised vision-language alignment (UVLA and PVLA). Specifically, UVLA utilizes a graph autoencoder, while PVLA uses dialog-guided visual grounding to conduct alignment. Finally, based on the aligned visual and textual representations, AlignVD gives a reasonable answer to the question via the cross-modal decoder. Extensive experiments on two large-scale visual dialog datasets have demonstrated the effectiveness of vision-language alignment, and our proposed AlignVD achieves new state-of-the-art results. In addition, our single model has won first place on the visual dialog challenge leaderboard with a NDCG metric of 78.70, surpassing the previous best ensemble model by about 1 point.

You Can even Annotate Text with Voice: Transcription-only-Supervised Text Spotting

Jingqun Tang
Su Qiao
Benlei Cui
Yuhang Ma
Sheng Zhang
Dimitrios Kanoulas

End-to-end scene text spotting has recently gained great attention in the research community. The majority of existing methods rely heavily on the location annotations of text instances (e.g., word-level boxes, word-level masks, and char-level boxes). We demonstrate that scene text spotting can be accomplished solely via text transcription, significantly reducing the need for costly location annotations. We propose a query-based paradigm to learn implicit location features via the interaction of text queries and image embeddings. These features are then made explicit during the text recognition stage via an attention activation map. Due to the difficulty of training the weakly-supervised model from scratch, we address the issue of model convergence via a circular curriculum learning strategy. Additionally, we propose a coarse-to-fine cross-attention localization mechanism for more precisely locating text instances. Notably, we provide a solution for text spotting via audio annotation, which further reduces the time required for annotation. Moreover, it establishes a link between audio, text, and image modalities in scene text spotting. Using only transcription annotations as supervision on both real and synthetic data, we achieve competitive results on several popular scene text benchmarks. The proposed method offers a reasonable trade-off between model accuracy and annotation time, allowing simplification of large-scale text spotting applications.

Inferential Visual Question Generation

Chao Bi
Shuhui Wang
Zhe Xue
Shengbo Chen
Qingming Huang

The task of Visual Question Generation (VQG) aims to generate natural language questions for images. Many methods regard it as a reverse Visual Question Answering (VQA) task. They trained a data-driven generator on VQA datasets, which is hard to obtain questions that can challenge robots and humans. Other methods rely heavily on elaborate but expensive artificial preprocessing to generate. To overcome these limitations, we propose a method to generate inferential questions from the image with noisy captions. Our method first introduces a core scene graph generation module, which can align text features and salient visual features to the initial scene graph. It constructs a special core scene graph with expanded linkage outwards from the high-confidence nodes hop by hop. Next, a question generation module uses the core scene graph as a basis to instantiate the function templates, resulting in questions with varying inferential paths. Experiments show that the visual questions generated by our method are controllable in both content and difficulty, and demonstrate clear inferential properties. In addition, since the salient region, captions, and function templates can be replaced by human-customized ones, our method has strong scalability and potential for more interactive applications. Finally, we use our method to automatically build a new dataset, InVQA, containing about 120k images and 480k question-answer pairs, to facilitate the development of more versatile VQA models.

A Baseline for Detecting Out-of-Distribution Examples in Image Captioning

Gal Shalev
Gabi Shalev
Joseph Keshet

Image captioning research achieved breakthroughs in recent years by developing neural models that can generate diverse and high-quality descriptions for images drawn from the same distribution as training images. However, when facing out-of-distribution (OOD) images, such as corrupted images, or images containing unknown objects, the models fail in generating relevant captions.

In this paper, we consider the problem of OOD detection in image captioning. We formulate the problem and suggest an evaluation setup for assessing the model's performance on the task. Then, we analyze and show the effectiveness of the caption's likelihood score at detecting and rejecting OOD images, which implies that the relatedness between the input image and the generated caption is encapsulated within the score.

Proxy Probing Decoder for Weakly Supervised Object Localization: A Baseline Investigation

Jingyuan Xu
Hongtao Xie
Chuanbin Liu
Yongdong Zhang

Weakly supervised object localization (WSOL) aims to localize the object with only image category labels. Existing methods generally fine-tune the models with manually selected training epochs and subjective loss functions to mitigate the partial activation problem of the classification-based model. However, such fine-tuning scheme would cause the model to degrade, e.g. affect the classification performance and generalization capabilities of the pre-trained model. In this paper, we propose a novel method named Proxy Probing Decoder (PPD) to meet these challenges, which utilizes the segmentation property of self-attention map in the self-supervised vision transformer and breaks through model fine-tuning with a novel proxy probing decoder. Specifically, we utilize the self-supervised vision transformer to capture long-range dependencies and avoid partial activation. Then we simply adopt a proxy consisting of a series of decoding layers to transform the feature representations into the heatmap of the objects' foreground and conduct localization. The backbone parameters are frozen during training while the proxy is used to decode the feature and localize the object. In this way, the vision transformer model can maintain the feature representation capabilities and only the proxy is required for adapting to the task. Without bells and whistles, our framework achieves 55.0% Top-1 Loc on the ILSVRC2012 dataset and 78.8% Top-1 Loc on the CUB-200-2011 dataset, which surpasses state-of-the-art by a large margin and provides a simple baseline. Codes and models will be available on Github.

Target-Driven Structured Transformer Planner for Vision-Language Navigation

Yusheng Zhao
Jinyu Chen
Chen Gao
Wenguan Wang
Lirong Yang
Haibing Ren
Huaxia Xia
Si Liu

Vision-language navigation is the task of directing an embodied agent to navigate in 3D scenes with natural language instructions. For the agent, inferring the long-term navigation target from visual-linguistic clues is crucial for reliable path planning, which, however, has rarely been studied before in literature. In this article, we propose a Target-Driven Structured Transformer Planner (TD-STP) for long-horizon goal-guided and room layout-aware navigation. Specifically, we devise an Imaginary Scene Tokenization mechanism for explicit estimation of the long-term target (even located in unexplored environments). In addition, we design a Structured Transformer Planner which elegantly incorporates the explored room layout into a neural attention architecture for structured and global planning. Experimental results demonstrate that our TD-STP substantially improves previous best methods' success rate by 2% and 5% on the test set of R2R and REVERIE benchmarks, respectively. Our code is available at https://github.com/YushengZhao/TD-STP.

Integrating Object-aware and Interaction-aware Knowledge for Weakly Supervised Scene Graph Generation

Xingchen Li
Long Chen
Wenbo Ma
Yi Yang
Jun Xiao

Recently, increasing efforts have been focused on Weakly Supervised Scene Graph Generation (WSSGG). The mainstream solution for WSSGG typically follows the same pipeline: they first align text entities in the weak image-level supervisions (e.g., unlocalized relation triplets or captions) with image regions, and then train SGG models in a fully-supervised manner with aligned instance-level "pseudo" labels. However, we argue that most existing WSSGG works only focus on object-consistency, which means the grounded regions should have the same object category label as text entities. While they neglect another basic requirement for an ideal alignment: interaction-consistency, which means the grounded region pairs should have the same interactions (i.e., visual relations) as text entity pairs. Hence, in this paper, we propose to enhance a simple grounding module with both object-aware and interaction-aware knowledge to acquire more reliable pseudo labels. To better leverage these two types of knowledge, we regard them as two teachers and fuse their generated targets to guide the training process of our grounding module. Specifically, we design two different strategies to adaptively assign weights to different teachers by assessing their reliability on each training sample. Extensive experiments have demonstrated that our method consistently improves WSSGG performance on various kinds of weak supervision.

Reading and Writing: Discriminative and Generative Modeling for Self-Supervised Text Recognition

Mingkun Yang
Minghui Liao
Pu Lu
Jing Wang
Shenggao Zhu
Hualin Luo
Qi Tian
Xiang Bai

Existing text recognition methods usually need large-scale training data. Most of them rely on synthetic training data due to the lack of annotated real images. However, there is a domain gap between the synthetic data and real data, which limits the performance of the text recognition models. Recent self-supervised text recognition methods attempted to utilize unlabeled real images by introducing contrastive learning, which mainly learns the discrimination of the text images. Inspired by the observation that humans learn to recognize the texts through both reading and writing, we propose to learn discrimination and generation by integrating contrastive learning and masked image modeling in our self-supervised method. The contrastive learning branch is adopted to learn the discrimination of text images, which imitates the reading behavior of humans. Meanwhile, masked image modeling is firstly introduced for text recognition to learn the context generation of the text images, which is similar to the writing behavior. The experimental results show that our method outperforms previous self-supervised text recognition methods by 10.2%-20.2% on irregular scene text recognition datasets. Moreover, our proposed text recognizer exceeds previous state-of-the-art text recognition methods by averagely 5.3% on 11benchmarks, with similar model size. We also demonstrate that our pre-trained model can be easily applied to other text-related tasks with obvious performance gain.

Hierarchical Walking Transformer for Object Re-Identification

Xudong Tian
Jun Liu
Zhizhong Zhang
Chengjie Wang
Yanyun Qu
Yuan Xie
Lizhuang Ma

Recently, transformer purely based on attention mechanism has been applied to a wide range of tasks and achieved impressive performance. Though extensive efforts have been made, there are still drawbacks to the transformer architecture which hinder its further applications: (i) the quadratic complexity brought by attention mechanism; (ii) barely incorporated inductive bias.

In this paper, we present a new hierarchical walking attention, which provides a scalable, flexible, and interpretable sparsification strategy to reduce the complexity from quadratic to linear, and meanwhile evidently boost the performance. Specifically, we learn a hierarchical structure by splitting an image with different receptive fields. We associate each high-level region with a supernode, and inject supervision with prior knowledge in this node. Supernode then acts as an indicator to decide whether this area should be skipped and thereby massive unnecessary dot-product terms in attention can be avoided. Two sparsification phases are finally introduced, allowing the transformer to achieve strictly linear complexity. Extensive experiments are conducted to demonstrate the superior performance and efficiency against state-of-the-art methods. Significantly, our method sharply reduces the inference time and the total of tokens by 28% and $94%$ respectively, and brings 2.6%@Rank-1 promotion on MSMT17.

Cross-modal Semantic Alignment Pre-training for Vision-and-Language Navigation

Siying Wu
Xueyang Fu
Feng Wu
Zheng-Jun Zha

Vision-and-Language Navigation needs an agent to navigate to a target location by progressively grounding and following the relevant instruction conditioning on its memory and current observation. Existing works utilize the cross-modal transformer to pass the message between visual modality and textual modality. However, they are still limited to mining the fine-grained matching between the underlying components of trajectories and instructions. Inspired by the significant progress achieved by large-scale pre-training methods, in this paper, we propose CSAP, a new method of Cross-modal Semantic Alignment Pre-training for Vision-and-Language Navigation. It is designed to learn the alignment from trajectory-instruction pairs through two novel tasks, including trajectory-conditioned masked fragment modeling and contrastive semantic-alignment modeling. Specifically, the trajectory-conditioned masked fragment modeling encourages the agent to extract useful visual information to reconstruct the masked fragment. The contrastive semantic-alignment modeling is designed to align the visual representation with corresponding phrase embeddings. By showing experimental results on the benchmark dataset, we demonstrate that transformer architecture-based navigation agent pre-trained with our proposed CSAP outperforms existing methods on both SR and SPL scores.

RONF: Reliable Outlier Synthesis under Noisy Feature Space for Out-of-Distribution Detection

Rundong He
Zhongyi Han
Xiankai Lu
Yilong Yin

Out-of-distribution~(OOD) detection is fundamental to guaranteeing the reliability of multimedia applications during deployment in the open world. However, due to the lack of supervision signals from OOD data, the current model easily outputs overconfident predictions to OOD data during the inference phase. Several previous methods rely on large-scale auxiliary OOD datasets for model regularization. However, obtaining suitable and clean large-scale auxiliary OOD datasets is usually challenging. In this paper, we present Reliable Outlier synthesis under Noisy Feature space (RONF), which synthesizes reliable virtual outliers in noisy feature space to provide supervision signals for model regularization. Specifically, RONF first introduces a novel virtual outlier synthesis strategy Boundary Feature Mixup (BFM), which mixes up samples from the low-likelihood region of the class-conditional distribution in the feature space. However, the feature space is noisy due to the spurious features, which cause unreliable outlier synthesizing. To mitigate this problem, RONF then introduces Optimal Parameter Learning (OPL) to obtain desirable features and remove spurious features. Alongside, RONF proposes a provable and effective scoring function called Energy with Energy Discrepancy (EED) for the uncertainty measurement of OOD data. Extensive studies on several representative datasets of multimedia applications show that RONF outperforms the state-of-the-arts remarkably

ConceptBeam: Concept Driven Target Speech Extraction

Yasunori Ohishi
Marc Delcroix
Tsubasa Ochiai
Shoko Araki
Daiki Takeuchi
Daisuke Niizumi
Akisato Kimura
Noboru Harada
Kunio Kashino

We propose a novel framework for target speech extraction based on semantic information, called ConceptBeam. Target speech extraction means extracting the speech of a target speaker in a mixture. Typical approaches have been exploiting properties of audio signals, such as harmonic structure and direction of arrival. In contrast, ConceptBeam tackles the problem with semantic clues. Specifically, we extract the speech of speakers speaking about a concept, i.e., a topic of interest, using a concept specifier such as an image or speech. Solving this novel problem would open the door to innovative applications such as listening systems that focus on a particular topic discussed in a conversation. Unlike keywords, concepts are abstract notions, making it challenging to directly represent a target concept. In our scheme, a concept is encoded as a semantic embedding by mapping the concept specifier to a shared embedding space. This modality-independent space can be built by means of deep metric learning using paired data consisting of images and their spoken captions. We use it to bridge modality-dependent information, i.e., the speech segments in the mixture, and the specified, modality-independent concept. As a proof of our scheme, we performed experiments using a set of images associated with spoken captions. That is, we generated speech mixtures from these spoken captions and used the images or speech signals as the concept specifiers. We then extracted the target speech using the acoustic characteristics of the identified segments. We compare ConceptBeam with two methods: one based on keywords obtained from recognition systems and another based on sound source separation. We show that ConceptBeam clearly outperforms the baseline methods and effectively extracts speech based on the semantic representation.

Query-driven Generative Network for Document Information Extraction in the Wild

Haoyu Cao
Xin Li
Jiefeng Ma
Deqiang Jiang
Antai Guo
Yiqing Hu
Hao Liu
Yinsong Liu
Bo Ren

This paper focuses on solving Document Information Extraction (DIE) in the wild problem, which is rarely explored before. In contrast to existing studies mainly tailored for document cases in known templates with predefined layouts and keys under the ideal input without OCR errors involved, we aim to build up a more practical DIE paradigm for real-world scenarios where input document images may contain unknown layouts and keys in the scenes of the problematic OCR results. To achieve this goal, we propose a novel architecture, termed Query-driven Generative Network (QGN), which is equipped with two consecutive modules, i.e., Layout Context-aware Module (LCM) and Structured Generation Module (SGM). Given a document image with unseen layouts and fields, the former LCM yields the value prefix candidates serving as the query prompts for the SGM to generate the final key-value pairs even with OCR noise. To further investigate the potential of our method, we create a new large-scale dataset, named LArge-scale STructured Documents (LastDoc4000), containing 4,000 documents with 1,511 layouts and 3,500 different keys. In experiments, we demonstrate that our QGN consistently achieves the best F1-score on the new LastDoc4000 dataset by at most 30.32% absolute improvement. A more comprehensive experimental analysis and experiments on other public benchmarks also verify the effectiveness and robustness of our proposed method for the wild DIE task.

SPTS: Single-Point Text Spotting

Dezhi Peng
Xinyu Wang
Yuliang Liu
Jiaxin Zhang
Mingxin Huang
Songxuan Lai
Jing Li
Shenggao Zhu
Dahua Lin
Chunhua Shen
Xiang Bai
Lianwen Jin

Existing scene text spotting (i.e., end-to-end text detection and recognition) methods rely on costly bounding box annotations (e.g., text-line, word-level, or character-level bounding boxes). For the first time, we demonstrate that training scene text spotting models can be achieved with an extremely low-cost annotation of a single-point for each instance. We propose an end-to-end scene text spotting method that tackles scene text spotting as a sequence prediction task. Given an image as input, we formulate the desired detection and recognition results as a sequence of discrete tokens and use an auto-regressive Transformer to predict the sequence. The proposed method is simple yet effective, which can achieve state-of-the-art results on widely used benchmarks. Most significantly, we show that the performance is not very sensitive to the positions of the point annotation, meaning that it can be much easier to be annotated or even be automatically generated than the bounding box that requires precise positions. We believe that such a pioneer attempt indicates a significant opportunity for scene text spotting applications of a much larger scale than previously possible. The code is available at https://github.com/shannanyinxiang/SPTS.

AI Illustrator: Translating Raw Descriptions into Images by Prompt-based Cross-Modal Generation

Yiyang Ma
Huan Yang
Bei Liu
Jianlong Fu
Jiaying Liu

AI illustrator aims to automatically design visually appealing images for books to provoke rich thoughts and emotions. To achieve this goal, we propose a framework for translating raw descriptions with complex semantics into semantically corresponding images. The main challenge lies in the complexity of the semantics of raw descriptions, which may be hard to be visualized e.g., "gloomy" or "Asian"). It usually poses challenges for existing methods to handle such descriptions. To address this issue, we propose a Prompt-based Cross-Modal Generation Framework (PCM-Frame) to leverage two powerful pre-trained models, including CLIP and StyleGAN. Our framework consists of two components: a projection module from Text Embeddings to Image Embeddings based on prompts, and an adapted image generation module built on StyleGAN which takes Image Embeddings as inputs and is trained by combined semantic consistency losses. To bridge the gap between realistic images and illustration designs, we further adopt a stylization model as post-processing in our framework for better visual effects. Benefiting from the pre-trained models, our method can handle complex descriptions and does not require external paired data for training. Furthermore, we have built a benchmark that consists of 200 descriptions from literature books or online resources. We conduct a user study to demonstrate our superiority over the competing methods of text-to-image translation with complicated semantics.

Purifier: Plug-and-play Backdoor Mitigation for Pre-trained Models Via Anomaly Activation Suppression

Xiaoyu Zhang
Yulin Jin
Tao Wang
Jian Lou
Xiaofeng Chen

Pre-trained models have been widely adopted in deep learning development, benefiting the fine-tuning of downstream user-specific tasks with enormous computation saving. However, backdoor attacks pose severe security threat to the subsequent models built upon compromised pre-trained models, which call for effective countermeasures to mitigate the backdoor threat before deploying the victim models to safety-critical applications. This paper proposesPurifier : a novel backdoor mitigation framework for pre-trained models via suppressing anomaly activation.Purifier is motivated by the observation that, for backdoor triggers, anomaly activation patterns exist across different perspectives (e.g., channel-wise, cube-wise, and feature-wise), featuring different degrees of granularity. More importantly, choosing to suppress at the right granularity is vital to robustness and accuracy. To this end,Purifier is capable of defending against diverse types of backdoor triggers without any prior knowledge of the backdoor attacks, meanwhile featuring a convenient and flexible characteristic during deployment, i.e., plug-and-play-able. The extensive experimental results show, against a series of state-of-the-art mainstream attacks, thatPurifier performs better in terms of both defense effectiveness and model inference accuracy on clean examples than the state-of-the-art methods. Our code and Appendix can be found in \urlgithub.com/RUIYUN-ML/Purifier.

C3CMR: Cross-Modality Cross-Instance Contrastive Learning for Cross-Media Retrieval

Junsheng Wang
Tiantian Gong
Zhixiong Zeng
Changchang Sun
Yan Yan

Cross-modal retrieval is an essential area of representation learning, which aims to retrieve instances with the same semantics from different modalities. In real implementation, a key challenge for cross-modal retrieval is to narrow the heterogeneity gap between different modalities and obtain modality-invariant and discriminative features. Typically, existing approaches for this task mainly learn inter-modal invariance and focus on how to combine pair-level loss and class-level loss, which cannot effectively and adequately learn discriminative features. To address these issues, in this paper, we propose a novel Cross-Modality Cross-Instance Contrastive Learning for Cross-Media Retrieval (C3CMR) method. Specifically, to fully employ the intra-modal similarities, we introduce the intra-modal contrastive learning to enhance the discriminative power of the unimodal features. Besides, we design a supervised inter-modal contrastive learning scheme to take full advantage of the label semantic associations. In this way, cross-semantic associations and inter-modal invariance can be further learned. Moreover, pertaining to the local suboptimal semantic similarity by only mining pairwise and triplewise sample relationships, we propose the cross-instance contrastive learning to mine the similarities among multiple instances. Comprehensive experimental results on four widely-used benchmark datasets demonstrate the superiority of our proposed method over several state-of-the-art cross-modal retrieval methods.

Progressive Attribute Embedding for Accurate Cross-modality Person Re-ID

Aihua Zheng
Peng Pan
Hongchao Li
Chenglong Li
Bin Luo
Chang Tan
Ruoran Jia

Attributes are important information to bridge the appearance gap across modalities, but have not been well explored in cross-modality person ReID. This paper proposes a progressive attribute embedding module (PAE) to effectively fuse the fine-grained semantic attribute information and the global structural visual information. Through a novel cascade way, we use attribute information to learn the relationship between the person images in different modalities, which significantly relieves the modality heterogeneity. Meanwhile, by embedding attribute information to guide more discriminative image feature generation, it simultaneously reduces the inter-class similarity and the intra-class discrepancy. In addition, we propose an attribute-based auxiliary learning strategy (AAL) to supervise the network to learn modality-invariant and identity-specific local features by joint attribute and identity classification losses. The PAE and AAL are jointly optimized in an end-to-end framework, namely, progressive attribute embedding network (PAENet). One can plug PAE and AAL into current mainstream models, as we implement them in five cross-modality person ReID frameworks to further boost the performance. Extensive experiments on public datasets demonstrate the effectiveness of the proposed method against the state-of-the-art cross-modality person ReID methods.

Class Discriminative Adversarial Learning for Unsupervised Domain Adaptation

Lihua Zhou
Mao Ye
Xiatian Zhu
Shuaifeng Li
Yiguang Liu

As a state-of-the-art family of Unsupervised Domain Adaptation (UDA), bi-classifier adversarial learning methods are formulated in an adversarial (minimax) learning framework with a single feature extractor and two classifiers. Model training alternates between two steps: (I) constraining the learning of the two classifiers to maximize the prediction discrepancy of unlabeled target domain data, and (II) constraining the learning of the feature extractor to minimize this discrepancy. Despite being an elegant formulation, this approach has a fundamental limitation: Maximizing and minimizing the classifier discrepancy is not class discriminative for the target domain, finally leading to a suboptimal adapted model. To solve this problem, we propose a novel Class Discriminative Adversarial Learning (CDAL) method characterized by discovering class discrimination knowledge and leveraging this knowledge to discriminatively regulate the classifier discrepancy constraints on-the-fly. This is realized by introducing an evaluation criterion for judging each classifier's capability and each target domain sample's feature reorientation via objective loss reformulation. Extensive experiments on three standard benchmarks show that our CDAL method yields new state-of-the-art performance. Our code is made available at https://github.com/buerzlh/CDAL.

Background Layout Generation and Object Knowledge Transfer for Text-to-Image Generation

Zhuowei Chen
Zhendong Mao
Shancheng Fang
Bo Hu

Text-to-Image generation (T2I) aims to generate realistic and semantically consistent images according to the natural language descriptions. Built upon the recent advances in generative adversarial networks (GANs), existing T2I models have made great process. However, a close inspection of their generated images shows two major limitations: 1) the background (e.g., fence, lake) of the generated image with the complicated, real-world scene tends to be unrealistic; 2) the object (e.g., elephant, zebra) in the generated image often presents highly distorted shape or key parts missing. To address these limitations, we propose a two-stage T2I approach, where the first stage redesigns the text-to-layout process to incorporate the background layout with the existing object layout, the second stage transfers the object knowledge from an existing class-to-image model to the layout-to-image process to improve the object fidelity. Specifically, a transformer-based architecture is introduced as the layout generator to learn the mapping from text to layout of object and background, and a Text-attended Layout-aware feature Normalization (TL-Norm) is proposed to adaptively transfer the object knowledge to the image generation. Benefitting from the background layout and transferred object knowledge, the proposed approach significantly surpasses previous state-of-the-art methods in the image quality metric and achieves superior image-text alignment performance.

Towards Further Comprehension on Referring Expression with Rationale

Rengang Li
Baoyu Fan
Xiaochuan Li
Runze Zhang
Zhenhua Guo
Kun Zhao
Yaqian Zhao
Weifeng Gong
Endong Wang

Referring Expression Comprehension (REC) is one important research branch in visual grounding, where the goal of REC is to localize a relevant object in the image, given an expression in the form of text to exactly describe a specific object. However, existing REC tasks aim at text content filtering and image object locating, which are evaluated based on the precision of the detection boxes. This may lead models to skip the learning process of multimodal comprehension directly and achieve good performance. In this paper, we work on how to enable an artificial agent to understand RE further and propose a more comprehensive task, called Further Comprehension on Referring Expression (FREC). In this task, we mainly focus on three sub-tasks: 1) correcting the erroneous text expression based on visual information; 2) generating the rationale of this input expression; 3) localizing the proper object based on the corrected expression. Accordingly, we make a new dataset named Further-RefCOCOs based on the RefCOCO, RefCOCO+, RefCOCOg benchmark datasets for this new task and make it publicly available. After that, we design a novel end-to-end pipeline to achieve these sub-tasks simultaneously. The experimental results demonstrate the validity of the proposed pipeline. We believe this work will motivate more researchers to explore along with this direction, and promote the development of visual grounding.

DSE-GAN: Dynamic Semantic Evolution Generative Adversarial Network for Text-to-Image Generation

Mengqi Huang
Zhendong Mao
Penghui Wang
Quan Wang
Yongdong Zhang

Text-to-image generation aims at generating realistic images which are semantically consistent with the given text. Previous works mainly adopt the multi-stage architecture by stacking generator-discriminator pairs to engage multiple adversarial training, where the text semantics used to provide generation guidance remain static across all stages. This work argues that text features at each stage should be adaptively re-composed conditioned on the status of the historical stage (\emphi.e., historical stage's text and image features) to provide diversified and accurate semantic guidance during the coarse-to-fine generation process. We thereby propose a novel Dynamical Semantic Evolution GAN (DSE-GAN) to re-compose each stage's text features under a novel single adversarial multi-stage architecture. Specifically, we design (1) Dynamic Semantic Evolution (DSE) module, which first aggregates historical image features to summarize the generative feedback, and then dynamically selects words required to be re-composed at each stage as well as re-composed them by dynamically enhancing or suppressing different granularity subspace's semantics. (2) Single Adversarial Multi-stage Architecture (SAMA), which extends the previous structure by eliminating complicated multiple adversarial training requirements and therefore allows more stages of text-image interactions, and finally facilitates the DSE module. We conduct comprehensive experiments and show that DSE-GAN achieves 7.48% and 37.8% relative FID improvement on two widely used benchmarks, i.e., CUB-200 and MSCOCO, respectively.

Synthesizing Counterfactual Samples for Effective Image-Text Matching

Hao Wei
Shuhui Wang
Xinzhe Han
Zhe Xue
Bin Ma
Xiaoming Wei
Xiaolin Wei

Image-text matching is a fundamental research topic bridging vision and language. Recent works use hard negative mining to capture the multiple correspondences between visual and textual domains. Unfortunately, the truly informative negative samples are quite sparse in the training data, which are hard to obtain only in a randomly sampled mini-batch. Motivated by causal inference, we aim to overcome this shortcoming by carefully analyzing the analogy between hard negative mining and causal effects optimizing. Further, we propose Counterfactual Matching (CFM) framework for more effective image-text correspondence mining. CFM contains three major components, \ie, Gradient-Guided Feature Selection for automatic casual factor identification, Self-Exploration for causal factor completeness, and Self-Adjustment for counterfactual sample synthesis. Compared with traditional hard negative mining, our method largely alleviates the over-fitting phenomenon and effectively captures the fine-grained correlations between image and text modality. We evaluate our CFM in combination with three state-of-the-art image-text matching architectures. Quantitative and qualitative experiments conducted on two publicly available datasets demonstrate its strong generality and effectiveness. Code is available at: https://github.com/weihao20/cfm.

Fine-tuning with Multi-modal Entity Prompts for News Image Captioning

Jingjing Zhang
Shancheng Fang
Zhendong Mao
Zhiwei Zhang
Yongdong Zhang

News Image Captioning aims to generate descriptions for images embedded in news articles, including plentiful real-world concepts, especially about named entities. However, existing methods are limited in the entity-level template. Not only is it labor-intensive to craft the template, but it is error-prone due to local entity-aware, which solely constrains the prediction output at each language model decoding step with corrupted entity relationship. To overcome the problem, we investigate a concise and flexible paradigm to achieve global entity-aware by introducing a prompting mechanism with fine-tuning pre-trained models, named Fine-tuning with Multi-modal Entity Prompts for News Image Captioning (NewsMEP). Firstly, we incorporate two pre-trained models: (i) CLIP, translating the image with open-domain knowledge; (ii) BART, extended to encode article and image simultaneously. Moreover, leveraging the BART architecture, we can easily take the end-to-end fashion. Secondly, we prepend the target caption with two prompts to utilize entity-level lexical cohesion and inherent coherence in the pre-trained language model. Concretely, the visual prompts are obtained by mapping CLIP embeddings, and contextual vectors automatically construct the entity-oriented prompts. Thirdly, we provide an entity chain to control caption generation that focuses on entities of interest. Experiments results on two large-scale publicly available datasets, including detailed ablation studies, show that our NewsMEP not only outperforms state-of-the-art methods in general caption metrics but also achieves significant performance in precision and recall of various named entities.

Rethinking the Reference-based Distinctive Image Captioning

Yangjun Mao
Long Chen
Zhihong Jiang
Dong Zhang
Zhimeng Zhang
Jian Shao
Jun Xiao

Distinctive Image Captioning (DIC) --- generating distinctive captions that describe the unique details of a target image --- has received considerable attention over the last few years. A recent DIC work proposes to generate distinctive captions by comparing the target image with a set of semantic-similar reference images, i.e., reference-based DIC (Ref-DIC). It aims to make the generated captions can tell apart the target and reference images. Unfortunately, reference images used by existing Ref-DIC works are easy to distinguish: these reference images only resemble the target image at scene-level and have few common objects, such that a Ref-DIC model can trivially generate distinctive captions even without considering the reference images. For example, if the target image contains objects "towel'' and "toilet'' while all reference images are without them, then a simple caption "A bathroom with a towel and a toilet'' is distinctive enough to tell apart target and reference images. To ensure Ref-DIC models really perceive the unique objects (or attributes) in target images, we first propose two new Ref-DIC benchmarks. Specifically, we design a two-stage matching mechanism, which strictly controls the similarity between the target and reference images at object-/attribute- level (vs. scene-level). Secondly, to generate distinctive captions, we develop a strong Transformer-based Ref-DIC baseline, dubbed as TransDIC. It not only extracts visual features from the target image, but also encodes the differences between objects in the target and reference images. Finally, for more trustworthy benchmarking, we propose a new evaluation metric named DisCIDEr for Ref-DIC, which evaluates both the accuracy and distinctiveness of the generated captions. Experimental results demonstrate that our TransDIC can generate distinctive captions. Besides, it outperforms several state-of-the-art models on the two new benchmarks over different metrics.

A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval

Alex Falcon
Giuseppe Serra
Oswald Lanz

Every hour, huge amounts of visual contents are posted on social media and user-generated content platforms. To find relevant videos by means of a natural language query, text-video retrieval methods have received increased attention over the past few years. Data augmentation techniques were introduced to increase the performance on unseen test examples by creating new training samples with the application of semantics-preserving techniques, such as color space or geometric transformations on images. Yet, these techniques are usually applied on raw data, leading to more resource-demanding solutions and also requiring the shareability of the raw data, which may not always be true, e.g. copyright issues with clips from movies or TV series. To address this shortcoming, we propose a multimodal data augmentation technique which works in the feature space and creates new videos and captions by mixing semantically similar samples. We experiment our solution on a large scale public dataset, EPIC-Kitchens-100, and achieve considerable improvements over a baseline method, improved state-of-the-art performance, while at the same time performing multiple ablation studies. We release code and pretrained models on Github at https://github.com/aranciokov/FSMMDA\_VideoRetrieval.

SESSION: Poster Session XI: Understanding Multimedia Content -- Vision and Language

MVPTR: Multi-Level Semantic Alignment for Vision-Language Pre-Training via Multi-Stage Learning

Zejun Li
Zhihao Fan
Huaixiao Tou
Jingjing Chen
Zhongyu Wei
Xuanjing Huang

Previous vision-language pre-training models mainly construct multi-modal inputs with tokens and objects (pixels) followed by performing cross-modality interaction between them. We argue that the input of only tokens and object features limits high-level semantic alignment like phrase-to-region grounding. Meanwhile, multi-level alignments are inherently consistent and able to facilitate the representation learning synergistically. Therefore, in this paper, we propose to learn Multi-level semantic alignment for Vision-language Pre-TRaining (MVPTR). In MVPTR, we follow the nested structure of both modalities to introduce concepts as high-level semantics. To ease the learning from multi-modal multi-level inputs, our framework is split into two stages, the first stage focuses on intra-modality multi-level representation learning, the second enforces interactions across modalities via both coarse-grained and fine-grained semantic alignment tasks. In addition to the commonly used image-text matching and masked language model tasks, we introduce a masked concept recovering task in the first stage to enhance the concept representation learning, and two more tasks in the second stage to explicitly encourage multi-level alignments across modalities. Our model achieves state-of-the-art results on several vision and language tasks.

Combining Vision and Language Representations for Patch-based Identification of Lexico-Semantic Relations

Prince Jha
Gaël Dias
Alexis Lechervy
Jose G. Moreno
Anubhav Jangra
Sebastião Pais
Sriparna Saha

Although a wide range of applications have been proposed in the field of multimodal natural language processing, very few works have been tackling multimodal relational lexical semantics. In this paper, we propose the first attempt to identify lexico-semantic relations with visual clues, which embody linguistic phenomena such as synonymy, co-hyponymy or hypernymy. While traditional methods take advantage of the paradigmatic approach or/and the distributional hypothesis, we hypothesize that visual information can supplement the textual information, relying on the apperceptum subcomponent of the semiotic textology linguistic theory. For that purpose, we automatically extend two gold-standard datasets with visual information, and develop different fusion techniques to combine textual and visual modalities following the patch-based strategy. Experimental results over the multimodal datasets show that the visual information can supplement the missing semantics of textual encodings with reliable performance improvements.

Multi-Attention Network for Compressed Video Referring Object Segmentation

Weidong Chen
Dexiang Hong
Yuankai Qi
Zhenjun Han
Shuhui Wang
Laiyun Qing
Qingming Huang
Guorong Li

Referring video object segmentation aims to segment the object referred by a given language expression. Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented, which increases computation and storage requirements and ultimately slows the inference down. This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones. To alleviate this problem, in this paper, we explore the referring object segmenta- tion task on compressed videos, namely on the original video data flow. Besides the inherent difficulty of the video referring object segmentation task itself, obtaining discriminative representation from compressed video is also rather challenging. To address this problem, we propose a multi-attention network which consists of dual-path dual-attention module and a query-based cross-modal Transformer module. Specifically, the dual-path dual-attention module is designed to extract effective representation from compressed data in three modalities, i.e., I-frame, Motion Vector and Residual. The query-based cross-modal Transformer firstly models the corre- lation between linguistic and visual modalities, and then the fused multi-modality features are used to guide object queries to generate a content-aware dynamic kernel and to predict final segmentation masks. Different from previous works, we propose to learn just one kernel, which thus removes the complicated post mask-matching procedure of existing methods. Extensive promising experimental results on three challenging datasets show the effectiveness of our method compared against several state-of-the-art methods which are proposed for processing RGB data. Source code is available at: https://github.com/DexiangHong/MANet.

Cross-modal Co-occurrence Attributes Alignments for Person Search by Language

Kai Niu
Linjiang Huang
Yan Huang
Peng Wang
Liang Wang
Yanning Zhang

Person search by language refers to retrieving the interested pedestrian images based on a free-form natural language description, which has important applications in smart video surveillance. Although great efforts have been made to align images with sentences, the challenge of reporting bias, i.e., attributes are only partially matched across modalities, still incurs large noise and influences the accurate retrieval seriously. To address this challenge, we propose a novel cross-modal matching method named Cross-modal Co-occurrence Attributes Alignments (C2A2), which can better deal with noise and obtain significant improvements in retrieval performance for person search by language. First, we construct visual and textual attribute dictionaries relying on matrix decomposition, and carry out cross-modal alignments using denoising reconstruction features to address the noise from pedestrian-unrelated elements. Second, we re-gather pixels of image and words of sentence under the guidance of learned attribute dictionaries, to adaptively constitute more discriminative co-occurrence attributes in both modalities. And the re-gathered co-occurrence attributes are carefully captured by imposing explicit cross-modal one-to-one alignments which consider relations across modalities, better alleviating the noise from non-correspondence attributes. The whole C_2A_2 method can be trained end-to-end without any pre-processing, i.e., requiring negligible additional computation overheads. It significantly outperforms the existing solutions, and finally achieves the new state-of-the-art retrieval performance on two large-scale benchmarks, CUHK-PEDES and RSTPReid datasets.

RefCrowd: Grounding the Target in Crowd with Referring Expressions

Heqian Qiu
Hongliang Li
Taijin Zhao
Lanxiao Wang
Qingbo Wu
Fanman Meng

Crowd understanding has aroused the widespread interest in vision domain due to its important practical significance. Unfortunately, there is no effort to explore crowd understanding in multi-modal domain that bridges natural language and computer vision. Referring expression comprehension (REF) is such a representative multi-modal task. Current REF studies focus more on grounding the target object from multiple distinctive categories in general scenarios. It is difficult to applied to complex real-world crowd understanding. To fill this gap, we propose a new challenging dataset, called RefCrowd, which towards looking for the target person in crowd with referring expressions. It not only requires to sufficiently mine natural language information, but also requires to carefully focus on subtle differences between the target and a crowd of persons with similar appearance, so as to realize fine-grained mapping from language to vision. Furthermore, we propose a Fine-grained Multi-modal Attribute Contrastive Network (FMAC) to deal with REF in crowd understanding. It first decomposes the intricate visual and language features into attribute-aware multi-modal features, and then captures discriminative but robustness fine-grained attribute features to effectively distinguish these subtle differences between similar persons. The proposed method outperforms existing state-of-the-art (SoTA) methods on our RefCrowd dataset and existing REF datasets. In addition, we implement an end-to-end REF toolbox for the deeper research in multi-modal domain. Our dataset and code can be available at: https://qiuheqian.github.io/datasets/refcrowd/.

Unified Normalization for Accelerating and Stabilizing Transformers

Qiming Yang
Kai Zhang
Chaoxiang Lan
Zhi Yang
Zheyang Li
Wenming Tan
Jun Xiao
Shiliang Pu

Solid results from Transformers have made them prevailing architectures in various natural language and vision tasks. As a default component in Transformers, Layer Normalization (LN) normalizes activations within each token to boost the robustness. However, LN requires on-the-fly statistics calculation in inference as well as division and square root operations, leading to inefficiency on hardware. What is more, replacing LN with other hardware-efficient normalization schemes (e.g., Batch Normalization) results in inferior performance, even collapse in training. We find that this dilemma is caused by abnormal behaviors of activation statistics, including large fluctuations over iterations and extreme outliers across layers. To tackle these issues, we propose Unified Normalization (UN), which can speed up the inference by being fused with other linear operations and achieve comparable performance on par with LN. UN strives to boost performance by calibrating the activation and gradient statistics with a tailored fluctuation smoothing strategy. Meanwhile, an adaptive outlier filtration strategy is applied to avoid collapse in training whose effectiveness is theoretically proved and experimentally verified in this paper. We demonstrate that UN can be an efficient drop-in alternative to LN by conducting extensive experiments on language and vision tasks. Besides, we evaluate the efficiency of our method on GPU. Transformers equipped with UN enjoy about 31% inference speedup and nearly 18% memory reduction. Code will be released at https://github.com/hikvision-research/Unified-Normalization.

Enhancing Semi-Supervised Learning with Cross-Modal Knowledge

Hui Zhu
Yongchun Lu
Hongbin Wang
Xunyi Zhou
Qin Ma
Yanhong Liu
Ning Jiang
Xin Wei
Linchengxi Zeng
Xiaofang Zhao

Semi-supervised learning (SSL), which leverages a small number of labeled data that rely on expert knowledge and a large number of easily accessible unlabeled data, has made rapid progress recently. However, the information comes from a single modality and the corresponding labels are in form of one-hot in pre-existing SSL approaches, which can easily lead to deficiency supervision, omission of information and unsatisfactory results, especially when more categories and less labeled samples are covered. In this paper, we propose a novel method to further enhance SSL by introducing semantic modal knowledge, which contains the word embeddings of class labels and the semantic hierarchy structure among classes. The former helps retain more potential information and almost quantitatively reflects the similarities and differences between categories. The later encourages the model to construct the classification edge from simple to complex, and thus improves the generalization ability of the model. Comprehensive experiments and ablation studies are conducted on commonly-used datasets to demonstrate the effectiveness of our method.

Dynamic Spatio-Temporal Modular Network for Video Question Answering

Zi Qian
Xin Wang
Xuguang Duan
Hong Chen
Wenwu Zhu

Video Question Answering (VideoQA) aims to understand given videos and questions comprehensively by generating correct answers. However, existing methods usually rely on end-to-end black-box deep neural networks to infer the answers, which significantly differs from human logic reasoning, thus lacking the ability to explain. Besides, the performances of existing methods tend to drop when answering compositional questions involving realistic scenarios. To tackle these challenges, we propose a Dynamic Spatio-Temporal Modular Network (DSTN) model, which utilizes a spatio-temporal modular network to simulate the compositional reasoning procedure of human beings. Concretely, we divide the task of answering a given question into a set of sub-tasks focusing on certain key concepts in questions and videos such as objects, actions, temporal orders, etc. Each sub-task can be solved with a separately designed module, e.g., spatial attention module, temporal attention module, logic module, and answer module. Then we dynamically assemble different modules assigned with different sub-tasks to generate a tree-structured spatio-temporal modular neural network for human-like reasoning before producing the final answer for the question. We carry out extensive experiments on the AGQA dataset to demonstrate our proposed DSTN model can significantly outperform several baseline methods in various settings. Moreover, we evaluate intermediate results and visualize each reasoning step to verify the rationality of different modules and the explainability of the proposed DSTN model.

Micro-video Tagging via Jointly Modeling Social Influence and Tag Relation

Xiao Wang
Tian Gan
Yinwei Wei
Jianlong Wu
Dai Meng
Liqiang Nie

The last decade has witnessed the proliferation of micro-videos on various user-generated content platforms. According to our statistics, around 85.7% of micro-videos lack annotation. In this paper, we focus on annotating micro-videos with tags. Existing methods mostly focus on analyzing video content, neglecting users' social influence and tag relation. Meanwhile, existing tag relation construction methods suffer from either deficient performance or low tag coverage. To jointly model social influence and tag relation, we formulate micro-video tagging as a link prediction problem in a constructed heterogeneous network. Specifically, the tag relation (represented by tag ontology) is constructed in a semi-supervised manner. Then, we combine tag relation, video-tag annotation, and user follow relation to build the network. Afterward, a better video and tag representation are derived through Behavior Spread modeling and visual and linguistic knowledge aggregation. Finally, the semantic similarity between each micro-video and all candidate tags is calculated in this video-tag network. Extensive experiments on industrial datasets of three verticals verify the superiority of our model compared with several state-of-the-art baselines.

MimCo: Masked Image Modeling Pre-training with Contrastive Teacher

Qiang Zhou
Chaohui Yu
Hao Luo
Zhibin Wang
Hao Li

Recent masked image modeling (MIM) has received much attention in self-supervised learning (SSL), which requires the target model to recover the masked part of the input image. Although MIM-based pre-training methods achieve new state-of-the-art performance when transferred to many downstream tasks, the visualizations show that the learned representations are less separable, especially compared to those based on contrastive learning pre-training. This inspires us to think whether the linear separability of MIM pre-trained representation can be further improved, thereby improving the pre-training performance. Since MIM and contrastive learning tend to utilize different data augmentations and training strategies, combining these two pretext tasks is not trivial. In this work, we propose a novel and flexible pre-training framework, named MimCo, which combines MIM and contrastive learning through two-stage pre-training. Specifically, MimCo takes a pre-trained contrastive learning model as the teacher model and is pre-trained with two types of learning targets: patch-level and image-level reconstruction losses.

Extensive transfer experiments on downstream tasks demonstrate the superior performance of our MimCo pre-training framework. Taking ViT-S as an example, when using the pre-trained MoCov3-ViT-S as the teacher model, MimCo only needs 100 epochs of pre-training to achieve 82.53% top-1 finetuning accuracy on Imagenet-1K, which outperforms the state-of-the-art self-supervised learning counterparts.

LS-GAN: Iterative Language-based Image Manipulation via Long and Short Term Consistency Reasoning

Gaoxiang Cong
Liang Li
Zhenhuan Liu
Yunbin Tu
Weijun Qin
Shenyuan Zhang
Chengang Yan
Wenyu Wang
Bin Jiang

Iterative language-based image manipulation aims to edit images step by step according to user's linguistic instructions. The existing methods mostly focus on aligning the attributes and appearance of new-added visual elements with current instruction. However, they fail to maintain consistency between instructions and images as iterative rounds increase. To address this issue, we propose a novel Long and Short term consistency reasoning Generative Adversarial Network (LS-GAN), which enhances the awareness of previous objects with current instruction and better maintains the consistency with the user's intent under the continuous iterations. Specifically, we first design a Context-aware Phrase Encoder (CPE) to learn the user's intention by extracting different phrase-level information about the instruction. Further, we introduce a Long and Short term Consistency Reasoning (LSCR) mechanism. The long-term reasoning improves the model on semantic understanding and positional reasoning, while short-term reasoning ensures the ability to construct visual scenes based on linguistic instructions. Extensive results show that LS-GAN improves the generation quality in terms of both object identity and position, and achieves the state-of-the-art performance on two public datasets.

Multimodal Hate Speech Detection via Cross-Domain Knowledge Transfer

Chuanpeng Yang
Fuqing Zhu
Guihua Liu
Jizhong Han
Songiln Hu

Nowadays, the hate speech diffusion of texts and images in social network has become the mainstream compared with the diffusion of texts-only, raising the pressing needs of multimodal hate speech detection task. Current research on this task mainly focuses on the construction of multimodal models without considering the influence of the unbalanced and widely distributed samples for various attacks in hate speech. In this situation, introducing enhanced knowledge is necessary for understanding the attack category of hate speech comprehensively. Due to the high correlation between hate speech detection and sarcasm detection tasks, this paper makes an initial attempt of common knowledge transfer based on the above two tasks, where hate speech detection and sarcasm detection are defined as primary and auxiliary tasks, respectively. A scalable cross-domain knowledge transfer (CDKT) framework is proposed, where the mainstream vision-language transformer could be employed as backbone flexibly. Three modules are included, bridging the semantic, definition and domain gaps simultaneously between primary and auxiliary tasks. Specifically, semantic adaptation module formulates the irrelevant parts between image and text in primary and auxiliary tasks, and disentangles with the text representation to align the visual and word tokens. Definition adaptation module assigns different weights to the training samples of auxiliary task by measuring the correlation between samples of the auxiliary and primary task. Domain adaptation module minimizes the feature distribution gap of samples in two tasks. Extensive experiments show that the proposed CDKT provides a stable improvement compared with baselines and produces a competitive performance compared with some existing multimodal hate speech detection methods.

CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training

Zhiyuan Ma
Jianjun Li
Guohui Li
Kaiyan Huang

With the flourishing of social media platforms, vision-language pre-training (VLP) recently has received great attention and many remarkable progresses have been achieved. The success of VLP largely benefits from the information complementation and enhancement between different modalities. However, most of recent studies focus on cross-modal contrastive learning (CMCL) to promote image-text alignment by pulling embeddings of positive sample pairs together while pushing those of negative pairs apart, which ignores the natural asymmetry property between different modalities and requires large-scale image-text corpus to achieve arduous progress. To mitigate this predicament, we propose CMAL, a Cross-Modal Associative Learning framework with anchor points detection and cross-modal associative learning for VLP. Specifically, we first respectively embed visual objects and textual tokens into separate hypersphere spaces to learn intra-modal hidden features, and then design a cross-modal associative prompt layer to perform anchor point masking and swap feature filling for constructing a hybrid cross-modal associative prompt. Afterwards, we exploit a unified semantic encoder to learn their cross-modal interactive features for context adaptation. Finally, we design an associative mapping classification layer to learn potential associative mappings between modalities at anchor points, within which we develop a fresh self-supervised associative mapping classification task to boost CMAL's performance. Experimental results verify the effectiveness of CMAL, showing that it achieves competitive performance against previous CMCL-based methods on four common downstream vision-and-language tasks, with significantly fewer corpus. Noteably, CMAL obtains new state-of-the-art results on SNLI-VE and REC (testA).

ARMANI: Part-level Garment-Text Alignment for Unified Cross-Modal Fashion Design

Xujie Zhang
Yu Sha
Michael C. Kampffmeyer
Zhenyu Xie
Zequn Jie
Chengwen Huang
Jianqing Peng
Xiaodan Liang

Cross-modal fashion image synthesis has emerged as one of the most promising directions in the generation domain due to the vast untapped potential of incorporating multiple modalities and the wide range of fashion image applications. To facilitate accurate generation, cross-modal synthesis methods typically rely on Contrastive Language-Image Pre-training (CLIP) to align textual and garment information. In this work, we argue that simply aligning texture and garment information is not sufficient to capture the semantics of the visual information and therefore propose MaskCLIP. MaskCLIP decomposes the garments into semantic parts, ensuring fine-grained and semantically accurate alignment between the visual and text information. Building on MaskCLIP, we propose ARMANI, a unified cross-modal fashion designer with part-level garment-text alignment. ARMANI discretizes an image into uniform tokens based on a learned cross-modal codebook in its first stage and uses a Transformer to model the distribution of image tokens for a real image given the tokens of the control signals in its second stage. Contrary to prior approaches that also rely on two-stage paradigms, ARMANI introduces textual tokens into the codebook, making it possible for the model to utilize fine-grain semantic information to generate more realistic images. Further, by introducing a cross-modal Transformer, ARMANI is versatile and can accomplish image synthesis from various control signals, such as pure text, sketch images, and partial images. Extensive experiments conducted on our newly collected cross-modal fashion dataset demonstrate that ARMANI generates photo-realistic images in diverse synthesis tasks and outperforms existing state-of-the-art cross-modal image synthesis approaches. Our code is available at https://github.com/Harvey594/ARMANI.

Skimming, Locating, then Perusing: A Human-Like Framework for Natural Language Video Localization

Daizong Liu
Wei Hu

This paper addresses the problem of natural language video localization (NLVL). Almost all existing works follow the "only look once" framework that exploits a single model to directly capture the complex cross- and self-modal relations among video-query pairs and retrieve the relevant segment. However, we argue that these methods have overlooked two indispensable characteristics of an ideal localization method: 1) Frame-differentiable: considering the imbalance of positive/negative video frames, it is effective to highlight positive frames and weaken negative ones during the localization. 2) Boundary-precise: to predict the exact segment boundary, the model should capture more fine-grained differences between consecutive frames since their variations are often smooth. To this end, inspired by how humans perceive and localize a segment, we propose a two-step human-like framework called Skimming-Locating-Perusing (SLP). SLP consists of a Skimming-and-Locating (SL) module and a Bi-directional Perusing (BP) module. The SL module first refers to the query semantic and selects the best matched frame from the video while filtering out irrelevant frames. Then, the BP module constructs an initial segment based on this frame, and dynamically updates it by exploring its adjacent frames until no frame shares the same activity semantic. Experimental results on three challenging benchmarks show that our SLP is superior to the state-of-the-art methods and localizes more precise segment boundaries.

Distance Matters in Human-Object Interaction Detection

Guangzhi Wang
Yangyang Guo
Yongkang Wong
Mohan Kankanhalli

Human-Object Interaction (HOI) detection has received considerable attention in the context of scene understanding. Despite the growing progress, we realize existing methods often perform unsatisfactorily on distant interactions, where the leading causes are two-fold: 1) Distant interactions are by nature more difficult to recognize than close ones. A natural scene often involves multiple humans and objects with intricate spatial relations, making the interaction recognition for distant human-object largely affected by complex visual context. 2) Insufficient number of distant interactions in datasets results in under-fitting on these instances. To address these problems, we propose a novel two-stage method for better handling distant interactions in HOI detection. One essential component in our method is a novel Far Near Distance Attention module. It enables information propagation between humans and objects, whereby the spatial distance is skillfully taken into consideration. Besides, we devise a novel Distance-Aware loss function which leads the model to focus more on distant yet rare interactions. We conduct extensive experiments on HICO-DET and V-COCO datasets. The results show that the proposed method surpass existing methods significantly, leading to new state-of-the-art results.

Token Embeddings Alignment for Cross-Modal Retrieval

Chen-Wei Xie
Jianmin Wu
Yun Zheng
Pan Pan
Xian-Sheng Hua

Cross-modal retrieval has achieved significant progress in recent years with the help of token embeddings interaction methods. Most existing methods first extract embedding for each token of input image and text, then feed the token-level embeddings into a multi-modal transformer to learn a joint representation, this joint representation can be used to predict matching score between input image and text. However, these methods don't explicitly supervise the alignment between visual and textual tokens. In this paper, we propose a novel Token Embeddings AlignMent (TEAM) block, it first explicitly aligns visual tokens and textual tokens, then produces token-level matching scores to measure fine-grained similarity between input image and text. TEAM achieves new state-of-the-art performance on commonly used cross-modal retrieval benchmarks. Moreover, TEAM is interpretable and we provide visualization experiments to show how it works. At last, we construct a new billion-scale vision-language pre-training dataset in Chinese, which is the largest Chinese vision-language pre-training dataset so far. After pre-training on this dataset, our framework also achieves state-of-the-art performance on Chinese cross-modal retrieval benchmarks.

From Token to Word: OCR Token Evolution via Contrastive Learning and Semantic Matching for Text-VQA

Zan-Xia Jin
Mike Zheng Shou
Fang Zhou
Satoshi Tsutsui
Jingyan Qin
Xu-Cheng Yin

Text-based Visual Question Answering (Text-VQA) is a question-answering task to understand scene text, where the text is usually recognized by Optical Character Recognition (OCR) systems. However, the text from OCR systems often includes spelling errors, such as "pepsi" being recognized as "peosi". These OCR errors are one of the major challenges for Text-VQA systems. To address this, we propose a novel Text-VQA method to alleviate OCR errors via OCR token evolution. First, we artificially create the misspelled OCR tokens in the training time, and make the system more robust to the OCR errors. To be specific, we propose an OCR Token-Word Contrastive (TWC) learning task, which pre-trains word representation by augmenting OCR tokens via the Levenshtein distance between the OCR tokens and words in a dictionary. Second, by assuming that the majority of characters in misspelled OCR tokens are still correct, a multimodal transformer is proposed and fine-tuned to predict the answer using character-based word embedding. Specifically, we introduce a vocabulary predictor with character-level semantic matching, which enables the model to recover the correct word from the vocabulary even with misspelled OCR tokens. A variety of experimental evaluations show that our method outperforms the state-of-the-art methods on both TextVQA and ST-VQA datasets. The code will be released at https://github.com/xiaojino/TWA.

IDEA: Increasing Text Diversity via Online Multi-Label Recognition for Vision-Language Pre-training

Xinyu Huang
Youcai Zhang
Ying Cheng
Weiwei Tian
Ruiwei Zhao
Rui Feng
Yuejie Zhang
Yaqian Li
Yandong Guo
Xiaobo Zhang

Vision-Language Pre-training (VLP) with large-scale image-text pairs has demonstrated superior performance in various fields. However, the image-text pairs co-occurrent on the Internet typically lack explicit alignment information, which is suboptimal for VLP. Existing methods proposed to adopt an off-the-shelf object detector to utilize additional image tag information. However, the object detector is time-consuming and can only identify the pre-defined object categories, limiting the model capacity. Inspired by the observation that the texts incorporate incomplete fine-grained image information, we introduce IDEA, which stands for increasing text diversity via online multi-label recognition for VLP. IDEA shows that multi-label learning with image tags extracted from the texts can be jointly optimized during VLP. Moreover, IDEA can identify valuable image tags online to provide more explicit textual supervision. Comprehensive experiments demonstrate that IDEA can significantly boost the performance on multiple downstream datasets with a small extra computational cost.

CLOP: Video-and-Language Pre-Training with Knowledge Regularizations

Guohao Li
Hu Yang
Feng He
Zhifan Feng
Yajuan Lyu
Hua Wu
Haifeng Wang

Video-and-language pre-training has shown promising results for learning generalizable representations. Most existing approaches usually model video and text in an implicit manner, without considering explicit structural representations of the multi-modal content. We denote such form of representations as structural knowledge, which express rich semantics of multiple granularities. There are related works that propose object-aware approaches to inject similar knowledge as inputs. However, the existing methods usually fail to effectively utilize such knowledge as regularizations to shape a superior cross-modal representation space. To this end, we propose a Cross-modaL knOwledge-enhanced Pre-training (CLOP) method with Knowledge Regularizations. There are two key designs of ours: 1) a simple yet effective Structural Knowledge Prediction (SKP) task to pull together the latent representations of similar videos; and 2) a novel Knowledge-guided sampling approach for Contrastive Learning (KCL) to push apart cross-modal hard negative samples. We evaluate our method on four text-video retrieval tasks and one multi-choice QA task. The experiments show clear improvements, outperforming prior works by a substantial margin. Besides, we provide ablations and insights of how our methods affect the latent representation space, demonstrating the value of incorporating knowledge regularizations into video-and-language pre-training.

Talk2Face: A Unified Sequence-based Framework for Diverse Face Generation and Analysis Tasks

Yudong Li
Xianxu Hou
Zhe Zhao
Linlin Shen
Xuefeng Yang
Kimmo Yan

Facial analysis is an important domain in computer vision and has received extensive research attention. For numerous downstream tasks with different input/output formats and modalities, existing methods usually design task-specific architectures and train them using face datasets collected in the particular task domain. In this work, we proposed a single model, Talk2Face, to simultaneously tackle a large number of face generation and analysis tasks, e.g. text guided face synthesis, face captioning and age estimation. Specifically, we cast different tasks into a sequence-to-sequence format with the same architecture, parameters and objectives. While text and facial images are tokenized to sequences, the annotation labels of faces for different tasks are also converted to natural languages for unified representation. We collect a set of 2.3M face-text pairs from available datasets across different tasks, to train the proposed model. Uniform templates are then designed to enable the model to perform different downstream tasks, according to the task context and target. Experiments on different tasks show that our model achieves better face generation and caption performances than SOTA approaches. On age estimation and multi-attribute classification, our model reaches competitive performance with those models specially designed and trained for these particular tasks. In practice, our model is much easier to be deployed to different facial analysis related tasks. Code and dataset will be available at https://github.com/ydli-ai/Talk2Face.

TxVAD: Improved Video Action Detection by Transformers

Zhenyu Wu
Zhou Ren
Yi Wu
Zhangyang Wang
Gang Hua

Video action detection aims to localize persons in both space and time from video sequences and recognize their actions. Most existing methods are composed of many specialized components, e.g., pretrained person/object detectors, region proposal networks (RPN), memory banks, and so on. This paper proposes a conceptually simple paradigm for video action detection using Transformers, which effectively removes the need for specialized components and achieves superior performance. Our proposed Transformer-based Video Action Detector (TxVAD) utilizes two Transformers to capture scene context information and long-range spatio-temporal context information, for person localization and action classification, respectively. Through extensive experiments on four public datasets, AVA, AVA-Kinetics, JHMDB-21, and UCF101-24, we show that our conceptually simple paradigm has achieved state-of-the-art performance for video action detection task, without using pre-trained person/object detectors, RPN, or memory bank.

Relational Representation Learning in Visually-Rich Documents

Xin Li
Yan Zheng
Yiqing Hu
Haoyu Cao
Yunfei Wu
Deqiang Jiang
Yinsong Liu
Bo Ren

Relational understanding is critical for a number of visually-rich documents (VRDs) understanding tasks. Through multi-modal pre-training, recent studies provide comprehensive contextual representations and exploit them as prior knowledge for downstream tasks. In spite of their impressive results, we observe that the widespread relational hints (e.g., relation of key/value fields on receipts) built upon contextual knowledge are not excavated yet. To mitigate this gap, we propose DocReL, a Document Relational Representation Learning framework. The major challenge of DocReL roots in the variety of relations. From the simplest pairwise relation to the complex global structure, it is infeasible to conduct supervised training due to the definition of relation varies and even conflicts in different tasks. To deal with the unpredictable definition of relations, we propose a novel contrastive learning task named Relational Consistency Modeling (RCM), which harnesses the fact that existing relations should be consistent in differently augmented positive views. RCM provides relational representations which are more compatible to the urgent need of downstream tasks, even without any knowledge about the exact definition of relation. DocReL achieves better performance on a wide variety of VRD relational understanding tasks, including table structure recognition, key information extraction and reading order detection.

Unified Multimodal Model with Unlikelihood Training for Visual Dialog

Zihao Wang
Junli Wang
Changjun Jiang

The task of visual dialog requires a multimodal chatbot to answer sequential questions from humans about image content. Prior work performs the standard likelihood training for answer generation on the positive instances (involving correct answers). However, the likelihood objective often leads to frequent and dull outputs and fails to exploit the useful knowledge from negative instances (involving incorrect answers). In this paper, we propose a Unified Multimodal Model with UnLikelihood Training, named UniMM-UL, to tackle this problem. First, to improve visual dialog understanding and generation by multi-task learning, our model extends ViLBERT from only supporting answer discrimination to holding both answer discrimination and answer generation seamlessly by different attention masks. Specifically, in order to make the original discriminative model compatible with answer generation, we design novel generative attention masks to implement the autoregressive Masked Language Modeling (autoregressive MLM) task. And to attenuate the adverse effects of the likelihood objective, we exploit unlikelihood training on negative instances to make the model less likely to generate incorrect answers. Then, to utilize dense annotations, we adopt different fine-tuning methods for both generating and discriminating answers, rather than just for discriminating answers as in the prior work. Finally, on the VisDial dataset, our model achieves the best generative results (69.23 NDCG score). And our model also yields comparable discriminative results with the state-of-the-art in both single-model and ensemble settings (75.92 and 76.17 NDCG scores).

Tackling Instance-Dependent Label Noise with Dynamic Distribution Calibration

Manyi Zhang
Yuxin Ren
Zihao Wang
Chun Yuan

Instance-dependent label noise is realistic but rather challenging, where the label-corruption process depends on instances directly. It causes a severe distribution shift between the distributions of training and test data, which impairs the generalization of trained models. Prior works put great effort into tackling the issue. Unfortunately, these works always highly rely on strong assumptions or remain heuristic without theoretical guarantees. In this paper, to address the distribution shift in learning with instance-dependent label noise, a dynamic distribution-calibration strategy is adopted. Specifically, we hypothesize that, before training data are corrupted by label noise, each class conforms to a multivariate Gaussian distribution at the feature level. Label noise produces outliers to shift the Gaussian distribution. During training, to calibrate the shifted distribution, we propose two methods based on the mean and covariance of multivariate Gaussian distribution respectively. The mean-based method works in a recursive dimension-reduction manner for robust mean estimation, which is theoretically guaranteed to train a high-quality model against label noise. The covariance-based method works in a distribution disturbance manner, which is experimentally verified to improve the model robustness. We demonstrate the utility and effectiveness of our methods on datasets with synthetic label noise and real-world unknown noise.

On Leveraging Variational Graph Embeddings for Open World Compositional Zero-Shot Learning

Muhammad Umer Anwaar
Zhihui Pan
Martin Kleinsteuber

Humans are able to identify and categorize novel compositions of known concepts. The task in Compositional Zero-Shot learning (CZSL) is to learn composition of primitive concepts, i.e. objects and states, in such a way that even their novel compositions can be zero-shot classied. In this work, we do not assume any prior knowledge on the feasibility of novel compositions, i.e. open-world setting, where infeasible compositions dominate the search space. We propose a Compositional Variational Graph Autoencoder (CVGAE) approach for learning the variational embeddings of the primitive concepts (nodes) as well as feasibility of their compositions (via edges). Such modelling makes CVGAE scalable to real-world application scenarios. This is in contrast to SOTA method, CGE, which is computationally very expensive. e.g. for benchmark C-GQA dataset, CGE requires 3.94×10^5 nodes, whereas CVGAE requires only 1323 nodes. We learn a mapping of the graph and image embeddings onto a common embedding space. CVGAE adopts a deep metric learning approach and learns a similarity metric in this space via bi-directional contrastive loss between projected graph and image embeddings. We validate the eectiveness of our approach on three benchmark datasets. We also demonstrate via an image retrieval task that the representations learnt by CVGAE are better suited for compositional generalization.

Comprehensive Relationship Reasoning for Composed Query Based Image Retrieval

Feifei Zhang
Ming Yan
Ji Zhang
Changsheng Xu

Composed Query Based Image Retrieval (CQBIR) aims at searching images relevant to a composed query, i.e., a reference image together with a modifier text. Compared with conventional image retrieval, which takes a single image or text to retrieve desired images, CQBIR encounters more challenges as it requires not only effective semantic correspondence between the heterogeneous query and target, but also synergistic understanding of the composed query. To establish robust CQBIR model, four critical types of relational information can be included, i.e., cross-modal, intra-sample, inter-sample, and cross-sample relationships. Pioneer studies mainly exploit parts of the information, which are hard to make them enhance and complement each other. In this paper, we propose a comprehensive relationship reasoning network by fully exploring the four types of information for CQBIR, which mainly includes two key designs. First, we introduce a memory-augmented cross-modal attention module, in which the representation of the composed query is augmented by considering the cross-modal relationship between the reference image and the modification text. Second, we design a multi-scale matching strategy to optimize our network, aiming at harnessing information from the intra-sample, inter-sample, and cross-sample relationships. To the best of our knowledge, this is the first work to fully explore the four pieces of relationships in a unified deep model for CQBIR. Comprehensive experimental results on five standard benchmarks demonstrate that the proposed method performs favorably against state-of-the-art models.

Image Understanding by Captioning with Differentiable Architecture Search

Ramtin Hosseini
Pengtao Xie

In deep learning applications, image understanding is a crucial task, where several techniques such as image captioning and visual question answering have been widely studied to improve and evaluate the performances of deep neural networks (DNN) in this area. In image captioning, models have encoder-decoder architectures, where the encoders take the input images, produce embeddings, and feed them into the decoders to generate textual descriptions. Designing a proper image captioning encoder-decoder architecture manually is a difficult challenge due to the complexity of recognizing the critical objects of the input images and their relationships to generate caption descriptions. To address this issue, we propose a three-level optimization method that employs differentiable architecture search strategies to seek the most suitable architecture for image captioning automatically. Our optimization framework involves three stages, which are performed end-to-end. In the first stage, an image captioning model learns and updates the weights of its encoder and decoder to create image captions. At the next stage, the trained encoder-decoder generates a pseudo image captioning dataset from unlabeled images, and the predictive model trains on the generated dataset to update its weights. Finally, the trained model validates its performance on the validation set and updates the encoder-decoder architecture by minimizing the validation loss. Experiments and studies on the COCO image captions datasets demonstrate that our method performs significantly better than the baselines and can achieve state-of-the-art results in image understanding tasks.

Atrous Pyramid Transformer with Spectral Convolution for Image Inpainting

Muqi Huang
Lefei Zhang

Owing to the ability of extracting features of images on long-range dependencies naturally, transformer is possible to reconstruct the damaged areas of images with the information from the uncorrupted regions globally. In this paper, we propose a two-stage framework based on a novel atrous pyramid transformer (APT) for image inpainting that recovers the structure and texture of an image progressively. Specifically, the patches of APT blocks are embedded in an atrous pyramid manner to explicitly enhance the correlation for both inter-and intra-windows to restore the high-level semantic structures of images more precisely, which could be served as a guide map for the second phase. Subsequently, a dual spectral transform convolution (DSTC) module is further designed to work together with APT to infer the low-level features of the generated areas. The DSTC module decouples the image signal into high frequency and low frequency for capturing texture information with a global view. Experiments on the CelebA-HQ, Paris StreetView, and Places2 demonstrate the superiority of the proposed approach.

QuadTreeCapsule: QuadTree Capsules for Deep Regression Tracking

Ding Ma
Xiangqian Wu

Benefit from the capability of capturing part-to-whole relationships, Capsule Network has been successful in many vision tasks. However, their high computational complexity poses a significant obstacle to applying them to visual tracking, requiring fast inference. In this paper, we introduce the idea of QuadTree Capsules, which explores the property of part-to-whole relationships endowed by the Capsule Network by significantly reducing the computational complexity. We build capsule pyramids and select meaningful relationships in a coarse-to-fine manner, dubbed as QuadTreeCapsule. Specifically, the top K capsules with the highest activation values are selected, and routing is only calculated within the relevant regions corresponding to these top K capsules with a novel symmetric guided routing algorithm. Additionally, considering the importance of temporal relationships, a multi-spectral pose matrix attention mechanism is developed for more accurate spatio-temporal capsule assignments between two sets of capsules. Moreover, during online inference, we shift part of the spatio-temporal capsules long the temporal dimension, facilitating information exchanged among neighboring frames. Extensive experimentation has proved the effectiveness of our methodology, which achieves state-of-the-art results compared with other tracking methods on eight widely-used benchmarks. Our tracker runs at approximately 43 fps on GPU.

End-to-End 3D Face Reconstruction with Expressions and Specular Albedos from Single In-the-wild Images

Qixin Deng
Binh H. Le
Aobo Jin
Zhigang Deng

Recovering 3D face models from in-the-wild face images has numerous potential applications. However, properly modeling complex lighting effects in reality, including specular lighting, shadows, and occlusions, from a single in-the-wild face image is still considered as a widely open research challenge. In this paper, we propose a convolutional neural network based framework to regress the face model from a single image in the wild. The outputted face model includes dense 3D shape, head pose, expression, diffuse albedo, specular albedo, and the corresponding lighting conditions. Our approach uses novel hybrid loss functions to disentangle face shape identities, expressions, poses, albedos, and lighting. Besides a carefully-designed ablation study, we also conduct direct comparison experiments to show that our method can outperform state-of-art methods both quantitatively and qualitatively.

Heterogeneous Learning for Scene Graph Generation

Yunqing He
Tongwei Ren
Jinhui Tang
Gangshan Wu

Scene Graph Generation (SGG) task aims to construct a graph structure to express objects and their relationships in a scene at a holistic level. Due to the neglect of heterogeneity of feature spaces between objects and relations, coupling of feature representations becomes obvious in current SGG methods, which results in large intra-class variation and inter-class ambiguity. In order to explicitly emphasize the heterogeneity in SGG, we propose a plug-and-play Heterogeneous Learning Branch (HLB), which enhances the independent representation capability of relation features. The HLB actively obscures the interconnection between objects and relation feature spaces via gradient reversal, with the assistance of a link prediction module as information barrier and an Auto Encoder for information preservation. To validate the effectiveness of HLB, we apply HLB to typical SGG methods in which the feature spaces are either homogeneous or semi-heterogeneous, and conduct evaluation on VG-150 dataset. The experimental results demonstrate that HLB significantly improves the performance of all these methods in the common evaluation criteria for SGG task.

Equivariant and Invariant Grounding for Video Question Answering

Yicong Li
Xiang Wang
Junbin Xiao
Tat-Seng Chua

Video Question Answering (VideoQA) is the task of answering the natural language questions about a video. Producing an answer requires understanding the interplay across visual scenes in video and linguistic semantics in question. However, most leading VideoQA models work as black boxes, which make the visual-linguistic alignment behind the answering process obscure. Such black-box nature calls for visual explainability that reveals "What part of the video should the model look at to answer the question?". Only a few works present the visual explanations in a post-hoc fashion, which emulates the target model's answering process via an additional method.

Instead of post-hoc explainability, we focus on intrinsic interpretability to make the answering process transparent. At its core is grounding the question-critical cues as the causal scene to yield answers, while rolling out the question-irrelevant information as the environment scene. Taking a causal look at VideoQA, we devise a self-interpretable framework, Equivariant and Invariant Grounding for Interpretable VideoQA (EIGV). Specifically, the equivariant grounding encourages the answering to be sensitive to the semantic changes in the causal scene and question; in contrast, the invariant grounding enforces the answering to be insensitive to the changes in the environment scene. By imposing them on the answering process, EIGV is able to distinguish the causal scene from the environment information, and explicitly present the visual-linguistic alignment. Extensive experiments on three benchmark datasets justify the superiority of EIGV in terms of accuracy and visual interpretability over the leading baselines.

Align and Adapt: A Two-stage Adaptation Framework for Unsupervised Domain Adaptation

Yan Yu
Yuchen Zhai
Yin Zhang

Unsupervised domain adaptation aims to transfer knowledge from a labeled but heterogeneous source domain to an unlabeled target domain, alleviating the labeling efforts. Early advances in domain adaptation focus on invariant representations learning (IRL) methods to align domain distributions. Recent studies further utilize semi-supervised learning (SSL) methods to regularize domain-invariant representations based on the cluster assumption, making the category boundary more clear. However, the misalignment in the IRL methods might be intensified by SSL methods if the target instances are more proximate to the wrong source centroid, resulting in incompatibility between these techniques. In this paper, we hypothesize this phenomenon derives from the distraction of the source domain, and further give a novel two-stage adaptation framework to adapt the model toward the target domain. In addition, we propose DCAN to reduce the misalignment in IRL methods in the first stage, and we propose PCST to encode the semantic structure of unlabeled target data in the second stage. Extensive experiments demonstrate that our method outperforms current state-of-the-art methods on four benchmarks (Office-31, ImageCLEF-DA, Office-Home, and VisDA-2017).

Detach and Attach: Stylized Image Captioning without Paired Stylized Dataset

Yutong Tan
Zheng Lin
Peng Fu
Mingyu Zheng
Lanrui Wang
Yanan Cao
Weipinng Wang

Stylized Image Captioning aims to generate captions with accurate image content and stylized elements simultaneously. However, large-scaled image and stylized caption pairs cost lots of resources and are usually unavailable. Therefore, it's a challenge to generate stylized captions without paired stylized caption dataset. Previous work on controlling the style of generated captions in an unsupervised way can be divided into two ways: implicitly and explicitly. The former mainly relies on a well-trained language model to capture style knowledge, which is limited to a single style and hard to handle multi-style task. Thus, the latter uses extra style constraints such as outlined style labels or stylized words extracted from stylized sentences to control the style rather than the trained style-specific language model. However, certain styles, such as humorous and romance, are implied in the whole sentence, instead of in some words of a sentence. To address the problems above, we propose a two-step method based on Transformer: firstly detach style representations from large-scaled stylized text-only corpus to provide more holistic style supervision, and secondly attach the style representations to image content to generate stylized captions. We learn a shared image-text space to narrow the gap between the image and the text modality for better attachment. Due to the trade-off between semantics and style, we explore three injection methods of style representations to balance two requirements of image content preservation and stylization. Experiments show that our method outperforms the state-of-the-art systems in overall performance, especially on implied styles.

PixelSeg: Pixel-by-Pixel Stochastic Semantic Segmentation for Ambiguous Medical Images

Wei Zhang
Xiaohong Zhang
Sheng Huang
Yuting Lu
Kun Wang

Semantic segmentation tasks often have multiple output hypotheses for a single input image. Particularly in medical images, these ambiguities arise from unclear object boundaries or differences in physicians' annotation. Learning the distribution of annotations and automatically giving multiple plausible predictions is useful to assist physicians in their decision-making. In this paper, we propose a semantic segmentation framework, PixelSeg, for modelling aleatoric uncertainty in segmentation maps and generating multiple plausible hypotheses. Unlike existing works, PixelSeg accomplishes the semantic segmentation task by sampling the segmentation maps pixel by pixel, which is achieved by the PixelCNN layers used to capture the conditional distribution between pixels. We propose (1) a hierarchical architecture to model high-resolution segmentation maps more flexibly, (2) a fast autoregressive sampling algorithm to improve sampling efficiency by 96.2, and (3) a resampling module to further improve predictions' quality and diversity. In addition, we demonstrate the great advantages of PixelSeg in the novel area of interactive uncertainty segmentation, which is beyond the capabilities of existing models. Extensive experiments and state-of-the-art results on the LIDC-IDRI and BraTS 2017 datasets demonstrate the effectiveness of our proposed model.

A Probabilistic Model for Controlling Diversity and Accuracy of Ambiguous Medical Image Segmentation

Wei Zhang
Xiaohong Zhang
Sheng Huang
Yuting Lu
Kun Wang

Medical image segmentation tasks often have more than one plausible annotation for a given input image due to its inherent ambiguity. Generating multiple plausible predictions for a single image is of interest for medical critical applications. Many methods estimate the distribution of the annotation space by developing probabilistic models to generate multiple hypotheses. However, these methods aim to improve the diversity of predictions at the expense of the more important accuracy. In this paper, we propose a novel probabilistic segmentation model, called Joint Probabilistic U-net, which successfully achieves flexible control over the two abstract conceptions of diversity and accuracy. Specifically, we (i) model the joint distribution of images and annotations to learn a latent space, which is used to decouple diversity and accuracy, and (ii) transform the Gaussian distribution in the latent space to a complex distribution to improve model's expressiveness. In addition, we explore two strategies for preventing the latent space collapse, which are effective in improving the model's performance on datasets with limited annotation. We demonstrate the effectiveness of the proposed model on two medical image datasets, i.e. LIDC-IDRI and ISBI 2016, and achieved state-of-the-art results on several metrics.

Crossmodal Few-shot 3D Point Cloud Semantic Segmentation

Ziyu Zhao
Zhenyao Wu
Xinyi Wu
Canyu Zhang
Song Wang

Recently, few-shot 3D point cloud semantic segmentation methods have been introduced to mitigate the limitations of existing fully supervised approaches, i.e., heavy dependence on labeled 3D data and poor capacity to generalize to new categories. However, those few-shot learning methods need one or few labeled data as support for testing. In practice, such data labeling usually requires manual annotation of large-scale points in 3D space, which can be very difficult and laborious. To address this problem, in this paper we introduce a novel crossmodal few-shot learning approach for 3D point cloud semantic segmentation. In this approach, the point cloud to be segmented is taken as query while one or few labeled 2D RGB images are taken as support to guide the segmentation of query. This way, we only need to annotate on a few 2D support images for the categories of interest. Specifically, we first convert the 2D support images into 3D point cloud format based on both appearance and the estimated depth information. We then introduce a co-embedding network for extracting the features of support and query, both from 3D point cloud format, to fill their domain gap. Finally, we compute the prototypes of support and employ cosine similarity between the prototypes and the query features for final segmentation. Experimental results on two widely-used benchmarks show that, with one or few labeled 2D images as support, our proposed method achieves competitive results against existing few-shot 3D point cloud semantic segmentation methods.

VQ-DcTr: Vector-Quantized Autoencoder With Dual-channel Transformer Points Splitting for 3D Point Cloud Completion

Ben Fei
Weidong Yang
Wen-Ming Chen
Lipeng Ma

Existing point cloud completion methods mainly utilize the global shape representation to recover the missing regions of the 3D shape from the partial point cloud. However, these methods learn the global shape representations with continuous features against the inherently discrete nature of point cloud, hardly resulting in a high-quality structure for points. To address this challenge, we concentrate on discrete representations, which are potentially a more natural fit for the modalities of the point cloud. Therefore, we propose to employ Vector Quantization (VQ) Auto-Encoder and Dual-channel Transformer for point cloud completion (VQ-DcTr). The VQ-DcTr is apt to use discrete global features and exploit them in a well-structured generation process. Specifically, the vector quantization auto-encoder is integrated to learn a discrete latent representation along with inductive biases inherent in the transformer-based auto-encoder. By using the decoded seeds from the auto-encoder, the dual-channel transformer leverages point-wise and channel-wise attention to learn the splitting patterns in the previous Dual-channel Transformer Points Splitting (DCTPS) layer to perform the points splitting in the current DCTPS layer. In this way, we can obtain the locally compact and structured point cloud by capturing the structure characteristic of 3D shape in local patches. Extensive experiments on all standard benchmarks demonstrate that VQ-DcTr outperforms the state-of-the-art point cloud completion methods through qualitative and quantitative analysis.

Fine-grained Action Recognition with Robust Motion Representation Decoupling and Concentration

Baoli Sun
Xinchen Ye
Tiantian Yan
Zhihui Wang
Haojie Li
Zhiyong Wang

Fine-grained action recognition is a challenging task that requires identifying discriminative and subtle motion variations among fine-grained action classes. Existing methods typically focus on spatio-temporal feature extraction and long-temporal modeling to characterize complex spatio-temporal patterns of fine-grained actions. However, the learned spatio-temporal features without explicit motion modeling may emphasize more on visual appearance than on motion, which could compromise the learning of effective motion features required for fine-grained temporal reasoning. Therefore, how to decouple robust motion representations from the spatio-temporal features and further effectively leverage them to enhance the learning of discriminative features still remains less explored, which is crucial for fine-grained action recognition. In this paper, we propose a motion representation decoupling and concentration network (MDCNet) to address these two key issues. First, we devise a motion representation decoupling (MRD) module to disentangle the spatio-temporal representation into appearance and motion features through contrastive learning from video and segment views. Next, in the proposed motion representation concentration (MRC) module, the decoupled motion representations are further leveraged to learn a universal motion prototype shared across all the instances of each action class. Finally, we project the decoupled motion features onto all the motion prototypes through semantic relations to obtain the concentrated action-relevant features for each action class, which can effectively characterize the temporal distinctions of fine-grained actions for improved recognition performance. Comprehensive experimental results on four widely used action recognition benchmarks, i.e., FineGym, Diving48, Kinetics400 and Something-Something, clearly demonstrate the superiority of our proposed method in comparison with other state-of-the-art ones.

Concept Propagation via Attentional Knowledge Graph Reasoning for Video-Text Retrieval

Sheng Fang
Shuhui Wang
Junbao Zhuo
Qingming Huang
Bin Ma
Xiaoming Wei
Xiaolin Wei

Due to the rapid growth of online video data, video-text retrieval techniques are in urgent need, which aim to search for the most relevant video given a natural language caption and vice versa. The major challenge of this task is how to identify the true fine-grained semantic correspondence between videos and texts, using only the document-level correspondence. To deal with this issue, we propose a simple yet effective two-stream framework which takes the concept information into account and introduces a new branch of semantic-level matching. We further propose a concept propagation mechanism for mining the latent semantics in videos and achieving enriched representations. The concept propagation is achieved by building a commonsense graph distilled from ConceptNet with concepts extracted from videos and captions. The original concepts of videos are detected by pretrained detectors as the initial concept representations. By conducting attentional graph reasoning on the commonsense graph with the guidance of external knowledge, we can extend some new concepts in a detector-free manner for further enriching the video representations. In addition, a propagated BCE loss is designed for supervising the concept propagation procedure. Common space learning is then constructed for cross-modal matching. We conduct extensive experiments on various baseline models and several benchmark datasets. Promising experimental results demonstrate the effectiveness and generalization ability of our method.

Domain Generalization via Frequency-domain-based Feature Disentanglement and Interaction

Jingye Wang
Ruoyi Du
Dongliang Chang
Kongming Liang
Zhanyu Ma

Adaptation to out-of-distribution data is a meta-challenge for all statistical learning algorithms that strongly rely on the i.i.d. assumption. It leads to unavoidable labor costs and confidence crises in realistic applications. For that, domain generalization aims at mining domain-irrelevant knowledge from multiple source domains that can generalize to unseen target domains. In this paper, by leveraging the frequency domain of an image, we uniquely work with two key observations: (i) the high-frequency information of an image depicts object edge structure, which preserves high-level semantic information of the object is naturally consistent across different domains, and (ii) the low-frequency component retains object smooth structure, while this information is susceptible to domain shifts. Motivated by the above observations, we introduce (i) an encoder-decoder structure to disentangle high- and low-frequency features of an image, (ii) an information interaction mechanism to ensure the helpful knowledge from both two parts can cooperate effectively, and (iii) a novel data augmentation technique that works on the frequency domain to encourage the robustness of frequency-wise feature disentangling. The proposed method obtains state-of-the-art performance on three widely used domain generalization benchmarks (Digit-DG, Office-Home, and PACS).

Immunofluorescence Capillary Imaging Segmentation: Cases Study

Runpeng Hou
Ziyuan Ye
Chengyu Yang
Linhao Fu
Chao Liu
Quanying Liu

Nonunion is one of the challenges faced by orthopedics clinics for the technical difficulties and high costs in photographing interosseous capillaries. Segmenting vessels and filling capillaries are critical in understanding the obstacles encountered in capillary growth. However, existing datasets for blood vessel segmentation mainly focus on the large blood vessels of the body, and the lack of labeled capillary image datasets greatly limits the methodological development and applications of vessel segmentation and capillary filling. Here, we present a benchmark dataset, named IFCIS-155, consisting of 155 2D capillary images with segmentation boundaries and vessel fillings annotated by biomedical experts, and 19 large-scale, high-resolution 3D capillary images. To obtain better images of interosseous capillaries, we leverage state-of-the-art immunofluorescence imaging techniques to highlight the rich vascular morphology of interosseous capillaries. We conduct comprehensive experiments to verify the effectiveness of the dataset and the benchmarking deep learning models (e.g. UNet/UNet++ and the modified UNet/UNet++). Our work offers a benchmark dataset for training deep learning models for capillary image segmentation and provides a potential tool for future capillary research. The IFCIS-155 dataset and code are all publicly available at https://github.com/ncclabsustech/IFCIS-55.

Imitated Detectors: Stealing Knowledge of Black-box Object Detectors

Siyuan Liang
Aishan Liu
Jiawei Liang
Longkang Li
Yang Bai
Xiaochun Cao

Deep neural networks have shown great potential in many practical applications, yet their knowledge is at the risk of being stolen via exposed services (\eg APIs). In contrast to the commonly-studied classification model extraction, there exist no studies on the more challenging object detection task due to the sufficiency and efficiency of problem domain data collection. In this paper, we for the first time reveal that black-box victim object detectors can be easily replicated without knowing the model structure and training data. In particular, we treat it as black-box knowledge distillation and propose a teacher-student framework named Imitated Detector to transfer the knowledge of the victim model to the imitated model. To accelerate the problem domain data construction, we extend the problem domain dataset by generating synthetic images, where we apply the text-image generation process and provide short text inputs consisting of object categories and natural scenes; to promote the feedback information, we aim to fully mine the latent knowledge of the victim model by introducing an iterative adversarial attack strategy, where we feed victim models with transferable adversarial examples making victim provide diversified predictions with more information. Extensive experiments on multiple datasets in different settings demonstrate that our approach achieves the highest model extraction accuracy and outperforms other model stealing methods by large margins in the problem domain dataset. Our codes can be found at \urlhttps://github.com/LiangSiyuan21/Imitated-Detectors.

Boosting Single-Frame 3D Object Detection by Simulating Multi-Frame Point Clouds

Wu Zheng
Li Jiang
Fanbin Lu
Yangyang Ye
Chi-Wing Fu

To boost a detector for single-frame 3D object detection, we present a new approach to train it to simulate features and responses following a detector trained on multi-frame point clouds. Our approach needs multi-frame point clouds only when training the single-frame detector, and once trained, it can detect objects with only single-frame point clouds as inputs during the inference. For this purpose, we design a novel Simulated Multi-Frame Single-Stage object Detector (SMF-SSD) framework: multi-view dense object fusion to densify ground-truth objects to generate a multi-frame point cloud; self-attention voxel distillation to facilitate one-to-many knowledge transfer from multi- to single-frame voxels; multi-scale BEV feature distillation to transfer knowledge in low-level spatial and high-level semantic BEV features; and adaptive response distillation to activate single-frame responses of high confidence and accurate localization. Experimental results on the Waymo test set show that our SMF-SSD consistently outperforms all state-of-the-art single-frame 3D object detectors for all object classes of difficulty levels 1 and 2 in terms of both mAP and mAPH.

Towards Complex Document Understanding By Discrete Reasoning

Fengbin Zhu
Wenqiang Lei
Fuli Feng
Chao Wang
Haozhou Zhang
Tat-Seng Chua

Document Visual Question Answering (VQA) aims to answer questions over visually-rich documents. In this work, we introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages comprising semi-structured table(s) and unstructured text as well as 16,558 question-answer pairs. The documents are sampled from financial reports and contain lots of numbers, which means discrete reasoning capability is demanded to answer the questions. Based on TAT-DQA, we further develop a novel model named MHST that takes into account the information in multi-modalities to intelligently address different types of questions with corresponding strategies, i.e., extraction or reasoning. The experiments show that MHST model significantly outperforms the baseline methods, demonstrating its effectiveness. However, the performance still lags far behind that of expert humans. We expect that our TAT-DQA dataset would facilitate the research on understanding of visually-rich documents, especially for scenarios that require discrete reasoning. Also, we hope the proposed model would inspire researchers to design more advanced Document VQA models in future.

RPPformer-Flow: Relative Position Guided Point Transformer for Scene Flow Estimation

Hanlin Li
Guanting Dong
Yueyi Zhang
Xiaoyan Sun
Zhiwei Xiong

Estimating scene flow for point clouds is one of the key problems in 3D scene understanding and autonomous driving. Recently the point transformer architecture has become a popular and successful solution for 3D computer vision tasks, e.g., point cloud object detection and completion, but its application to scene flow estimation is rarely explored. In this work, we provide a full transformer based solution for scene flow estimation. We first introduce a novel relative position guided point attention mechanism. Then to relax the memory consumption in practice, we provide an efficient implementation of our proposed point attention layer via matrix factorization and nearest neighbor sampling. Finally, we build a pyramid transformer, named RPPformer-Flow, to estimate the scene flow between two consecutive point clouds in a coarse-to-fine manner. We evaluate our RPPformer-Flow on the FlyingThings3D and KITTI Scene Flow 2015 benchmarks. Experimental results show that our method outperforms previous state-of-the-art methods with large margins.

mmLayout: Multi-grained MultiModal Transformer for Document Understanding

Wenjin Wang
Zhengjie Huang
Bin Luo
Qianglong Chen
Qiming Peng
Yinxu Pan
Weichong Yin
Shikun Feng
Yu Sun
Dianhai Yu
Yin Zhang

Recent efforts of multimodal Transformers have improved Visually Rich Document Understanding (VrDU) tasks via incorporating visual and textual information. However, existing approaches mainly focus on fine-grained elements such as words and document image patches, making it hard for them to learn from coarse-grained elements, including natural lexical units like phrases and salient visual regions like prominent image regions. In this paper, we attach more importance to coarse-grained elements containing high-density information and consistent semantics, which are valuable for document understanding. At first, a document graph is proposed to model complex relationships among multi-grained multimodal elements, in which salient visual regions are detected by a cluster-based method. Then, a multi-grained multimodal Transformer called mmLayout is proposed to incorporate coarse-grained information into existing pre-trained fine-grained multimodal Transformers based on the graph. In mmLayout, coarse-grained information is aggregated from fine-grained, and then, after further processing, is fused back into fine-grained for final prediction. Furthermore, common sense enhancement is introduced to exploit the semantic information of natural lexical units. Experimental results on four tasks, including information extraction and document question answering, show that our method can improve the performance of multimodal Transformers based on fine-grained elements and achieve better performance with fewer parameters. Qualitative analyses show that our method can capture consistent semantics in coarse-grained elements.

Boosting Video-Text Retrieval with Explicit High-Level Semantics

Haoran Wang
Di Xu
Dongliang He
Fu Li
Zhong Ji
Jungong Han
Errui Ding

Video-text retrieval (VTR) is an attractive yet challenging task for multi-modal understanding, which aims to search for relevant video (text) given a query (video). Existing methods typically employ completely heterogeneous visual-textual information to align video and text, whilst lacking the awareness of homogeneous high-level semantic information residing in both modalities. To fill this gap, in this work, we propose a novel visual-linguistic aligning model named HiSE for VTR, which improves the cross-modal representation by incorporating explicit high-level semantics. First, we explore the hierarchical property of explicit high-level semantics, and further decompose it into two levels, i.e. discrete semantics and holistic semantics. Specifically, for visual branch, we exploit an off-the-shelf semantic entity predictor to generate discrete high-level semantics. In parallel, a trained video captioning model is employed to output holistic high-level semantics. As for the textual modality, we parse the text into three parts including occurrence, action and entity. In particular, the occurrence corresponds to the holistic high-level semantics, meanwhile both action and entity represent the discrete ones. Then, different graph reasoning techniques are utilized to promote the interaction between holistic and discrete high-level semantics. Extensive experiments demonstrate that, with the aid of explicit high-level semantics, our method achieves the superior performance over state-of-the-art methods on three benchmark datasets, including MSR-VTT, MSVD and DiDeMo.

Rethinking the Mechanism of the Pattern Pruning and the Circle Importance Hypothesis

Hengyi Zhou
Longjun Liu
Haonan Zhang
Nanning Zheng

Network pruning is an effective and widely-used model compression technique. Pattern pruning is a new sparsity dimension pruning approach whose compression ability has been proven in some prior works. However, a detailed study on "pattern" and pattern pruning is still lacking. In this paper, we analyze the mechanism behind pattern pruning. Our analysis reveals that the effectiveness of pattern pruning should be attributed to finding the less important weights even before training. Then, motivated by the fact that the retinal ganglion cells in the biological visual system have approximately concentric receptive fields, we further investigate and propose the Circle Importance Hypothesis to guide the design of efficient patterns. We also design two series of special efficient patterns - circle patterns and semicircle patterns. Moreover, inspired by the neural architecture search technique, we propose a novel one-shot gradient-based pattern pruning algorithm. Besides, we also expand depthwise convolutions with our circle patterns, which improves the accuracy of networks with little extra memory cost. Extensive experiments are performed to validate our hypotheses and the effectiveness of the proposed methods. For example, we reduce the 44.0% FLOPS of ResNet-56 while improving its accuracy to 94.38% on CIFAR-10. And we reduce the 41.0% FLOPS of ResNet-18 with only a 1.11% accuracy drop on ImageNet.

A Region-based Document VQA

Xinya Wu
Duo Zheng
Ruonan Wang
Jiashen Sun
Minzhen Hu
Fangxiang Feng
Xiaojie Wang
Huixing Jiang
Fan Yang

Practical Document Visual Question Answering (DocVQA) needs not only to recognize and extract the document contents, but also reason on them for answering questions. However, previous DocVQA data mainly focuses on in-line questions, where the answers could be directly extracted after locating keywords in the documents, which needs less reasoning. This paper therefore builds a large-scale dataset named Region-based Document VQA (RDVQA), which includes more practical questions for DocVQA. We then propose a novel Reason-over-In-region-Question-answering (ReIQ) model for addressing the problems. It is a pre-training-based model, where a Spatial-Token Pre-trained Model (STPM) is employed as the backbone. Two novel pre-training tasks, Masked Text Box Regression and Shuffled Triplet Reconstruction, are proposed to learn the entailment relationship between text blocks and tokens as well as contextual information, respectively. Moreover, a DocVQA State Tracking Module (DocST) is also proposed to track the DocVQA state in the fine-tuning stage. Experimental results show that our model improves the performance onRDVQA significantly, although more work should be done for practical DocVQA as shown inRDVQA.

CyclicShift: A Data Augmentation Method For Enriching Data Patterns

Hui Lu
Xuan Cheng
Wentao Xia
Pan Deng
MingHui Liu
Tianshu Xie
XiaoMin Wang
Ming Liu

In this paper, we propose a simple yet effective data augmentation strategy, dubbed CyclicShift, to enrich data patterns. The idea is to shift the image in a certain direction and then circularly refill the resultant out-of-frame part to the other side. Compared with previous related methods, Translation, and Shuffle, our proposed method is able to avoid losing pixels of the original image and preserve its semantic information as much as possible. Visually and emprically, we show that our method indeed brings new data patterns and thereby improves the generalization ability as well as the performance of models. Extensive experiments demonstrate our method's effectiveness in image classification and fine-grained recognition over multiple datasets and various network architectures. Furthermore, our method can also be superimposed on other data augmentation methods in a very simple way. CyclicMix, the simultaneous use of CyclicShift and CutMix, hits a new high in most cases. Our code is open-source and available at https://github.com/dejavunHui/CyclicShift.

Counterexample Contrastive Learning for Spurious Correlation Elimination

Jinqiang Wang
Rui Hu
Chaoquan Jiang
Rui Hu
Jitao Sang

Biased dataset will lead models to learn bias features highly correlated to labels, which will deteriorate the performance especially when the test data deviates from the training distribution. Most existing solutions resort to introducing additional data to explicitly balance the dataset, e.g., counterfactually generating augmented data. In this paper, we argue that there actually exist valuable samples within the original dataset which are potential to assist model circumvent spurious correlations. We call those observed samples with inconsistent bias-task correspondences with the majority samples as counterexample. By analyzing when and how counterexamples assist in circumventing spurious correlations, we propose Counterexample Contrastive Learning (CounterCL) to exploit the limited observed counterexample to regulate feature representation. Specifically, CounterCL manages to pull counterexamples close to the samples with the different bias features in the same class and at the same time push them away from the samples with the same bias features in the different classes. Quantitative and qualitative experiments validate the effectiveness and demonstrate the compatibility to other debiasing solutions.

MC-SLT: Towards Low-Resource Signer-Adaptive Sign Language Translation

Tao Jin
Zhou Zhao
Meng Zhang
Xingshan Zeng

One of the challenging factors in real application of sign language translation (SLT) is inter-signer variation. With the assumption that the pre-trained translation model cannot cover all the signers, the adaptation capability for unseen signers is of great concern. In this paper, we take a completely different perspective for SLT, called signer-adaptive SLT, which mainly considers the transferable ability of SLT systems. To attack this challenging problem, we propose MC-SLT, a novel meta-learning framework that could exploit additional new-signer data via a support set, and output a signer-adaptive model via a few-gradient-step update. Considering the various degrees of style discrepancies of different words performed by multiple signers, we further devise diversity-aware meta-adaptive weights for the token-wise cross-entropy losses. Besides, to improve the training robustness, we adopt the self-guided curriculum learning scheme that first captures the global curricula from each signer to avoid falling into a bad local optimum early, and then learns the curricula of individualities to improve the model adaptability for learning signer-specific knowledge. We re-construct the existing standard datasets of SLT for the signer-adaptive setting and establish a new benchmark for subsequent research.

Deep Evidential Learning with Noisy Correspondence for Cross-modal Retrieval

Yang Qin
Dezhong Peng
Xi Peng
Xu Wang
Peng Hu

Cross-modal retrieval has been a compelling topic in the multimodal community. Recently, to mitigate the high cost of data collection, the co-occurred pairs (e.g., image and text) could be collected from the Internet as a large-scaled cross-modal dataset, e.g., Conceptual Captions. However, it will unavoidably introduce noise (i.e., mismatched pairs) into training data, dubbed noisy correspondence. Unquestionably, such noise will make supervision information unreliable/uncertain and remarkably degrade the performance. Besides, most existing methods focus training on hard negatives, which will amplify the unreliability of noise. To address the issues, we propose a generalized Deep Evidential Cross-modal Learning framework (DECL), which integrates a novel Cross-modal Evidential Learning paradigm (CEL) and a Robust Dynamic Hinge loss (RDH) with positive and negative learning. CEL could capture and learn the uncertainty brought by noise to improve the robustness and reliability of cross-modal retrieval. Specifically, the bidirectional evidence based on cross-modal similarity is first modeled and parameterized into the Dirichlet distribution, which not only provides accurate uncertainty estimation but also imparts resilience to perturbations against noisy correspondence. To address the amplification problem, RDH smoothly increases the hardness of negatives focused on, thus embracing higher robustness against high noise. Extensive experiments are conducted on three image-text benchmark datasets, i.e., Flickr30K, MS-COCO, and Conceptual Captions, to verify the effectiveness and efficiency of the proposed method. The code is available at \urlhttps://github.com/QinYang79/DECL.

CAliC: Accurate and Efficient Image-Text Retrieval via Contrastive Alignment and Visual Contexts Modeling

Hongyu Gao
Chao Zhu
Mengyin Liu
Weibo Gu
Hongfa Wang
Wei Liu
Xu-cheng Yin

Image-text retrieval is an essential task of information retrieval, in which the models with the Vision-and-Language Pretraining(VLP) are able to achieve ideal accuracy compared with the ones without VLP. Among different VLP approaches, the single-stream models achieve the overall best retrieval accuracy, but slower inference speed. Recently, researchers have introduced the two-stage retrieval setting commonly used in the information retrieval field to the single-stream VLP model for a better accuracy/efficiency trade-off. However, the retrieval accuracy and efficiency are still unsatisfactory mainly due to the limitations of the patch-based visual unimodal encoder in these VLP models. The unimodal encoders are trained on pure visual data, so the visual features extracted by them are difficult to align with the textual features and it is also difficult for the multi-modal encoder to understand visual information. Under these circumstances, we propose an accurate and efficient two-stage image-text retrieval model via Contrastive Alignment and visual Contexts modeling(CAliC). In the first stage of the proposed model, the visual unimodal encoder is pretrained with cross-modal contrastive learning to extract easily aligned visual features, which improves the retrieval accuracy and the inference speed. In the second stage of the proposed model, we introduce a new visual contexts modeling task during pretraining to help the multi-modal encoder better understand the visual information and get more accurate predictions. Extensive experimental evaluation validates the effectiveness of our proposed approach, which achieves a higher retrieval accuracy while keeping a faster inference speed, and outperforms existing state-of-the-art retrieval methods on image-text retrieval tasks over Flickr30K and COCO benchmarks.

Correspondence Matters for Video Referring Expression Comprehension

Meng Cao
Ji Jiang
Long Chen
Yuexian Zou

We investigate the problem of video Referring Expression Comprehension (REC), which aims to localize the referent objects described in the sentence to visual regions in the video frames. Despite the recent progress, existing methods suffer from two problems: 1) inconsistent localization results across video frames; 2) confusion between the referent and contextual objects. To this end, we propose a novel Dual Correspondence Network (dubbed as DCNet) which explicitly enhances the dense associations in both the inter-frame and cross-modal manners. Firstly, we aim to build the inter-frame correlations for all existing instances within the frames. Specifically, we compute the inter-frame patch-wise cosine similarity to estimate the dense alignment and then perform the inter-frame contrastive learning to map them close in feature space. Secondly, we propose to build the fine-grained patch-word alignment to associate each patch with certain words. Due to the lack of this kind of detailed annotations, we also predict the patch-word correspondence through the cosine similarity. Extensive experiments demonstrate that our DCNet achieves state-of-the-art performance on both video and image REC benchmarks. Furthermore, we conduct comprehensive ablation studies and thorough analyses to explore the optimal model designs. Notably, our inter-frame and cross-modal contrastive losses are plug-and-play functions and are applicable to any video REC architectures. For example, by building on top of Co-grounding, we boost the performance by 1.48% absolute improvement on Accu.@0.5 for VID-Sentence dataset.

Point to Rectangle Matching for Image Text Retrieval

Zheng Wang
Zhenwei Gao
Xing Xu
Yadan Luo
Yang Yang
Heng Tao Shen

The difficulty of image-text retrieval is further exacerbated by the phenomenon of one-to-many correspondence, where multiple semantic manifestations of the other modality could be obtained by a given query. However, the prevailing methods adopt the deterministic embedding strategy to retrieve the most similar candidate, which encodes the representations of different modalities as single points in vector space. We argue that such a deterministic point mapping is obviously insufficient to represent a potential set of retrieval results for one-to-many correspondence, despite its noticeable progress. As a remedy to this issue, we propose a Point to Rectangle Matching (abbreviated as P2RM) mechanism, which actually is a geometric representation learning method for image-text retrieval. Specifically, our intuitive insight is that the representations of different modalities could be extended to rectangles, then a set of points inside such a rectangle embedding could be semantically related to many candidate correspondences. Thus our P2RM method could essentially address the one-to-many correspondence. Besides, we design a novel semantic similarity measurement method from the perspective of distance for our rectangle embedding. Under the evaluation metric for multiple matches, extensive experiments and ablation studies on two commonly used benchmarks demonstrate our effectiveness and superiority in tackling the multiplicity of image-text retrieval.

Shifting Perspective to See Difference: A Novel Multi-view Method for Skeleton based Action Recognition

Ruijie Hou
Yanran Li
Ningyu Zhang
Yulin Zhou
Xiaosong Yang
Zhao Wang

Skeleton-based human action recognition is a longstanding challenge due to its complex dynamics. Some fine-grain details of the dynamics play a vital role in classification. The existing work largely focuses on designing incremental neural networks with more complicated adjacent matrices to capture the details of joints relationships. However, they still have difficulties distinguishing actions that have broadly similar motion patterns but belong to different categories. Interestingly, we found that the subtle differences in motion patterns can be significantly amplified and become easy for audience to distinct through specified view directions, where this property haven't been fully explored before. Drastically different from previous work, we boost the performance by proposing a conceptually simple yet effective Multi-view strategy that recognizes actions from a collection of dynamic view features. Specifically, we design a novel Skeleton-Anchor Proposal (SAP) module which contains a Multi-head structure to learn a set of views. For feature learning of different views, we introduce a novel Angle Representation to transform the actions under different views and feed the transformations into the baseline model. Our module can work seamlessly with the existing action classification model. Incorporated with baseline models, our SAP module exhibits clear performance gains on many challenging benchmarks. Moreover, comprehensive experiments show that our model consistently beats down the state-of-the-art and remains effective and robust especially when dealing with corrupted data. Related code will be available on https://github.com/ideal-idea/SAP

Counterfactually Measuring and Eliminating Social Bias in Vision-Language Pre-training Models

Yi Zhang
Junyang Wang
Jitao Sang

Vision-Language Pre-training (VLP) models have achieved state-of-the-art performance in numerous cross-modal tasks. Since they are optimized to capture the statistical properties of intra- and inter-modality, there remains risk to learn social biases presented in the data as well. In this work, we (1) introduce a counterfactual-based bias measurement CounterBias to quantify the social bias in VLP models by comparing the [MASK]ed prediction probabilities of factual and counterfactual samples; (2) construct a novel VL-Bias dataset including 24K image-text pairs for measuring gender bias in VLP models, from which we observed that significant gender bias is prevalent in VLP models; and (3) propose a VLP debiasing method FairVLP to minimize the difference in the [MASK]ed prediction probabilities between factual and counterfactual image-text pairs for VLP debiasing. Although CounterBias and FairVLP focus on social bias, they are generalizable to serve as tools and provide new insights to probe and regularize more knowledge in VLP models.

Towards Adversarial Attack on Vision-Language Pre-training Models

Jiaming Zhang
Qi Yi
Jitao Sang

While vision-language pre-training model (VLP) has shown revolutionary improvements on various vision-language (V+L) tasks, the studies regarding its adversarial robustness remain largely unexplored. This paper studied the adversarial attack on popular VLP models and V+L tasks. First, we analyzed the performance of adversarial attacks under different settings. By examining the influence of different perturbed objects and attack targets, we concluded some key observations as guidance on both designing strong multimodal adversarial attack and constructing robust VLP models. Second, we proposed a novel multimodal attack method on the VLP models called Collaborative Multimodal Adversarial Attack (Co-Attack), which collectively carries out the attacks on the image modality and the text modality. Experimental results demonstrated that the proposed method achieves improved attack performances on different V+L downstream tasks and VLP models. The analysis observations and novel attack method hopefully provide new understanding into the adversarial robustness of VLP models, so as to contribute their safe and reliable deployment in more real-world scenarios.

TPSNet: Reverse Thinking of Thin Plate Splines for Arbitrary Shape Scene Text Representation

Wei Wang
Yu Zhou
Jiahao Lv
Dayan Wu
Guoqing Zhao
Ning Jiang
Weipinng Wang

The research focus of scene text detection and recognition has shifted to arbitrary shape text in recent years, where the text shape representation is a fundamental problem. An ideal representation should be compact, complete, efficient, and reusable for subsequent recognition in our opinion. However, previous representations have flaws in one or more aspects. Thin-Plate-Spline (TPS) transformation has achieved great success in scene text recognition. Inspired by this, we reversely think of its usage and sophisticatedly take TPS as an exquisite representation for arbitrary shape text representation. The TPS representation is compact, complete, and efficient. With the predicted TPS parameters, the detected text region can be directly rectified to a near-horizontal one to assist the subsequent recognition. To further exploit the potential of the TPS representation, the Border Alignment Loss is proposed. Based on these designs, we implement the text detector TPSNet, which can be extended to a text spotter conveniently. Extensive evaluation and ablation of several public benchmarks demonstrate the effectiveness and superiority of the proposed method for text representation and spotting. Particularly, TPSNet achieves the detection F-Measure improvement of 4.4% (78.4% vs. 74.0%) on Art dataset and the end-to-end spotting F-Measure improvement of 5.0% (78.5% vs. 73.5%) on Total-Text, which are large margins with no bells and whistles. The source code will be available.

Efficient Modeling of Future Context for Image Captioning

Zhengcong Fei

Existing approaches to image captioning usually generate the sentence word-by-word from left to right, with the constraint of conditioned on local context including the given image and history generated words. There have been many studies target to make use of global information during decoding, e.g., iterative refinement. However, it is still under-explored how to effectively and efficiently incorporate the future context. To respond to this issue, inspired by that Non-Autoregressive Image Captioning (NAIC) can leverage two-side relation with modified mask operation, we aim to graft this advance to the conventional Autoregressive Image Captioning (AIC) model while maintaining the inference efficiency without extra time cost. Specifically, AIC and NAIC models are first trained combined with shared visual encoders, forcing the visual encoder to contain sufficient and valid future context; then the AIC model is encouraged to capture the causal dynamics of cross-layer interchanging from NAIC model on its unconfident words, which follows a teacher-student paradigm and optimized with the distribution calibration training objective. Empirical evidences demonstrate that our proposed approach clearly surpass the state-of-the-art baselines in both automatic metrics and human evaluations on the MS COCO benchmark. The source code is available at: https://github.com/feizc/Future-Caption.

Relative Pose Estimation for Multi-Camera Systems from Point Correspondences with Scale Ratio

Banglei Guan
Ji Zhao

The use of multi-camera systems is becoming more common in self-driving cars, micro aerial vehicles or augmented reality headsets. In order to perform 3D geometric tasks, the accuracy and efficiency of relative pose estimation algorithms are very important for the multi-camera systems, and is catching significant research attention these days. The point coordinates of point correspondences (PCs) obtained from feature matching strategies have been widely used for relative pose estimation. This paper exploits known scale ratios besides the point coordinates, which are also intrinsically provided by scale invariant feature detectors (e.g., SIFT). Two-view geometry of scale ratio associated with the extracted features is derived for multi-camera systems. Thanks to the constraints provided by the scale ratio across two views, the number of PCs needed for relative pose estimation is reduced from 6 to 3. Requiring fewer PCs makes RANSAC-like randomized robust estimation significantly faster. For different point correspondence layouts, four minimal solvers are proposed for typical two-camera rigs. Extensive experiments demonstrate that our solvers have better accuracy than the state-of-the-art ones and outperform them in terms of processing time.

Towards Open-Ended Text-to-Face Generation, Combination and Manipulation

Jun Peng
Han Pan
Yiyi Zhou
Jing He
Xiaoshuai Sun
Yan Wang
Yongjian Wu
Rongrong Ji

Text-to-face (T2F) generation is an emerging research hot spot in multimedia, and its main challenge lies in the high fidelity requirement of generated portraits. Many existing works resort to exploring the latent space in a pre-trained generator, e.g., StyleGAN, which has obvious shortcomings in efficiency and generalization ability. In this paper, we propose a generative network for open-ended text-to-face generation, which is termed OpenFaceGAN. Differing from existing StyleGAN-based methods, OpenFaceGAN constructs an effective multi-modal latent space that directly converts the natural language description into a face. This mapping paradigm can fit the real data distribution well and make the model capable of open-ended and even zero-shot T2F generation. Our method improves the inference speed by an order of magnitude, e.g., 294 times than TediGAN. Based on OpenFaceGAN, we further explore text-guided face manipulation (editing). In particular, we propose a parameterized module, OpenEditor, to automatically disentangle the target latent code and update the original style information. OpenEditor also makes OpenFaceGAN directly applicable for most manipulation instructions without example-dependent searches or optimizations, greatly improving the efficiency of face manipulation. We conduct extensive experiments on two benchmark datasets namely Multi-Modal CelebA-HQ and Face2Text-v1.0. The experimental results not only show the superior performance of OpenFaceGAN to the existing T2F methods in both image quality and image-text matching degree but also greatly confirm its outstanding ability in the zero-shot generation. Codes will be released at: \textcolormagenta \urlhttps://github.com/pengjunn/OpenFace

Improving Fusion of Region Features and Grid Features via Two-Step Interaction for Image-Text Retrieval

Dongqing Wu
Huihui Li
Cang Gu
Lei Guo
Hang Liu

In recent years, region features extracted from object detection networks have been widely used in the image-text retrieval task. However, they lack rich background and contextual information, which makes it difficult to match words describing global concepts in sentences. Meanwhile, the region features also lose the details of objects in the image. Fortunately, these disadvantages of region features are the advantages of grid features. In this paper, we propose a novel framework, which fuses the region features and grid features through a two-step interaction strategy, thus extracting a more comprehensive image representation for image-text retrieval. Concretely, in the first step, a joint graph with spatial information constraints is constructed, where all region features and grid features are represented as graph nodes. By modeling the relationships using the joint graph, the information can be passed edge-wise. In the second step, we propose a Cross-attention Gated Fusion module, which further explores the complex interactions between region features and grid features, and then adaptively fuses different types of features. With these two steps, our model can fully realize the complementary advantages of region features and grid features. In addition, we propose a Multi-Attention Pooling module to better aggregate the fused region features and grid features. Extensive experiments on two public datasets, including Flickr30K and MS-COCO, demonstrate that our model achieves the state-of-the-art and pushes the performance of image-text retrieval to a new height.

A Numerical DEs Perspective on Unfolded Linearized ADMM Networks for Inverse Problems

Weixin An
Yingjie Yue
Yuanyuan Liu
Fanhua Shang
Hongying Liu

Many research works show that the continuous-time Differential Equations (DEs) allow for a better understanding of traditional Alternating Direction Multiplier Methods (ADMMs). And many unfolded algorithms directly inherit the traditional iterations to build deep networks. Although they obtain a faster convergence rate and superior practical performance, there is a lack of an appropriate explanation of the unfolded network architectures. Thus, we attempt to explore the connection between the existing unfolded Linearized ADMM (LADMM) and numerical DEs, and propose efficient unfolded network design schemes. First, we present an unfolded Euler LADMM scheme as a by-product, which originates from the Euler method for solving first-order DEs. Then inspired by the trapezoid method in numerical DEs, we design a new more effective network scheme, called unfolded Trapezoid LADMM scheme. Moreover, we analyze that the Trapezoid LADMM scheme has higher precision than the Euler LADMM scheme. To the best of our knowledge, this is the first work to explore the connection between unfolded ADMMs and numerical DEs with theoretical guarantees. Finally, we instantiate our Euler LADMM and Trapezoid LADMM schemes into ELADMM and TLADMM with the proximal operators, and ELADMM-Net and TLADMM-Net with convolutional neural networks. And extensive experiments show that our algorithms are competitive with state-of-the-art methods.

UDoc-GAN: Unpaired Document Illumination Correction with Background Light Prior

Yonghui Wang
Wengang Zhou
Zhenbo Lu
Houqiang Li

Document images captured by mobile devices are usually degraded by uncontrollable illumination, which hampers the clarity of document content. Recently, a series of research efforts have been devoted to correcting the uneven document illumination. However, existing methods rarely consider the use of ambient light information, and usually rely on paired samples including degraded and the corrected ground-truth images which are not always accessible. To this end, we propose UDoc-GAN, the first framework to address the problem of document illumination correction under the unpaired setting. Specifically, we first predict the ambient light features of the document. Then, according to the characteristics of different level of ambient lights, we re-formulate the cycle consistency constraint to learn the underlying relationship between normal and abnormal illumination domains. To prove the effectiveness of our approach, we conduct extensive experiments on DocProj dataset under the unpaired setting. Compared with the state-of-the-art approaches, our method demonstrates promising performance in terms of character error rate (CER) and edit distance (ED), together with better qualitative results for textual detail preservation. The source code is now publicly available at \urlhttps://github.com/harrytea/UDoc-GAN.

Dilated Context Integrated Network with Cross-Modal Consensus for Temporal Emotion Localization in Videos

Juncheng Li
Junlin Xie
Linchao Zhu
Long Qian
Siliang Tang
Wenqiao Zhang
Haochen Shi
Shengyu Zhang
Longhui Wei
Qi Tian
Yueting Zhuang

Understanding human emotions is a crucial ability for intelligent robots to provide better human-robot interactions. The existing works are limited to trimmed video-level emotion classification, failing to locate the temporal window corresponding to the emotion. In this paper, we introduce a new task, named Temporal Emotion Localization in videos (TEL), which aims to detect human emotions and localize their corresponding temporal boundaries in untrimmed videos with aligned subtitles. TEL presents three unique challenges compared to temporal action localization: 1) The emotions have extremely varied temporal dynamics; 2) The emotion cues are embedded in both appearances and complex plots; 3) The fine-grained temporal annotations are complicated and labor-intensive. To address the first two challenges, we propose a novel dilated context integrated network with a coarse-fine two-stream architecture. The coarse stream captures varied temporal dynamics by modeling multi-granularity temporal contexts. The fine stream achieves complex plots understanding by reasoning the dependency between the multi-granularity temporal contexts from the coarse stream and adaptively integrates them into fine-grained video segment features. To address the third challenge, we introduce a cross-modal consensus learning paradigm, which leverages the inherent semantic consensus between the aligned video and subtitle to achieve weakly-supervised learning. We contribute a new testing set with 3,000 manually-annotated temporal boundaries so that future research on the TEL problem can be quantitatively evaluated. Extensive experiments show the effectiveness of our approach on temporal emotion localization. The repository of this work is at https://github.com/YYJMJC/TemporalEmotion-Localization-in-Videos.

Balanced Gradient Penalty Improves Deep Long-Tailed Learning

Dong Wang
Yicheng Liu
Liangji Fang
Fanhua Shang
Yuanyuan Liu
Hongying Liu

In recent years, deep learning has achieved a great success in various image recognition tasks. However, the long-tailed setting over a semantic class plays a leading role in real-world applications. Common methods focus on optimization on balanced distribution or naive models. Few works explore long-tailed learning from a deep learning-based generalization perspective. The loss landscape on long-tailed learning is first investigated in this work. Empirical results show that sharpness-aware optimizers work not well on long-tailed learning. Because they do not take class priors into consideration, and they fail to improve performance of few-shot classes. To better guide the network and explicitly alleviate sharpness without extra computational burden, we develop a universal Balanced Gradient Penalty (BGP) method. Surprisingly, our BGP method does not need the detailed class priors and preserves privacy. Our new algorithm BGP, as a regularization loss, can achieve the state-of-the-art results on various image datasets (i.e., CIFAR-LT, ImageNet-LT and iNaturalist-2018) in the settings of different imbalance ratios.

Uncertainty-Aware 3D Human Pose Estimation from Monocular Video

Jinlu Zhang
Yujin Chen
Zhigang Tu

Estimating the 3D human pose from the monocular video is challenging mainly due to the depth ambiguity and inaccurate 2D detected keypoints. To quantify the depth uncertainty of 3D human pose via the neural network, we imbue the uncertainty modeling to depth prediction by using evidential deep learning (EDL). Meanwhile, to calibrate the distribution uncertainty of the 2D detection, we explore a probabilistic representation to model the realistic distribution. Specifically, we exploit the EDL to measure the depth prediction uncertainty of the network, and decompose the x-y coordinates into individual distributions to model the deviation uncertainty of the inaccurate 2D keypoints. Then we optimize the depth uncertainty parameters and calibrate the 2D deviations to obtain accurate 3D human poses. Besides, to provide effective latent features for uncertainty learning, we design an encoder which combines graph convolutional network (GCN) and transformer to learn discriminative spatio-temporal representations. Extensive experiments are conducted on three benchmarks (Human3.6M, MPI-INF-3DHP, and HumanEva-I) and the comprehensive results show that our model surpasses the state-of-the-arts by a large margin.

MVSPlenOctree: Fast and Generic Reconstruction of Radiance Fields in PlenOctree from Multi-view Stereo

Wenpeng Xing
Jie Chen

We present MVSPlenOctree, a novel approach that can efficiently reconstruct radiance fields for view synthesis. Unlike previous scene-specific radiance fields reconstruction methods, we present a generic pipeline that can efficiently reconstruct 360-degree-renderable radiance fields via multi-view stereo (MVS) inference from tens of sparse-spread out images. Our approach leverages variance-based statistic features for MVS inference, and combines this with image based rendering and volume rendering for radiance field reconstruction. We first train a MVS Machine for reasoning scene's density and appearance. Then, based on the spatial hierarchy of the PlenOctree and coarse-to-fine dense sampling mechanism, we design a robust and efficient sampling strategy for PlenOctree reconstruction, which handles occlusion robustly. A 360-degree-renderable radiance fields can be reconstructed in PlenOctree from MVS Machine in an efficient single forward pass. We trained our method on real-world DTU, LLFF datasets, and synthetic datasets. We validate its generalizability by evaluating on the test set of DTU dataset which are unseen in training. In summary, our radiance field reconstruction method is both efficient and generic, a coarse 360-degree-renderable radiance field can be reconstructed in seconds and a dense one within minutes. Please visit the project page for more details: https://derry-xing.github.io/projects/MVSPlenOctree.

A Dual-Masked Auto-Encoder for Robust Motion Capture with Spatial-Temporal Skeletal Token Completion

Junkun Jiang
Jie Chen
Yike Guo

Multi-person motion capture can be challenging due to ambiguities caused by severe occlusion, fast body movement, and complex interactions. Existing frameworks build on 2D pose estimations and triangulate to 3D coordinates via reasoning the appearance, trajectory, and geometric consistencies among multi-camera observations. However, 2D joint detection is usually incomplete and with wrong identity assignments due to limited observation angle, which leads to noisy 3D triangulation results. To overcome this issue, we propose to explore the short-range autoregressive characteristics of skeletal motion using transformer. First, we propose an adaptive, identity-aware triangulation module to reconstruct 3D joints and identify the missing joints for each identity. To generate complete 3D skeletal motion, we then propose a Dual-Masked Auto-Encoder (D-MAE) which encodes the joint status with both skeletal-structural and temporal position encoding for trajectory completion. D-MAE's flexible masking and encoding mechanism enable arbitrary skeleton definitions to be conveniently deployed under the same framework. In order to demonstrate the proposed model's capability in dealing with severe data loss scenarios, we contribute a high-accuracy and challenging motion capture dataset of multi-person interactions with severe occlusion. Evaluations on both benchmark and our new dataset demonstrate the efficiency of our proposed model, as well as its advantage against the other state-of-the-art methods.

Learning Dynamic Prior Knowledge for Text-to-Face Pixel Synthesis

Jun Peng
Xiaoxiong Du
Yiyi Zhou
Jing He
Yunhang Shen
Xiaoshuai Sun
Rongrong Ji

Text-to-face (T2F) generation is an emerging research hot spot in multimedia, which aims to synthesize vivid portraits based on the given descriptions. Its main challenge lies in the accurate alignments from texts to image pixels, which pose a high demand in generation fidelity. We define T2F as a pixel synthesis problem conditioned on the texts and propose a novel dynamic pixel synthesis network, PixelFace, for end-to-end T2F generation in this paper. To fully exploit the prior knowledge for T2F synthesis, we propose a novel dynamic parameter generation module, which transforms text features into dynamic knowledge embeddings for end-to-end pixel regression. These knowledge embeddings are example-dependent and spatially related to image pixels, based on which PixelFace can exploit the text priors for high-quality text-guided face generation. To validate the proposed PixelFace, we conduct extensive experiments on the MMCelebA, and compare PixelFace with a set of state-of-the-art methods in T2F and T2I generations, e.g., StyleCLIP and TediGAN. The experimental results not only show the greater performance of PixelFace than the compared methods but also validates its merits over existing T2F methods in both text-image matching and inference speed. Codes will be released at: \textcolormagenta \urlhttps://github.com/pengjunn/PixelFace .

Correct Twice at Once: Learning to Correct Noisy Labels for Robust Deep Learning

Jingzheng Li
Hailong Sun

Deep Neural Networks (DNNs) have shown impressive performance on large-scale training data with high-quality annotations. However, the collected annotations inevitably contain inaccurate labels in consideration of time and money budget, which causes DNNs to generalize poorly on the test set. To combat noisy labels in deep learning, the label correction methods are dedicated to simultaneously updating model parameters and correcting noisy labels, in which the noisy labels are usually corrected based on model predictions, the topological structures of data, or the aggregation of multiple models. However, such self-training manner cannot guarantee that the direction of label correction is always reliable. In view of this, we propose a novel label correction method to supervise and guide the process of label correction. In particular, the proposed label correction is an online two-fold process at each iteration only through back-propagation. The first label correction minimizes the empirical risk on noisy training data using noise-tolerant loss function, and the second label correction adopts a meta-learning paradigm to rectify the direction of first label correction so that the model can perform optimally in the evaluation procedure. Extensive experiments demonstrate the effectiveness of the proposed method on synthetic datasets with varying noise types and noise rates. Notably, our method achieves test accuracy of 77.37% on the real-world Clothing1M dataset.

Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge

Zhihong Chen
Guanbin Li
Xiang Wan

Medical vision-and-language pre-training (Med-VLP) has received considerable attention owing to its applicability to extracting generic vision-and-language representations from medical images and texts. Most existing methods mainly contain three elements: uni-modal encoders (i.e., a vision encoder and a language encoder), a multi-modal fusion module, and pretext tasks, with few studies considering the importance of medical domain expert knowledge and explicitly exploiting such knowledge to facilitate Med-VLP. Although there exist knowledge-enhanced vision-and-language pre-training (VLP) methods in the general domain, most require off-the-shelf toolkits (e.g., object detectors and scene graph parsers), which are unavailable in the medical domain. In this paper, we propose a systematic and effective approach to enhance Med-VLP by structured medical knowledge from three perspectives. First, considering knowledge can be regarded as the intermediate medium between vision and language, we align the representations of the vision encoder and the language encoder through knowledge. Second, we inject knowledge into the multi-modal fusion model to enable the model to perform reasoning using knowledge as the supplementation of the input image and text. Third, we guide the model to put emphasis on the most critical information in images and texts by designing knowledge-induced pretext tasks. To perform a comprehensive evaluation and facilitate further research, we construct a medical vision-and-language benchmark including three tasks. Experimental results illustrate the effectiveness of our approach, where state-of-the-art performance is achieved on all downstream tasks. Further analyses explore the effects of different components of our approach and various settings of pre-training.

Diverse Human Motion Prediction via Gumbel-Softmax Sampling from an Auxiliary Space

Lingwei Dang
Yongwei Nie
Chengjiang Long
Qing Zhang
Guiqing Li

Diverse human motion prediction aims at predicting multiple possible future pose sequences from a sequence of observed poses. Previous approaches usually employ deep generative networks to model the conditional distribution of data, and then randomly sample outcomes from the distribution. While different results can be obtained, they are usually the most likely ones which are not diverse enough. Recent work explicitly learns multiple modes of the conditional distribution via a deterministic network, which however can only cover a fixed number of modes within a limited range. In this paper, we propose a novel sampling strategy for sampling very diverse results from an imbalanced multimodal distribution learned by a deep generative model. Our method works by generating an auxiliary space and smartly making randomly sampling from the auxiliary space equivalent to the diverse sampling from the target distribution. We propose a simple yet effective network architecture that implements this novel sampling strategy, which incorporates a Gumbel-Softmax coefficient matrix sampling method and an aggressive diversity promoting hinge loss function. Extensive experiments demonstrate that our method significantly improves both the diversity and accuracy of the samplings compared with previous state-of-the-art sampling approaches. Code and pre-trained models are available at https://github.com/Droliven/diverse_sampling.

Towards High-Fidelity Face Normal Estimation

Meng Wang
Chaoyue Wang
Xiaojie Guo
Jiawan Zhang

While existing face normal estimation methods have produced promising results on small datasets, they often suffer from severe performance degradation on diverse in-the-wild face images, especially for the high-fidelity face normal estimation. Training a high-fidelity face normal estimation model with generalization capability requires a large amount of training data with face normal ground truth. Since collecting such high-fidelity database is difficult in practice, which prevents current methods from recovering face normal with fine-grained geometric details. To mitigate this issue, we propose a coarse-to-fine framework to estimate face normal from an in-the-wild image with only a coarse exemplar reference. Specifically, we first train a model using limited training data to exploit the coarse normal of a real face image. Then, we leverage the estimated coarse normal as an exemplar and devise an exemplar-based normal estimation network to explore robust mapping from the input face image to the fine-grained normal. In this manner, our method can largely alleviate the negative impact caused by lacking training data, and focus on exploring the high-fidelity normal contained in natural images. Extensive experiments and ablation studies are conducted to demonstrate the efficacy of our design, and reveal its superiority over state-of-the-art methods in terms of both training data requirement and recovery quality of fine-grained face normal. Our code is available at \urlhttps://github.com/AutoHDR/HFFNE.

Generating Transferable Adversarial Examples against Vision Transformers

Yuxuan Wang
Jiakai Wang
Zixin Yin
Ruihao Gong
Jingyi Wang
Aishan Liu
Xianglong Liu

Vision transformers (ViTs) are prevailing among several visual recognition tasks, therefore drawing intensive interest in generating adversarial examples against them. Different from CNNs, ViTs enjoy unique architectures, e.g., self-attention and image-embedding, which are commonly-shared features among various types of transformer-based models. However, existing adversarial methods suffer from weak transferable attacking ability due to the overlook of these architectural features. To address the problem, we propose an Architecture-oriented Transferable Attacking (ATA) framework to generate transferable adversarial examples by activating the uncertain attention and perturbing the sensitive embedding.Specifically, we first locate the patch-wise attentional regions that mostly affect model perception, therefore intensively activating the uncertainty of the attention mechanism and confusing the model decisions in turn.Furthermore, we search the pixel-wise attacking positions that are more likely to derange the embedded tokens using sensitive embedding perturbation, which could serve as a strong transferable attacking pattern.By jointly confusing the unique yet widely-used architectural features among transformer-based models, we can activate strong attacking transferability among diverse ViTs. Extensive experiments on large-scale dataset ImageNet using various popular transformers demonstrate that our ATA outperforms other baselines by large margins (at least +15% Attack Success Rate). Our code is available at https://github.com/nlsde-safety-team/ATA

Video-Guided Curriculum Learning for Spoken Video Grounding

Yan Xia
Zhou Zhao
Shangwei Ye
Yang Zhao
Haoyuan Li
Yi Ren

In this paper, we introduce a new task, spoken video grounding (SVG), which aims to localize the desired video fragments from spoken language descriptions. Compared with using text, employing audio requires the model to directly exploit the useful phonemes and syllables related to the video from raw speech. Moreover, we randomly add environmental noises to this speech audio, further increasing the difficulty of this task and better simulating real applications. To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL) during the audio pre-training process, which can make use of the vital visual perceptions to help understand the spoken language and suppress the external noise. Considering during inference the model can not obtain ground truth video segments, we design a curriculum strategy that gradually shifts the input video from the ground truth to the entire video content during pre-training. Finally, the model can learn how to extract critical visual information from the entire video clip to help understand the spoken language. In addition, we collect the first large-scale spoken video grounding dataset based on ActivityNet, which is named as ActivityNet Speech dataset. Extensive experiments demonstrate our proposed video-guided curriculum learning can facilitate the pre-training process to obtain a mutual audio encoder, significantly promoting the performance of spoken video grounding tasks. Moreover, we prove that in the case of noisy sound, our model outperforms the method that grounding video with ASR transcripts, further demonstrating the effectiveness of our curriculum strategy.

Multi-Scale Coarse-to-Fine Transformer for Frame Interpolation

Chen Li
Li Song
Xueyi Zou
Jiaming Guo
Youliang Yan
Wenjun Zhang

The majority of prevailing video interpolation methods compute flows to estimate the intermediate motion. However, accurate estimation of the intermediate motion is difficult with low-order motion model hypothesis, which induces enormous difficulties for subsequent processing. To alleviate the limitation, we propose a two-stage flow-free video interpolation architecture. Rather than utilizing pre-defined motion models, our method represents complex motion through data-driven learning. In the first stage, we analyze spatial-temporal information and generate coarse anchor frame features. In the second stage, we employ transformers to transfer neighboring features to the intermediate time steps and enhance the spatial textures. To improve the quality of coarse anchor frame features and the robustness in dealing with the multi-scale textures with large-scale motion, we propose a multi-scale architecture and transformers with variable token sizes to progressively enhance the features. The experimental results demonstrate that our model outperforms state-of-the-art methods for both single frame and multi frames interpolation tasks, and the extended ablation studies verify the effectiveness of our model.

Progressive Tree-Structured Prototype Network for End-to-End Image Captioning

Pengpeng Zeng
Jinkuan Zhu
Jingkuan Song
Lianli Gao

Studies of image captioning are shifting towards a trend of a fully end-to-end paradigm by leveraging powerful visual pre-trained models and transformer-based generation architecture for more flexible model training and faster inference speed. State-of-the-art approaches simply extract isolated concepts or attributes to assist description generation. However, such approaches do not consider the hierarchical semantic structure in the textual domain, which leads to an unpredictable mapping between visual representations and concept words. To this end, we propose a novel Progressive Tree-Structured prototype Network (dubbed PTSN), which is the first attempt to narrow down the scope of prediction words with appropriate semantics by modeling the hierarchical textual semantics. Specifically, we design a novel embedding method called tree-structured prototype, producing a set of hierarchical representative embeddings which capture the hierarchical semantic structure in textual space. To utilize such tree-structured prototypes into visual cognition, we also propose a progressive aggregation module to exploit semantic relationships within the image and prototypes. By applying our PTSN to the end-to-end captioning framework, extensive experiments conducted on MSCOCO dataset show that our method achieves a new state-of-the-art performance with 144.2% (single model) and 146.5% (ensemble of 4 models) CIDEr scores on 'Karpathy' split and 141.4% (c5) and 143.9% (c40) CIDEr scores on the official online test server. Trained models and source code have been released at: https://github.com/NovaMind-Z/PTSN.

S-CCR: Super-Complete Comparative Representation for Low-Light Image Quality Inference In-the-wild

Miaohui Wang
Zhuowei Xu
Yuanhao Gong
Wuyuan Xie

With the rapid development of weak-illumination imaging technology, low-light images have brought new challenges to quality of experience and service. However, developing a robust quality indicator for authentic low-light distortions in-the-wild remains a major challenge in practical quality control systems. In this paper, we develop a new super-complete comparative representation (S-CCR) for the region-level quality inference of low-light images. Specifically, we excavate the color, luminance, and detail quality evidence for the feature embedding guidance of comparative representation based on the human visual characteristics. Moreover, we decompose the inputs into a super-complete feature group so that the image quality of each region can be fully represented, which allows to preserve the distinctiveness, distinguishability, and consistency. Finally, we further establish a comparative domain alignment method, so that the comparative representation of an unseen image can be aligned with respect to the quality features of already-seen ones. Extensive experiments on the benchmark dataset validate the superiority of our S-CCR over 11 competing methods on authentic distortions.

Talking Head from Speech Audio using a Pre-trained Image Generator

Mohammed M. Alghamdi
He Wang
Andrew J. Bulpitt
David C. Hogg

We propose a novel method for generating high-resolution videos of talking-heads from speech audio and a single 'identity' image. Our method is based on a convolutional neural network model that incorporates a pre-trained StyleGAN generator. We model each frame as a point in the latent space of StyleGAN so that a video corresponds to a trajectory through the latent space. Training the network is in two stages. The first stage is to model trajectories in the latent space conditioned on speech utterances. To do this, we use an existing encoder to invert the generator, mapping from each video frame into the latent space. We train a recurrent neural network to map from speech utterances to displacements in the latent space of the image generator. These displacements are relative to the back-projection into the latent space of an identity image chosen from the individuals depicted in the training dataset. In the second stage, we improve the visual quality of the generated videos by tuning the image generator on a single image or a short video of any chosen identity. We evaluate our model on standard measures (PSNR, SSIM, FID and LMD) and show that it significantly outperforms recent state-of-the-art methods on one of two commonly used datasets and gives comparable performance on the other. Finally, we report on ablation experiments that validate the components of the model. The code and videos from experiments can be found at https://mohammedalghamdi.github.io/talking-heads-acm-mm/

Exploring High-quality Target Domain Information for Unsupervised Domain Adaptive Semantic Segmentation

Junjie Li
Zilei Wang
Yuan Gao
Xiaoming Hu

In unsupervised domain adaptive (UDA) semantic segmentation, the distillation based methods are currently dominant in performance. However, the distillation technique requires complicate multi-stage process and many training tricks. In this paper, we propose a simple yet effective method that can achieve competitive performance to the advanced distillation methods. Our core idea is to fully explore the target-domain information from the views of boundaries and features. First, we propose a novel mix-up strategy to generate high-quality target-domain boundaries with ground-truth labels. Different from the source-domain boundaries in previous works, we select the high-confidence target-domain areas and then paste them to the source-domain images. Such a strategy can generate the object boundaries in target domain (edge of target-domain object areas) with the correct labels. Consequently, the boundary information of target domain can be effectively captured by learning on the mixed-up samples. Second, we design a multi-level contrastive loss to improve the representation of target-domain data, including pixel-level and prototype-level contrastive learning. By combining two proposed methods, more discriminative features can be extracted and hard object boundaries can be better addressed for the target domain. The experimental results on two commonly adopted benchmarks (i.e., GTA5 -> Cityscapes and SYNTHIA -> Cityscapes) show that our method achieves competitive performance to complicated distillation methods. Notably, for the SYNTHIA-> Cityscapes scenario, our method achieves the state-of-the-art performance with 57.8% mIoU and 64.6% mIoU on 16 classes and 13 classes. Code is available at https://github.com/ljjcoder/EHTDI.

Semantics-Driven Generative Replay for Few-Shot Class Incremental Learning

Aishwarya Agarwal
Biplab Banerjee
Fabio Cuzzolin
Subhasis Chaudhuri

We deal with the problem of few-shot class incremental learning (FSCIL), which requires a model to continuously recognize new categories for which limited training data are available. Existing FSCIL methods depend on prior knowledge to regularize the model parameters for combating catastrophic forgetting. Devising an effective prior in a low-data regime, however, is not trivial. The memory-replay based approaches from the fully-supervised class incremental learning (CIL) literature cannot be used directly for FSCIL as the generative memory-replay modules of CIL are hard to train from few training samples. However, generative replay can tackle both the stability and plasticity of the models simultaneously by generating a large number of class-conditional samples. Convinced by this fact, we propose a generative modeling-based FSCIL framework using the paradigm of memory-replay in which a novel conditional few-shot generative adversarial network (GAN) is incrementally trained to produce visual features while ensuring the stability-plasticity trade-off through novel loss functions and combating the mode-collapse problem effectively. Furthermore, the class-specific synthesized visual features from the few-shot GAN are constrained to match the respective latent semantic prototypes obtained from a well-defined semantic space. We find that the advantages of this semantic restriction is two-fold, in dealing with forgetting, while making the features class-discernible. The model requires a single per-class prototype vector to be maintained in a dynamic memory buffer. Experimental results on the benchmark and large-scale CiFAR-100, CUB-200, and Mini-ImageNet confirm the superiority of our model over the current FSCIL state of the art.

Global-Local Cross-View Fisher Discrimination for View-Invariant Action Recognition

Lingling Gao
Yanli Ji
Yang Yang
HengTao Shen

View change brings a significant challenge to action representation and recognition due to pose occlusion and deformation. We propose a Global-Local Cross-View Fisher Discrimination (GL-CVFD) algorithm to tackle this problem. In the GL-CVFD approach, we firstly capture the motion trajectory of body joints in action sequences as feature input to weaken the effect of view change. Secondly, we design a Global-Local Cross-View Representation (CVR) learning module, which builds global-level and local-level graphs to link body parts and joints between different views. It can enhance the cross-view information interaction and obtain an effective view-common action representation. Thirdly, we present a Cross-View Fisher Discrimination (CVFD) module, which performs a view-differential operation to separate view-specific action features and modifies the Fisher discriminator to implement view-semantic Fisher contrastive learning. It operates by pulling and pushing on view-specific and view-common action features in the view term to guarantee the validity of the CVR module, then distinguishes view-common action features in the semantic term for view-invariant recognition. Extensive and fair evaluations are implemented in the UESTC, NTU 60, and NTU 120 datasets. Experiment results illustrate that our proposed approach achieves encouraging performance in skeleton-based view-invariant action recognition.

Reflecting on Experiences for Response Generation

Chenchen Ye
Lizi Liao
Suyu Liu
Tat-Seng Chua

Multimodal dialogue systems attract much attention recently, but they are far from skills like: 1) automatically generate context- specific responses instead of safe but general responses; 2) naturally coordinate between the different information modalities (e.g. text and image) in responses; 3) intuitively explain the reasons for generated responses and improve a specific response without re-training the whole model. To approach these goals, we propose a different angle for the task - Reflecting Experiences for Response Generation (RERG). This is supported by the fact that generating a response from scratch can be hard, but much easier if we can access other similar dialogue contexts and the corresponding responses. In particular, RERG first uses a multimodal contrastive learning enhanced retrieval model for soliciting similar dialogue instances. It then employs a cross copy based reuse model to explore the current dialogue context (vertical) and similar dialogue instances' responses (horizontal) for response generation simultaneously. Experimental results demonstrate that our model outperforms other state-of-the-art models on both automatic metrics and human evaluation. Moreover, RERG naturally provides supporting dialogue instances for better explainability. It also has a strong capability in adapting to unseen dialogue settings by simply adding related samples to the retrieval datastore without re-training the whole model.

AI-VQA: Visual Question Answering based on Agent Interaction with Interpretability

Rengang Li
Cong Xu
Zhenhua Guo
Baoyu Fan
Runze Zhang
Wei Liu
Yaqian Zhao
Weifeng Gong
Endong Wang

Visual Question Answering (VQA) serves as a proxy for evaluating the scene understanding of an intelligent agent by answering questions about images. Most VQA benchmarks to date are focused on those questions that can be answered through understanding visual content in the scene, such as simple counting, visual attributes, and even a little challenging questions that require extra encyclopedic knowledge. However, humans have a remarkable capacity to reason dynamic interaction on the scene, which is beyond the literal content of an image and has not been investigated so far. In this paper, we propose Agent Interaction Visual Question Answering (AI-VQA), a task investigating deep scene understanding if the agent takes a certain action. For this task, a model not only needs to answer action-related questions but also to locate the objects in which the interaction occurs for guaranteeing it truly comprehends the action. Accordingly, we make a new dataset based on Visual Genome and ATOMIC knowledge graph, including more than 19,000 manually annotated questions, and will make it publicly available. Besides, we also provide an annotation of the reasoning path while developing the answer for each question. Based on the dataset, we further propose a novel method, called ARE, that can comprehend the interaction and explain the reason based on a given event knowledge base. Experimental results show that our proposed method outperforms the baseline by a clear margin.

Situational Perception Guided Image Matting

Bo Xu
Jiake Xie
Han Huang
Ziwen Li
Cheng Lu
Yong Tang
Yandong Guo

Most automatic matting methods try to separate the salient foreground from the background. However, the insufficient quantity and subjective bias of the current existing matting datasets make it difficult to fully explore the semantic association between object-to-object and object-to-environment in a given image. In this paper, we propose a Situational Perception Guided Image Matting (SPG-IM) method that mitigates subjective bias of matting annotations and captures sufficient situational perception information for better global saliency distilled from the visual-to-textual task. SPG-IM can better associate inter-objects and object-to-environment saliency, and compensate the subjective nature of image matting and its expensive annotation. We also introduce a textual Semantic Transformation (TST) module that can effectively transform and integrate the semantic feature stream to guide the visual representations. In addition, an Adaptive Focal Transformation (AFT) Refinement Network is proposed to adaptively switch multi-scale receptive fields and focal points to enhance both global and local details. Extensive experiments demonstrate the effectiveness of situational perception guidance from the visual-to-textual tasks on image matting, and our model outperforms the state-of-the-art methods. We also analyze the significance of different components in our model.

ROMA: Cross-Domain Region Similarity Matching for Unpaired Nighttime Infrared to Daytime Visible Video Translation

Zhenjie Yu
Kai Chen
Shuang Li
Bingfeng Han
Chi Harold Liu
Shuigen Wang

Infrared cameras are often utilized to enhance the night vision since the visible light cameras exhibit inferior efficacy without sufficient illumination. However, infrared data possesses inadequate color contrast and representation ability attributed to its intrinsic heat-related imaging principle, which hinders its application. Although, the domain gaps between unpaired nighttime infrared and daytime visible videos are even huger than paired ones that captured at the same time, establishing an effective translation mapping will greatly contribute to various fields. In this case, the structural knowledge within nighttime infrared videos and semantic information contained in the translated daytime visible pairs could be utilized simultaneously. To this end, we propose a tailored framework ROMA that couples with our introduced cRoss-domain regiOn siMilarity mAtching technique for bridging the huge gaps. To be specific, ROMA could efficiently translate the unpaired nighttime infrared videos into fine-grained daytime visible ones, meanwhile maintain the spatiotemporal consistency via matching the cross-domain region similarity. Furthermore, we design a multiscale region-wise discriminator to distinguish the details from synthesized visible results and real references. Moreover, we provide a new and challenging dataset encouraging further research for unpaired nighttime infrared and daytime visible video translation, named InfraredCity, which is $20$ times larger than the recently released infrared-related dataset IRVI. Codes and datasets are available https://github.com/BIT-DA/ROMA here.

A3GAN: Attribute-Aware Anonymization Networks for Face De-identification

Liming Zhai
Qing Guo
Xiaofei Xie
Lei Ma
Yi Estelle Wang
Yang Liu

Face de-identification (De-ID) removes face identity information in face images to avoid personal privacy leakage. Existing face De-ID breaks the raw identity by cutting out the face regions and recovering the corrupted regions via deep generators, which inevitably affect the generation quality and cannot control generation results according to subsequent intelligent tasks (eg., facial expression recognition). In this work, for the first attempt, we think the face De-ID from the perspective of attribute editing and propose an attribute-aware anonymization network (A3GAN) by formulating face De-ID as a joint task of semantic suppression and controllable attribute injection. Intuitively, the semantic suppression removes the identity-sensitive information in embeddings while the controllable attribute injection automatically edits the raw face along the attributes that benefit De-ID. To this end, we first design a multi-scale semantic suppression network with a novel suppressive convolution unit (SCU), which can remove the face identity along multi-level deep features progressively. Then, we propose an attribute-aware injective network (AINet) that can generate De-ID-sensitive attributes in a controllable way (i.e., specifying which attributes can be changed and which cannot) and inject them into the latent code of the raw face. Moreover, to enable effective training, we design a new anonymization loss to let the injected attributes shift far away from the original ones. We perform comprehensive experiments on four datasets covering four different intelligent tasks including face verification, face detection, facial expression recognition, and fatigue detection, all of which demonstrate the superiority of our face De-ID over state-of-the-art methods.

CAIBC: Capturing All-round Information Beyond Color for Text-based Person Retrieval

Zijie Wang
Aichun Zhu
Jingyi Xue
Xili Wan
Chao Liu
Tian Wang
Yifeng Li

Given a natural language description, text-based person retrieval aims to identify images of a target person from a large-scale person image database. Existing methods generally face a color over-reliance problem, which means that the models rely heavily on color information when matching cross-modal data. Indeed, color information is an important decision-making accordance for retrieval, but the over-reliance on color would distract the model from other key clues (e.g. texture information, structural information, etc.), and thereby lead to a sub-optimal retrieval performance. To solve this problem, in this paper, we propose to Capture All-round Information Beyond Color (CAIBC) via a jointly optimized multi-branch architecture for text-based person retrieval. CAIBC contains three branches including an RGB branch, a grayscale (GRS) branch and a color (CLR) branch. Besides, with the aim of making full use of all-round information in a balanced and effective way, a mutual learning mechanism is employed to enable the three branches which attend to varied aspects of information to communicate with and learn from each other. Extensive experimental analysis is carried out to evaluate our proposed CAIBC method on the CUHK-PEDES and RSTPReid datasets in both supervised and weakly supervised text-based person retrieval settings, which demonstrates that CAIBC significantly outperforms existing methods and achieves the state-of-the-art performance on all the three tasks.

PreyNet: Preying on Camouflaged Objects

Miao Zhang
Shuang Xu
Yongri Piao
Dongxiang Shi
Shusen Lin
Huchuan Lu

Species often adopt various camouflage strategies to be seamlessly blended into the surroundings for self-protection. To figure out the concealment, predators have evolved excellent hunting skills. Exploring the intrinsic mechanisms of the predation behavior can offer more insightful glimpse into the task of camouflaged object detection (COD). In this work, we strive to seek answers for accurate COD and propose a PreyNet, which mimics the two processes of predation, namely, initial detection (sensory mechanism) and predator learning (cognitive mechanism). To exploit the sensory process, a bidirectional bridging interaction module (BBIM) is designed for selecting and aggregating initial features in an attentive manner. The predator learning process is formulated as a policy-and-calibration paradigm, with the goal of deciding on uncertain regions and encouraging targeted feature calibration. Besides, we obtain adaptive weight for multi-layer supervision during training via computing on the uncertainty estimation. Extensive experiments demonstrate that our model produces state-of-the-art results on several benchmarks. We further verify the scalability of the predator learning paradigm through applications on top-ranking salient object detection models. Our code is publicly available at \urlhttps://github.com/OIPLab-DUT/PreyNet.

Not All Pixels Are Matched: Dense Contrastive Learning for Cross-Modality Person Re-Identification

Hanzhe Sun
Jun Liu
Zhizhong Zhang
Chengjie Wang
Yanyun Qu
Yuan Xie
Lizhuang Ma

Visible-Infrared Person Re-Identification (VI-ReID) has become an emerging task for night-time surveillance systems. In order to reduce the cross-modality discrepancy, previous works either align the features via metric learning or generate synthesized cross-modality images by Generative Adversary Network. However, feature-level alignment ignores the heterogeneous data itself while generative framework suffers from the low generation quality, limiting their applications. In this paper, we propose a dense contrastive learning framework (DCLNet), which performs pixel-to-pixel dense alignment acting on the intermediate representations, rather than the final deep feature. It is a new loss function that brings views of positive pixels with same semantic information closer in shallow representation space, whilst pushing views of negative pixels apart. It naturally provides additional dense supervision and captures fine-grained pixel correspondence, reducing the modality gap from a new perspective. To implement it, a Part Aware Parsing (PAP) module and a Semantic Rectification Module (SRM) are introduced to learn and refine a semantic-guided mask, allowing us to efficiently find positive pairs only requiring instance-level supervision. Extensive experiments on the public SYSU-MM01 and RegDB datasets demonstrate the superiority of our pipeline over state-of-the-arts. Code is available at https://github.com/sunhz0117/DCLNet.

Asymmetric Adversarial-based Feature Disentanglement Learning for Cross-Database Micro-Expression Recognition

Shiting Xu
Zhiheng Zhou
Junyuan Shang

Recently, micro-expression recognition (MER) has gained tremendous progress. However, most methods are based on individual-database micro-expression recognition and are difficult to generalize into complicated scenarios. Therefore, cross-database micro-expression recognition (CDMER) has drawn growing attention due to its robustness and generalizability. In this paper, we propose a novel CDMER algorithm with asymmetric adversarial-based feature disentanglement learning, which implements the disentanglement of domain features and emotion features aiming to learn domain-invariant and discriminative representation. Furthermore, to facilitate the feature disentanglement learning, a Domain Information Filtering (DF) module is designed to filter out the domain component of the micro-expression features (emotion feature). Extensive experiments on the SMIC and CASME II databases have shown that our proposed method outperforms the state-of-the-art method and has superior performance against excessive domain discrepancies.

Backdoor Attacks on Crowd Counting

Yuhua Sun
Tailai Zhang
Xingjun Ma
Pan Zhou
Jian Lou
Zichuan Xu
Xing Di
Yu Cheng
Lichao Sun

Crowd counting is a regression task that estimates the number of people in a scene image, which plays a vital role in a range of safety-critical applications, such as video surveillance, traffic monitoring and flow control. In this paper, we investigate the vulnerability of deep learning based crowd counting models to backdoor attacks, a major security threat to deep learning. A backdoor attack implants a backdoor trigger into a target model via data poisoning so as to control the model's predictions at test time. Different from image classification models on which most of existing backdoor attacks have been developed and tested, crowd counting models are regression models that output multi-dimensional density maps, thus requiring different techniques to manipulate. In this paper, we propose two novel Density Manipulation Backdoor Attacks (DMBA- and DMBA+) to attack the model to produce arbitrarily large or small density estimations. Experimental results demonstrate the effectiveness of our DMBA attacks on five classic crowd counting models and four types of datasets. We also provide an in-depth analysis of the unique challenges of backdooring crowd counting models and reveal two key elements of effective attacks: 1) full and dense triggers and 2) manipulation of the ground truth counts or density maps. Our work could help evaluate the vulnerability of crowd counting models to potential backdoor attacks.

Robust Industrial UAV/UGV-Based Unsupervised Domain Adaptive Crack Recognitions with Depth and Edge Awareness: From System and Database Constructions to Real-Site Inspections

Kangcheng Liu

The defect diagnosis of modern infrastructures is crucial to public safety. In this work, we propose a complete crack inspection system with three main components, including the autonomous system setup, the geographic-information-system-based 3D reconstruction, and the database construction as well as domain adaptive algorithms design. To fulfill the unsupervised domain adaptation (UDA) task of cracks recognition in infrastructural inspections, we propose a robust unsupervised domain adaptive learning strategy termed Crack-DA to increase the generalization capacity of the model in unseen test circumstances. Specifically, firstly, we propose leveraging the self-supervised depth information to help the learning of semantics. Secondly, we propose using the edge information to suppress the non-edge background objects and noises. Thirdly, we propose using the data augmentation-based consistency learning to increase the prediction robustness. Finally, we use the disparity in depth to evaluate the domain gap in semantics and explicitly consider the domain gap in the optimization of the network. Also, we propose a source database consisting of 11,298 crack images with detailed pixel-level labels for network training in domain adaptations. Extensive experiments on UAV-captured highway cracks and real-site UAV inspections of building cracks demonstrate the robustness and effectiveness of the proposed domain adaptive crack recognition approach.

Forcing the Whole Video as Background: An Adversarial Learning Strategy for Weakly Temporal Action Localization

Ziqiang Li
Yongxin Ge
Jiaruo Yu
Zhongming Chen

With video-level labels, weakly supervised temporal action localization (WTAL) applies a localization-by-classification paradigm to detect and classify the action in untrimmed videos. Due to the characteristic of classification, class-specific background snippets are inevitably mis-activated to improve the discriminability of the classifier in WTAL. To alleviate the disturbance of background, existing methods try to enlarge the discrepancy between action and background through modeling background snippets with pseudo-snippet-level annotations, which largely rely on artificial hypotheticals. Distinct from the previous works, we present an adversarial learning strategy to break the limitation of mining pseudo background snippets. Concretely, the background classification loss forces the whole video to be regarded as the background by a background gradient reinforcement strategy, confusing the recognition model. Reversely, the foreground(action) loss guides the model to focus on action snippets under such conditions. As a result, competition between the two classification losses drives the model to boost its ability for action modeling. Simultaneously, a novel temporal enhancement network is designed to facilitate the model to construct temporal relation of affinity snippets based on the proposed strategy, for further improving the performance of action localization. Finally, extensive experiments conducted on THUMOS14 and ActivityNet1.2 demonstrate the effectiveness of the proposed method.

Towards Accurate Post-Training Quantization for Vision Transformer

Yifu Ding
Haotong Qin
Qinghua Yan
Zhenhua Chai
Junjie Liu
Xiaolin Wei
Xianglong Liu

Vision transformer emerges as a potential architecture for vision tasks. However, the intense computation and non-negligible delay hinder its application in the real world. As a widespread model compression technique, existing post-training quantization methods still cause severe performance drops. We find the main reasons lie in (1) the existing calibration metric is inaccurate in measuring the quantization influence for extremely low-bit representation, and (2) the existing quantization paradigm is unfriendly to the power-law distribution of Softmax. Based on these observations, we propose a novel Accurate Post-training Quantization framework for Vision Transformer, namely APQ-ViT. We first present a unified Bottom-elimination Blockwise Calibration scheme to optimize the calibration metric to perceive the overall quantization disturbance in a blockwise manner and prioritize the crucial quantization errors that influence more on the final output. Then, we design a Matthew-effect Preserving Quantization for Softmax to maintain the power-law character and keep the function of the attention mechanism. Comprehensive experiments on large-scale classification and detection datasets demonstrate that our APQ-ViT surpasses the existing post-training quantization methods by convincing margins, especially in lower bit-width settings (e.g., averagely up to 5.17% improvement for classification and 24.43% for detection on W4A4). We also highlight that APQ-ViT enjoys versatility and works well on diverse transformer variants.

Neighbor Correspondence Matching for Flow-based Video Frame Synthesis

Zhaoyang Jia
Yan Lu
Houqiang Li

Video frame synthesis, which consists of interpolation and extrapolation, is an essential video processing technique that can be applied to various scenarios. However, most existing methods cannot handle small objects or large motion well, especially in high-resolution videos such as 4K videos. To eliminate such limitations, we introduce a neighbor correspondence matching (NCM) algorithm for flow-based frame synthesis. Since the current frame is not available in video frame synthesis, NCM is performed in a current-frame-agnostic fashion to establish multi-scale correspondences in the spatial-temporal neighborhoods of each pixel. Based on the powerful motion representation capability of NCM, we further propose to estimate intermediate flows for frame synthesis in a heterogeneous coarse-to-fine scheme. Specifically, the coarse-scale module is designed to leverage neighbor correspondences to capture large motion, while the fine-scale module is more computationally efficient to speed up the estimation process. Both modules are trained progressively to eliminate the resolution gap between training dataset and real-world videos. Experimental results show that NCM achieves state-of-the-art performance on several benchmarks. In addition, NCM can be applied to various practical scenarios such as video compression to achieve better performance.

ReFormer: The Relational Transformer for Image Captioning

Xuewen Yang
Yingru Liu
Xin Wang

Image captioning is shown to be able to achieve a better performance by using scene graphs to represent the relations of objects in the image. The current captioning encoders generally use a Graph Convolutional Net (GCN) to represent the relation information and merge it with the object region features via concatenation or convolution to get the final input for sentence decoding. However, the GCN-based encoders in the existing methods are less effective for captioning due to two reasons. First, using the image captioning as the objective (i.e., Maximum Likelihood Estimation) rather than a relation-centric loss cannot fully explore the potential of the encoder. Second, using a pre-trained model instead of the encoder itself to extract the relationships is not flexible and cannot contribute to the explainability of the model. To improve the quality of image captioning, we propose a novel architecture ReFormer- a RElational transFORMER to generate features with relation information embedded and to explicitly express the pair-wise relationships between objects in the image. ReFormer incorporates the objective of scene graph generation with that of image captioning using one modified Transformer model. This design allows ReFormer to generate not only better image captions with the benefit of extracting strong relational image features, but also scene graphs to explicitly describe the pair-wise relationships. Experiments on publicly available datasets show that our model significantly outperforms state-of-the-art methods on image captioning and scene graph generation.

Transcript to Video: Efficient Clip Sequencing from Texts

Yu Xiong
Fabian Caba Heilbron
Dahua Lin

Among numerous videos shared on the web, well-edited ones always attract more attention. However, it is difficult for inexperienced users to make well-edited videos because it requires professional expertise and immense manual labor. To meet the demands for non-experts, we present Transcript-to-Video -- a weakly-supervised framework that uses texts as input to automatically create video sequences from an extensive collection of shots. Specifically, we propose a Content Retrieval Module and a Temporal Coherent Module to learn visual-language representations and model shot sequencing styles, respectively. For fast inference, we introduce an efficient search strategy for real-time video clip sequencing. Quantitative results and user studies demonstrate empirically that the proposed learning framework can retrieve content-relevant shots while creating plausible video sequences in terms of style. Besides, the run-time performance analysis shows that our framework can support real-world applications. Project page: http://www.xiongyu.me/projects/transcript2video/

Domain Reconstruction and Resampling for Robust Salient Object Detection

Senbo Yan
Liang Peng
Chuer Yu
Zheng Yang
Haifeng Liu
Deng Cai

Salient Object Detection (SOD) aims at detecting the salient objects covering the whole natural scene. However, one of the main problems in SOD is data bias. Natural scenes vary greatly, while each image in the SOD dataset contains a specific scene. It means that each image is just a sampling point in a specific scene, which is not representative and causes serious sampling bias. Building larger datasets is one solution but costly to address the sampling bias. Our method regards the data distribution of natural scenes as a Gaussian Mixture Distribution, and each scene follows a sub-Gaussian distribution. Our main idea is to reconstruct the data distribution of each scene from the sampling images and then resample from the distribution domain. We represent a scene by a distribution instead of a fixed sampling image to reserve the sampling uncertainty in SOD. Specifically, we employ a Style Conditional Variational AutoEncoder (Style-CVAE) to reconstruct the data distribution from image styles and a Gaussian Randomize Attribute Filter (GRAF) to reconstruct data distribution from image attributes (such as lightness, saturation, hue, etc.). We resample the reconstructed data distribution according to the Gaussian probability density function and train the SOD model. Experimental results prove that our method outperforms 16 state-of-the-art methods on five benchmarks.

Phase-based Memory Network for Video Dehazing

Ye Liu
Liang Wan
Huazhu Fu
Jing Qin
Lei Zhu

Video dehazing using deep-learning based methods has just received increasing attention in recent years. However, most existing methods tackle temporal consistency in the color domain only, which are less sensitive to small and imperceptible motions in a video, due to fog's drift and diffusion. In this work, we investigate in the frequency domain, which enables us to capture small motions effectively, and find that the phase component contains more semantic structures yet less haze information than the amplitude component of the hazy image. Based on these observations, we propose a novel phase-based memory network (PM-Net) to integrate the phase and color memory information for boosting video dehazing. Apart from the color memory from consecutive video frames, our PM-Net constructs a phase memory, which stores phase features of past video frames, and devise a cross-modal memory read (CMR) module, which fully leverages features from the color memory and the phase memory to boost features extracted from the current video frame for dehazing. Experimental results on the benchmark dataset of real hazy videos and a newly collected dataset of synthetic videos, show that the proposed PM-Net clearly outperforms the state-of-the-art image and video dehazing methods. Code is available at https://github.com/liuye123321/PM-Net.

UConNet: Unsupervised Controllable Network for Image and Video Deraining

Junhao Zhuang
Yisi Luo
Xile Zhao
Taixiang Jiang
Bichuan Guo

Image deraining is an important task for subsequent multimedia applications in rainy weather. Traditional deep learning-based methods rely on the quantity and diversity of training data, which is hard to cover all complex real-world rain scenarios. In this work, we propose the first Unsupervised Controllable Network (UConNet) to flexibly tackle different rain scenarios by adaptively controlling the network at the inference stage. Specifically, our unsupervised network takes the physics-based regularizations as the unsupervised loss function. Then, we sensibly derive the relationship between trade-off parameters of the loss function and the weightings of feature maps. Based on this relationship, our learned UConNet can be flexibly customized for different rain scenarios by controlling the weightings of feature maps at the inference stage. Alternatively, these weightings can also be efficiently determined by a learned weightings recommendation network. Extensive experiments for image and video deraining show that our method achieves very promising effectiveness, efficiency, and generalization abilities as compared with state-of-the-art methods.

Weakly-supervised Disentanglement Network for Video Fingerspelling Detection

Ziqi Jiang
Shengyu Zhang
Siyuan Yao
Wenqiao Zhang
Sihan Zhang
Juncheng Li
Zhou Zhao
Fei Wu

Fingerspelling detection, which aims to localize and recognize fingerspelling gestures in raw, untrimmed videos, is a nascent but important research area that could help bridge the communication gap between deaf people and others. Many existing works tend to exploit additional knowledge, such as pose annotations, and newly datasets for performance improvement. However, in real-world applications, additional data collection and annotation require tremendous human efforts that are not always affordable. In this paper, we propose the Weakly-supervised Disentanglement Network, namely WED, that requires no additional knowledge, and better exploits the video-sentence weak supervisions. Specifically, WED incorporates two critical components: 1) Masked Disentanglement Module, which employs a Variational Autoencoder for signed letters disentanglement. Each latent factor in the VAE corresponds to a particular signed letter, and we mask latent factors corresponding to letters that do not appear in the video during decoding. Compared to the vanilla VAE, the masked reconstruction leverages the video-sentence weak supervision, leading to a better sign language oriented disentanglement; and 2) the Dynamic Memory Network module, which leverages the disentangled sign knowledge as prior knowledge and reference for sign-related frame identification and gesture recognition through a carefully designed memory reading component. We conduct extensive experiments on the benchmark ChicagoFSWild and ChicagoFSWild+ datasets. Empirical studies validate that the WED network achieves effective sign gesture disentanglement, contributing to the state-of-the-art performance for fingerspelling detection and recognition.

AGTGAN: Unpaired Image Translation for Photographic Ancient Character Generation

Hongxiang Huang
Daihui Yang
Gang Dai
Zhen Han
Yuyi Wang
Kin-Man Lam
Fan Yang
Shuangping Huang
Yongge Liu
Mengchao He

The study of ancient writings has great value for archaeology and philology. Essential forms of material are photographic characters, but manual photographic character recognition is extremely time-consuming and expertise-dependent. Automatic classification is therefore greatly desired. However, the current performance is limited due to the lack of annotated data. Data generation is an inexpensive but useful solution to data scarcity. Nevertheless, the diverse glyph shapes and complex background textures of photographic ancient characters make the generation task difficult, leading to unsatisfactory results of existing methods. To this end, we propose an unsupervised generative adversarial network called AGTGAN in this paper. By explicitly modeling global and local glyph shape styles, followed by a stroke-aware texture transfer and an associate adversarial learning mechanism, our method can generate characters with diverse glyphs and realistic textures. We evaluate our method on photographic ancient character datasets, e.g., OBC306 and CSDD. Our method outperforms other state-of-the-art methods in terms of various metrics and performs much better in terms of the diversity and authenticity of generated samples. With our generated images, experiments on the largest photographic oracle bone character dataset show that our method can achieve a significant increase in classification accuracy, up to 16.34%. The source code is available at https://github.com/Hellomystery/AGTGAN.

CLIPTexture: Text-Driven Texture Synthesis

Yiren Song

Can artificial intelligence create textures with artistic value according to human language control? Existing texture synthesis methods require example texture input. However, in many practical situations, users don't have satisfying textures but tell designers about their needs through simple sketches and verbal descriptions. This paper proposes a novel texture synthesis framework based on the CLIP, which models the texture synthesis problem as an optimization process and realizes text-driven texture synthesis by minimizing the distance between the input image and the text prompt in latent space. Our method performs zero-shot image manipulation successfully even between unseen domains. We implement texture synthesis using two different optimization methods, the TextureNet and Diffvg, demonstrating the generality of CLIPTexture. Extensive experiments confirmed the robust and superior manipulation performance of our methods compared to the existing baselines.

OCR-Pose: Occlusion-aware Contrastive Representation for Unsupervised 3D Human Pose Estimation

Junjie Wang
Zhenbo Yu
Zhengyan Tong
Hang Wang
Jinxian Liu
Wenjun Zhang
Xiaoyan Wu

Occlusion is a significant problem in 3D human pose estimation from the 2D counterpart. On one hand, without explicit annotation, the 3D skeleton is hard to be accurately estimated from the occluded 2D pose. On the other hand, one occluded 2D pose might correspond to multiple 3D skeletons with low confidence parts. To address these issues, we decouple the 3D representation feature into view-invariant part termed occlusion-aware feature and view-dependent part termed rotation feature to facilitate subsequent optimization of the former. Then we propose an occlusion-aware contrastive representation based scheme (OCR-Pose) consisting of Topology Invariant Contrastive Learning module (TiCLR) and View Equivariant Contrastive Learning module (VeCLR). Specifically, TiCLR drives invariance to topology transformation, i.e., bridging the gap between an occluded 2D pose and the unoccluded one. While VeCLR encourages equivariance to view transformation, i.e., capturing the geometric similarity of the 3D skeleton in two views. Both modules optimize occlusion-aware constrastive representation with pose filling and lifting networks via an iterative training strategy in an end-to-end manner. OCR-Pose not only achieves superior performance against state-of-the-art unsupervised methods on unoccluded benchmarks, but also obtains significant improvements when occlusion is involved. Our project is available at https://sites.google.com/view/ocr-pose.

DualSign: Semi-Supervised Sign Language Production with Balanced Multi-Modal Multi-Task Dual Transformation

Wencan Huang
Zhou Zhao
Jinzheng He
Mingmin Zhang

Sign Language Production (SLP) aims to translate a spoken language description to its corresponding continuous sign language sequence. A prevailing solution for this problem is in a two-staged manner: it formulates SLP as two sub-tasks, i.e., Text to Gloss (T2G) translation and Gloss to Pose (G2P) animation, with gloss annotations as pivots. Although two-staged approaches achieve better performance than their direct translation counterparts, the requirement of gloss intermediaries causes a parallel data bottleneck. In this paper, to reduce reliance on gloss annotations in two-staged approaches, we propose DualSign, a semi-supervised two-staged SLP framework, which can effectively utilize partially gloss-annotated text-pose pairs and monolingual gloss data. The key component of DualSign is a novel Balanced Multi-Modal Multi-Task Dual Transformation (BM3T-DT) method, where two well-designed models, i.e., a Multi-Modal T2G model (MM-T2G) and a Multi-Task G2P model (MT-G2P), are jointly trained by leveraging their task duality and unlabeled data. After applying BM3T-DT, we derive the expected uni-modal T2G model from the well-trained MM-T2G with knowledge distillation. Considering that the MM-T2G may suffer from modality imbalance when decoding with multiple input modalities, we devise a cross-modal balancing loss, further boosting the system's overall performance. Extensive experiments conducted on the PHOENIX14T dataset show the effectiveness of our approach in the semi-supervised setting. By training with additionally collected unlabeled data, DualSign substantially improves previous state-of-the-art SLP methods.

A Lightweight Graph Transformer Network for Human Mesh Reconstruction from 2D Human Pose

Ce Zheng
Matias Mendieta
Pu Wang
Aidong Lu
Chen Chen

Existing deep learning-based human mesh reconstruction approaches have a tendency to build larger networks to achieve higher accuracy. Computational complexity and model size are often neglected, despite being key characteristics for practical use of human mesh reconstruction models (e.g. virtual try-on systems). In this paper, we present GTRS, a lightweight pose-based method that can reconstruct human mesh from 2D human pose. We propose a pose analysis module that uses graph transformers to exploit structured and implicit joint correlations, and a mesh regression module that combines the extracted pose feature with the mesh template to reconstruct the final human mesh. We demonstrate the efficiency and generalization of GTRS by extensive evaluations on the Human3.6M and 3DPW datasets. In particular, GTRS achieves better accuracy than the SOTA pose-based method Pose2Mesh while only using 10.2% of the parameters (Params) and 2.5% of the FLOPs on the challenging in-the-wild 3DPW dataset. Code is available at https://github.com/zczcwh/GTRS

Repainting and Imitating Learning for Lane Detection

Yue He
Minyue Jiang
Xiaoqing Ye
Liang Du
Zhikang Zou
Wei Zhang
Xiao Tan
Errui Ding

Current lane detection methods are struggling with the invisibility lane issue caused by heavy shadows, severe road mark degradation, and serious vehicle occlusion. As a result, discriminative lane features can be barely learned by the network despite elaborate designs due to the inherent invisibility of lanes in the wild. In this paper, we target at finding an enhanced feature space where the lane features are distinctive while maintaining a similar distribution of lanes in the wild. To achieve this, we propose a novel Repainting and Imitating Learning (RIL) framework containing a pair of teacher and student without any extra data or extra laborious labeling. Specifically, in the repainting step, an enhanced ideal virtual lane dataset is built in which only the lane regions are repainted while non-lane regions are kept unchanged, maintaining the similar distribution of lanes in the wild. The teacher model learns enhanced discriminative representation based on the virtual data and serves as the guidance for a student model to imitate. In the imitating learning step, through the scale-fusing distillation module, the student network is encouraged to generate features that mimic the teacher model both on the same scale and cross scales. Furthermore, the coupled adversarial module builds the bridge to connect not only teacher and student models but also virtual and real data, adjusting the imitating learning process dynamically. Note that our method introduces no extra time cost during inference and can be plug-and-play in various cutting-edge lane detection networks. Experimental results prove the effectiveness of the RIL framework both on CULane and TuSimple for four modern lane detection methods. The code and model will be available soon.

Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval

Hao Wang
Guosheng Lin
Steven Hoi
Chunyan Miao

This paper investigates an open research problem of generating text-image pairs to improve the training of fine-grained image-to-text cross-modal retrieval task, and proposes a novel framework for paired data augmentation by uncovering the hidden semantic information of StyleGAN2 model. Specifically, we first train a StyleGAN2 model on the given dataset. We then project the real images back to the latent space of StyleGAN2 to obtain the latent codes. To make the generated images manipulatable, we further introduce a latent space alignment module to learn the alignment between StyleGAN2 latent codes and the corresponding textual caption features. When we do online paired data augmentation, we first generate augmented text through random token replacement, then pass the augmented text into the latent space alignment module to output the latent codes, which are finally fed to StyleGAN2 to generate the augmented images. We evaluate the efficacy of our augmented data approach on two public cross-modal retrieval datasets, in which the promising experimental results demonstrate the augmented text-image pair data can be trained together with the original data to boost the image-to-text cross-modal retrieval performance.

BlumNet: Graph Component Detection for Object Skeleton Extraction

Yulu Zhang
Liang Sang
Marcin Grzegorzek
John See
Cong Yang

In this paper, we present a simple yet efficient framework, BlumNet, for extracting object skeletons in natural images and binary shapes. With the need for highly reliable skeletons in various multimedia applications, the proposed BlumNet is distinguished in three aspects: (1) The inception of graph decomposition and reconstruction strategies further simplifies the skeleton extraction task into a graph component detection problem, which significantly improves the accuracy and robustness of extracted skeletons. (2) The intuitive representation of each skeleton branch with multiple structured and overlapping line segments can effectively prevent the skeleton branch vanishing problem. (3) In comparison to traditional skeleton heatmaps, our approach directly outputs skeleton graphs, which is more feasible for real-world applications. Through comprehensive experiments, we demonstrate the advantages of BlumNet: significantly higher accuracy than the state-of-the-art AdaLSN (0.826 vs. 0.786) on the SK1491 dataset, a marked improvement in robustness on mixed object deformations, and also a state-of-the-art performance on binary shape datasets (e.g. 0.893 on the MPEG7 dataset).

PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative Grounding

Zihan Ding
Zi-han Ding
Tianrui Hui
Junshi Huang
Xiaoming Wei
Xiaolin Wei
Si Liu

Panoptic Narrative Grounding (PNG) is an emerging task whose goal is to segment visual objects of things and stuff categories described by dense narrative captions of a still image. The previous two-stage approach first extracts segmentation region proposals by an off-the-shelf panoptic segmentation model, then conducts coarse region-phrase matching to ground the candidate regions for each noun phrase. However, the two-stage pipeline usually suffers from the performance limitation of low-quality proposals in the first stage and the loss of spatial details caused by region feature pooling, as well as complicated strategies designed for things and stuff categories separately. To alleviate these drawbacks, we propose a one-stage end-to-end Pixel-Phrase Matching Network (PPMN), which directly matches each phrase to its corresponding pixels instead of region proposals and outputs panoptic segmentation by simple combination. Thus, our model can exploit sufficient and finer cross-modal semantic correspondence from the supervision of densely annotated pixel-phrase pairs rather than sparse region-phrase pairs. In addition, we also propose a Language-Compatible Pixel Aggregation (LCPA) module to further enhance the discriminative ability of phrase features through multi-round refinement, which selects the most compatible pixels for each phrase to adaptively aggregate the corresponding visual context. Extensive experiments show that our method achieves new state-of-the-art performance on the PNG benchmark with 4.0 absolute Average Recall gains.

Incremental Few-Shot Semantic Segmentation via Embedding Adaptive-Update and Hyper-class Representation

Guangchen Shi
Yirui Wu
Jun Liu
Shaohua Wan
Wenhai Wang
Tong Lu

Incremental few-shot semantic segmentation (IFSS) targets at incrementally expanding model's capacity to segment new class of images supervised by only a few samples. However, features learned on old classes could significantly drift, causing catastrophic forgetting. Moreover, few samples for pixel-level segmentation on new classes lead to notorious overfitting issues in each learning session. In this paper, we explicitly represent class-based knowledge for semantic segmentation as a category embedding and a hyper-class embedding, where the former describes exclusive semantical properties, and the latter expresses hyper-class knowledge as class-shared semantic properties. Aiming to solve IFSS problems, we present EHNet, i.e., Embedding adaptive-update and Hyper-class representation Network from two aspects. First, we propose an embedding adaptive-update strategy to avoid feature drift, which maintains old knowledge by hyper-class representation, and adaptively update category embeddings with a class-attention scheme to involve new classes learned in individual sessions. Second, to resist overfitting issues caused by few training samples, a hyper-class embedding is learned by clustering all category embeddings for initialization and aligned with category embedding of the new class for enhancement, where learned knowledge assists to learn new knowledge, thus alleviating performance dependence on training data scale. Significantly, these two designs provide representation capability for classes with sufficient semantics and limited biases, enabling to perform segmentation tasks requiring high semantic dependence. Experiments on PASCAL-5i and COCO datasets show that EHNet achieves new state-of-the-art performance with remarkable advantages.

Synthetic Data Supervised Salient Object Detection

Zhenyu Wu
Lin Wang
Wei Wang
Tengfei Shi
Chenglizhao Chen
Aimin Hao
Shuo Li

Although deep salient object detection (SOD) has achieved remarkable progress, deep SOD models are extremely data-hungry, requiring large-scale pixel-wise annotations to deliver such promising results. In this paper, we propose a novel yet effective method for SOD, coined SODGAN, which can generate infinite high-quality image-mask pairs requiring only a few labeled data, and these synthesized pairs can replace the human-labeled DUTS-TR to train any off-the-shelf SOD model. Its contribution is three-fold. 1) Our proposed diffusion embedding network can address the manifold mismatch and is tractable for the latent code generation, better matching with the ImageNet latent space. 2) For the first time, our proposed few-shot saliency mask generator can synthesize infinite accurate image synchronized saliency masks with a few labeled data. 3) Our proposed quality-aware discriminator can select highquality synthesized image-mask pairs from noisy synthetic data pool, improving the quality of synthetic data. For the first time, our SODGAN tackles SOD with synthetic data directly generated from the generative model, which opens up a new research paradigm for SOD. Extensive experimental results show that the saliency model trained on synthetic data can achieve $98.4%$ F-measure of the saliency model trained on the DUTS-TR. Moreover, our approach achieves a new SOTA performance in semi/weakly-supervised methods, and even outperforms several fully-supervised SOTA methods. Code is available at https://github.com/wuzhenyubuaa/SODGAN

Learning Granularity-Unified Representations for Text-to-Image Person Re-identification

Zhiyin Shao
Xinyu Zhang
Meng Fang
Zhifeng Lin
Jian Wang
Changxing Ding

Text-to-image person re-identification (ReID) aims to search for pedestrian images of an interested identity via textual descriptions. It is challenging due to both rich intra-modal variations and significant inter-modal gaps. Existing works usually ignore the difference in feature granularity between the two modalities, i.e., the visual features are usually fine-grained while textual features are coarse, which is mainly responsible for the large inter-modal gaps. In this paper, we propose an end-to-end framework based on transformers to learn granularity-unified representations for both modalities, denoted as LGUR. LGUR framework contains two modules: a Dictionary-based Granularity Alignment (DGA) module and a Prototype-based Granularity Unification (PGU) module. In DGA, in order to align the granularities of two modalities, we introduce a Multi-modality Shared Dictionary (MSD) to reconstruct both visual and textual features. Besides, DGA has two important factors, i.e., the cross-modality guidance and the foreground-centric reconstruction, to facilitate the optimization of MSD. In PGU, we adopt a set of shared and learnable prototypes as the queries to extract diverse and semantically aligned features for both modalities in the granularity-unified feature space, which further promotes the ReID performance. Comprehensive experiments show that our LGUR consistently outperforms state-of-the-arts by large margins on both CUHK-PEDES and ICFG-PEDES datasets. Code will be released at https://github.com/ZhiyinShao-H/LGUR.

Class Gradient Projection For Continual Learning

Cheng Chen
Ji Zhang
Jingkuan Song
Lianli Gao

Catastrophic forgetting is one of the most critical challenges in Continual Learning (CL). Recent approaches tackle this problem by projecting the gradient update orthogonal to the gradient subspace of existing tasks. While the results are remarkable, those approaches ignore the fact that these calculated gradients are not guaranteed to be orthogonal to the gradient subspace of each class due to the class deviation in tasks, e.g., distinguishing "Man" from "Sea" v.s. differentiating "Boy" from "Girl". Therefore, this strategy may still cause catastrophic forgetting for some classes. In this paper, we propose Class Gradient Projection (CGP), which calculates the gradient subspace from individual classes rather than tasks. Gradient update orthogonal to the gradient subspace of existing classes can be effectively utilized to minimize interference from other classes. To improve the generalization and efficiency, we further design a Base Refining (BR) algorithm to combine similar classes and refine class bases dynamically. Moreover, we leverage a contrastive learning method to improve the model's ability to handle unseen tasks. Extensive experiments on benchmark datasets demonstrate the effectiveness of our proposed approach. It improves the previous methods by 2.0% on the CIFAR-100 dataset. The code is available at https://github.com/zackschen/CGP.

Flexible Hybrid Lenses Light Field Super-Resolution using Layered Refinement

Song Chang
Youfang Lin
Shuo Zhang

In the hybrid lenses Light Field (LF) images, a high-resolution (HR) camera is in the center of the multiple low-resolution (LR) cameras, which introduces the beneficial high-frequency information for LF super-resolution. Therefore, how to effectively utilize the high-frequency information of the central view is the key issue for the hybrid lenses LF images super-resolution. In this paper, we propose a novel learning-based framework with Layered Refinement to super-resolve the hybrid lenses LF images. Specifically, we first transform the depth information of the scene into the layered position information, and refine it by complementing the high-frequency information of the HR central view to generate a high-quality representation of the depth information. Then, guided by high-quality depth representation, we propagate the information of the HR central view to the surrounding views accurately, and utilize the layered position information to maintain the occlusion relationship during the propagation. Moreover, as the generation of each layer position information is independent in our method, our trained model can flexibly adapt kinds of scenes with various disparity ranges without additional training. Experiments show that the proposed method outperforms the SOTA methods in kinds of scenes from simulated and real-world datasets with various disparity ranges. The code is available at \urlhttps://github.com/racso10/LFHSR.

DS-MVSNet: Unsupervised Multi-view Stereo via Depth Synthesis

Jingliang Li
Zhengda Lu
Yiqun Wang
Ying Wang
Jun Xiao

In recent years, supervised or unsupervised learning-based MVS methods achieved excellent performance compared with traditional methods. However, these methods only use the probability volume computed by cost volume regularization to predict reference depths and this manner cannot mine enough information from the probability volume. Furthermore, the unsupervised methods usually try to use two-step or additional inputs for training which make the procedure more complicated. In this paper, we propose the DS-MVSNet, an end-to-end unsupervised MVS structure with the source depths synthesis. To mine the information in probability volume, we creatively synthesize the source depths by splattering the probability volume and depth hypotheses to source views. Meanwhile, we propose the adaptive Gaussian sampling and improved adaptive bins sampling approach that improve the depths hypotheses accuracy. On the other hand, we utilize the source depths to render the reference images and propose depth consistency loss and depth smoothness loss. These can provide additional guidance according to photometric and geometric consistency in different views without additional inputs. Finally, we conduct a series of experiments on the DTU dataset and Tanks $&$ Temples dataset that demonstrate the efficiency and robustness of our DS-MVSNet compared with the state-of-the-art methods.

Enhancing Image Rescaling using Dual Latent Variables in Invertible Neural Network

Min Zhang
Zhihong Pan
Xin Zhou
C.-C. Jay Kuo

Normalizing flow models have been used successfully for generative image super-resolution (SR) by approximating complex distribution of natural images to simple tractable distribution in latent space through Invertible Neural Networks (INN). These models can generate multiple realistic SR images from one low-resolution (LR) input using randomly sampled points in the latent space, simulating the ill-posed nature of image upscaling where multiple high-resolution (HR) images correspond to the same LR. Lately, the invertible process in INN has also been used successfully by bidirectional image rescaling models like IRN and HCFlow for joint optimization of downscaling and inverse upscaling, resulting in significant improvements in upscaled image quality. While they are optimized for image downscaling too, the ill-posed nature of image downscaling, where one HR image could be downsized to multiple LR images depending on different interpolation kernels and resampling methods, is not considered. A new downscaling latent variable, in addition to the original one representing uncertainties in image upscaling, is introduced to model variations in the image downscaling process. This dual latent variable enhancement is applicable to different image rescaling models and it is shown in extensive experiments that it can improve image upscaling accuracy consistently without sacrificing image quality in downscaled LR images. It is also shown to be effective in enhancing other INN-based models for image restoration applications like image hiding.

ScatterNet: Point Cloud Learning via Scatters

Qi Liu
Nianjuan Jiang
Jiangbo Lu
Mingang Chen
Ran Yi
Lizhuang Ma

Design of point cloud shape descriptors is a challenging problem in practical applications due to the sparsity and the inscrutable distribution of the point clouds. In this paper, we propose ScatterNet, a novel 3D local feature learning approach for exploring and aggregating hypothetical scatters of the point clouds. Scatters of relational points are first organized in point cloud via guided explorations, and then propagated back to extend the capacity in representing the point-wise characteristics. We provide an practical implementation of the ScatterNet, which involves an unique scatter exploration operator and a scatter convolution operator. Our method achieves the state-of-the-art performance on several point cloud analysis tasks like classification, part segmentation and normal estimation. The source code of ScatterNet is available in supplementary materials.

Making The Best of Both Worlds: A Domain-Oriented Transformer for Unsupervised Domain Adaptation

Wenxuan Ma
Jinming Zhang
Shuang Li
Chi Harold Liu
Yulin Wang
Wei Li

Extensive studies on Unsupervised Domain Adaptation (UDA) have propelled the deployment of deep learning from limited experimental datasets into real-world unconstrained domains. Most UDA approaches align features within a common embedding space and apply a shared classifier for target prediction. However, since a perfectly aligned feature space may not exist when the domain discrepancy is large, these methods suffer from two limitations. First, the coercive domain alignment deteriorates target domain discriminability due to lacking target label supervision. Second, the source-supervised classifier is inevitably biased to source data, thus it may underperform in target domain. To alleviate these issues, we propose to simultaneously conduct feature alignment in two individual spaces focusing on different domains, and create for each space a domain-oriented classifier tailored specifically for that domain. Specifically, we design a Domain-Oriented Transformer (DOT) that has two individual classification tokens to learn different domain-oriented representations, and two classifiers to preserve domain-wise discriminability. Theoretical guaranteed contrastive-based alignment and the source-guided pseudo-label refinement strategy are utilized to explore both domain-invariant and specific information. Comprehensive experiments validate that our method achieves state-of-the-art on several benchmarks. Code is released at https://github.com/BIT-DA/Domain-Oriented-Transformer.

Gloss Semantic-Enhanced Network with Online Back-Translation for Sign Language Production

Shengeng Tang
Richang Hong
Dan Guo
Meng Wang

Sign Language Production (SLP) aims to generate the visual appearance of sign language according to the spoken language, in which a key procedure is to translate sign Gloss to Pose (G2P). Existing G2P methods mainly focus on regression prediction of posture coordinates, namely closely fitting the ground truth. In this paper, we provide a new viewpoint: a Gloss semantic-Enhanced Network is proposed with Online Back-Translation (GEN-OBT) for G2P in the SLP task. Specifically, GEN-OBT consists of a gloss encoder, a pose decoder, and an online reverse gloss decoder. In the gloss encoder based on the transformer, we design a learnable gloss token without any prior knowledge of gloss, to explore the global contextual dependency of the entire gloss sequence. During sign pose generation, the gloss token is aggregated onto the existing generated poses as gloss guidance. Then, the aggregated features are interacted with the entire gloss embedding vectors to generate the next pose. Furthermore, we design a CTC-based reverse decoder to convert the generated poses backward into glosses, which guarantees the semantic consistency during the processes of gloss-to-pose and pose-to-gloss. Extensive experiments on the challenging PHOENIX14T benchmark demonstrate that the proposed GEN-OBT outperforms the state-of-the-art models. Visualization results further validate the interpretability of our method.

Paint and Distill: Boosting 3D Object Detection with Semantic Passing Network

Bo Ju
Zhikang Zou
Xiaoqing Ye
Minyue Jiang
Xiao Tan
Errui Ding
Jingdong Wang

3D object detection task from lidar or camera sensors is essential for autonomous driving. Pioneer attempts at multi-modality fusion complement the sparse lidar point clouds with rich semantic texture information from images at the cost of extra network designs and overhead. In this work, we propose a novel semantic passing framework, named SPNet, to boost the performance of existing lidar-based 3D detection models with the guidance of rich context painting, with no extra computation cost during inference. Our key design is to first exploit the potential instructive semantic knowledge within the ground-truth labels by training a semantic-painted teacher model and then guide the pure-lidar network to learn the semantic-painted representation via knowledge passing modules at different granularities: class-wise passing, pixel-wise passing and instance-wise passing. Experimental results show that the proposed SPNet can seamlessly cooperate with most existing 3D detection frameworks with 1$\sim$5% AP gain and even achieve new state-of-the-art 3D detection performance on the KITTI test benchmark. Code is available at: https://github.com/jb892/SPNet.

Dual Contrastive Learning for Spatio-temporal Representation

Shuangrui Ding
Rui Qian
Hongkai Xiong

Contrastive learning has shown promising potential in self-supervised spatio-temporal representation learning. Most works naively sample different clips to construct positive and negative pairs. However, we observe that this formulation inclines the model towards the background scene bias. The underlying reasons are twofold. First, the scene difference is usually more noticeable and easier to discriminate than the motion difference. Second, the clips sampled from the same video often share similar backgrounds but have distinct motions. Simply regarding them as positive pairs will draw the model to the static background rather than the motion pattern. To tackle this challenge, this paper presents a novel dual contrastive formulation. Concretely, we decouple the input RGB video sequence into two complementary modes, static scene and dynamic motion. Then, the original RGB features are pulled closer to the static features and the aligned dynamic features, respectively. In this way, the static scene and the dynamic motion are simultaneously encoded into the compact RGB representation. We further conduct the feature space decoupling via activation maps to distill static- and dynamic-related features. We term our method as Dual Contrastive Learning for spatio-temporal Representation (DCLR). Extensive experiments demonstrate that DCLR learns effective spatio-temporal representations and obtains state-of-the-art or comparable performance on UCF-101, HMDB-51, and Diving-48 datasets.

Fine-Grained Fragment Diffusion for Cross Domain Crowd Counting

Huilin Zhu
Jingling Yuan
Zhengwei Yang
Xian Zhong
Zheng Wang

Deep learning improves the performance of crowd counting, but model migration remains a tricky challenge. Due to the reliance on training data and inherent domain shift, model application to unseen scenarios is tough. To facilitate the problem, this paper proposes a cross-domain Fine-Grained Fragment Diffusion model (FGFD) that explores feature-level fine-grained similarities of crowd distributions between different fragments to bridge the cross-domain gap (content-level coarse-grained dissimilarities). Specifically, we obtain features of fragments in both source and target domains, and then perform the alignment of the crowd distribution across different domains. With the assistance of the diffusion of crowd distribution, it is able to label unseen domain fragments and make source domain close to target domain, which is fed back to the model to reduce the domain discrepancy. By monitoring the distribution alignment, the distribution perception model is updated, then the performance of distribution alignment is improved. During the model inference, the gap between different domains is gradually alleviated. Multiple sets of migration experiments show that the proposed method achieves competitive results with other state-of-the-art domain-transfer methods.

Depth-inspired Label Mining for Unsupervised RGB-D Salient Object Detection

Teng Yang
Yue Wang
Lu Zhang
Jinqing Qi
Huchuan Lu

Existing deep learning-based unsupervised Salient Object Detection (SOD) methods heavily rely on the pseudo labels predicted from handcrafted features. However, the pseudo ground truth obtained only from RGB space would easily bring undesirable noises, especially in some complex scenarios. This naturally leads to the incorporation of extra depth modality with RGB images for more robust object identification, namely RGB-D SOD. Compared with the well-studied unsupervised SOD in the RGB domain, deep unsupervised RGB-D SOD is a less explored direction in the literature. In this paper, we propose to tackle this task by introducing a novel systemic design for high-quality pseudo-label mining. Our framework consists of two key components, Depth-inspired Label Generation (DLG) and Multi-source Uncertainty-aware Label Optimization (MULO). In DLG, a lightweight deep network is designed for automatically producing pseudo labels from depth maps in a self-supervised manner. Then, MULO introduces an effective pseudo label optimization strategy by learning the uncertainty of the pseudo labels from the depth domain and heuristic features. Extensive experiments demonstrate that the proposed method significantly outperforms the state-of-the-art unsupervised methods on mainstream benchmarks.

FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis

Yongqi Wang
Zhou Zhao

Unconstrained lip-to-speech synthesis aims to generate corresponding speeches from silent videos of talking faces with no restriction on head poses or vocabulary. Current works mainly use sequence-to-sequence models to solve this problem, either in an autoregressive architecture or a flow-based non-autoregressive architecture. However, these models suffer from several drawbacks: 1) Instead of directly generating audios, they use a two-stage pipeline that first generates mel-spectrograms and then reconstructs audios from the spectrograms. This causes cumbersome deployment and degradation of speech quality due to error propagation; 2) The audio reconstruction algorithm used by these models limits the inference speed and audio quality, while neural vocoders are not available for these models since their output spectrograms are not accurate enough; 3) The autoregressive model suffers from high inference latency, while the flow-based model has high memory occupancy: neither of them is efficient enough in both time and memory usage. To tackle these problems, we propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency, and has a relatively small model size. Besides, different from the widely used 3D-CNN visual frontend for lip movement encoding, we for the first time propose a transformer-based visual frontend for this task. Experiments show that our model achieves 19.76x speedup for audio waveform generation compared with the current autoregressive model on input sequences of 3 seconds, and obtains superior audio quality.

Interact with Open Scenes: A Life-long Evolution Framework for Interactive Segmentation Models

Ruitong Gan
Junsong Fan
Yuxi Wang
Zhaoxiang Zhang

Existing interactive segmentation methods mainly focus on optimizing user interacting strategies, as well as making better use of clicks provided by users. However, the intention of the interactive segmentation model is to obtain high-quality masks with limited user interactions, which are supposed to be applied to unlabeled new images. But most existing methods overlooked the generalization ability of their models when witnessing new target scenes. To overcome this problem, we propose a life-long evolution framework for interactive models in this paper, which provides a possible solution for dealing with dynamic target scenes with one single model. Given several target scenes and an initial model trained with labels on the limited closed dataset, our framework arranges sequentially evolution steps on each target set. Specifically, we propose an interactive-prototype module to generate and refine pseudo masks, and apply a feature alignment module in order to adapt the model to a new target scene and keep the performance on previous images at the same time. All evolution steps above do not require ground truth labels as supervision. We conduct thorough experiments on PASCAL VOC, Cityscapes, and COCO datasets, demonstrating the effectiveness of our framework in solving new target datasets and maintaining performance on previous scenes at the same time.

Visual Dialog for Spotting the Differences between Pairs of Similar Images

Duo Zheng
Fandong Meng
Qingyi Si
Hairun Fan
Zipeng Xu
Jie Zhou
Fangxiang Feng
Xiaojie Wang

Visual dialog has witnessed great progress after introducing various vision-oriented goals into the conversation. Much of previous work focuses on tasks where only one image can be accessed by two interlocutors, such as VisDial and GuessWhat. The work on situations where two interlocutors access different images has received less attention. Those situations are common in real world and bring some different challenges compared with one-image tasks. The lack of such types of dialog tasks and corresponding large-scale datasets makes it impossible to carry out in-depth research. This paper therefore first proposes a new visual dialog task named Dial-the-Diff, where two interlocutors accessing two similar images respectively try to spot the difference between the images through conversing in natural language. The task raises new challenges to the dialog strategy and the ability of categorizing objects. We then build a large-scale multi-modal dataset for the task, named DialDiff, which contains 87k Virtual Reality images and 78k dialogs. Some details of the data are given and analyzed to highlight the challenges behind the task. Finally, we propose benchmark models for this task, and conduct extensive experiments to evaluate their performance as well as its problems remained.

Time and Memory Efficient Large-Scale Canonical Correlation Analysis in Fourier Domain

Xiang-Jun Shen
Zhaorui Xu
Liangjun Wang
Zechao Li

Canonical correlation analysis (CCA) is a linear correlation analysis technique used widely in the statistics and machine learning community. However, the high complexity involved in pursuing eigenvector lays a heavy burden on the memory and computational time, making CCA nearly impractical in large-scale cases. In this paper, we attempt to overcome this issue by representing the data in the Fourier domain. Thanks to the data characteristic of pattern repeatability, one can translate projection-seeking of CCA into choosing some discriminative Fourier bases with only element-wise dot product and sum operations, without time-consuming eigenvector computation. Another merit of this scheme is that the eigenvalues can be approximated asymptotically in contrast to existing methods. Specifically, the eigenvalues can be estimated progressively, and the accuracy goes up as the number of data samples increases monotonously. This makes it possible to use partial data samples to obtain satisfactory accuracy. All the facts above make the proposed method extremely fast and memory efficient. Experimental results on several large-scale datasets, such as MNIST 8M, X-RAY MICROBEAM SPEECH, and TWITTER USERS Data, demonstrate the superiority of the proposed algorithm over SOTA large-scale CCA methods, as our proposed method achieves almost same accuracy with the training time being 1,000 times faster than SOTA methods.

SESSION: Oral Session XII: Understanding Multimedia Content -- Media Interpretation

Enlarging the Long-time Dependencies via RL-based Memory Network in Movie Affective Analysis

Jie Zhang
Yin Zhao
Kai Qian

Affective analysis of movies heavily depends on the causal understanding of the story with long-time dependencies. Limited by the existing sequence models such as LSTM, Transformer, etc., current works generally split the movies into dependent clips and predict the affective impacts (Valence/Arousal) independently, ignoring the long historical impacts across the clips. In this paper, we introduce a novel Reinforcement learning based Memory Net (RMN) for this task, which facilitates the prediction of the current clip to rely on the possible related historical clips of this movie. Compared with LSTM, the proposed method solves the long-time dependencies from two aspects. First, we introduce a readable and writable memory bank to store useful historical information, which solves the problem of the restricted memory unit for LSTM. However, the traditional parameters' update scheme of the memory network, when applied for long sequence prediction, still needs to store the gradients for long sequences. It suffers from gradient vanishing and exploding, similar to the issues of backpropagation through time (BPTT). For this problem, we introduce a reinforcement learning framework in the memory write operation. The memory updating scheme of the framework is optimized via one-step temporal difference, modeling the long-time dependencies using both the policy and value networks. Experiments on the LIRIS-ACCEDE dataset show that our method achieves significant performance gains over the existing methods. Besides, we also apply our method to other long sequence prediction tasks, such as music emotion recognition and video summarization, and also achieve state-of-the-art on those tasks.

A Tree-Based Structure-Aware Transformer Decoder for Image-To-Markup Generation

Shuhan Zhong
Sizhe Song
Guanyao Li
S.-H. Gary Chan

Image-to-markup generation aims at translating an image into markup (structured language) that represents both the contents and the structural semantics corresponding to the image. Recent encoder-decoder based approaches typically employ string decoders to model the string representation of the target markup, which cannot effectively capture the rich embedded structural information. In this paper, we propose TSDNet, a novel Tree-based Structure-aware Transformer Decoder NETwork to directly generate the tree representation of the target markup in a structure-aware manner. Specifically, our model learns to sequentially predict the node attributes, edge attributes, and node connectivities by multi-task learning. Meanwhile, we introduce a novel tree-structured attention to our decoder such that it can directly operate on the partial tree generated in each step to fully exploit the structural information. TSDNet doesn't rely on any prior assumptions on the target tree structure, and can be jointly optimized with encoders in an end-to-end fashion. We evaluate the performance of our model on public image-to-markup generation datasets, and demonstrate its ability to learn the complicated correlation from the structural information in the target markup with significant improvement over state-of-the-art methods by up to 5.6% in mathematical expression recognition and up to 35.34% in chemical formula recognition.

Zero-shot Video Classification with Appropriate Web and Task Knowledge Transfer

Junbao Zhuo
Yan Zhu
Shuhao Cui
Shuhui Wang
Bin M A
Qingming Huang
Xiaoming Wei
Xiaolin Wei

Zero-shot video classification (ZSVC) that aims to recognize video classes that have never been seen during model training, has become a thriving research direction. ZSVC is achieved by building mappings between visual and semantic embeddings. Recently, ZSVC has been achieved by automatically mining the underlying objects in videos as attributes and incorporating external commonsense knowledge. However, the object mined from seen categories can not generalized to unseen ones. Besides, the category-object relationships are usually extracted from commonsense knowledge or word embedding, which is not consistent with video modality. To tackle these issues, we propose to mine associated objects and category-object relationships for each category from retrieved web images. The associated objects of all categories are employed as generic attributes and the mined category-object relationships could narrow the modality inconsistency for better knowledge transfer. Another issue of existing ZSVC methods is that the model sufficiently trained with labeled seen categories may not generalize well to distinct unseen categories. To encourage a more reliable transfer, we propose Task Similarity aware Representation Learning (TSRL). In TSRL, the similarity between seen categories and the unseen ones is estimated and used to regularize the model in an appropriate way. We construct a model for ZSVC based on the constructed attributes, the mined category-object relationships and the proposed TSRL. Experimental results on four public datasets, i.e., FCVID, UCF101, HMDB51 and Olympic Sports, show that our model performs favorably against state-of-the-art methods. Our codes are publicly available at https://github.com/junbaoZHUO/TSRL.

Long-term Leap Attention, Short-term Periodic Shift for Video Classification

Hao Zhang
Lechao Cheng
Yanbin Hao
Chong-wah Ngo

Video transformer naturally incurs a heavier computation burden than a static vision transformer, as the former processes T times longer sequence than the latter under the current attention of quadratic complexity (T2N2). The existing works treat the temporal axis as a simple extension of spatial axes, focusing on shortening the spatio-temporal sequence by either generic pooling or local windowing without utilizing temporal redundancy.

However, videos naturally contain redundant information between neighboring frames; thereby, we could potentially suppress attention on visually similar frames in a dilated manner. Based on this hypothesis, we propose the LAPS, a long-term "Leap Attention" (LA), short-term "Periodic Shift" (P-Shift) module for video transformers, with (2TN2) complexity. Specifically, the "LA" groups long-term frames into pairs, then refactors each discrete pair via attention. The "P -Shift" exchanges features between temporal neighbors to confront the loss of short-term dynamics. By replacing a vanilla 2D attention with the LAPS, we could adapt a static transformer into a video one, with zero extra parameters and neglectable computation overhead (~2.6%). Experiments on the standard Kinetics-400 benchmark demonstrate that our LAPS transformer could achieve competitive performances in terms of accuracy, FLOPs, and Params among CNN and transformer SOTAs. We open-source our project in: https://github.com/VideoNetworks/LAPS-transformer .

Boat in the Sky: Background Decoupling and Object-aware Pooling for Weakly Supervised Semantic Segmentation

Jianjun Xu
Hongtao Xie
Hai Xu
Yuxin Wang
Sun-ao Liu
Yongdong Zhang

Previous image-level weakly-supervised semantic segmentation methods based on Class Activation Map (CAM) have two limitations: 1) focusing on partial discriminative foreground regions and 2) containing undesirable background. The above issues are attributed to the spurious correlations between the object and background (semantic ambiguity) and the insufficient spatial perception ability of the classification network (spatial ambiguity). In this work, we propose a novel self-supervised framework to mitigate the semantic and spatial ambiguity from the perspectives of background bias and object perception. First, a background decoupling mechanism (BDM) is proposed to handle the semantic ambiguity by regularizing the consistency of predicted CAMs from the samples with identical foregrounds but different backgrounds. Thus, a decoupled relationship is constructed to reduce the dependence between the object instance and the scene information. Second, a global object-aware pooling (GOP) is introduced to alleviate spatial ambiguity. The GOP utilizes a learnable object-aware map to dynamically aggregate spatial information and further improve the performance of CAMs. Extensive experiments demonstrate the effectiveness of our method by achieving new state-of-the-art results on both the Pascal VOC 2012 and MS COCO 2014 datasets.

Dynamic Scene Graph Generation via Temporal Prior Inference

Shuang Wang
Lianli Gao
Xinyu Lyu
Yuyu Guo
Pengpeng Zeng
Jingkuan Song

Real-world videos are composed of complex actions with inherent temporal continuity (eg "person-touching-bottle" is usually followed by "person-holding-bottle"). In this work, we propose a novel method to mine such temporal continuity for dynamic scene graph generation (DSGG), namely Temporal Prior Inference (TPI). As opposed to current DSGG methods, which individually capture the temporal dependence of each video by refining representations, we make the first attempt to explore the temporal continuity by extracting the entire co-occurrence patterns of action categories from a variety of videos in Action Genome (AG) dataset. Then, these inherent patterns are organized as Temporal Prior Knowledge (TPK) which serves as prior knowledge for models' learning and inference. Furthermore, given the prior knowledge, human-object relationships in current frames can be effectively inferred from adjacent frames via the robust Temporal Prior Inference algorithm with tiny computation cost. Specifically, to efficiently guide the generating of temporal-consistent dynamic scene graphs, we incorporate the temporal prior inference into a DSGG framework by introducing frame enhancement, continuity loss, and fast inference. The proposed model-agnostic strategies significantly boost the performances of existing state-of-the-art models on the Action Genome dataset, achieving 69.7 and 72.6 for R@10 and R@20 on PredCLS. In addition, the inference speed can be significantly reduced by 41% with an acceptable drop on R@10 (69.7 to 66.8) by utilizing fast inference.

Source-Free Active Domain Adaptation via Energy-Based Locality Preserving Transfer

Xinyao Li
Zhekai Du
Jingjing Li
Lei Zhu
Ke Lu

Unsupervised domain adaptation (UDA) aims at transferring knowledge from one labeled source domain to a related but unlabeled target domain. Recently, active domain adaptation (ADA) has been proposed as a new paradigm which significantly boosts performance of UDA with minor additional labeling. However, existing ADA methods require source data to explicitly measure the domain gap between the source domain and the target domain, which is restricted in many real-world scenarios. In this work, we handle ADA with only a source-pretrained model and unlabeled target data, proposing a new setting named source-free active domain adaptation. Specifically, we propose a Locality Preserving Transfer (LPT) framework which preserves and utilizes locality structures on target data to achieve adaptation without source data. Meanwhile, a label propagation strategy is adopted to improve the discriminability for better adaptation. After LPT, unique samples with insignificant locality structure are identified by an energy-based approach for active annotation. An energy-based pseudo labeling strategy is further applied to generate labels for reliable samples. Finally, with supervision from the annotated samples and pseudo labels, a well adapted model is obtained. Extensive experiments on three widely used UDA benchmarks show that our method is comparable or superior to current state-of-the-art active domain adaptation methods even without access to source data.

Inferring Speaking Styles from Multi-modal Conversational Context by Multi-scale Relational Graph Convolutional Networks

Jingbei Li
Yi Meng
Xixin Wu
Zhiyong Wu
Jia Jia
Helen Meng
Qiao Tian
Yuping Wang
Yuxuan Wang

To support applications of speech-driven interactive systems in various conversational scenarios, text-to-speech (TTS) synthesis needs to understand the conversational context and determine appropriate speaking styles in its synthesized speeches. These speaking styles are influenced by the dependencies between the multi-modal information in the context at both global scale (i.e. utterance level) and local scale (i.e. word level). However, the dependency modeling and speaking style inference at the local scale are largely missing in state-of-the-art TTS systems, resulting in the synthesis of incorrect or improper speaking styles. In this paper, to learn the dependencies in conversations at both global and local scales and to improve the synthesis of speaking styles, we propose a context modeling method which models the dependencies among the multi-modal information in context with multi-scale relational graph convolutional network (MSRGCN). The learnt multi-modal context information at multiple scales is then utilized to infer the global and local speaking styles of the current utterance for speech synthesis. Experiments demonstrate the effectiveness of the proposed approach, and ablation studies reflect the contributions from modeling multi-modal information and multi-scale dependencies.

Understanding News Text and Images Connection with Context-enriched Multimodal Transformers

Cláudio Bartolomeu
Rui Nóbrega
David Semedo

The connection between news and the images that illustrate them goes beyond visual concept to natural language matching. Instead, the open-domain and event-reporting nature of news leads to semantically complex texts, in which images are used as a contextualizing element. This connection is often governed by a certain level of indirection, with journalistic criteria also playing an important role. In this paper, we address the complex challenge of connecting images to news text. A context-enriched Multimodal Transformer model is proposed, NewsLXMERT, capable of jointly attending to complementary multimodal news data perspectives. The idea is to create knowledge-rich and diverse multimodal sequences, going beyond the news headline (often lacking the necessary context) and visual objects, to effectively ground images to news pieces.

A comprehensive evaluation of challenging image-news piece matching settings is conducted, where we show the effectiveness of NewsLXMERT, the importance of leveraging the additional context and demonstrate the usefulness of the obtained pre-trained news representations for transfer-learning. Finally, to shed light on the heterogeneous nature of the problem, we contribute with a systematic model-driven study that identifies image-news matching profiles, thus explaining news piece-image matches.

Deepfake Video Detection with Spatiotemporal Dropout Transformer

DaiChi Zhang
Fanzhao Lin
Yingying Hua
Pengju Wang
Dan Zeng
Shiming Ge

While the abuse of deepfake technology has caused serious concerns recently, how to detect deepfake videos is still a challenge due to the high photo-realistic synthesis of each frame. Existing image-level approaches often focus on single frame and ignore the spatiotemporal cues hidden in deepfake videos, resulting in poor generalization and robustness. The key of a video-level detector is to fully exploit the spatiotemporal inconsistency distributed in local facial regions across different frames in deepfake videos. Inspired by that, this paper proposes a simple yet effective patch-level approach to facilitate deepfake video detection via spatiotemporal dropout transformer. The approach reorganizes each input video into bag of patches that is then fed into a vision transformer to achieve robust representation. Specifically, a spatiotemporal dropout operation is proposed to fully explore patch-level spatiotemporal cues and serve as effective data augmentation to further enhance model's robustness and generalization ability. The operation is flexible and can be easily plugged into existing vision transformers. Extensive experiments demonstrate the effectiveness of our approach against 25 state-of-the-arts with impressive robustness, generalizability, and representation ability.

ELMformer: Efficient Raw Image Restoration with a Locally Multiplicative Transformer

Jiaqi Ma
Shengyuan Yan
Lefei Zhang
Guoli Wang
Qian Zhang

In order to get raw images of high quality for downstream Image Signal Process (ISP), in this paper we present an Efficient Locally Multiplicative Transformer called ELMformer for raw image restoration. ELMformer contains two core designs especially for raw images whose primitive attribute is single-channel. The first design is a Bi-directional Fusion Projection (BFP) module, where we consider both the color characteristics of raw images and spatial structure of single-channel. The second one is that we propose a Locally Multiplicative Self-Attention (L-MSA) scheme to effectively deliver information from the local space to relevant parts. ELMformer can efficiently reduce the computational consumption and perform well on raw image restoration tasks. Enhanced by these two core designs, ELMformer achieves the highest performance and keeps the lowest FLOPs on raw denoising and raw deblurring benchmarks compared with state-of-the-arts. Extensive experiments demonstrate the superiority and generalization ability of ELMformer. On SIDD benchmark, our method has even better denoising performance than ISP-based methods which need huge amount of additional sRGB training images.

SIM-Trans: Structure Information Modeling Transformer for Fine-grained Visual Categorization

Hongbo Sun
Xiangteng He
Yuxin Peng

Fine-grained visual categorization (FGVC) aims at recognizing objects from similar subordinate categories, which is challenging and practical for human's accurate automatic recognition needs. Most FGVC approaches focus on the attention mechanism research for discriminative regions mining while neglecting their interdependencies and composed holistic object structure, which are essential for model's discriminative information localization and understanding ability. To address the above limitations, we propose the Structure Information Modeling Transformer (SIM-Trans) to incorporate object structure information into transformer for enhancing discriminative representation learning to contain both the appearance information and structure information. Specifically, we encode the image into a sequence of patch tokens and build a strong vision transformer framework with two well-designed modules: (i) the structure information learning (SIL) module is proposed to mine the spatial context relation of significant patches within the object extent with the help of the transformer's self-attention weights, which is further injected into the model for importing structure information; (ii) the multi-level feature boosting (MFB) module is introduced to exploit the complementary of multi-level features and contrastive learning among classes to enhance feature robustness for accurate fine-grained visual categorization. The proposed two modules are light-weighted and can be plugged into any transformer network and trained end-to-end easily, which only depends on the attention weights that come with the vision transformer itself. Extensive experiments and analyses demonstrate that the proposed SIM-Trans achieves state-of-the-art performance on fine-grained visual categorization benchmarks. The code will be available at https://github.com/PKU-ICST-MIPL/SIM-Trans_ACMMM2022.

Pay Attention to Your Positive Pairs: Positive Pair Aware Contrastive Knowledge Distillation

Zhipeng Yu
Qianqian Xu
Yangbangyan Jiang
Haoyu Qin
Qingming Huang

Deep neural networks have achieved impressive success on various multimedia applications in the past decades. To reach a higher performance on real-world resource-constrained devices with large models that are already learned, knowledge distillation, which aims at transferring representational knowledge from a large teacher network into a small student network, has attracted increasing attention. Recently, contrastive distillation methods have achieved superior performance in this area, due to the powerful representability brought by contrastive/self-supervised learning. These models often transfer knowledge through individual samples or inter-class relationships, while ignoring the correlation lying among intra-class samples, which convey abundant information. In this paper, we propose a Positive pair Aware Contrastive Knowledge Distillation (PACKD) framework to extend the contrastive distillation with more positive pairs to capture more abundant knowledge from the teacher. Specifically, it pulls together features of pairs from the same class learned by the student and teacher while simultaneously pushing apart those of pairs from different classes. With a positive-pair similarity weighting strategy based on optimal transport, the proposed contrastive objective is able to improve the feature discriminability between positive samples with large visual discrepancies. Experiments on different benchmarks demonstrate the effectiveness of the proposed PACKD.

JPEG Compression-aware Image Forgery Localization

Menglu Wang
Xueyang Fu
Jiawei Liu
Zheng-Jun Zha

Image forgery localization, which aims to find suspicious regions tampered with splicing, copy-move or removal manipulations, has attracted increasing attention. Existing image forgery localization methods have made great progress on public datasets. However, these methods suffer a severe performance drop when the forged images are JPEG compressed, which is widely applied in social media transmission. To tackle this issue, we propose a wavelet-based compression representation learning scheme for the specific JPEG-resistant image forgery localization. Specifically, to improve the performance against JPEG compression, we first learn the abstract representations to distinguish various compression levels through wavelet integrated contrastive learning strategy. Then, based on the learned representations, we introduce a JPEG compression-aware image forgery localization network to flexibly handle forged images compressed with various JPEG quality factors. Moreover, a boundary correction branch is designed to alleviate the edge artifacts caused by JPEG compression. Extensive experiments demonstrate the superiority of our method to existing state-of-the-art approaches, not only on standard datasets, but also on the JPEG forged images with multiple compression quality factors.

SESSION: Poster Session XII: Understanding Multimedia Content -- Media Interpretation

Hierarchical Hourglass Convolutional Network for Efficient Video Classification

Yi Tan
Yanbin Hao
Hao Zhang
Shuo Wang
Xiangnan He

Videos naturally contain dynamic variation over the temporal axis, which will result in the same visual clues (e.g., semantics, objects) changing their scale, position, and perspective patterns between adjacent frames. A primary trend in video CNN is adopting spatial-2D convolution for spatial semantics and temporal-1D convolution for temporal dynamics. Though the direction achieves a favorable balance between efficiency and efficacy, it suffers from misalignment of visual clues with large displacements. Particularly, rigid temporal convolution would fail to capture correct motions when a specific target moves out of the reception field of temporal convolution between adjacent frames.

To tackle large visual displacements between temporal neighbors, we propose a new temporal convolution namedHourglass Convolution (HgC). The temporal reception field of HgC has an hourglass shape, where the spatial reception field is enlarged in prior & post temporal frames, enabling an ability to capture large displacement. Moreover, since videos contain long, short-term movements viewed from multiple temporal interval levels, we hierarchically organize the HgC net to both capture temporal dynamics from frame (short-term) and clip (long-term) levels. Besides, we also adopt strategies, such as low-resolution for short-term modeling and channel reduction for long-term modeling, from efficiency concerns. With HgC, our H$^2$CN equips off-the-shelf CNNs with a strong ability in capturing spatio-temporal dynamics at a neglectable computation overhead. We validate the efficiency and efficacy of HgC on standard action recognition benchmarks, including Something-Something V1&V2, Diving48, and EGTEA Gaze+. We also analyse the complementarity of frame-level motion and clip-level motion with visualizations. The code and models will be available at https://github.com/ty-97/H2CN.

TextBlock: Towards Scene Text Spotting without Fine-grained Detection

Jin Wei
Yuan Zhang
Yu Zhou
Gangyan Zeng
Zhi Qiao
Youhui Guo
Haiying Wu
Hongbin Wang
Weipinng Wang

Scene text spotting systems which integrate text detection and recognition modules have witnessed a lot of success in recent years. Existing works mostly follow the framework of word/character-level fine-grained detection and isolated-instance recognition, which overemphasize the role of detector and ignore the rich context information in recognition. After rethinking the conventional framework, and inspired by the glimpse-focus spotting pipeline of human beings, we ask:1) "can machine spot text without accurate detection just like human beings?", and if yes, 2) "is text block another alternative for scene text spotting other than word or character?". Based on these questions, we propose a new perspective of coarse-grained detection with multi-instance recognition for text spotting. Specifically, a pioneering network termed TextBlock is developed, and a heuristic text block generation method as well as a multi-instance block-level recognition module are proposed. In this way, the burden of detection is relieved, and the contextual semantic information is well explored for recognition. To train the block-level recognizer, a synthetic dataset including about 800K images is formed. As a by-product of attention, fine-grained detection can be recovered with the recognizer. Equipped with a detector without many bells and whistles (e.g., Faster R-CNN), TextBlock achieves competitive or even better performance compared with previous sophisticated text spotters on several public benchmarks. As a primary attempt, we expect this framework will have a potential impact on scene text spotting research in the future.

Progressive Cross-modal Knowledge Distillation for Human Action Recognition

Jianyuan Ni
Anne H.H. Ngu
Yan Yan

Wearable sensor-based Human Action Recognition (HAR) has achieved remarkable success recently. However, the accuracy performance of wearable sensor-based HAR is still far behind the ones from the visual modalities-based system (i.e., RGB video, skeleton and depth). Diverse input modalities can provide complementary cues and thus improve the accuracy performance of HAR, but how to take advantage of multi-modal data on wearable sensor-based HAR has rarely been explored. Currently, wearable devices, i.e., smartwatches, can only capture limited kinds of non-visual modality data. This hinders the multi-modal HAR association as it is unable to simultaneously use both visual and non-visual modality data. Another major challenge lies in how to efficiently utilize multi-modal data on wearable devices with their limited computation resources. In this work, we propose a novel Progressive Skeleton-to-sensor Knowledge Distillation (PSKD) model which utilizes only time-series data, i.e., accelerometer data, from a smartwatch for solving the wearable sensor-based HAR problem. Specifically, we construct multiple teacher models using data from both teacher (human skeleton sequence) and student (time-series accelerometer data) modalities. In addition, we propose an effective progressive learning scheme to eliminate the performance gap between teacher and student models. We also designed a novel loss function called Adaptive-Confidence Semantic (ACS), to allow the student model to adaptively select either one of the teacher models or the ground-truth label it needs to mimic. To demonstrate the effectiveness of our proposed PSKD method, we conduct extensive experiments on Berkeley-MHAD, UTD-MHAD and MMAct datasets. The results confirm that the proposed PSKD method has competitive performance compared to the previous mono sensor-based HAR methods.

Finding the Host from the Lesion by Iteratively Mining the Registration Graph

Zijie Yang
Lingxi Xie
Xinyue Huo
Sheng Tang
Qi Tian
Yongdong Zhang

Voxel-level annotation has always been a burden of training medical image segmentation models. This paper investigates an interesting problem that finds the host organ of a lesion without actually labeling the organ. To remedy the missing annotation, we construct a graph using an off-the-shelf registration algorithm, on which lesion labels over the training set are accumulated to obtain the pseudo organ for each case. These pseudo labels are used to train a deep network, whose predictions determine the affinity of each lesion on the registration graph. We iteratively update the pseudo labels with the affinity until the training convergence. Our method is evaluated on the MSD Liver and KiTS datasets, without seeing any organ annotation, we achieve the test Dice score of 93% for liver and 92% for kidney, and boosts the accuracy of tumor segmentation to a considerable degree, $3%$, which even surpasses the model trained with ground-truth of both organ and tumor.

3D Body Reconstruction Revisited: Exploring the Test-time 3D Body Mesh Refinement Strategy via Surrogate Adaptation

Jonathan Samuel Lumentut
In Kyu Park

Recent 3D body reconstruction works are achieving state-of-the-art performances. Each of the prior works applied specific modifications to their respective modules, allowing the ability to show plausible predicted 3D body poses and shapes to human eyes. Unfortunately, those outputs may sometimes be far from the correct position. In contrast to these works, we took a different perspective on how to re-improve this limitation. Without any addition or modification at the module level, we propose a test-time adaptation strategy that fine-tunes the module directly. Our approach is inspired by the science of vaccination that leverages surrogate information, which is helpful and not harmful in improving the human immune system. This notion is translated to our adaptation strategy by fine-tuning the surrogate 3D body module using reliable virtual data. In doing so, the proposed work can revisit the prior state-of-the-art works and improve their performances directly in the test phase. The experimental results demonstrate our strategy's ability to straightforwardly improve the prior works, even with fast adaptation capability.

Domain Adaptation for Time-Series Classification to Mitigate Covariate Shift

Felix Ott
David Rügamer
Lucas Heublein
Bernd Bischl
Christopher Mutschler

The performance of a machine learning model degrades when it is applied to data from a similar but different domain than the data it has initially been trained on. To mitigate this domain shift problem, domain adaptation (DA) techniques search for an optimal transformation that converts the (current) input data from a source domain to a target domain to learn a domain-invariant representation that reduces domain discrepancy. This paper proposes a novel supervised DA based on two steps. First, we search for an optimal class-dependent transformation from the source to the target domain from a few samples. We consider optimal transport methods such as the earth mover's distance, Sinkhorn transport and correlation alignment. Second, we use embedding similarity techniques to select the corresponding transformation at inference. We use correlation metrics and higher-order moment matching techniques. We conduct an extensive evaluation on time-series datasets with domain shift including simulated and various online handwriting datasets to demonstrate the performance.

Face Anthropometry Aware Audio-visual Age Verification

Pavel Korshunov
Sébastien Marcel

Protection of minors against destructive content or illegal advertising is an important problem, which is now under increasing societal and legislative pressure. The latest advancements in an automated age verification is a possible solution to this problem. There are however limitations of the current state of the art age verification methods, specifically, the lack of approaches focusing on video-based or even solely audio-based approaches, since the image domain is the one with the majority of publicly available datasets. In this paper, we consider the problem of age verification as a multimodal problem by proposing and evaluating several audio- and image-based models and their combinations. To that end, we annotated a set of publicly available videos with age labels, with a special focus on the children age labels. We also propose a new training strategy based on the adaptive label distribution learning (ALDL), which is driven by facial anthropometry and age-based skin degradation. This adaptive approach demonstrates the best accuracy when evaluated across several test databases.

PDD-GAN: Prior-based GAN Network with Decoupling Ability for Single Image Dehazing

Xiaoxuan Chai
Junchi Zhou
Hang Zhou
Juihsin Lai

Single image dehazing is a challenging vision problem aiming to provide clear images for downstream computer vision applications (e.g., semantic segmentation, object detection, and super resolution). Most existing methods leverage the physical scattering model or convolutional neural networks (CNNs) for haze removal, which however ignore the complementary advantages between each other. Especially lacking marginal and visual prior instructions, CNN-based methods still have gaps in details and color recovery. To solve these, we propose a Prior-based with Decoupling ability Dehazing GAN Network (PDD-GAN), which is based on PeleetNet and attached with an attention module (CBAM). The prior-based decoupling approach consists of two parts: high and low frequency filtering and HSV contrastive loss. We process the image via a band-stop filter and add it as the fourth channel of data (RGBFHL) to decouple the hazy image at the structural level. Besides, a novel prior loss with contrastive regularization is proposed at the visual level. Sufficient experiments are carried out to demonstrate that PDD-GAN outperforms state-of-the-art methods by up to 0.86db in PSNR. In particular, extensive experiments indicate that RGBFHL increases by 0.99db compared with the original three-channel data (RGB) and the extra HSV prior loss escalates by 2.0db. Above all, our PDD-GAN indeed has the decoupling ability and improves the dehazing results.

Active Patterns Perceived for Stochastic Video Prediction

Yechao Xu
Zhengxing Sun
Qian Li
Yunhan Sun
Shoutong Luo

Predicting future scenes based on historical frames is challenging, especially when it comes to the complex uncertainty in nature. We observe that there is a divergence between spatial-temporal variations of active patterns and non-active patterns in a video, where these patterns constitute visual content and the former ones implicate more violent movement. This divergence enables active patterns the higher potential to act with more severe future uncertainty. Meanwhile, the existence of non-active patterns provides an opportunity for machines to examine some underlying rules with a mutual constraint between non-active patterns and active patterns. In order to solve this divergence, we provide a method called active patterns-perceived stochastic video prediction (ASVP) which allows active patterns to be perceived by neural networks during training. Our method starts with separating active patterns along with non-active ones from a video. Then, both scene-based prediction and active pattern-perceived prediction are conducted to respectively capture the variations within the whole scene and active patterns. Specially for active pattern-perceived prediction, a conditional generative adversarial network (CGAN) is exploited to model active patterns as conditions, with a variational autoencoder (VAE) for predicting the complex dynamics of active patterns. Additionally, a mutual constraint is designed to improve the learning procedure for the network to better understand underlying interacting rules among these patterns. Extensive experiments are conducted on both KTH human action and BAIR action-free robot pushing datasets with comparison to state-of-the-art works. Experimental results demonstrate the competitive performance of the proposed method as we expected. The released code and models are at https://github.com/tolearnmuch/ASVP.

Few-shot Open-set Recognition Using Background as Unknowns

Nan Song
Chi Zhang
Guosheng Lin

In this paper, we propose to solve the problem from two novel aspects. First, instead of learning the decision boundaries between seen classes, as is done in standard close-set classification, we reserve space for unseen classes, such that images located in these areas are recognized as the unseen classes. Second, to effectively learn such decision boundaries, we propose to utilize the background features from seen classes. As these background regions do not significantly contribute to the decision of close-set classification, it is natural to use them as pseudo unseen classes for classifier learning. Our extensive experiments show that our proposed method not only outperforms multiple baselines but also sets new state-of-the-art results on three popular benchmarks, namely tieredImageNet, miniImageNet, and Caltech-USCD Birds-200-2011 (CUB).

Self-supervised Scene Text Segmentation with Object-centric Layered Representations Augmented by Text Regions

Yibo Wang
Yunhu Ye
Yuanpeng Mao
Yanwei Yu
Yuanping Song

Text segmentation tasks have a very wide range of application values, such as image editing, style transfer, watermark removal, etc. However, existing public datasets are of poor quality of pixel-level labels that have been shown to be notoriously costly to acquire, both in terms of money and time. At the same time, when pretraining is performed on synthetic datasets, the data distribution of the synthetic datasets is far from the data distribution in the real scene. These all pose a huge challenge to the current pixel-level text segmentation algorithms. To alleviate the above problems, we propose a self-supervised scene text segmentation algorithm with layered decoupling of representations derived from the object-centric manner to segment images into texts and background. In our method, we propose two novel designs which include Region Query Module and Representation Consistency Constraints adapting to the unique properties of text as complements to Auto Encoder, which improves the network's sensitivity to texts. For this unique design, we treat the polygon-level masks predicted by the text localization model as extra input information, and neither utilize any pixel-level mask annotations for training stage nor pretrain on synthetic datasets. Extensive experiments show the effectiveness of the method proposed. On several public scene text datasets, our method outperforms the state-of-the-art unsupervised segmentation algorithms.

Self-Supervised Representation Learning for Skeleton-Based Group Activity Recognition

Cunling Bian
Wei Feng
Song Wang

Group activity recognition (GAR) is a challenging task for discerning the behavior of a group of actors. This paper aims at learning discriminative representation for GAR in a self-supervised manner based on human skeletons. As modeling relations between actors lie at the center of GAR, we propose a valid self-supervised learning pretext task with a matching framework, where a representation model is driven to identify subgroups in a synthetic group based on actors' skeleton sequences. For backbone networks, while spatial-temporal graph convolution networks have dominated the skeleton-based action recognition, they under-explore the group relevant interactions among actors. To address this issue, we come up with a novel plug-in Actor-Association Graph Convolution Module (AAGCM) based on inductive graph convolution, which can be integrated into many common backbones. It can not only model the interactions at different levels but also adapt to variable group sizes. The effectiveness of our approaches is demonstrated by extensive experiments on three benchmark datasets: Volleyball, Collective Activity, and Mutual NTU.

Graph-DETR3D: Rethinking Overlapping Regions for Multi-View 3D Object Detection

Zehui Chen
Zhenyu Li
Shiquan Zhang
Liangji Fang
Qinhong Jiang
Feng Zhao

3D object detection from multiple image views is a fundamental and challenging task for visual scene understanding. However, accurately detecting objects through perspective views in the 3D space is extremely difficult due to the lack of depth information. Recently, DETR3D introduces a novel 3D-2D query paradigm in aggregating multi-view images for 3D object detection and achieves state-of-the-art performance. In this paper, with intensive pilot experiments, we quantify the objects located at different regions and find that the "truncated instances'' (i.e., at the border regions of each image) are the main bottleneck hindering the performance of DETR3D. Although it merges multiple features from two adjacent views in the overlapping regions, DETR3D still suffers from insufficient feature aggregation, thus missing the chance to fully boost the detection performance. In an effort to tackle the problem, we propose Graph-DETR3D to automatically aggregate multi-view imagery information through graph structure learning. It constructs a dynamic 3D graph between each object query and 2D feature maps to enhance the object representations, especially at the border regions. Besides, Graph-DETR3D benefits from a novel depth-invariant multi-scale training strategy, which maintains the visual depth consistency by simultaneously scaling the image size and the object depth. Extensive experiments on the nuScenes dataset demonstrate the effectiveness and efficiency of our Graph-DETR3D. Notably, our best model achieves 49.5 NDS on the nuScenes test leaderboard, achieving new state-of-the-art in comparison with various published image-view 3D object detectors.

Adaptive Mixture of Experts Learning for Generalizable Face Anti-Spoofing

Qianyu Zhou
Ke-Yue Zhang
Taiping Yao
Ran Yi
Shouhong Ding
Lizhuang Ma

With various face presentation attacks emerging continually, face anti-spoofing (FAS) approaches based on domain generalization (DG) have drawn growing attention. Existing DG-based FAS approaches always capture the domain-invariant features for generalizing on the various unseen domains. However, they neglect individual source domains' discriminative characteristics and diverse domain-specific information of the unseen domains, and the trained model is not sufficient to be adapted to various unseen domains. To address this issue, we propose an Adaptive Mixture of Experts Learning (AMEL) framework, which exploits the domain-specific information to adaptively establish the link among the seen source domains and unseen target domains to further improve the generalization. Concretely, Domain-Specific Experts (DSE) are designed to investigate discriminative and unique domain-specific features as a complement to common domain-invariant features. Moreover, Dynamic Expert Aggregation (DEA) is proposed to adaptively aggregate the complementary information of each source expert based on the domain relevance to the unseen target domain. And combined with meta-learning, these modules work collaboratively to adaptively aggregate meaningful domain-specific information for the various unseen target domains. Extensive experiments and visualizations demonstrate the effectiveness of our method against the state-of-the-art competitors.

Multi-Granular Semantic Mining for Weakly Supervised Semantic Segmentation

Meijie Zhang
Jianwu Li
Tianfei Zhou

This paper solves the problem of learning image semantic segmentation using image-level supervision. The task is promising in terms of reducing annotation efforts, yet extremely challenging due to the difficulty to directly associate high-level concepts with low-level appearance. While current efforts handle each concept independently, we take a broader perspective to harvest implicit, holistic structures of semantic concepts, which express valuable prior knowledge for accurate concept grounding. This raises multi-granular semantic mining, a new formalism allowing flexible specification of complex relations in the label space. In particular, we propose a heterogeneous graph neural network (Hgnn) to model the heterogeneity of multi-granular semantics within a set of input images. The Hgnn consists of two types of sub-graphs: 1) an external graph characterizes the relations across different images to mine inter-image contexts; and for each image, 2) an internal graph is constructed to mine inter-class semantic dependencies within each individual image. Through heterogeneous graph learning, our Hgnn is able to land a comprehensive understanding of object patterns, leading to more accurate semantic concept grounding. Extensive experimental results show that Hgnn outperforms the current state-of-the-art approaches on the popular PASCAL VOC 2012 and COCO 2014 benchmarks. Our code is available at: https://github.com/maeve07/HGNN.git.

Consistency Learning based on Class-Aware Style Variation for Domain Generalizable Semantic Segmentation

Siwei Su
Haijian Wang
Meng Yang

Domain generalizable (DG) semantic segmentation, i.e., a semantic segmentation model pretrained from a source domain performs well in previously unseen target domains without any fine-tuning, remains an open question. A promising solution is learning style-agnostic and domain-invariant features with stylized augmented data. However, existing methods mainly focused on performing stylization on coarse-grained image-level features, while ignoring to explore fine-grained semantic style clues and high-order semantic context correlation, which are essential in enhancing the generalization. Motivated by this, we propose a novel framework termed Consistent Learning based on Class-Aware Style Variation (CL-CASV) for DG semantic segmentation. Specifically, with the guidance of class-level semantic information, our proposed Class-Aware Style Variation (CASV) module simulates imaging object and imaging condition style variation that can appear in complex real-world scenarios, thus generating fine-grained class-aware stylized images with rich style variation. Then the similarities between augmentations and original images are exploited via our Self-Correlation Consistency Learning (SCCL) that mines global context consistency from the views of channel correlation and spatial correlation in the feature and prediction spaces. Extensive experiments on mainstream benchmarks, including Cityscapes, GTAV, BDD100K, SYNTHIA, and Mapillary, demonstrate the effectiveness of our method as it surpasses the state-of-the-art methods.

Delving into the Continuous Domain Adaptation

Yinsong Xu
Zhuqing Jiang
Aidong Men
Yang Liu
Qingchao Chen

Existing domain adaptation methods assume that domain discrepancies are caused by a few discrete attributes and variations, e.g., art, real, painting, quickdraw, etc. We argue that this is not realistic as it is implausible to define the real-world datasets using a few discrete attributes. Therefore, we propose to investigate a new problem namely the Continuous Domain Adaptation (CDA) through the lens where infinite domains are formed by continuously varying attributes. Leveraging knowledge of two labeled source domains and several observed unlabeled target domains data, the objective of CDA is to learn a generalized model for whole data distribution with the continuous attribute. Besides the contributions of formulating a new problem, we also propose a novel approach as a strong CDA baseline. To be specific, firstly we propose a novel alternating training strategy to reduce discrepancies among multiple domains meanwhile generalize to unseen target domains. Secondly, we propose a continuity constraint when estimating the cross-domain divergence measurement. Finally, to decouple the discrepancy from the mini-batch size, we design a domain-specific queue to maintain the global view of the source domain that further boosts the adaptation performances. Our method is proven to achieve the state-of-the-art in CDA problem using extensive experiments. The code is available at https://github.com/SPIresearch/CDA.

Digging Into Normal Incorporated Stereo Matching

Zihua Liu
Songyan Zhang
Zhicheng Wang
Masatoshi Okutomi

Despite the remarkable progress facilitated by learning-based stereo matching algorithms, disparity estimation in low-texture, occluded, and bordered regions still remain bottlenecks that limit the performance. To tackle these challenges, geometric guidance like plane information is necessary as it provides intuitive guidance about disparity consistency and affinity similarity. In this paper, we propose a normal incorporated joint learning that framework consisting of two specific modules named non-local disparity propagation(NDP) and affinity-aware residual learning(ARL). The estimated normal map is first utilized for calculating a non-local affinity matrix as well as a non-local offset to perform spatial propagation at the disparity level. To enhance geometric consistency, especially in low-texture regions, the estimated normal map is then leveraged to calculate a local affinity matrix which provides the residual learning with information about where the correction should refer and thus improve the residual learning efficiency. Extensive experiments on several public datasets including Scene Flow, KITTI 2015, and Middlebury 2014 validate the effectiveness of our proposed method. By the time we finished this work, our approach ranked 1st for stereo matching across foreground pixels on the KITTI 2015 dataset and 3rd on the Scene Flow dataset among all the published works.

Box-FaceS: A Bidirectional Method for Box-Guided Face Component Editing

Wenjing Huang
Shikui Tu
Lei Xu

While the quality of face manipulation has been improved tremendously, the ability to control face components, e.g., eyebrows, is still limited. Although existing methods have realized component editing with user-provided geometry guidance, such as masks or sketches, their performance is largely dependent on the user's painting efforts. To address these issues, we propose Box-FaceS, a bidirectional method that can edit face components by simply translating and zooming the bounding boxes. This framework learns representations for every face component, independently, as well as a high-dimensional tensor capturing face outlines. To enable box-guided face editing, we develop a novel Box Adaptive Modulation (BAM) module for the generator, which first transforms component embeddings to style parameters and then modulates visual features inside a given box-like region on the face outlines. A cooperative learning scheme is proposed to impose independence between face outlines and component embeddings. As a result, it is flexible to determine the component style by its embedding, and to control its position and size by the provided bounding box. Box-FaceS also learns to transfer components between two faces while maintaining the consistency of image content. In particular, Box-FaceS can generate creative faces with reasonable exaggerations, requiring neither supervision nor complex spatial morphing operations. Through the comparisons with state-of-the-art methods, Box-FaceS shows its superiority in component editing, both qualitatively and quantitatively. To the best of our knowledge, Box-FaceS is the first approach that can freely edit the position and shape of the face components without editing the face masks or sketches. Our implementation is available at https://github.com/CMACH508/Box-FaceS.

Learning Parallax Transformer Network for Stereo Image JPEG Artifacts Removal

Xuhao Jiang
Weimin Tan
Ri Cheng
Shili Zhou
Bo Yan

Under stereo settings, the performance of image JPEG artifacts removal can be further improved by exploiting the additional information provided by a second view. However, incorporating this information for stereo image JPEG artifacts removal is a huge challenge, since the existing compression artifacts make pixel-level view alignment difficult. In this paper, we propose a novel parallax transformer network (PTNet) to integrate the information from stereo image pairs for stereo image JPEG artifacts removal. Specifically, a well-designed symmetric bi-directional parallax transformer module is proposed to match features with similar textures between different views instead of pixel-level view alignment. Due to the issues of occlusions and boundaries, a confidence-based cross-view fusion module is proposed to achieve better feature fusion for both views, where the cross-view features are weighted with confidence maps. Especially, we adopt a coarse-to-fine design for the cross-view interaction, leading to better performance. Comprehensive experimental results demonstrate that our PTNet can effectively remove compression artifacts and achieves superior performance than other testing state-of-the-art methods.

Geometry-Aware Reference Synthesis for Multi-View Image Super-Resolution

Ri Cheng
Yuqi Sun
Bo Yan
Weimin Tan
Chenxi Ma

Recent multi-view multimedia applications struggle between high-resolution (HR) visual experience and storage or bandwidth constraints. Therefore, this paper proposes a Multi-View Image Super-Resolution (MVISR) task. It aims to increase the resolution of multi-view images captured from the same scene. One solution is to apply image or video super-resolution (SR) methods to reconstruct HR results from the low-resolution (LR) input view. However, these methods cannot handle large-angle transformations between views and leverage information in all multi-view images. To address these problems, we propose the MVSRnet, which uses geometry information to extract sharp details from all LR multi-view to support the SR of the LR input view. Specifically, the proposed Geometry-Aware Reference Synthesis module in MVSRnet uses geometry information and all multi-view LR images to synthesize pixel-aligned HR reference images. Then, the proposed Dynamic High-Frequency Search network fully exploits the high-frequency textural details in reference images for SR. Extensive experiments on several benchmarks show that our method significantly improves over the state-of-the-art approaches.

Chinese Character Recognition with Augmented Character Profile Matching

Xinyan Zu
Haiyang Yu
Bin Li
Xiangyang Xue

Chinese character recognition (CCR) has drawn continuous research interest due to its wide applications. After decades of study, there still exist several challenges,e.g., different characters with similar appearance and the one-to-many problem. There is no unified solution to the above challenges as previous methods tend to address these problems separately. In this paper, we propose a Chinese character recognition method named Augmented Character Profile Matching (ACPM), which utilizes a collection of character knowledge from three decomposition levels to recognize Chinese characters. Specifically, the feature maps of each character image are utilized as the character-level knowledge. In addition, we introduce a radical-stroke counting module (RSC) to help produce augmented character profiles, including the number of radicals, the number of strokes, and the total length of strokes, which characterize the character more comprehensively. The feature maps of the character image and the outputs of the RSC module are collected to constitute a character profile for selecting the closest candidate character through joint matching. The experimental results show that the proposed method outperforms the state-of-the-art methods on both the ICDAR 2013 and CTW datasets by 0.35% and 2.23%, respectively. Moreover, it also clearly outperforms the compared methods in the zero-shot settings. Code is available at https://github.com/FudanVI/FudanOCR/tree/main/character-profile-matching.

Hierarchical Scene Normality-Binding Modeling for Anomaly Detection in Surveillance Videos

Qianyue Bao
Fang Liu
Yang Liu
Licheng Jiao
Xu Liu
Lingling Li

Anomaly detection in surveillance videos is an important topic in the multimedia community, which requires efficient scene context extraction and the capture of temporal information as a basis for decision. From the perspective of hierarchical modeling, we parse the surveillance scene from global to local and propose a Hierarchical Scene Normality-Binding Modeling framework (HSNBM) to handle anomaly detection. For the static background hierarchy, we design a Region Clustering-driven Multi-task Memory Autoencoder (RCM-MemAE), which can simultaneously perform region segmentation and scene reconstruction. The normal prototypes of each local region are stored, and the frame reconstruction error is subsequently amplified by global memory augmentation. For the dynamic foreground object hierarchy, we employ a Scene-Object Binding Frame Prediction module (SOB-FP) to bind all foreground objects in the frame with the prototypes stored in the background hierarchy according their positions, thus fully exploit the normality relationship between foreground and background. The bound features are then fed into the decoder to predict the future movement of the objects. With the binding mechanism between foreground and background, HSNBM effectively integrates the "reconstruction" and "prediction" tasks and builds a semantic bridge between the two hierarchies. Finally, HSNBM fuses the anomaly scores of the two hierarchies to make a comprehensive decision. Extensive empirical studies on three standard video anomaly detection datasets demonstrate the effectiveness of the proposed HSNBM framework.

ParseMVS: Learning Primitive-aware Surface Representations for Sparse Multi-view Stereopsis

Haiyang Ying
Jinzhi Zhang
Yuzhe Chen
Zheng Cao
Jing Xiao
Ruqi Huang
Lu Fang

Multi-view stereopsis (MVS) recovers 3D surfaces by finding dense photo-consistent correspondences from densely sampled images. In this paper, we tackle the challenging MVS task from sparsely sampled views (up to an order of magnitude fewer images), which is more practical and cost-efficient in applications. The major challenge comes from the significant correspondence ambiguity introduced by the severe occlusions and the highly skewed patches. On the other hand, such ambiguity can be resolved by incorporating geometric cues from the global structure. In light of this, we propose ParseMVS, boosting sparse MVS by learning the P rimitive-A waR e S urface rE presentation. In particular, on top of being aware of global structure, our novel representation further allows for the preservation of fine details including geometry, texture, and visibility. More specifically, the whole scene is parsed into multiple geometric primitives. On each of them, the geometry is defined as the displacement along the primitives' normal directions, together with the texture and visibility along each view direction. An unsupervised neural network is trained to learn these factors by progressively increasing the photo-consistency and render-consistency among all input images. Since the surface properties are changed locally in the 2D space of each primitive, ParseMVS can preserve global primitive structures while optimizing local details, handling the 'incompleteness' and the 'inaccuracy' problems. We experimentally demonstrate that ParseMVS constantly outperforms the state-of-the-art surface reconstruction method in both completeness and the overall score under varying sampling sparsity, especially under the extreme sparse-MVS settings. Beyond that, ParseMVS also shows great potential in compression, robustness, and efficiency.

Set-Based Face Recognition Beyond Disentanglement: Burstiness Suppression With Variance Vocabulary

Jiong Wang
Zhou Zhao
Fei Wu

Set-based face recognition (SFR) aims to recognize the face sets in the unconstrained scenario, where the appearance of same identity may change dramatically with extreme variances (e.g., illumination, pose, expression). We argue that the two crucial issues in SFR, the face quality and burstiness, are both identity-irrelevant and variance-relevant. The quality and burstiness assessment are interfered with by the entanglement of identity, and the face recognition is interfered with by the entanglement of variance. Thus we propose to separate the identity features with the variance features in a light-weighted set-based disentanglement framework.

Beyond disentanglement, the variance features are fully utilized to indicate face quality and burstiness in a set, rather than being discarded after training. To suppress face burstiness in the sets, we propose a vocabulary-based burst suppression (VBS) method which quantizes faces with a reference vocabulary. With interword and intra-word normalization operations on the assignment scores, the face burtisness degrees are appropriately estimated. The extensive illustrations and experiments demonstrate the effect of the disentanglement framework with VBS, which gets new state-of-the-art on the SFR benchmarks. The codes will be released.

Gait Recognition in the Wild with Multi-hop Temporal Switch

Jinkai Zheng
Xinchen Liu
Xiaoyan Gu
Yaoqi Sun
Chuang Gan
Jiyong Zhang
Wu Liu
Chenggang Yan

Existing studies for gait recognition are dominated by in-the-lab scenarios. Since people live in real-world senses, gait recognition in the wild is a more practical problem that has recently attracted the attention of the community of multimedia and computer vision. Current methods that obtain state-of-the-art performance on in-the-lab benchmarks achieve much worse accuracy on the recently proposed in-the-wild datasets because these methods can hardly model the varied temporal dynamics of gait sequences in unconstrained scenes. Therefore, this paper presents a novel multi-hop temporal switch method to achieve effective temporal modeling of gait patterns in real-world scenes. Concretely, we design a novel gait recognition network, named Multi-hop Temporal Switch Network (MTSGait), to learn spatial features and multi-scale temporal features simultaneously. Different from existing methods that use 3D convolutions for temporal modeling, our MTSGait models the temporal dynamics of gait sequences by 2D convolutions. By this means, it achieves high efficiency with fewer model parameters and reduces the difficulty in optimization compared with 3D convolution-based models. Based on the specific design of the 2D convolution kernels, our method can eliminate the misalignment of features among adjacent frames. In addition, a new sampling strategy, i.e., non-cyclic continuous sampling, is proposed to make the model learn more robust temporal features. Finally, the proposed method achieves superior performance on two public gait in-the-wild datasets, i.e., GREW and Gait3D, compared with state-of-the-art methods.

Generic Image Manipulation Localization through the Lens of Multi-scale Spatial Inconsistence

Zan Gao
Shenghao Chen
Yangyang Guo
Weili Guan
Jie Nie
Anan Liu

Image manipulation localization is of vital importance to public order protection. One dominant approach is to detect the anomalies in images, i.e., visual artifacts, as the tampered edge clue for aiding manipulation prediction. Nevertheless, we argue that these methods struggle with the modeling of spatial inconsistency within multi-scale, resulting in sub-optimal model performance. To overcome this problem, in this paper, we propose a novel end-to-end method to identify the multi-scale spatial inconsistency for image manipulation localization (abbreviated as MSI) where the multi-scale edge-guided attention stream (MEA) and multi-scale context-aware search stream (MCS) are jointly explored in a unified framework, moreover, multi-scale information is efficiently used. In the former, the edge-attention module is designed to precisely locate the tampered regions based upon multi-scale edge boundary features. In the latter, the context-aware search module is designed to model spatial contextual information within multiple scales. To validate the effectiveness of the proposed method, we conduct extensive experiments on six image manipulation localization datasets including NIST-2016, Columbia, CASIA1.0, COVER, DEF-12K, and IMD2020. The experimental results demonstrate that our proposed method can outperform state-of-the-art methods by a significant margin in terms of average F1 score while maintaining robustness with respect to various attacks. Compared with MVSS-Net (Published in ICCV 2021) on the NIST-2016, CASIA1.0, DEF-12K, and IMD2020 datasets, the improvements in F1 score can reach 6.7%, 9.5%, 5.4%, and 8.4%, respectively.

Beyond Geo-localization: Fine-grained Orientation of Street-view Images by Cross-view Matching with Satellite Imagery

Wenmiao Hu
Yichen Zhang
Yuxuan Liang
Yifang Yin
Andrei Georgescu
An Tran
Hannes Kruppa
See-Kiong Ng
Roger Zimmermann

Street-view imagery provides us with novel experiences to explore different places remotely. Carefully calibrated street-view images (e.g., Google Street View) can be used for different downstream tasks, e.g., navigation, map features extraction. As personal high-quality cameras have become much more affordable and portable, an enormous amount of crowdsourced street-view images are uploaded to the internet, but commonly with missing or noisy sensor information. To prepare this hidden treasure for "ready-to-use" status, determining missing location information and camera orientation angles are two equally important tasks. Recent methods have achieved high performance on geo-localization of street-view images by cross-view matching with a pool of geo-referenced satellite imagery. However, most of the existing works focus more on geo-localization than estimating the image orientation. In this work, we re-state the importance of finding fine-grained orientation for street-view images, formally define the problem and provide a set of evaluation metrics to assess the quality of the orientation estimation. We propose two methods to improve the granularity of the orientation estimation, achieving 82.4% and 72.3% accuracy for images with estimated angle errors below 2 degrees for CVUSA and CVACT datasets, corresponding to 34.9% and 28.2% absolute improvement compared to previous works. Integrating fine-grained orientation estimation in training also improves the performance on geo-localization, giving top 1 recall 95.5%/85.5% and 86.8%/80.4% for orientation known/unknown tests on the two datasets.

Region-based Pixels Integration Mechanism for Weakly Supervised Semantic Segmentation

Chen Qian
Hui Zhang

Image-level annotations allow to achieve semantic segmentation in a weakly-supervised way. Most advanced approaches utilize class activation map (CAM) from deep classifier to generate pseudo-labels. However, CAM generally only focuses on the most discriminative parts of targets. To explore more pixel-level semantic information and recognize all pixels within the objects for segmentation, we propose a Region-based Pixels Integration Mechanism (RPIM) which discovers the intra-region and inter-region information. Firstly, the foreground regions are formed on the basis of superpixels and the initial responses. Each region is regarded as a subtree, whose nodes are the image pixels within the region. Then, an Intra-region Integration (IRI) Module is designed to explore the nodes relationships inside the subtree. Within each subtree, nodes will vote for the most confident class and share the highest probability. Moreover, an Inter-region Spreading (IRS) Module is proposed to further improve the consistency of CAM. For each class, the most confident unprocessed subtree finds their homologous neighbors, connects with them and shares its probability. By iterative refinement, the training process will integrate the individual nodes into region subtrees, and gradually form the subtrees with similar probabilities to the object semantic trees for each foreground class. To our best knowledge, our approach achieves the state-of-the-art performance on PASCAL VOC 2012 validation set with 71.4% mIoU. The experiments also show that our scheme is plug-and-play and can collaborate with different approaches to improve their performance.

IVT: An End-to-End Instance-guided Video Transformer for 3D Pose Estimation

Zhongwei Qiu
Qiansheng Yang
Jian Wang
Dongmei Fu

Video 3D human pose estimation aims to localize the 3D coordinates of human joints from videos. Recent transformer-based approaches focus on capturing the spatiotemporal information from sequential 2D poses, which cannot model the contextual depth feature effectively since the visual depth features are lost in the step of 2D pose estimation. In this paper, we simplify the paradigm into an end-to-end framework, Instance-guided Video Transformer (IVT), which enables learning spatiotemporal contextual depth information from visual features effectively and predicts 3D poses directly from video frames. In particular, we firstly formulate video frames as a series of instance-guided tokens and each token is in charge of predicting the 3D pose of a human instance. These tokens contain body structure information since they are extracted by the guidance of joint offsets from the human center to the corresponding body joints. Then, these tokens are sent into IVT for learning spatiotemporal contextual depth. In addition, we propose a cross-scale instance-guided attention mechanism to handle the variational scales among multiple persons. Finally, the 3D poses of each person are decoded from instance-guided tokens by coordinate regression. Experiments on three widely-used 3D pose estimation benchmarks show that the proposed IVT achieves state-of-the-art performances.

Point Cloud Completion via Multi-Scale Edge Convolution and Attention

Rui Cao
Kaiyi Zhang
Yang Chen
Ximing Yang
Cheng Jin

Point cloud completion aims to recover a complete shape of a 3D object from its partial observation. Existing methods usually predict complete shapes from global representations, consequently, local geometric details may be ignored. Furthermore, they tend to overlook relations among different local regions, which are valuable during shape inference. To solve these problems, we propose a novel point cloud completion network based on multi-scale edge convolution and attention mechanism, named MEAPCN. We represent a point cloud as a set of embedded points, each of which contains geometric information of local patches around it. Firstly, we devise an encoder to extract multi-scale local features of the input point cloud and produce partial embedded points. Then, we generate coarse complete embedded points to represent the overall shape. In order to enrich features of complete embedded points, attention mechanism is utilized to selectively aggregate local informative features of partial ones. Lastly, we recover a fine-grained point cloud with highly detailed geometries using folding-based strategy. To better reflect real-world occlusion scenarios, we contribute a more challenging dataset, which consists of view-occluded partial point clouds. Experimental results on various benchmarks demonstrate that our method achieves a superior completion performance with much smaller model size and much lower computation cost.

CRNet: Unsupervised Color Retention Network for Blind Motion Deblurring

Suiyi Zhao
Zhao Zhang
Richang Hong
Mingliang Xu
Haijun Zhang
Meng Wang
Shuicheng Yan

Blind image deblurring is still a challenging problem due to the inherent ill-posed properties. To improve the deblurring performance, many supervised methods have been proposed. However, obtaining labeled samples from a specific distribution (or a domain) is usually expensive, and the data-driven training-based model also cannot be generalized to the blurry images in all domains. These challenges have given birth to certain unsupervised deblurring methods. However, there is a great chromatic aberration between the latent and original images, directly degrading the performance. In this paper, we therefore propose a novel unsupervised color retention network termed CRNet to perform blind motion deblurring. In addition, new concepts of blur offset estimation and adaptive blur correction are proposed to retain the color information when deblurring. As a result, unlike the previous studies, CRNet does not learn a mapping directly from the blurry image to the restored latent image, but from the blurry image to a motion offset. An adaptive blur correction operation is then performed on the blurry image to restore the latent image, thereby retaining the color information of the original image to the greatest extent. To further effectively retain the color information and extract the blur information, we also propose a new module called pyramid global blur feature perception (PGBFP). To quantitatively prove the effectiveness of our network in color retention, we propose a novel chromatic aberration quantization metrics in line with the human perception. Extensive quantitative and visualization experiments show that CRNet can obtain the state-of-the-art performance in unsupervised deblurring tasks.

SGINet: Toward Sufficient Interaction Between Single Image Deraining and Semantic Segmentation

Yanyan Wei
Zhao Zhang
Huan Zheng
Richang Hong
Yi Yang
Meng Wang

Data-driven single image deraining (SID) models have achieved greater progress by simulations, but there is still a large gap between current deraining performance and practical high-level applications, since high-level semantic information is usually neglected in current studies. Although few studies jointly considered high-level tasks (e.g., segmentation) to enable the model to learn more high-level information, there are two obvious shortcomings. First, they require the segmentation labels for training, limiting their operations on other datasets without high-level labels. Second, high- and low-level information are not fully interacted, hence having limited improvement in both deraining and segmentation tasks. In this paper, we propose a Semantic Guided Interactive Network (SGINet), which considers the sufficient interaction between SID and semantic segmentation using a three-stage deraining manner, i.e., coarse deraining, semantic information extraction, and semantics guided deraining. Specifically, a Full Resolution Module (FRM) without down-/up-sampling is proposed to predict the coarse deraining images without context damage. Then, a Segmentation Extracting Module (SEM) is designed to extract accurate semantic information. We also develop a novel contrastive semantic discovery (CSD) loss, which can instruct the process of semantic segmentation without real semantic segmentation labels. Finally, a triple-direction U-net-based Semantic Interaction Module (SIM) takes advantage of the coarse deraining images and semantic information for fully interacting low-level with high-level tasks. Extensive simulations on the newly-constructed complex datasets Cityscapes_syn and Cityscapes_real demonstrated that our model could obtain more promising results. Overall, our SGINet achieved SOTA deraining and segmentation performance in both simulation and real-scenario data, compared with other representative SID methods.

Robust Low-Rank Convolution Network for Image Denoising

Jiahuan Ren
Zhao Zhang
Richang Hong
Mingliang Xu
Haijun Zhang
Mingbo Zhao
Meng Wang

Convolutional Neural Networks (CNNs) are powerful for image representation, but the convolution operation may be influenced and degraded by the included noise, and the deep features may not be fully learned. In this paper, we propose a new encoder-decoder based image restoration network, termed Robust Low-Rank Convolution Network with Feature Denoising (LRCnet). LRCnet presents a novel low-rank convolution (LR-Conv) for image representation, and a residual dense connection (RDC) for feature fusion between encoding and decoding. Different from directly splitting convolution into ordinary convolution and mirror convolution as existing work, LR-Conv deploys a feature denoising module after the ordinary convolution to remove noise for mirror convolution. A low-rank embedding process is then used to project the convolutional features into a robust low-rank subspace, which can retain the local geometry of input signal to some extent and separate the signal and noise by finding low-rank structure of features to reduce the impact of noise on convolution. Besides, most networks increase the depth of network simply to obtain deep information and lack of effective connections to fuse the multilevel features, which may not fully discover the deep features in various layers. Thus, we design a residual dense connection with a channel attention to connect multilevel feature effectively to obtain more useful information to enhance the data representation. Extensive experiments on several datasets verified the effectiveness of LRCnet for image denoising.

FCL-GAN: A Lightweight and Real-Time Baseline for Unsupervised Blind Image Deblurring

Suiyi Zhao
Zhao Zhang
Richang Hong
Mingliang Xu
Yi Yang
Meng Wang

Blind image deblurring (BID) remains a challenging and significant task. Benefiting from the strong fitting ability of deep learning, paired data-driven supervised BID methods have obtained great progress. However, paired data are usually synthesized by hand, and the realistic blurs are more complex than synthetic ones, which makes the supervised methods inept at modeling realistic blurs and hinders real-world applications. As such, unsupervised deep BID methods without paired data offer certain advantages, but current methods still suffer from some drawbacks, e.g., bulky model size, long inference time, and strict image resolution and domain requirements. In this paper, we propose a lightweight and real-time unsupervised BID baseline, termed Frequency-domain Contrastive Loss Constrained Lightweight CycleGAN (shortly, FCL-GAN), with attractive properties, i.e., no image domain limitation, no image resolution limitation, 25x lighter than SOTA, and 5x faster than SOTA. To guarantee the lightweight property and performance superiority, two new collaboration units called lightweight domain conversion unit (LDCU) and parameter-free frequency-domain contrastive unit (PFCU) are designed. LDCU mainly implements inter-domain conversion in lightweight manner. PFCU further explores the similarity measure, external difference and internal connection between the blurred domain and sharp domain images in frequency domain, without involving extra parameters. Extensive experiments on several image datasets demonstrate the effectiveness of our FCL-GAN in terms of performance, model size and inference time.

Task-adaptive Spatial-Temporal Video Sampler for Few-shot Action Recognition

Huabin Liu
Weixian Lv
John See
Weiyao Lin

A primary challenge faced in few-shot action recognition is inadequate video data for training. To address this issue, current methods in this field mainly focus on devising algorithms at the feature level while little attention is paid to processing input video data. Moreover, existing frame sampling strategies may omit critical action information in temporal and spatial dimensions, which further impacts video utilization efficiency. In this paper, we propose a novel video frame sampler for few-shot action recognition to address this issue, where task-specific spatial-temporal frame sampling is achieved via a temporal selector (TS) and a spatial amplifier (SA). Specifically, our sampler first scans the whole video at a small computational cost to obtain a global perception of video frames. The TS plays its role in selecting top-T frames that contribute most significantly and subsequently. The SA emphasizes the discriminative information of each frame by amplifying critical regions with the guidance of saliency maps. We further adopt task-adaptive learning to dynamically adjust the sampling strategy according to the episode task at hand. Both the implementations of TS and SA are differentiable for end-to-end optimization, facilitating seamless integration of our proposed sampler with most few-shot action recognition methods. Extensive experiments show a significant boost in the performances on various benchmarks including long-term videos.

MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing

Jiashuo Yu
Ying Cheng
Rui-Wei Zhao
Rui Feng
Yuejie Zhang

Recognizing and localizing events in videos is a fundamental task for video understanding. Since events may occur in auditory and visual modalities, multimodal detailed perception is essential for complete scene comprehension. Most previous works attempted to analyze videos from a holistic perspective. However, they do not consider semantic information at multiple scales, which makes the model difficult to localize events in different lengths. In this paper, we present a Multimodal Pyramid Attentional Network (MM-Pyramid ) for event localization. Specifically, we first propose the attentive feature pyramid module. This module captures temporal pyramid features via several stacking pyramid units, each of them is composed of a fixed-size attention block and dilated convolution block. We also design an adaptive semantic fusion module, which leverages a unit-level attention block and a selective fusion block to integrate pyramid features interactively. Extensive experiments on audio-visual event localization and weakly-supervised audio-visual video parsing tasks verify the effectiveness of our approach.

Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild

Sindhu B. Hegde
K R Prajwal
Rudrabha Mukhopadhyay
Vinay P. Namboodiri
C.V. Jawahar

In this work, we address the problem of generating speech from silent lip videos for any speaker in the wild. In stark contrast to previous works, our method (i) is not restricted to a fixed number of speakers, (ii) does not explicitly impose constraints on the domain or the vocabulary and (iii) deals with videos that are recorded in the wild as opposed to within laboratory settings. The task presents a host of challenges, with the key one being that many features of the desired target speech, like voice, pitch and linguistic content, cannot be entirely inferred from the silent face video. In order to handle these stochastic variations, we propose a new VAE-GAN architecture that learns to associate the lip and speech sequences amidst the variations. With the help of multiple powerful discriminators that guide the training process, our generator learns to synthesize speech sequences in any voice for the lip movements of any person. Extensive experiments on multiple datasets show that we outperform all baselines by a large margin. Further, our network can be fine-tuned on videos of specific identities to achieve a performance comparable to single-speaker models that are trained on $4\times$ more data. We conduct numerous ablation studies to analyze the effect of different modules of our architecture. We also provide a demo video that demonstrates several qualitative results along with the code and trained models on our website http://cvit.iiit.ac.in/research/projects/cvit-projects/lip-to-speech-syn....

Attribute-guided Dynamic Routing Graph Network for Transductive Few-shot Learning

Chaofan Chen
Xiaoshan Yang
Ming Yan
Changsheng Xu

Motivated by the structured form of human cognition, attributes have been introduced in few-shot classification to learn more representative sample features. However, existing attribute-based methods usually treat the importance of different attributes as equals to conclude the sample relations, which cannot distinguish the classes with many similar attributes well. In order to address this problem, we propose an Attribute-guided Dynamic Routing Graph Network (ADRGN) to explicitly learn task-dependent attribute importance scores to help explore the sample relations in a fine-grained manner for adaptive graph-based inference. Specifically, we first leverage a CNN backbone and a transformation network to generate attribute-specific sample representations according to attribute annotations. Next, we treat the attribute-specific sample representations as visual primary capsules and employ an inter-sample routing to explore the visual diversity of each attribute in the current task. Based on the generated diversity capsules, we perform an inter-attribute routing to explore the relations between different attributes to predict the visual attribute importance scores. Meanwhile, we design an attribute semantic routing module to predict the semantic attribute importance from the semantic attribute embeddings to help the learning of the visual attribute importance prediction with a knowledge distillation strategy. Finally, we utilize the visual attribute importance scores to adaptively aggregate sample similarities computed based on the attribute-specific representations to capture the global fine-grained sample relations for message passing and graph-based inference. Experimental results on three few-shot classification benchmarks show that the proposed ADRGN obtains state-of-the-art performance.

OS-MSL: One Stage Multimodal Sequential Link Framework for Scene Segmentation and Classification

Ye Liu
Lingfeng Qiao
Di Yin
Zhuoxuan Jiang
Xinghua Jiang
Deqiang Jiang
Bo Ren

Scene segmentation and classification (SSC) serve as a critical step towards the field of video structuring analysis. Intuitively, jointly learning of these two tasks can promote each other by sharing common information. However, scene segmentation concerns more on the local difference between adjacent shots while classification needs the global representation of scene segments, which probably leads to the model dominated by one of the two tasks in the training phase. In this paper, from an alternate perspective to overcome the above challenges, we unite these two tasks into one task by a new form of predicting shots link: a link connects two adjacent shots, indicating that they belong to the same scene or category. To the end, we propose a general One Stage Multimodal Sequential Link Framework (OS-MSL) to both distinguish and leverage the two-fold semantics by reforming the two learning tasks into a unified one. Furthermore, we tailor a specific module called DiffCorrNet to explicitly extract the information of differences and correlations among shots. Extensive experiments on a brand-new large scale dataset collected from real-world applications, and MovieScenes are conducted. Both the results demonstrate the effectiveness of our proposed method against strong baselines. The code is made available.

Modality-aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection

Jiashuo Yu
Jinyu Liu
Ying Cheng
Rui Feng
Yuejie Zhang

Weakly-supervised audio-visual violence detection aims to distinguish snippets containing multimodal violence events with video-level labels. Many prior works perform audio-visual integration and interaction in an early or intermediate manner, yet overlooking the modality heterogeneousness over the weakly-supervised setting. In this paper, we analyze the modality asynchrony and undifferentiated instances phenomena of the multiple instance learning (MIL) procedure, and further investigate its negative impact on weakly-supervised audio-visual learning. To address these issues, we propose a modality-aware contrastive instance learning with self-distillation (MACIL-SD) strategy . Specifically, we leverage a lightweight two-stream network to generate audio and visual bags, in which unimodal background, violent, and normal instances are clustered into semi-bags in an unsupervised way. Then audio and visual violent semi-bag representations are assembled as positive pairs, and violent semi-bags are combined with background and normal instances in the opposite modality as contrastive negative pairs. Furthermore, a self-distillation module is applied to transfer unimodal visual knowledge to the audio-visual model, which alleviates noises and closes the semantic gap between unimodal and multimodal features. Experiments show that our framework outperforms previous methods with lower complexity on the large-scale XD-Violence dataset. Results also demonstrate that our proposed approach can be used as plug-in modules to enhance other networks. Codes are available at https://github.com/JustinYuu/MACIL_SD.

Parameterization of Cross-token Relations with Relative Positional Encoding for Vision MLP

Zhicai Wang
Yanbin Hao
Xingyu Gao
Hao Zhang
Shuo Wang
Tingting Mu
Xiangnan He

Vision multi-layer perceptrons (MLPs) have shown promising performance in computer vision tasks, and become the main competitor of CNNs and vision Transformers. They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers. However, the heavily parameterized token-mixing layers naturally lack mechanisms to capture local information and multi-granular non-local relations, thus their discriminative power is restrained. To tackle this issue, we propose a new positional spacial gating unit (PoSGU). It exploits the attention formulations used in the classical relative positional encoding (RPE), to efficiently encode the cross-token relations for token mixing. It can successfully reduce the current quadratic parameter complexity O(N2) of vision MLPs to $O(N)$ and O(1). We experiment with two RPE mechanisms, and further propose a group-wise extension to improve their expressive power with the accomplishment of multi-granular contexts. These then serve as the key building blocks of a new type of vision MLP, referred to as PosMLP. We evaluate the effectiveness of the proposed approach by conducting thorough experiments, demonstrating an improved or comparable performance with reduced parameter complexity. For instance, for a model trained on ImageNet1K, we achieve a performance improvement from 72.14% to 74.02% and a learnable parameter reduction from 19.4M to 18.2M. Code could be found at https://github.com/Zhicaiwww/PosMLP https://github.com/Zhicaiwww/PosMLP.

Real-time Semantic Segmentation with Parallel Multiple Views Feature Augmentation

Jian-Jun Qiao
Zhi-Qi Cheng
Xiao Wu
Wei Li
Ji Zhang

Real-time semantic segmentation is essential for many practical applications, which utilizes attention-based feature aggregation into lightweight structures to improve accuracy and efficiency. However, existing attention-based methods ignore 1) high-level and low-level feature augmentation guided by spatial information, and 2) low-level feature augmentation guided by semantic context, so that feature gaps between multi-level features and noise of low-level spatial details still exist. To address these problems, a new real-time semantic segmentation network, called MvFSeg, is proposed. In MvFSeg, parallel convolution with multiple depths is designed as a context head to generate and integrate multi-view features with larger receptive fields. Moreover, MvFSeg designs multiple views feature augmentation strategies that exploit spatial and semantic guidance for shallow and deep feature augmentation in an inter-layer and intra-layer manner. These strategies eliminate feature gaps between multi-level features, filter out the noise of spatial details, and provide spatial and semantic guidance for multi-level features. By combining multi-view features and augmented features from the lightweight networks with progressive dense aggregation structures, MvFSeg effectively captures invariance at various scales and generates high-quality segmentation results. Experiments conducted on Cityscapes and CamVid benchmark show that MvFSeg outperforms existing state-of-the-art methods.

Exposure-Consistency Representation Learning for Exposure Correction

Jie Huang
Man Zhou
Yajing Liu
Mingde Yao
Feng Zhao
Zhiwei Xiong

Images captured under improper exposures including underexposure and overexposure often suffer from unsatisfactory visual effects. Since their correction procedures are quite different, it is challenging for a single network to correct various exposures. The key to addressing this issue is consistently learning underexposure and overexposure corrections. To achieve this goal, we propose an Exposure-Consistency Processing (ECP) module to consistently learn the representation of both underexposure and overexposure in the feature space. Specifically, the ECP module employs the bilateral activation mechanism that derives both underexposure and overexposure property features for exposure-consistency representation modeling, which is followed by two shared-weight branches to process these features. Based on the ECP module, we build the whole network by utilizing it as the basic unit. Additionally, to further assist the exposure-consistency learning, we develop an Exposure-Consistency Constraining (ECC) strategy that augments the various local region exposures and then constrains the feature representation change between the exposure augmented image and the original one. Our proposed network is lightweight and outperforms existing methods remarkably, while the ECP module can also be extended to other baselines, demonstrating its superiority and scalability. code: https://github.com/KevinJ-Huang/ECLNet.

Global Meets Local: Effective Multi-Label Image Classification via Category-Aware Weak Supervision

Jiawei Zhan
Jun Liu
Wei Tang
Guannan Jiang
Xi Wang
Bin-Bin Gao
Tianliang Zhang
Wenlong Wu
Wei Zhang
Chengjie Wang
Yuan Xie

Multi-label image classification, which can be categorized into label-dependency and region-based methods, is a challenging problem due to the complex underlying object layouts. Although region-based methods are less likely to encounter issues with model generalizability than label-dependency methods, they often generate hundreds of meaningless or noisy proposals with non-discriminative information, and the contextual dependency among the localized regions is often ignored or over-simplified. This paper builds a unified framework to perform effective noisy-proposal suppression and to interact between global and local features for robust feature learning. Specifically, we propose category-aware weak supervision to concentrate on non-existent categories so as to provide deterministic information for local feature learning, restricting the local branch to focus on more high-quality regions of interest. Moreover, we develop a cross-granularity attention module to explore the complementary information between global and local features, which can build the high-order feature correlation containing not only global-to-local, but also local-to-local relations. Both advantages guarantee a boost in the performance of the whole network. Extensive experiments on two large-scale datasets (MS-COCO and VOC 2007) demonstrate that our framework achieves superior performance over state-of-the-art methods.

Domain-Specific Conditional Jigsaw Adaptation for Enhancing transferability and Discriminability

Qi He
Zhaoquan Yuan
Xiao Wu
Jun-Yan He

Unsupervised Domain Adaptation (UDA) aims to transfer knowledge from a label-rich source domain to a target domain where the label is unavailable. Existing approaches tend to reduce the distribution discrepancy between the source and target domains or assign the pseudo target labels to implement a self-training strategy. However, the transferability or discriminability lackage of the traditional methods results in the limited ability to generalize on the target domain. To remedy this issue, a novel unsupervised domain adaptation framework called Domain-specific Conditional Jigsaw Adaptation Network (DCJAN) is proposed for UDA, which simultaneously encourages the network to extract transferable and discriminative features. To improve the discriminability, a conditional jigsaw module is presented to reconstruct class-aware features of the original images by reconstructing that of corresponding shuffled images. Moreover, in order to enhance the transferability, a domain-specific jigsaw adaptation is proposed to deal with the domain gaps, which utilizes the prior knowledge of jigsaw puzzles to reduce mismatching. It trains conditional jigsaw modules for each domain and updates the shared feature extractor to make the domain-specific conditional jigsaw modules could perform well not only on the corresponding domain but also on the other domain. A consistent conditioning strategy is proposed to ensure the safe training of conditional jigsaw. Experiments conducted on the widely-used Office-31, Office-Home, VisDA-2017, and DomainNet datasets demonstrate the effectiveness of the proposed approach, which outperforms the state-of-the-art methods.

Effective Video Abnormal Event Detection by Learning A Consistency-Aware High-Level Feature Extractor

Guang Yu
Siqi Wang
Zhiping Cai
Xinwang Liu
Chengkun Wu

With pure normal training videos, video abnormal event detection (VAD) aims to build a normality model, and then detect abnormal events that deviate from this model. Despite of some progress, existing VAD methods typically train the normality model by a low-level learning objective (e.g. pixel-wise reconstruction/prediction), which often overlooks the high-level semantics in videos. To better exploit high-level semantics for VAD, we propose a novel paradigm that performs VAD by learning a Consistency-Aware high-level Feature Extractor (CAFE). Specifically, with a pre-trained deep neural network (DNN) as teacher network, we first feed raw video events into the teacher network and extract the outputs of multiple hidden layers as their high-level features, which contain rich high-level semantics. Guided by high-level features extracted from normal training videos, we train a student network to be the high-level feature extractor of normal events, so as to explicitly consider high-level semantics in training. For inference, a video event can be viewed as normal if the student extractor produces similar high-level features to the teacher network. Second, based on the fact that consecutive video frames usually enjoy minor differences, we propose a consistency-aware scheme that requires high-level features extracted from neighboring frames to be consistent. Our consistency-aware scheme not only encourages the student extractor to ignore low-level differences and capture more high-level semantics, but also enables better anomaly scoring. Last, we also design a generic framework that can bridge high-level and low-level learning in VAD to further ameliorate VAD performance. By flexibly embedding one or more low-level learning objectives into CAFE, the framework makes it possible to combine the strengths of both high-level and low-level learning. The proposed method attains state-of-the-art results on commonly-used benchmark datasets.

Less is More: Consistent Video Depth Estimation with Masked Frames Modeling

Yiran Wang
Zhiyu Pan
Xingyi Li
Zhiguo Cao
Ke Xian
Jianming Zhang

Temporal consistency is the key challenge of video depth estimation. Previous works are based on additional optical flow or camera poses, which is time-consuming. By contrast, we derive consistency with less information. Since videos inherently exist with heavy temporal redundancy, a missing frame could be recovered from neighboring ones. Inspired by this, we propose the frame masking network (FMNet), a spatial-temporal transformer network predicting the depth of masked frames based on their neighboring frames. By reconstructing masked temporal features, the FMNet can learn intrinsic inter-frame correlations, which leads to consistency. Compared with prior arts, experimental results demonstrate that our approach achieves comparable spatial accuracy and higher temporal consistency without any additional information. Our work provides a new perspective on consistent video depth estimation.

Deep Multi-Resolution Mutual Learning for Image Inpainting

Huan Zheng
Zhao Zhang
Haijun Zhang
Yi Yang
Shuicheng Yan
Meng Wang

Deep image inpainting methods have improved the inpainting performance greatly due to the powerful representation ability of deep learning. However, current deep inpainting networks still tend to produce unreasonable structures and blurry textures due to the ill-posed properties of the task, i.e., image inpainting is still a challenging topic. In this paper, we therefore propose a novel deep multi-resolution mutual learning (DMRML) strategy, which can fully explore the information from various resolutions. Specifically, we design a new image inpainting network, termed multi-resolution mutual network (MRM-Net), which takes the damaged images of different resolutions as input, then excavates and exploits the correlation among different resolutions to guide the image inpainting process. Technically, we designs two new modules called multi-resolution information interaction (MRII) and adaptive content enhancement (ACE). MRII aims at discovering the correlation of multiple resolutions and exchanging information, and ACE focuses on enhancing the contents using the interacted features. Note that we also present an memory preservation mechanism (MPM) to prevent from the information loss with the increasing layers. Extensive experiments on Paris Street View, Places2 and CelebA-HQ datasets demonstrate that our proposed MRM-Net can effectively recover the textures and structures, and performs favorably against other state-of-the-art methods.

TGDM: Target Guided Dynamic Mixup for Cross-Domain Few-Shot Learning

Linhai Zhuo
Yuqian Fu
Jingjing Chen
Yixin Cao
Yu-Gang Jiang

Given sufficient training data on the source domain, cross-domain few-shot learning (CD-FSL) aims at recognizing new classes with a small number of labeled examples on the target domain. The key to addressing CD-FSL is to narrow the domain gap and transferring knowledge of a network trained on the source domain to the target domain. To help knowledge transfer, this paper introduces an intermediate domain generated by mixing images in the source and the target domain. Specifically, to generate the optimal intermediate domain for different target data, we propose a novel target guided dynamic mixup (TGDM) framework that leverages the target data to guide the generation of mixed images via dynamic mixup. The proposed TGDM framework contains a Mixup-3T network for learning classifiers and a dynamic ratio generation network (DRGN) for learning the optimal mix ratio. To better transfer the knowledge, the proposed Mixup-3T network contains three branches with shared parameters for classifying classes in the source domain, target domain, and intermediate domain. To generate the optimal intermediate domain, the DRGN learns to generate an optimal mix ratio according to the performance on auxiliary target data. Then, the whole TGDM framework is trained via bi-level meta-learning so that TGDM can rectify itself to achieve optimal performance on target data. Extensive experimental results on several benchmark datasets verify the effectiveness of our method.

SIR-Former: Stereo Image Restoration Using Transformer

Zizheng Yang
Mingde Yao
Jie Huang
Man Zhou
Feng Zhao

Stereo image pairs record the scene from two different views and introduce cross-view information for image restoration. However, there are two challenges in utilizing the cross-view information for stereo image restoration: cross-view alignment and information fusion. Most existing methods adopt convolutional neural networks to align the views and fuse the information locally, which has difficulty in capturing the global correspondence across stereo images for view alignment and makes it hard to integrate the long-term information across views. In this paper, we propose to address the stereo image restoration with transformer by leveraging its powerful capability of modeling long-range context dependencies. Specifically, we construct a stereo image restoration transformer (SIR-Former) to effectively exploit the cross-view correlations. First, to explore the global correspondence for view alignment effectively, we devise a stereo alignment transformer (SAT) module across stereo images, enabling robust alignment under the epipolar constraint. Then, we design a stereo fusion transformer (SFT) module for aggregating the cross-view information in a small horizontal neighborhood, aiming to enhance important features for succeeding restoration. Extensive experiments show that SIR-Former can remarkably boost quantitative and qualitative quality on various image restoration tasks (e.g., super-resolution, deblurring, deraining, and low-light enhancement), which demonstrate the effectiveness of the proposed framework.

Learning Occlusion-aware Coarse-to-Fine Depth Map for Self-supervised Monocular Depth Estimation

Zhengming Zhou
Qiulei Dong

Self-supervised monocular depth estimation, aiming to learn scene depths from single images in a self-supervised manner, has received much attention recently. In spite of recent efforts in this field, how to learn accurate scene depths and alleviate the negative influence of occlusions for self-supervised depth estimation, still remains an open problem. Addressing this problem, we firstly empirically analyze the effects of both the continuous and discrete depth constraints which are widely used in the training process of many existing works. Then inspired by the above empirical analysis, we propose a novel network to learn an Occlusion-aware Coarse-to-Fine Depth map for self-supervised monocular depth estimation, called OCFD-Net. Given an arbitrary training set of stereo image pairs, the proposed OCFD-Net does not only employ a discrete depth constraint for learning a coarse-level depth map, but also employ a continuous depth constraint for learning a scene depth residual, resulting in a fine-level depth map. In addition, an occlusion-aware module is designed under the proposed OCFD-Net, which is able to improve the capability of the learnt fine-level depth map for handling occlusions. Experimental results on KITTI demonstrate that the proposed method outperforms the comparative state-of-the-art methods under seven commonly used metrics in most cases. In addition, experimental results on Make3D demonstrate the effectiveness of the proposed method in terms of the cross-dataset generalization ability under four commonly used metrics. The code is available at https://github.com/ZM-Zhou/OCFD-Net_pytorch.

Guess-It-Generator: Generating in a Lewis Signaling Framework through Logical Reasoning

Arghya Pal
Sailaja Rajanala
Raphael Phan
Koksheik Wong

Human minds spontaneously integrate two inherited cognitive capabilities: perception and reasoning to accomplish cognitive tasks such as problem solving, imagination, and causation. It is observed in the primate brains that perception offers the assistance required for problem comprehension, whilst the reasoning elucidates upon the facts recovered during perception in order to make a decision. The field of artificial intelligence (AI) thus considers perception and reasoning as two complementary areas that are realized by machine learning and logic programming, respectively. In this work, we propose a generative model using a collaborative guessing game of the kind first introduced by David Lewis in his famous work called the Lewis signaling game that is synonymous with the "20 Questions'' game. Our proposed model, Guess-It-Generator (GIG) is a collaborative framework that engages two recurrent neural networks in a guessing game. GIG unifies perception and reasoning with a view to generating labeled images by capturing, (X, y), the underlying density of a data distribution, i.e. (X, y) - p(X, y). An encoder attends to a region of the input image and encodes that onto a latent variable that acts as a perception signal to a decoder. In contrast, the decoder leverages on the perception signals to guess the image and verifies the guess by reasoning with logical facts derived from the domain knowledge. Our experiments and comprehensive studies on seven datasets: PCAM, Chest-Xray-14, FIRE, HAM10000 from the medical domain, and CIFAR 10, LSUN, ImageNet, among standard benchmark datasets, show significant promise for the proposed method.

Long-Term Person Re-identification with Dramatic Appearance Change: Algorithm and Benchmark

Mengmeng Liu
Zhi Ma
Tao Li
Yanfeng Jiang
Kai Wang

For person re-identification (Re-ID) task, most of previous studies assumed that the pedestrians do not change their appearances. The works on cross-appearance Re-ID, including datasets and algorithms, are still few. Therefore, this paper contributes a cross-season appearance change Re-ID dataset, namely NKUP+, including more than 300 IDs from surveillance videos over 10 months, to support the studies of the cross-appearance Re-ID. In addition, we propose a network named M2Net, which integrates multi-modality features from the RGB images, contour images and human parsing images. By ignoring irrelevant misleading information for cross-appearance retrieval in RGB images, M2Net can learn features that are robust to appearance changes. Meanwhile, we propose a sampling strategy called RAS to contain a variety of appearances in one batch. And appearance loss and multi-appearance loss are designed to guide the network to learn both same-appearance and cross-appearance features. Finally, we evaluated our method on NKUP+/PRCC/DeepChange datasets, and the results showed that, compared with the baseline, our method renders significant improvement, leading to the state-of-the-art performance over other methods. Our dataset is available at https://github.com/nkicsl/NKUP-dataset.

PaCL: Part-level Contrastive Learning for Fine-grained Few-shot Image Classification

Chuanming Wang
Huiyuan Fu
Huadong Ma

Recently, it is gaining increasingly attention to incorporate self-supervised technologies into few-shot learning. Previous methods have exclusively focused on image-level self-supervision, but they ignore that capturing subtle part features plays an important role in distinguishing fine-grained images. In this paper, we propose an approach named PaCL that embeds part-level contrastive learning into fine-grained few-shot image classification, strengthening the models' capability to extract discriminative features from indistinguishable images. PaCL treats parts as the inputs of contrastive learning, and it uses a transformation module to involve image-specific information into pre-defined meta parts, generating multiple features from each meta part depending on different images. To alleviate the impact of changes in views or occlusions, we propose to adopt part prototypes in contrastive learning. Part prototypes are generated by aggregating the features of each certain type of part, which are more reliable than directly using part features. A few-shot classifier is adopted to predict query images, which calculates the classification loss to optimize the transformation module and meta parts in conjunction with the loss calculated in contrastive learning. The optimization process will enforce the model to learn to extract discriminative and diverse features from different parts of the objects, even for the samples of unseen classes. Extensive studies show that our proposed method improves the performance of fine-grained few-shot image classification across several backbones, datasets, and tasks, achieving superior results compared with state-of-the-art methods.

FMNet: Frequency-Aware Modulation Network for SDR-to-HDR Translation

Gang Xu
Qibin Hou
Le Zhang
Ming-Ming Cheng

High-dynamic-range (HDR) media resources that preserve high contrast and more details in shadow and highlight areas in television are becoming increasingly popular for modern display technology compared to the widely available standard-dynamic-range (SDR) media resources. However, due to the exorbitant price of HDR cameras, researchers have attempted to develop the SDR-to-HDR techniques to convert the abundant SDR media resources to the HDR versions for cost-saving. Recent SDR-to-HDR methods mostly apply the image-adaptive modulation scheme to dynamically modulate the local contrast. However, these methods often fail to properly capture the low-frequency cues, resulting in artifacts in the low-frequency regions and low visual quality. Motivated by the Discrete Cosine Transform (DCT), in this paper, we propose a Frequency-aware Modulation Network (FMNet) to enhance the contrast in a frequency-adaptive way for SDR-to-HDR translation. Specifically, we design a frequency-aware modulation block that can dynamically modulate the features according to its frequency-domain responses. This allows us to reduce the structural distortions and artifacts in the translated low-frequency regions and reconstruct high-quality HDR content in the translated results. Experimental results on the HDRTV1K dataset show that our FMNet outperforms previous methods and the perceptual quality of the generated HDR images can be largely improved. Our code is available at https://github.com/MCG-NKU/FMNet.

CrossNet: Boosting Crowd Counting with Localization

Ji Zhang
Zhi-Qi Cheng
Xiao Wu
Wei Li
Jian-Jun Qiao

Generating high-quality density maps is a crucial step in crowd counting. It is obvious that exploiting the head location of the people can naturally highlight the crowded area and eliminate the interference of background noise. However, existing crowd counting methods are still tricky to reasonably use location in density generation. In this paper, a novel location-guided framework named CrossNet is proposed for crowd counting, which integrates location supervision into density maps through dual-branch joint training. First, a new branching network is proposed to localize the potential positions of pedestrians. With the help of supervision induced from the localization branch, Location Enhancement (LE) module is designed to obtain high-quality density maps by positioning foreground regions. Second, Adaptive Density Awareness Attention (ADAA) module is engaged to enhance localization accuracy, which can efficiently use the density of the counting branch to adaptively capture the error-prone dense areas of the location maps. Finally, Density Awareness Localization (DAL) loss is offered to allocate attention to the crowd density levels, which delivers more focus on regions with high densities and less concentration on areas with low densities. Extensive experiments conducted on four benchmark datasets demonstrate that the proposed method outperforms the state-of-the-art approaches both in crowd counting and crowd localization.

NeRF-SR: High Quality Neural Radiance Fields using Supersampling

Chen Wang
Xian Wu
Yuan-Chen Guo
Song-Hai Zhang
Yu-Wing Tai
Shi-Min Hu

We present NeRF-SR, a solution for high-resolution (HR) novel view synthesis with mostly low-resolution (LR) inputs. Our method is built upon Neural Radiance Fields (NeRF) that predicts per-point density and color with a multi-layer perceptron. While producing images at arbitrary scales, NeRF struggles with resolutions that go beyond observed images. Our key insight is that NeRF benefits from 3D consistency, which means an observed pixel absorbs information from nearby views. We first exploit it by a super-sampling strategy that shoots multiple rays at each image pixel, which further enforces multi-view constraint at a sub-pixel level. Then, we show that NeRF-SR can further boost the performance of super-sampling by a refinement network that leverages the estimated depth at hand to hallucinate details from related patches on only one HR reference image. Experiment results demonstrate that NeRF-SR generates high-quality results for novel view synthesis at HR on both synthetic and real-world datasets without any external information. Project page: https://cwchenwang.github.io/NeRF-SR

Rail Detection: An Efficient Row-based Network and a New Benchmark

Xinpeng Li
Xiaojiang Peng

Rail detection, essential for railroad anomaly detection, aims to identify the railroad region in video frames. Although various studies on rail detection exist, neither an open benchmark nor a high-speed network is available in the community, making al- gorithm comparison and development difficult. Inspired by the growth of lane detection, we propose a rail database and a row- based rail detection method. In detail, we make several contribu- tions: (i) We present a real-world railway dataset, Rail-DB, with 7432 pairs of images and annotations. The images are collected from different situations in lighting, road structures, and views. The rails are labeled with polylines, and the images are catego- rized into nine scenes. The Rail-DB is expected to facilitate the improvement of rail detection algorithms. (ii) We present an ef- ficient row-based rail detection method, Rail-Net, containing a lightweight convolutional backbone and an anchor classifier. Specif- ically, we formulate the process of rail detection as a row-based selecting problem. This strategy reduces the computational cost compared to alternative segmentation methods. (iii) We evaluate the Rail-Net on Rail-DB with extensive experiments, including cross-scene settings and network backbones ranging from ResNet to Vision Transformers. Our method achieves promising perfor- mance in terms of both speed and accuracy. Notably, a lightweight version could achieve 92.77% accuracy and 312 frames per second. The Rail-Net outperforms the traditional method by 50.65% and the segmentation one by 5.86%. The database and code are available at: https://github.com/Sampson-Lee/Rail-Detection.

Robust Attention Deraining Network for Synchronous Rain Streaks and Raindrops Removal

Yanyan Wei
Zhao Zhang
Mingliang Xu
Richang Hong
Jicong Fan
Shuicheng Yan

Synchronous Rain streaks and Raindrops Removal (SR^3) is a hard and challenging task, since rain streaks and raindrops are two wildly divergent real-world phenomena with different optical properties and mathematical distributions. As such, most existing data-driven deep Singe Image Deraining (SID) methods only focus on one of them. Although there are only a few existing SR^3 methods, they still suffer from blur textures and unknown noise in reality due to weak robustness and generalization ability. In this paper, we will propose a new and universal SID model with novel modules, termed Robust Attention Deraining Network (RadNet), with strong robustness and generalization ability that are reflected in two main aspects. (1) RadNet can restore different rain degenerations, including raindrops, rain streaks, or both; (2) RadNet can adapt to different data strategies, including single-type, superimposed-type, and blended-type. The generalization ability is also demonstrated by the performance of dealing with real rain images. Specifically, we first design a lightweight and robust attention module (RAM) with a universal attention mechanism for coarse rain removal, and then present a new deep refining module (DRM) with multi-scale blocks for precise rain removal. To solve the inconsistent labels of real scenario data, we also introduce a flow & warp module (FWM) into the network, which can greatly improve the performance of real scenario data via optical flow prediction and alignment. The whole process is unified in a network to ensure sufficient robustness and strong generalization ability. We evaluated the performance of our method under a variety of data strategies, and extensive experiments demonstrated that our RadNet could outperform other state-of-the-art SID methods.

TSRFormer: Table Structure Recognition with Transformers

Weihong Lin
Zheng Sun
Chixiang Ma
Mingze Li
Jiawei Wang
Lei Sun
Qiang Huo

We present a new table structure recognition (TSR) approach, called TSRFormer, to robustly recognizing the structures of complex tables with geometrical distortions from various table images. Unlike previous methods, we formulate table separation line prediction as a line regression problem instead of an image segmentation problem and propose a new two-stage DETR based separator prediction approach, dubbed Sep arator RE gression TR ansformer (SepRETR), to predict separation lines from table images directly. To make the two-stage DETR framework work efficiently and effectively for the separation line prediction task, we propose two improvements: 1) A prior-enhanced matching strategy to solve the slow convergence issue of DETR; 2) A new cross attention module to sample features from a high-resolution convolutional feature map directly so that high localization accuracy is achieved with low computational cost. After separation line prediction, a simple relation network based cell merging module is used to recover spanning cells. With these new techniques, our TSRFormer achieves state-of-the-art performance on several benchmark datasets, including SciTSR, PubTabNet and WTW. Furthermore, we have validated the robustness of our approach to tables with complex structures, borderless cells, large blank spaces, empty or spanning cells as well as distorted or even curved shapes on a more challenging real-world in-house dataset.

Structure- and Texture-Aware Learning for Low-Light Image Enhancement

Jinghao Zhang
Jie Huang
Mingde Yao
Man Zhou
Feng Zhao

Structure and texture information is critically important for low-light image enhancement, in terms of stable global adjustment and fine details recovery. However, most existing methods tend to learn the structure and texture of low-light images in a coupled manner, without well considering the heterogeneity between them, which challenges the capability of the model to learn both adequately. In this paper, we tackle this problem in a divide and conquer strategy, based on the observation that the structure and texture representations are highly separated in the frequency spectrum. Specifically, we propose a Structure and Texture Aware Network (STAN) for low-light image enhancement, which consists of a structure sub-network and a texture sub-network. The former exploits the low-pass characteristic of the transformer to capture low-frequency-related structural representation. While the latter builds upon central difference convolution to capture high-frequency-related texture representation. We establish the Multi-Spectrum Interaction (MSI) module between two sub-networks to bidirectionally provide complementary information. In addition, to further elevate the capability of the model, we introduce a dual distillation scheme that assists the learning process of two sub-networks via counterparts' normal-light structure and texture representations. Comprehensive experiments show that the proposed STAN outperforms the state-of-the-art methods qualitatively and quantitatively.

CLUT-Net: Learning Adaptively Compressed Representations of 3DLUTs for Lightweight Image Enhancement

Fengyi Zhang
Hui Zeng
Tianjun Zhang
Lin Zhang

Learning-based image enhancement has made great progress recently, among which the 3-Dimensional LookUp Table (3DLUT) based methods achieve a good balance between enhancement performance and time-efficiency. Generally, the more basis 3DLUTs are used in such methods, the more application scenarios could be covered, and thus the stronger enhancement capability could be achieved. However, more 3DLUTs would also lead to the rapid growth of the parameter amount, since a single 3DLUT has as many as D3 parameters where D is the table length. A large parameter amount not only hinders the practical application of the 3DLUT-based schemes but also gives rise to the training difficulty and does harm to the effectiveness of the basis 3DLUTs, leading to even worse performances with more utilized 3DLUTs. Through in-depth analysis of the inherent compressibility of 3DLUT, we propose an effective Compressed representation of 3-dimensional LookUp Table (CLUT) which maintains the powerful mapping capability of 3DLUT but with a significantly reduced parameter amount. Based on CLUT, we further construct a lightweight image enhancement network, namely CLUT-Net, in which image-adaptive and compression-adaptive CLUTs are learned in an end-to-end manner. Extensive experimental results on three benchmark datasets demonstrate that our proposed CLUT-Net outperforms the existing state-of-the-art image enhancement methods with orders of magnitude smaller parameter amounts. The source codes are available at https://github.com/Xian-Bei/CLUT-Net.

Automatic Piano Fingering from Partially Annotated Scores using Autoregressive Neural Networks

Pedro Ramoneda
Dasaem Jeong
Eita Nakamura
Xavier Serra
Marius Miron

Piano fingering is a creative and highly individualised task acquired by musicians progressively in their first music education years. Pianists must learn to choose the order of fingers to play the piano keys because scores do not have engraved finger and hand movements as other technique elements. Numerous research efforts have been conducted for automatic piano fingering based on a previous dataset composed of 150 score excerpts fully annotated by multiple expert annotators. However, most piano sheets include partial annotations for problematic finger and hand movements. We introduce a novel dataset for the task, the ThumbSet dataset, containing 2523 pieces with partial and noisy annotations of piano fingering crowdsourced from non-expert annotators. As part of our methodology, we propose two autoregressive neural networks with beam search decoding for modelling automatic piano fingering as a sequence-to-sequence learning problem, considering the correlation between output finger labels. We design the first model with the exact pitch representation of previous proposals. The second model uses graph neural networks to more effectively represent polyphony, whose treatment has been a common issue across previous studies. Finally, we finetune the models on the existing expert annotations dataset. The evaluation shows that (1) we are able to achieve high performance when training on the ThumbSet dataset and that (2) the proposed models outperform the state-of-the-art hidden Markov models and recurrent neural network baselines. Code, dataset, models, and results are made available to enhance the task reproducibility, including a new framework for evaluation.

Extreme-scale Talking-Face Video Upsampling with Audio-Visual Priors

Sindhu B. Hegde
Rudrabha Mukhopadhyay
Vinay P. Namboodiri
C.V. Jawahar

In this paper, we explore an interesting question of what can be obtained from an 8x8 pixel video sequence. Surprisingly, it turns out to be quite a lot. We show that when we process this 8x8 video with the right set of audio and image priors, we can obtain a full-length, 256x256 video. We achieve this 32x scaling of an extremely low-resolution input using our novel audio-visual upsampling network. The audio prior helps to recover the elemental facial details and precise lip shapes and a single high-resolution target identity image prior provides us with rich appearance details. Our approach is an end-to-end multi-stage framework. The first stage produces a coarse intermediate output video that can be then used to animate single target identity image and generate realistic, accurate and high-quality outputs. Our approach is simple and performs exceedingly well (an 8x improvement in FID score) compared to previous super-resolution methods. We also extend our model to talking-face video compression, and show that we obtain a 3.5x improvement in terms of bits/pixel over the previous state-of-the-art. The results from our network are thoroughly analyzed through extensive ablation experiments (in the paper and supplementary material). We also provide the demo video along with code and models on our http://cvit.iiit.ac.in/research/projects/cvit-projects/talking-face-vide....

Enhancement by Your Aesthetic: An Intelligible Unsupervised Personalized Enhancer for Low-Light Images

Naishan Zheng
Jie Huang
Qi Zhu
Man Zhou
Feng Zhao
Zheng-Jun Zha

Low-light image enhancement is an inherently subjective process whose targets vary with the user's aesthetic. Motivated by this, several personalized enhancement methods have been investigated. However, the enhancement process based on user preferences in these techniques is invisible, i.e., a "black box". In this work, we propose an intelligible unsupervised personalized enhancer (iUP-Enhancer) for low-light images, which establishes the correlations between the low-light and the unpaired reference images with regard to three user-friendly attributions (brightness, chromaticity, and noise). The proposed iUP-Enhancer is trained with the guidance of these correlations and the corresponding unsupervised loss functions. Rather than a "black box" process, our iUP-Enhancer presents an intelligible enhancement process with the above attributions. Extensive experiments demonstrate that the proposed algorithm produces competitive qualitative and quantitative results while maintaining excellent flexibility and scalability. This can be validated by personalization with single/multiple references, cross-attribution references, or merely adjusting parameters.

Scale-flow: Estimating 3D Motion from Video

Han Ling
Quansen Sun
Zhenwen Ren
Yazhou Liu
Hongyuan Wang
Zichen Wang

This paper addresses the problem of normalized scene flow (NSF): given a pair of RGB video frames, estimating the 3D motion, which consisted of optical flow and motion-in-depth estimation. NSF is a powerful tool for action prediction and autonomous robot navigation, presenting the advantage of only needing a monocular and uncalibrated camera. However, most existing methods directly regress motion-in-depth from two RGB frames or optical flow, resulting in sub-accurate and non-robust results. Our key insight is the scale matching scheme-establishing correlations between two frames containing objects in different scales, to estimate dense and continuous motion-in-depth. Based on the scale matching, we propose a unified framework: Scale-flow, which combines scale matching and optical flow estimation. This combination makes optical flow estimation can use dense and continuous scale information for the first time, so that the moving foreground objects can be estimated more accurately. On KITTI, our monocular approach achieves the lowest error in the foreground scene flow task, even compared with the multi-camera method. Moreover, on the motion-in-depth estimation task, Scale-flow reduces the error by 34% compared with the best-published method. Code will be available.

SlimSeg: Slimmable Semantic Segmentation with Boundary Supervision

Danna Xue
Fei Yang
Pei Wang
Luis Herranz
Jinqiu Sun
Yu Zhu
Yanning Zhang

Accurate semantic segmentation models typically require significant computational resources, inhibiting their use in practical applications. Recent works rely on well-crafted lightweight models to achieve fast inference. However, these models cannot flexibly adapt to varying accuracy and efficiency requirements. In this paper, we propose a simple but effective slimmable semantic segmentation (SlimSeg) method, which can be executed at different capacities during inference depending on the desired accuracy-efficiency tradeoff. More specifically, we employ parametrized channel slimming by stepwise downward knowledge distillation during training. Motivated by the observation that the differences between segmentation results of each submodel are mainly near the semantic borders, we introduce an additional boundary guided semantic segmentation loss to further improve the performance of each submodel. We show that our proposed SlimSeg with various mainstream networks can produce flexible models that provide dynamic adjustment of computational cost and better performance than independent models. Extensive experiments on semantic segmentation benchmarks, Cityscapes and CamVid, demonstrate the generalization ability of our framework.

Saliency in Augmented Reality

Huiyu Duan
Wei Shen
Xiongkuo Min
Danyang Tu
Jing Li
Guangtao Zhai

With the rapid development of multimedia technology, Augmented Reality (AR) has become a promising next-generation mobile platform. The primary theory underlying AR is human visual confusion, which allows users to perceive the real-world scenes and augmented contents (virtual-world scenes) simultaneously by superimposing them together. To achieve good Quality of Experience (QoE), it is important to understand the interaction between two scenarios, and harmoniously display AR contents. However, studies on how this superimposition will influence the human visual attention are lacking. Therefore, in this paper, we mainly analyze the interaction effect between background (BG) scenes and AR contents, and study the saliency prediction problem in AR. Specifically, we first construct a Saliency in AR Dataset (SARD), which contains 450 BG images, 450 AR images, as well as 1350 superimposed images generated by superimposing BG and AR images in pair with three mixing levels. A large-scale eye-tracking experiment among 60 subjects is conducted to collect eye movement data. To better predict the saliency in AR, we propose a vector quantized saliency prediction method and generalize it for AR saliency prediction. For comparison, three benchmark methods are proposed and evaluated together with our proposed method on our SARD. Experimental results demonstrate the superiority of our proposed method on both of the common saliency prediction problem and the AR saliency prediction problem over benchmark methods. Our dataset and code are available at: https://github.com/DuanHuiyu/ARSaliency.

T-former: An Efficient Transformer for Image Inpainting

Ye Deng
Siqi Hui
Sanping Zhou
Deyu Meng
Jinjun Wang

Benefiting from powerful convolutional neural networks (CNNs), learning-based image inpainting methods have made significant breakthroughs over the years. However, some nature of CNNs (e.g. local prior, spatially shared parameters) limit the performance in the face of broken images with diverse and complex forms. Recently, a class of attention-based network architectures, called transformer, has shown significant performance on natural language processing fields and high-level vision tasks. Compared with CNNs, attention operators are better at long-range modeling and have dynamic weights, but their computational complexity is quadratic in spatial resolution, and thus less suitable for applications involving higher resolution images, such as image inpainting. In this paper, we design a novel attention linearly related to the resolution according to Taylor expansion. And based on this attention, a network called T-former is designed for image inpainting. Experiments on several benchmark datasets demonstrate that our proposed method achieves state-of-the-art accuracy while maintaining a relatively low number of parameters and computational complexity.

Cycle Self-Training for Semi-Supervised Object Detection with Distribution Consistency Reweighting

Hao Liu
Bin Chen
Bo Wang
Chunpeng Wu
Feng Dai
Peng Wu

Recently, many semi-supervised object detection (SSOD) methods adopt teacher-student framework and have achieved state-of-the-art results. However, the teacher network is tightly coupled with the student network since the teacher is an exponential moving average (EMA) of the student, which causes a performance bottleneck. To address the coupling problem, we propose a Cycle Self-Training (CST) framework for SSOD, which consists of two teachers T1 and T2, two students S1 and S2. Based on these networks, a cycle self-training mechanism is built, i.e., S1$\rightarrow $T1$\rightarrow $S2$\rightarrow $T2$\rightarrow $S1. For S$\rightarrow $T, we also utilize the EMA weights of the students to update the teachers. For T$\rightarrow $S, instead of providing supervision for its own student S1(S2) directly, the teacher T1(T2) generates pseudo-labels for the student S2(S1), which looses the coupling effect. Moreover, owing to the property of EMA, the teacher is most likely to accumulate the biases from the student and make the mistakes irreversible. To mitigate the problem, we also propose a distribution consistency reweighting strategy, where pseudo-labels are reweighted based on distribution consistency across the teachers T1 and T2. With the strategy, the two students S2 and S1 can be trained robustly with noisy pseudo labels to avoid confirmation biases. Extensive experiments prove the superiority of CST by consistently improving the AP over the baseline and outperforming state-of-the-art methods by 2.1% absolute AP improvements with scarce labeled data.

VMRF: View Matching Neural Radiance Fields

Jiahui Zhang
Fangneng Zhan
Rongliang Wu
Yingchen Yu
Wenqing Zhang
Bai Song
Xiaoqin Zhang
Shijian Lu

Neural Radiance Fields (NeRF) has demonstrated very impressive performance in novel view synthesis via implicitly modelling 3D representations from multi-view 2D images. However, most existing studies train NeRF models with either reasonable camera pose initialization or manually-crafted camera pose distributions which are often unavailable or hard to acquire in various real-world data. We design VMRF, an innovative view matching NeRF that enables effective NeRF training without requiring prior knowledge in camera poses or camera pose distributions. VMRF introduces a view matching scheme, which exploits unbalanced optimal transport to produce a feature transport plan for mapping a rendered image with randomly initialized camera pose to the corresponding real image. With the feature transport plan as the guidance, a novel pose calibration technique is designed which rectifies the initially randomized camera poses by predicting relative pose transformations between the pair of rendered and real images. Extensive experiments over a number of synthetic and real datasets show that the proposed VMRF outperforms the state-of-the-art qualitatively and quantitatively by large margins.

ME-D2N: Multi-Expert Domain Decompositional Network for Cross-Domain Few-Shot Learning

Yuqian Fu
Yu Xie
Yanwei Fu
Jingjing Chen
Yu-Gang Jiang

Recently, Cross-Domain Few-Shot Learning (CD-FSL) which aims at addressing the Few-Shot Learning (FSL) problem across different domains has attracted rising attention. The core challenge of CD-FSL lies in the domain gap between the source and novel target datasets. Though many attempts have been made for CD-FSL without any target data during model training, the huge domain gap makes it still hard for existing CD-FSL methods to achieve very satisfactory results. Alternatively, learning CD-FSL models with few labeled target domain data which is more realistic and promising is advocated in previous work. Thus, in this paper, we stick to this setting and technically contribute a novel Multi-Expert Domain Decompositional Network (ME-D2N). Concretely, to solve the data imbalance problem between the source data with sufficient examples and the auxiliary target data with limited examples, we build our model under the umbrella of multi-expert learning. Two teacher models which can be considered to be experts in their corresponding domain are first trained on the source and the auxiliary target sets, respectively. Then, the knowledge distillation technique is introduced to transfer the knowledge from two teachers to a unified student model. Taking a step further, to help our student model learn knowledge from different domain teachers simultaneously, we further present a novel domain decomposition module that learns to decompose the student model into two domain-related sub-parts. This is achieved by a novel domain-specific gate that learns to assign each filter to only one specific domain in a learnable way. Extensive experiments demonstrate the effectiveness of our method. Codes and models are available at https://github.com/lovelyqian/ME-D2N\_for\_CDFSL.

Towards Causality Inference for Very Important Person Localization

Xiao Wang
Zheng Wang
Wu Liu
Xin Xu
Qijun Zhao
Shin'ichi Satoh

Very Important Person Localization (VIPLoc) aims at detecting certain individuals in a given image, who are more attractive than others in the image. Existing uncontrolled VIPLoc benchmark assumes that the image has one single VIP, which is not suitable for actual application scenarios when multiple VIPs or no VIPs appear in the image. In this paper, we re-built a complex uncontrolled conditions (CUC) dataset to make the VIPLoc closer to the actual situation, containing no, single, and multiple VIPs. Existing methods use the hand-designed and deep learning strategies to extract the features of persons and analyze the differences between VIPs and other persons from the perspective of statistics. They are not explainable as to why the VIP located this output for that input. Thus, there exist the severe performance degradation when we use these models in real-world VIPLoc. Specifically, we establish a causal inference framework that unpacks the causes of previous methods and derives a new principled solution for VIPLoc. It treats the scene as confounding factor, allowing the ever-elusive confounding effects to be eliminated and the essential determinants to be uncovered. Through extensive experiments, our method outperforms the state-of-the-art methods on public VIPLoc datasets and the re-built CUC dataset.

MMDV: Interpreting DNNs via Building Evaluation Metrics, Manual Manipulation and Decision Visualization

Keyang Cheng
Yu Si
Hao Zhou
Rabia Tahir

The unexplainability and untrustworthiness of deep neural networks hinder their application in various high-risk fields. The existing methods lack solid evaluation metrics, interpretable models, and controllable manual manipulation. This paper presents Manual Manipulation and Decision Visualization (MMDV) which makes Human-in-the-loop improve the interpretability of deep neural networks. The MMDV offers three unique benefits: 1) The Expert-drawn CAM (Draw CAM) is presented to manipulate the key feature map and update the convolutional layer parameters, which makes the model focus on and learn the important parts by making a mask of the input image from the CAM drawn by the expert; 2) A hierarchical learning structure with sequential decision trees is proposed to provide a decision path and give strong interpretability for the fully connected layer of DNNs; 3) A novel metric, Data-Model-Result interpretable evaluation(DMR metric), is proposed to assess the interpretability of data, model and the results. Comprehensive experiments are conducted on the pre-trained models and public datasets. The results of the DMR metric are 0.4943, 0.5280, 0.5445 and 0.5108. These data quantifications represent the interpretability of the model and results. The attention force ratio is about 6.5% higher than the state-of-the-art methods. The Average Drop rate achieves 26.2% and the Average Increase rate achieves 36.6%. We observed that MMDV is better than other explainable methods by attention force ratio under the positioning evaluation. Furthermore, the manual manipulation disturbance experiments show that MMDV correctly locates the most responsive region in the target item and explains the model's internal decision-making basis. The MMDV not only achieves easily understandable interpretability but also makes it possible for people to be in the loop.

Learning Dual Convolutional Dictionaries for Image De-raining

Chengjie Ge
Xueyang Fu
Zheng-Jun Zha

Rain removal is a vital and highly ill-posed low-level vision task. While currently existing deep convolutional neural networks (CNNs) based image de-raining methods have achieved remarkable results, they still possess apparent shortcomings: First, most of the CNNs based models are lack of interpretability. Second, these models are not embedded with physical structures of rain streaks and background images. Third, they omit useful information in the background images. These deficiencies result in unsatisfied de-raining results in some sophisticated scenarios. To solve the above problems, we propose a Deep Dual Convolutional Dictionary Learning Network (DDCDNet) for these specific tasks. We firstly propose a new dual dictionary learning objective function, and then unfold it into the form of neural networks to learn prior knowledge from the data automatically. This network tries to learn the rain-streaks layer and the clean background using two dictionary learning networks instead of merely predicting the rain-streaks layer like most of the de-raining methods. To further increase the interpretability and generalization capability, we add sparsity and adaptive dictionary to our network to generate dynamic dictionary for each image based on content. Experimental results reveal that our model possesses outstanding de-raining ability on both synthetic and real-world data sets in terms of PSNR and SSIM as well as visual appearance.

Source-Free Domain Adaptation for Real-World Image Dehazing

Hu Yu
Jie Huang
Yajing Liu
Qi Zhu
Man Zhou
Feng Zhao

Deep learning-based source dehazing methods trained on synthetic datasets have achieved remarkable performance but suffer from dramatic performance degradation on real hazy images due to domain shift. Although certain Domain Adaptation (DA) dehazing methods have been presented, they inevitably require access to the source dataset to reduce the gap between the source synthetic and target real domains. To address these issues, we present a novel Source-Free Unsupervised Domain Adaptation (SFUDA) image dehazing paradigm, in which only a well-trained source model and an unlabeled target real hazy dataset are available. Specifically, we devise the Domain Representation Normalization (DRN) module to make the representation of real hazy domain features match that of the synthetic domain to bridge the gaps. With our plug-and-play DRN module, unlabeled real hazy images can adapt existing well-trained source networks. Besides, the unsupervised losses are applied to guide the learning of the DRN module, which consists of frequency losses and physical prior losses. Frequency losses provide structure and style constraints, while the prior loss explores the inherent statistic property of haze-free images. Equipped with our DRN module and unsupervised loss, existing source dehazing models are able to dehaze unlabeled real hazy images. Extensive experiments on multiple baselines demonstrate the validity and superiority of our method visually and quantitatively.

Knowledge Guided Representation Disentanglement for Face Recognition from Low Illumination Images

Xiangyu Miao
Shangfei Wang

Low illumination face recognition is challenging as details are lacking due to lighting conditions. Retinex theory points out that images can be divided into reflectance with color constancy and ambient illumination. Inspired by this, we propose a knowledge-guided representation disentanglement method to disentangle facial images into face-related and illumination-related features, and then leverage the disentangled face-related features for face recognition. Specifically, the proposed method consists of two components: feature disentanglement and face classifier. Following Retinex, high-dimensional face-related features and ambient illumination-related features are extracted from facial images. Reconstruction and crossreconstruction methods are used to make sure the integrity and accuracy of the disentangled features. Furthermore, we find that the influence of illumination changes on illumination-related features should be invariant for faces of different identities, so we design an illumination offset loss to satisfy the prior invariance for better disentanglement. Finally high-dimensional face-related features are mapped to low-dimensional features through the face classifier for use in face recognition task. Experimental results on low illumination and NIR-VIS datasets demonstrate the superiority and effectiveness of our proposed method.

APPTracker: Improving Tracking Multiple Objects in Low-Frame-Rate Videos

Tao Zhou
Wenhan Luo
Zhiguo Shi
Jiming Chen
Qi Ye

Multi-object tracking (MOT) in the scenario of low-frame-rate videos is a promising solution for deploying MOT methods on edge devices with limited computing, storage, power, and transmitting bandwidth. Tracking with a low frame rate poses particular challenges in the association stage as objects in two successive frames typically exhibit much quicker variations in locations, velocities, appearances, and visibilities than those in normal frame rates. In this paper, we observe severe performance degeneration of many existing association strategies caused by such variations. Though optical-flow-based methods like CenterTrack can handle the large displacement to some extent due to their large receptive field, the temporally local nature makes them fail to give correct displacement estimations of objects whose visibility flip within adjacent frames. To overcome the local nature of optical-flow-based methods, we propose an online tracking method by extending the CenterTrack architecture with a new head, named APP, to recognize unreliable displacement estimations. Then we design a two-stage association policy where displacement estimations or historical motion cues are leveraged in the corresponding stage according to APP predictions. Our method, with little additional computational overhead, shows robustness in preserving identities in low-frame-rate video sequences. Experimental results on public datasets in various low-frame-rate settings demonstrate the advantages of the proposed method.

ICNet: Joint Alignment and Reconstruction via Iterative Collaboration for Video Super-Resolution

Jiaxu Leng
Jia Wang
Xinbo Gao
Bo Hu
Ji Gan
Chenqiang Gao

Most previous frameworks either cost too much time or adopt some fixed modules resulting in alignment error in video super-resolution (VSR). In this paper, we propose a novel many-to-many VSR framework with Iterative Collaboration (ICNet), which employs the concurrent operation by iterative collaboration between alignment and reconstruction proving to be more efficient and effective than existing recurrent and sliding-window frameworks. With the proposed iterative collaboration, alignment can be conducted on super-resolved features from reconstruction while accurate alignment boosts reconstruction in return. In each iteration, the features of low-resolution video frames are first fed into the alignment and reconstruction subnetworks, which can generate temporal aligned features and spatial super-resolved features. Then, both outputs are fed into the proposed Tidy Two-stream Fusion (TTF) subnetwork that shares inter-frame temporal information and intra-frame spatial information without redundancy. Moreover, we design the Frequency Separation Reconstruction (FSR) subnetwork to not only model high-frequency and low-frequency information separately but also take benefit of each other for better reconstruction. Extensive experiments on benchmark datasets demonstrate that the proposed ICNet outperforms state-of-the-art VSR methods in terms of PSNR/SSIM values and visual quality, respectively.

Estimation of Reliable Proposal Quality for Temporal Action Detection

Junshan Hu
Chaoxu Guo
Liansheng Zhuang
Biao Wang
Tiezheng Ge
Yuning Jiang
Houqiang Li

Temporal action detection (TAD) aims to locate and recognize the actions in an untrimmed video. Anchor-free methods have made remarkable progress which mainly formulate TAD into two tasks: classification and localization using two separate branches. This paper reveals the temporal misalignment between the two tasks hindering further progress. To address this, we propose a new method that gives insights into moment and region perspectives simultaneously to align the two tasks by acquiring reliable proposal quality. For the moment perspective, Boundary Evaluate Module (BEM) is designed which focuses on local appearance and motion evolvement to estimate boundary quality and adopts a multi-scale manner to deal with varied action durations. For the region perspective, we introduce Region Evaluate Module (REM) which uses a new and efficient sampling method for proposal feature representation containing more contextual information compared with point feature to refine category score and proposal boundary. The proposed B oundary Evaluate Module and R egion E valuate M odule (BREM) are generic, and they can be easily integrated with other anchor-free TAD methods to achieve superior performance. In our experiments, BREM is combined with two different frameworks and improves the performance on THUMOS14 by 3.6% and 1.0% respectively, reaching a new state-of-the-art (63.6% average mAP). Meanwhile, a competitive result of 36.2% average mAP is achieved on ActivityNet-1.3 with the consistent improvement of BREM. The codes are released at \urlhttps://github.com/Junshan233/BREM.

Semi-supervised Semantic Segmentation via Prototypical Contrastive Learning

Zenggui Chen
Zhouhui Lian

The key idea of semi-supervised semantic segmentation is to leverage both labeled and unlabeled data. To achieve the goal, most existing methods resort to pseudo-labels for training. However, the dispersed feature distribution and biased category centroids could inevitably lead to the calculation deviation of feature distances and noisy pseudo labels. In this paper, we propose to denoise pseudo labels with representative prototypes. Specifically, to mitigate the effects of outliers, we first employ automatic clustering to model multiple prototypes with which the distribution of outliers can be better characterized. Then, a compact structure and clear decision boundary can be obtained by using contrastive learning. It is worth noting that our prototype-wise pseudo segmentation strategy can also be applied in most existing semantic segmentation networks. Experimental results show that our method outperforms other state-of-the-art approaches on both Cityscapes and Pascal VOC semantic segmentation datasets under various data partition protocols.

Towards Understanding Cross Resolution Feature Matching for Surveillance Face Recognition

Chiawei Kuo
Yi-Ting Tsai
Hong-Han Shuai
Yi-ren Yeh
Ching-Chun Huang

Cross-resolution face recognition (CRFR) in an open-set setting is a practical application for surveillance scenarios where low-resolution (LR) probe faces captured via surveillance cameras require being matched to a watchlist of high-resolution (HR) galleries. Although CRFR is to be of practical use, it sees a performance drop of more than 10% compared to that of high-resolution face recognition protocols. The challenges of CRFR are multifold, including the domain gap induced by the HR and LR images, the pose/texture variations, etc. To this end, this work systematically discusses possible issues and their solutions that affect the accuracy of CRFR. First, we explore the effect of resolution changes and conclude that resolution matching is the key for CRFR. Even simply downscaling the HR faces to match the LR ones brings a performance gain. Next, to further boost the accuracy of matching cross-resolution faces, we found that a well-designed super-resolution network, which can (a) represent the images continuously, is (b) suitable for real-world degradation kernel, (c) adaptive to different input resolutions, and (d) guided by an identity-preserved loss, is necessary to upsample the LR faces with discriminative enhancement. Here, the proposed identity-preserved loss plays the role of reconciling the objective discrepancy of super-resolution between human perception and machine recognition. Finally, we emphasize that removing the pose variations is an essential step before matching faces for recognition in the super-resolved feature space. Our method is evaluated on benchmark datasets, including SCface, cross-resolution LFW, and QMUL-Tinyface. The results show that the proposed method outperforms the SOTA methods by a clear margin and narrows the performance gap compared to the high-resolution face recognition protocol.

Single Image Shadow Detection via Complementary Mechanism

Yurui Zhu
Xueyang Fu
Chengzhi Cao
Xi Wang
Qibin Sun
Zheng-Jun Zha

In this paper, we present a novel shadow detection framework by investigating the mutual complementary mechanisms contained in this specific task. Our method is based on a key observation: in a single shadow image, shadow regions and non-shadow counterparts are complementary to each other in nature, thus a better estimation on one side leads to an improved estimation on the other, and vice versa. Motivated by this observation, we first leverage two parallel interactive branches to jointly produce shadow and non-shadow masks. The interaction between two parallel branches is to retain the deactivated intermediate features of one branch by introducing the negative activation technique, which could serve as complementary features to the other branch. Besides, we also apply identity reconstruction loss as complementary training guidance at the image level. Finally, we design two discriminative losses to satisfy the complementary requirements of shadow detection, i.e., neither missing any shadow regions nor falsely detecting non-shadow regions. By fully exploring and exploiting the complementary mechanism of shadow detection, our method can confidently predict more accurate shadow detection results. Extensive experiments on the three widely-used benchmarks demonstrate our proposed method achieves superior shadow detection performance against state-of-the-art methods with a relatively low computational cost.

Distilling Resolution-robust Identity Knowledge for Texture-Enhanced Face Hallucination

Qiqi Bao
Rui Zhu
Bowen Gang
Pengyang Zhao
Wenming Yang
Qingmin Liao

The main focus of most existing face hallucination methods is to generate visually pleasing results. However, in many applications, the final goal is to identify the person in the low-resolution (LR) image. In this paper, we propose a texture and identity integration network (TIIN) to effectively incorporate identity information into face hallucination tasks. TIIN consists of an identity-preserving denormalization module (IDM) and an equalized texture enhance module (ETEM). The IDM exploits the identity prior and the ETEM improves image quality through histogram equalization. To extract identity information effectively, we propose a resolution-robust identity knowledge distillation network (RIKDN). RIKDN is specifically designed for LR face recognition and can be of independent interest. It employs two teacher-student streams. One stream narrows the performance gap between high-resolution (HR) and LR images. The other distills correlation information from the HR-HR teacher stream to guide learning in the LR-HR student stream. We conduct extensive experiments on multiple datasets to demonstrate the effectiveness of our methods.

Phoneme-Aware Adaptation with Discrepancy Minimization and Dynamically-Classified Vector for Text-independent Speaker Verification

Jia Wang
Tianhao Lan
Jie Chen
Chengwen Luo
Chao Wu
Jianqiang Li

Recent studies show that introducing phonetic information into multi-task learning could significantly improve the performance of speaker embedding extraction. However, benefits of such architectures usually depend largely on the availibility of a well-matched dataset, and domain or language mismatch would result in obvious dropdown in performance. Meanwhile, the utilization of these massive mismatched data and application of these auxiliary tasks may bring many rich features that could be exploited. In this paper, we propose a phoneme-aware adaptation network with discrepancy minimization and dynamically-classified vector for text-independent speaker verification to address these abovementioned challenges. More specifically, our method first utilize the maximum mean discrepancy (MMD) as part of the total loss function to solve the mismatch between training data of the speaker subnet and the phoneme subnet. And then we use a dynamically-classified vector-guided softmax loss (DV-Softmax), which could adaptively emphasize different high-quality features and dynamically change their weights, to guide the discriminative speaker embedding. Experimental results on VoxCeleb1 data set confirmed its superiority against the other state-of-the-art phoneme adaptation methods, providing approximately 15% relative improvements in equal error rate (EER).

Anomaly Warning: Learning and Memorizing Future Semantic Patterns for Unsupervised Ex-ante Potential Anomaly Prediction

Jiaxu Leng
Mingpi Tan
Xinbo Gao
Wen Lu
Zongyi Xu

Existing video anomaly detection methods typically utilize reconstruction or prediction error to detect anomalies in the current frame. However, these methods cannot predict ex-ante potential anomalies in future frames, which is imperative in real scenes. Inspired by the ex-ante prediction ability of humans, we propose an unsupervised Ex-ante Potential Anomaly Prediction Network (EPAP-Net), which learns to build a semantic pool to memorize the normal semantic patterns of future frames for indirect anomaly prediction. At the training time, the memorized patterns are encouraged to be discriminated through our Semantic Pool Building Module (SPBM) with the novel padding and updating strategies. Moreover, we present a novel Semantic Similarity Loss (SSLoss) at the feature level to maximize the semantic consistency of memorized items and corresponding future frames. Specially, to enhance the value of our work, we design a Multiple Frames Prediction module (MFP) to achieve anomaly prediction in future multiple frames. At the test time, we utilize the trained semantic pool instead of ground truth to evaluate the anomalies of future frames. Besides, to obtain better feature representations for our task, we introduce a novel Channel-selected Shift Encoder (CSE), which shifts channels along the temporal dimension between the input frames to capture motion information without generating redundant features. Experimental results demonstrate that the proposed EPAP-Net can effectively predict the potential anomalies in future frames and exhibit superior or competitive performance on video anomaly detection.

DuetFace: Collaborative Privacy-Preserving Face Recognition via Channel Splitting in the Frequency Domain

Yuxi Mi
Yuge Huang
Jiazhen Ji
Hongquan Liu
Xingkun Xu
Shouhong Ding
Shuigeng Zhou

With the wide application of face recognition systems, there is rising concern that original face images could be exposed to malicious intents and consequently cause personal privacy breaches. This paper presents DuetFace, a novel privacy-preserving face recognition method that employs collaborative inference in the frequency domain. Starting from a counterintuitive discovery that face recognition can achieve surprisingly good performance with only visually indistinguishable high-frequency channels, this method designs a credible split of frequency channels by their cruciality for visualization and operates the server-side model on non-crucial channels. However, the model degrades in its attention to facial features due to the missing visual information. To compensate, the method introduces a plug-in interactive block to allow attention transfer from the client-side by producing a feature mask. The mask is further refined by deriving and overlaying a facial region of interest (ROI). Extensive experiments on multiple datasets validate the effectiveness of the proposed method in protecting face images from undesired visual inspection, reconstruction, and identification while maintaining high task availability and performance. Results show that the proposed method achieves a comparable recognition accuracy and computation cost to the unprotected ArcFace and outperforms the state-of-the-art privacy-preserving methods. The source code is available at https://github.com/Tencent/TFace/tree/master/recognition/tasks/duetface.

3D Human Mesh Reconstruction by Learning to Sample Joint Adaptive Tokens for Transformers

Youze Xue
Jiansheng Chen
Yudong Zhang
Cheng Yu
Huimin Ma
Hongbing Ma

Reconstructing 3D human mesh from a single RGB image is a challenging task due to the inherent depth ambiguity. Researchers commonly use convolutional neural networks to extract features and then apply spatial aggregation on the feature maps to explore the embedded 3D cues in the 2D image. Recently, two methods of spatial aggregation, the transformers and the spatial attention, are adopted to achieve the state-of-the-art performance, whereas they both have limitations. The use of transformers helps modelling long-term dependency across different joints whereas the grid tokens are not adaptive for the positions and shapes of human joints in different images. On the contrary, the spatial attention focuses on joint-specific features. However, the non-local information of the body is ignored by the concentrated attention maps. To address these issues, we propose a Learnable Sampling module to generate joint adaptive tokens and then use transformers to aggregate global information. Feature vectors are sampled accordingly from the feature maps to form the tokens of different joints. The sampling weights are predicted by a learnable network so that the model can learn to sample joint-related features adaptively. Our adaptive tokens are explicitly correlated with human joints, so that more effective modeling of global dependency among different human joints can be achieved. To validate the effectiveness of our method, we conduct experiments on several popular datasets including Human3.6M and 3DPW. Our method achieves lower reconstruction errors in terms of both the vertex-based metric and the joint-based metric compared to previous state of the arts. The codes and the trained models are released at https://github.com/thuxyz19/Learnable-Sampling.

Grouped Adaptive Loss Weighting for Person Search

Yanling Tian
Di Chen
Yunan Liu
Shanshan Zhang
Jian Yang

Person search is an integrated task of multiple sub-tasks such as foreground/background classification, bounding box regression and person re-identification. Therefore, person search is a typical multi-task learning problem, especially when solved in an end-to-end manner. Recently, some works enhance person search features by exploiting various auxiliary information, e.g. person joint keypoints, body part position, attributes, etc., which brings in more tasks and further complexifies a person search model. The inconsistent convergence rate of each task could potentially harm the model optimization. A straightforward solution is to manually assign different weights to different tasks, compensating for the diverse convergence rates. However, given the special case of person search, i.e. with a large number of tasks, it is impractical to weight the tasks manually. To this end, we propose a Grouped Adaptive Loss Weighting (GALW) method which adjusts the weight of each task automatically and dynamically. Specifically, we group tasks according to their convergence rates. Tasks within the same group share the same learnable weight, which is dynamically assigned by considering the loss uncertainty. Experimental results on two typical benchmarks, CUHK-SYSU and PRW, demonstrate the effectiveness of our method.

Multi-view Gait Video Synthesis

Weilai Xiang
Hongyu Yang
Di Huang
Yunhong Wang

This paper investigates a new fine-grained video generation task, namely Multi-view Gait Video Synthesis, where the generation model works on a video of a walking human of arbitrary viewpoint and creates multi-view renderings of the subject. This task is particularly challenging, as it requires synthesizing visually plausible results, while simultaneously preserving discriminative gait cues subject to identification. To tackle the challenge caused by the entanglement of viewpoint, texture, and body structure, we present a network with two collaborative branches to decouple the novel view rendering process into two streams for human appearances (texture) and silhouettes (structure), respectively. Additionally, the prior knowledge of person re-identification and gait recognition is incorporated into the training loss for more adequate and accurate dynamic details. Experimental results show that the presented method is able to achieve promising success rates when attacking state-of-the-art gait recognition models. Furthermore, the method can improve gait recognition systems by effective data augmentation. To the best of our knowledge, this is the first task to manipulate views for human videos with person-specific behavioral constraints.

Curriculum-NAS: Curriculum Weight-Sharing Neural Architecture Search

Yuwei Zhou
Xin Wang
Hong Chen
Xuguang Duan
Chaoyu Guan
Wenwu Zhu

Neural Architecture Search (NAS) is an effective way to automatically design neural architectures for various multimedia applications. Weight-sharing, as one of the most popular NAS strategies, has been widely adopted due to its search efficiency. Existing weight-sharing NAS methods overlook the influence of data distribution and treat each data sample equally. Contrastively, in this paper, we empirically discover that different data samples have different influences on architectures, e.g., some data samples are easy to fit by certain architectures but hard by others. Hence, there exist architectures with better performances on early data samples being more likely to be discovered in the whole NAS searching process, which leads to a suboptimal searching result. To tackle this problem, we propose Curriculum-NAS, a curriculum training framework on weight-sharing NAS, which dynamically changes the training data weights during the searching process. In particular, Curriculum-NAS utilizes the multiple subnets included in weight-sharing NAS to jointly assess data uncertainty, which serves as the difficulty criterion in a curriculum manner, so that the potentially optimal architectures can obtain higher probability of being fully trained and discovered. Extensive experiments on several image and text datasets demonstrate that our Curriculum-NAS can bring consistent improvement over existing weight-sharing NAS. The code is available online at https://github.com/zhouyw16/curriculum-nas.

Content and Gradient Model-driven Deep Network for Single Image Reflection Removal

Ya-Nan Zhang
Linlin Shen
Qiufu Li

Single image reflection removal (SIRR) is an extremely challenging, ill-posed problem with many application scenarios. In recent years, massive deep learning-based methods have been proposed to remove undesirable reflections from a single input image. However, these methods lack interpretability and do not fully utilize the intrinsic physical structure of reflection images. In this paper, we propose a content and gradient-guided deep network (CGDNet) for single image reflection removal, which is a full-interpretable and model-driven network. Firstly, using the multi-scale convolutional dictionary, we design a novel single image reflection removal model, which combines the image content prior and gradient prior information. Then, the model is optimized using an optimization algorithm based on the proximal gradient technique and unfolded into a neural network, i.e., CGDNet. All the parameters of CGDNet can be automatically learned by end-to-end training. Besides, we introduce a reflection detection module into CGDNet to obtain a probabilistic confidence map and ensure that the network pays attention to reflection regions. Extensive experiments on four benchmark datasets demonstrate that CGDNet is more efficient than state-of-the-art methods in terms of both subjective and objective evaluations. Code is available at https://github.com/zynwl/CGDNet.

TransCNN-HAE: Transformer-CNN Hybrid AutoEncoder for Blind Image Inpainting

Haoru Zhao
Zhaorui Gu
Bing Zheng
Haiyong Zheng

Blind image inpainting is extremely challenging due to the unknown and multi-property complexity of contamination in different contaminated images. Current mainstream work decomposes blind image inpainting into two stages: mask estimating from the contaminated image and image inpainting based on the estimated mask, and this two-stage solution involves two CNN-based encoder-decoder architectures for estimating and inpainting separately. In this work, we propose a novel one-stage Transformer-CNN Hybrid AutoEncoder (TransCNN-HAE) for blind image inpainting, which intuitively follows the inpainting-then-reconstructing pipeline by leveraging global long-range contextual modeling of Transformer to repair contaminated regions and local short-range contextual modeling of CNN to reconstruct the repaired image. Moreover, a Cross-layer Dissimilarity Prompt (CDP) is devised to accelerate the identifying and inpainting of contaminated regions. Ablation studies validate the efficacy of both TransCNN-HAE and CDP, and extensive experiments on various datasets with multi-property contaminations show that our method achieves state-of-the-art performance with much lower computational cost on blind image inpainting. Our code is available at https://github.com/zhenglab/TransCNN-HAE.

Trajectory Prediction from Hierarchical Perspective

Tangwen Qian
Yongjun Xu
Zhao Zhang
Fei Wang

Predicting the future trajectories of multiple agents is essential for various applications in real life, such as surveillance systems, autonomous driving and social robots. The trajectory prediction task is influenced by many factors, including the individual historical trajectory, interactions between agents and fuzzy nature of an agent's motion. While existing methods have made great progress on the topic of trajectory prediction, they treat all the information uniformly, which limits the sufficiency of using information. To this end, in this paper, we propose to regard all the information in a two-level hierarchical view. Particularly, the first-level view is the inter-trajectory view. In this level, we observe that the difficulty to predict different trajectory samples is different. We define trajectory difficulty and train the proposed model in an "easy-to-hard'' schema. The second-level view is the intra-trajectory level. We find the influencing factors for a particular trajectory can be divided into two parts. The first part is global features, which keep stable within a trajectory, i.e., the expected destination. The second part is local features, which change over time, i.e., the current position. We believe that the two types of information should be handled in different ways. The hierarchical view is beneficial to take full advantage of the information in a fine-grained way. Experimental results validate the effectiveness of the proposed model.

Exploring Effective Knowledge Transfer for Few-shot Object Detection

Zhiyuan Zhao
Qingjie Liu
Yunhong Wang

Recently, few-shot object detection(FSOD) has received much attention from the community, and many methods are proposed to address this problem from a knowledge transfer perspective. Though promising results have been achieved, these methods fail to achieve shot-stable:methods that excel in low-shot regimes are likely to struggle in high-shot regimes, and vice versa. We believe this is because the primary challenge of FSOD changes when the number of shots varies. In the low-shot regime, the primary challenge is the lack of inner-class variation. In the high-shot regime, as the variance approaches the real one, the main hindrance to the performance comes from misalignment between learned and true distributions. However, these two distinct issues remain unsolved in most existing FSOD methods. In this paper, we propose to overcome these challenges by exploiting rich knowledge the model has learned and effectively transferring them to the novel classes. For the low-shot regime, we propose a distribution calibration method to deal with the lack of inner-class variation problem. Meanwhile, a shift compensation method is proposed to compensate for possible distribution shift during fine-tuning. For the high-shot regime, we propose to use the knowledge learned from ImageNet as guidance for the feature learning in the fine-tuning stage, which will implicitly align the distributions of the novel classes. Although targeted toward different regimes, these two strategies can work together to further improve the FSOD performance. Experiments on both the VOC and COCO benchmarks show that our proposed method can significantly outperform the baseline method and produce competitive results in both low-shot settings(shot<5) and high-shot settings(shot>=5). Code is available at https://github.com/JulioZhao97/EffTrans_Fsdet.git.

More is better: Multi-source Dynamic Parsing Attention for Occluded Person Re-identification

Xinhua Cheng
Mengxi Jia
Qian Wang
Jian Zhang

Occluded person re-identification (re-ID) has been a long-standing challenge in surveillance systems. Most existing methods tackle this challenge by aligning spatial features of human parts according to external semantic cues, which are inferred from the off-the-shelf semantic models (e.g. human parsing and pose estimation). However, there is a significant domain gap between the images in re-ID datasets and the images used for training the semantic models, such that inevitably making those semantic cues unreliable and deteriorating the re-ID performance. Multi-source knowledge ensemble has been proved to be effective for domain adaptation. Inspired by this, we propose a multi-source dynamic parsing attention (MSDPA) mechanism that leverages knowledge learned from different source datasets to generate reliable semantic cues and dynamically integrate and adapt them in a self-supervised manner by attention mechanism. Specifically, we first design a parsing embedding module (PEM) to integrate and embed the multi-source semantic cues into the patch tokens through a voting procedure. To further exploit correlations among body parts with similar semantics, we design a dynamic parsing attention block (DPAB) to guide the patch sequences aggregation by prior attentions which are dynamically generated from human parsing results. Extensive experiments over occluded, partial, and holistic re-ID datasets show that the MSDPA achieves superior re-ID performance consistently and outperforms the state-of-the-art methods by large margins on occluded datasets.

ReFu: Refine and Fuse the Unobserved View for Detail-Preserving Single-Image 3D Human Reconstruction

Gyumin Shim
Minsoo Lee
Jaegul Choo

Single-image 3D human reconstruction aims to reconstruct the 3D textured surface of the human body given a single image. While implicit function-based methods recently achieved reasonable reconstruction performance, they still bear limitations showing degraded quality in both surface geometry and texture from an unobserved view. In response, to generate a realistic textured surface, we propose ReFu, a coarse-to-fine approach that refines the projected backside view image and fuses the refined image to predict the final human body. To suppress the diffused occupancy that causes noise in projection images and reconstructed meshes, we propose to train occupancy probability by simultaneously utilizing 2D and 3D supervisions with occupancy-based volume rendering. We also introduce a refinement architecture that generates detail-preserving backside-view images with front-to-back warping. Extensive experiments demonstrate that our method achieves state-of-the-art performance in 3D human reconstruction from a single image, showing enhanced geometry and texture quality from an unobserved view.

Transformers in Spectral Domain for Estimating Image Geometric Transformation

Mingii Choi
Sangyeong Lee
Heesun Jung
Jong-Uk Hou

The blind estimation of image geometric transformation is an essential problem in digital image forensics. In this paper, we propose an end-to-end transformer-based estimator that can predict the geometric transformation parameters of an image. Deviating from the existing classification-based formulation, we provided a more generalized method by directly estimating the transformation matrix. We note that the frequency peak position of the inherent resampling artifacts leaves explicit clues for the geometric transformation. To use this feature, a direct analysis of the spatial frequency is performed using the positional encoding of fast Fourier transform and multi-head self-attention. Combining the regression layers with the preceding transformer effectively analyzes the geometric transformation parameters of the image. Performing extensive comparison tests with a public database, the proposed method demonstrates a prediction performance higher than existing methods and also demonstrated robustness to JPEG compression.

SESSION: Brave New Ideas Session

Can Language Understand Depth?

Renrui Zhang
Ziyao Zeng
Ziyu Guo
Yafeng Li

Besides image classification, Contrastive Language-Image Pre-training (CLIP) has accomplished extraordinary success for a wide range of vision tasks, including object-level and 3D space understanding. However, it's still challenging to transfer semantic knowledge learned from CLIP into more intricate tasks of quantified targets, such as depth estimation with geometric information. In this paper, we propose to apply CLIP for zero-shot monocular depth estimation, named DepthCLIP. We found that the patches of input image could respond to a certain semantic distance token and then be projected to a quantified depth bin for coarse estimation. Without any training, our DepthCLIP surpasses existing unsupervised methods and even approaches the early fully-supervised networks. To our best knowledge, we are the first to conduct zero-shot adaptation from the semantic language knowledge to quantified downstream tasks and perform zero-shot monocular depth estimation. We hope our work could cast a light on the future research. The code is available at https://github.com/Adonis-galaxy/DepthCLIP.

Compute to Tell the Tale: Goal-Driven Narrative Generation

Yongkang Wong
Shaojing Fan
Yangyang Guo
Ziwei Xu
Karen Stephen
Rishabh Sheoran
Anusha Bhamidipati
Vivek Barsopia
Jianquan Liu
Mohan Kankanhalli

Man is by nature a social animal. One important facet of human evolution is through narrative imagination, be it fictional or factual, and to tell the tale to other individuals. The factual narrative, such as news, journalism, field report, etc., is based on real-world events and often requires extensive human efforts to create. In the era of big data where video capture devices are commonly available everywhere, a massive amount of raw videos (including life-logging, dashcam or surveillance footage) are generated daily. As a result, it is rather impossible for humans to digest and analyze these video data. This paper reviews the problem of computational narrative generation where a goal-driven narrative (in the form of text with or without video) is generated from a single or multiple long videos. Importantly, the narrative generation problem makes itself distinguished from the existing literature by its focus on a comprehensive understanding of user goal, narrative structure and open-domain input. We tentatively outline a general narrative generation framework and discuss the potential research problems and challenges in this direction. Informed by the real-world impact of narrative generation, we then illustrate several practical use cases in Video Logging as a Service platform which enables users to get more out of the data through a goal-driven intelligent storytelling AI agent.

Benign Adversarial Attack: Tricking Models for Goodness

Jitao Sang
Xian Zhao
Jiaming Zhang
Zhiyu Lin

In spite of the successful application in many fields, machine learning models today suffer from notorious problems like vulnerability to adversarial examples. Beyond falling into the cat-and-mouse game between adversarial attack and defense, this paper provides alternative perspective to consider adversarial example and explore whether we can exploit it in benign applications. We first attribute adversarial example to the human-model disparity on employing non-semantic features. While largely ignored in classical machine learning mechanisms, non-semantic feature enjoys three interesting characteristics as (1) exclusive to model, (2) critical to affect inference, and (3) utilizable as features. Inspired by this, we present brave new idea of benign adversarial attack to exploit adversarial examples for goodness in three directions: (1) adversarial Turing test, (2) rejecting malicious model application, and (3) adversarial data augmentation. Each direction is positioned with motivation elaboration, justification analysis and prototype applications to showcase its potential.

Demographic Feature Isolation for Bias Research using Deepfakes

Kurtis Haut
Caleb Wohn
Victor Antony
Aidan Goldfarb
Melissa Welsh
Dillanie Sumanthiran
M. Rafayet Ali
Ehsan Hoque

This paper explores the complexity of what constitutes the demographic features of race and how race is perceived. "Race" is composed of a variety of factors including skin tone, facial features, and accent. Isolating these interrelated race features is a difficult problem and failure to do so properly can easily invite confounding factors. Here we propose a novel method to isolate features of race by using AI-based technology and measure the impact these modifications have on an outcome variable of interest; i.e., perceived credibility. We used videos from a deception dataset for which the ground-truth is known and create three conditions: 1) a Black vs White CycleGAN image condition; 2) an original vs deepfake video condition; 3) an original vs deepfake still frame condition. We crowd-sourced 1736 responses to measure how credibility was influenced by changing the perceived race. We found that it is possible to alter perceived race through modifying demographically visual features alone. However, we did not find any statistically significant differences for credibility across our experiments based on these changes. Our findings help quantify intuitions from prior research that the relationship between racial perception and credibility is more complex than visual features alone. Our presented deepfake framework could be incorporated to precisely measure the impact of a wider range of demographic features (such as gender or age) due to the fine-grained isolation and control that was previously impossible in a lab setting.

Recipe-oriented Food Logging for Nutritional Management

Yoko Yamakata
Akihisa Ishino
Akiko Sunto
Sosuke Amano
Kiyoharu Aizawa

We propose a recipe-oriented food logging method that records food by recipe, unlike the ordinary food logging method that records food by name. We also develop an application RecipeLog for this purpose. RecipeLog can create a "skeleton recipe," which is a standardized recipe representation suitable for estimating the nutritional value of a dish. This is represented by a list of ingredients linked to a Nutrition Facts table and a flow graph consisting of cooking actions such as cutting, mixing, baking, simmering, and frying the ingredients. The recipe log allows recipes to be written with fewer operations by editing only the differences from the already registered base recipe. Experiments have confirmed that the recipe log can effectively identify differences in recipes from household to household. The future work is to construct a multimedia recipe dataset consisting of structured recipes and their images using RecipeLog.

SESSION: Doctoral Consortium

Video Coding Enhancements for HTTP Adaptive Streaming

Vignesh V. Menon

Rapid growth in multimedia streaming traffic over the Internet motivates the research and further investigation of the video coding performance of such services in terms of speed and Quality of Experience (QoE). HTTP Adaptive Streaming (HAS) is today's de-facto standard to deliver clients the highest possible video quality. In HAS, the same video content is encoded at multiple bitrates, resolutions, framerates, and coding formats called representations. This study aims to (i) provide fast and compression-efficient multi-bitrate, multi-resolution representations, (ii) provide fast and compression-efficient multi-codec representations, (iii) improve the encoding efficiency of Video on Demand (VoD) streaming using content-adaptive encoding optimizations, and (iv) provide encoding schemes with optimizations per-title for live streaming applications to decrease the storage or delivery costs or/and increase QoE.

Unsupervised Multi-object Tracking via Dynamical VAE and Variational Inference

Xiaoyu Lin

In this paper, we address the problem of motion modeling in the Multi-Object Tracking (MOT) task. We present an unsupervised probabilistic motion model and associated estimation algorithm based on a dynamical variational autoencoder (DVAE), called DVAE-UMOT. The DVAE is a latent-variable deep generative model used to model long-term and non-linear temporal dependencies in sequential data. Thanks to the powerful sequential modeling capacity of the DVAE, DVAE-UMOT is able to keep long-term tracks purely based on motion cues under the tracking-by-detection paradigm and generate reasonable bounding boxes when there is detection absence. Experimental results show that our model is particularly good at handling object disappearance and identity switch problems caused by long-term occlusion and unstable detections. Finally, DVAE-UMOT is shown experimentally to compete well with and even surpass the performance of two state-of-the-art probabilistic MOT motion models. Code and data are publicly available.

Enabling Effective Low-Light Perception using Ubiquitous Low-Cost Visible-Light Cameras

Igor Morawski

Deep ConvNets have changed and improved the capabilities and robustness of machine perception models such as object detectors. Meanwhile, visual perception using visible-light sensors under low-light conditions is still challenging for deep learning models. While there are sensors that allow mitigating some of the limitations of visible-light cameras, they are often infeasible in real-life cost-, power- or space-constrained systems. By contrast, to extend the functionality of existing vision systems based on visible-light cameras, we investigate effective machine perception in low light using these ubiquitous low-cost sensors. Our first contribution to the computer vision community was our high-quality object detection dataset targeting low-light machine perception. Next, we proposed a neural Image Signal Processor that processes raw sensor data into representation optimal for low-light machine cognition. In the future work, we aim to focus on a higher-level approach to low-light machine cognition. Finally, we plan to address low-light conditions holistically by integrating low- and high-level domain solutions in a common framework.

Interaction with Immersive Cultural Heritage Environments: Using XR Technologies to Represent Multiple Perspectives on Serralves Museum

Manuel Silva

Museums have increasingly been using digital approaches to explore new ways to provide new experiences with Cultural Heritage (CH). The need for these solutions exploded with the COVID-19 pandemic forcing museums and cultural organizations to move towards a digital transformation to engage their audiences. Although, with a lack of guidelines on how to create eXtended Reality (XR) experiences with multiple perspectives for CH environments. This project aims to provide Museums with novel approaches to include interactive and immersive activities targeted at the cultural assets of their exhibitions and educational activities. Firstly, we will map and critically access current participatory practices in museums; then, we will develop a new methodology for creating and implementing experiences using XR technology in a range of workshops. The concept of multiple perspectives on cultural heritage will be explored through the application of co-creation practices, towards engaging different communities with cultural heritage contents from Serralves Museum.

Multi-modal Learning Algorithms and Network Architectures for Information Extraction and Retrieval

Maurits Bleeker

Large-scale (pre-)training has recently achieved great success on both uni- and multi-modal downstream evaluation tasks. However, this training paradigm generally comes with a high cost, both in the amount of compute and data needed for training. In my Ph.D. thesis, I study the problem of multi-modal learning for information extraction and retrieval, with the main focus on new learning algorithms and network architectures to make the learning process more efficient. First, I introduce a novel network architecture for bidirectional decoding for the scene text recognition (STR) task. Next, I focus on the image-caption retrieval (ICR) task. I question if the results obtained in the metric learning field generalize to the ICR task. Finally, I focus on the reduction of shortcut learning for the ICR task. I introduce latent target decoding (LTD), a novel constraint-based learning algorithm which reduces shortcut feature learning by decoding the input caption in a semantic latent space.

Enriching Existing Educational Video Datasets to Improve Slide Classification and Analysis

Travis Seng

Educational content is increasingly available online. The most common media types for such content are audio and video. Many of these educational video are unprocessed -- they are simply captured with a camera and then uploaded onto video servers for viewers to watch. We believe that automatic video analysis that recovers the structure of educational videos would allow learners and creators to fully exploit the semantics of the captured content. In this paper, we outline our research plan towards better semantic understanding of educational videos. We highlight our work done so far: (i) we extended the FitVid Dataset to exploit the semantics of any type of learning video; and (ii) we improved upon existing state-of-the-art slide classification techniques.

Zero-shot Generalization of Multimodal Dialogue Agents

Diogo Tavares

Multimodal conversational agents are an ever expanding field which benefits from the introduction of large language models. Production-ready robust conversational assistants trade breadth of scope for higher accuracy and general dialogue quality. These conversational assistants must be able to maintain the conversation focused, respond appropriately to user requests, maintain a certain level of natural response generation, be robust to out-of-scope and chitchat attempts, and, of course, be accurate in assisting the user in reaching their domain-specific goals. This work discusses data-centric observations, alongside providing research hypothesis for future, and some of my already developed work, to be expanded throughout my PhD.

The First Impression: Understanding the Impact of Multimodal System Responses on User Behavior in Task-oriented Agents

Diogo Silva

In task-oriented dialog agents, the initial user experience sets the tone for future interactions and plays a major role in whether the user returns to the system. But with the diversity of human interaction, these systems often struggle with more complex requests that differ significantly from their training data creating an unpleasant and frustrating user experience. In this work, we formulate an alternative and less resource-intensive approach to improving user experience by implicitly conveying possible user action in system responses. We make use of real-world data to test and support our claims, showing that a significant share of the users made use of such information. Additionally, we extend this to product recommendation as a way to anchor the interaction to well-established flows, we then perform data analysis that shows that close to a third of the users tried benefited from such recommendations. We conclude with an overview of the plans for my Ph.D. thesis and how these findings line up with our goal of producing an ideal agent.

SESSION: Technical Demonstrators

SingMaster: A Sight-singing Evaluation System of "Shoot and Sing" Based on Smartphone

Wei Xu
Bowen Tian
Lijie Luo
Weiming Yang
Xianke Wang
Lei Wu

Based on the smart phone, this paper integrates OMR (Optical Music Recognition) with sight-singing evaluation, and develops a "shoot and sing" practice APP called SingMaster. This system is mainly composed of three modules: OMR, evaluation and user interface. The OMR module converts the score photographed in the real scene into a note reference sequence. The sight-sing evaluation module first completes the note transcription of the sound spectrum through onset detection and pitch extraction, then aligns the transcribed note sequence with the reference sequence, and performs the evaluation. Finally, the evaluation results are visually fed back to the practitioners through the user interface module. It can provide guidance for practitioners at any time, any place and on any score instead of a real teacher.

Seeing Speech: Magnetic Resonance Imaging-Based Vocal Tract Deformation Visualization Using Cross-Modal Transformer

Kele Xu
Ming Feng
Weiquan Huang

As an essential component to advance speech science, understanding of speech production can be greatly helpful to improve our understanding of motor control, dynamical systems of humans during natural speech. Different medical imaging modalities have been leveraged to visualize the dynamic process, in which Magnetic resonance imaging (MRI) provides a valuable tool for evaluating static postures. In this demo, we present our solution to visualize the vocal tract deformation, leveraging the correlation between the MRI and the acoustical signals. We first formulate the problem as a cross-modal prediction task and a novel cross-modal Transformer network is proposed. Thus, we can infer the deformation of the vocal tract by only utilizing the acoustical signals. Then, we present an interactive framework, which can be used to visualize the deformation utilizing the aforementioned network. We hope our solution can also be helpful in pronunciation training for children with sound speech disorders and second language learning.

Developing Embodied Conversational Agents in the Unreal Engine: The FANTASIA Plugin

Antonio Origlia
Martina Di Bratto
Maria Di Maro
Sabrina Mennella

The fast-evolving industry of games pushes the technology behind audio, graphics and controllers to evolve at a very high speed. The front-end of interactive experiences such as games sets standards that must be met also by research teams working on Artificial Intelligence. It is, therefore, important, to devise strategies to let researchers concentrate on interaction models while being up to date with interface design standards. FANTASIA is designed to extend the functionalities offered by a high-profile industrial game engine, the Unreal Engine, to obtain an advanced development environment to develop Embodied Conversational Agents. The FANTASIA plugin is open source and is freely available on GitHub https://github.com/antori82/FANTASIA together with tutorial series on a dedicated Youtube channel.

A Platform for Deploying the TFE Ecosystem of Automatic Speech Recognition

Yuanfeng Song
Rongzhong Lian
Yixin Chen
Di Jiang
Xuefang Zhao
Conghui Tan
Qian Xu
Raymond Chi-Wing Wong

Since data regulations such as the European Union's General Data Protection Regulation (GDPR) have taken effect, the traditional two-step Automatic Speech Recognition (ASR) optimization strategy (i.e., training a one-size-fits-all model with vendor's centralized data and fine-tuning the model with clients' private data) has become infeasible. To meet these privacy requirements, TFE, a novel GDPR-compliant ASR ecosystem, has been proposed by us to incorporate transfer learning, federated learning, and evolutionary learning towards effective ASR model optimization. In this demonstration, we further design and implement a novel platform to promote the deployment and applicability of TFE. Our proposed platform allows enterprises to easily conduct the ASR optimization task using TFE across organizations.

Mediascape XR: A Cultural Heritage Experience in Social VR

Ignacio Reimat
Yanni Mei
Evangelos Alexiou
Jack Jansen
Jie Li
Shishir Subramanyam
Irene Viola
Johan Oomen
Pablo Cesar

Social virtual reality (VR) allows multiple remote users to interact in a shared space, unveiling new possibilities for communication in immersive environments. Mediascape XR presents a social VR experience that teleports 3D representations of remote users, using volumetric video, to a virtual museum. It enables visitors to interact with cultural heritage artifacts while allowing social interactions in real time between them. The application is designed following a human-centered approach, enabling an interactive, educating, and entertaining experience.

AI Carpet: Automatic Generation of Aesthetic Carpet Pattern

Ziyi Wang
Xingqi Wang
Zeyu Jin
Xiaohan Li
Shikun Sun
Jia Jia

Stylized pattern generation is challenging and has received increasing attention in recent studies. However, it requires further exploration in pattern generation that matches the given scenes. This paper proposes a demonstration that automatically generates carpet patterns with input home scenes and other user preferences. Besides meeting the individual needs of users and providing highly editable output, the critical challenge of the system is to make the output pattern coordinate with the input home scene, which distinguishes our approach from others. The carpets generated by the system also permit easy modification or extension to various interior design styles.

Sync Sofa: Sofa-type Side-by-side Communication Experience Based on Multimodal Expression

Yuki Tajima
Shota Okubo
Tomoaki Konno
Toshiharu Horiuchi
Tatsuya Kobayashi

Lifestyle changes and digitalization have reduced opportunities for face-to-face, intimate communication, which is an indispensable activity for human beings. We have realized a method that allows intimate communication between two persons even though they are faraway from each other by integrating multimodal technologies. Sync Sofa is a new sofa-type communication tool. It senses the partner with a camera, microphones, and accelerometers. The sensed data are cross-modally integrated and then presented to the user through a life-size display, multichannel loudspeakers, and multichannel vibrotactile actuators. We have received positive feedback from many users who have experienced Sync Sofa, such as, "I felt like the other person was really sitting next to me".

Attribute Controllable Beautiful Caucasian Face Generation by Aesthetics Driven Reinforcement Learning

Xin Jin
Shu Zhao
Le Zhang
Xin Zhao
Qiang Deng
Chaoen Xiao

In recent years, image generation has made great strides in improving the quality of images, producing high-fidelity ones. Also, quite recently, there are architecture designs, which enable GAN to unsupervisedly learn the semantic attributes represented in different layers. However, there is still a lack of research on generating face images more consistent with human aesthetics. Based on EigenGAN [He et al., ICCV 2021], we build the techniques of reinforcement learning into the generator of EigenGAN. The agent tries to figure out how to alter the semantic attributes of the generated human faces towards more preferable ones. To accomplish this, we trained an aesthetics scoring model that can conduct facial beauty prediction. We also can utilize this scoring model to analyze the correlation between face attributes and aesthetics scores. Empirically, using off-the-shelf techniques from reinforcement learning would not work well. So instead, we present a new variant incorporating the ingredients emerging in the reinforcement learning communities in recent years. Compared to the original generated images, the adjusted ones show clear distinctions concerning various attributes. Experimental results using the MindSpore, show the effectiveness of the proposed method. Altered facial images are commonly more attractive, with significantly improved aesthetic levels.

An AI Powered Re-Identification System for Real-time Contextual Multimedia Applications

Giuseppe Becchi
Andrea Ferracani
Filippo Principi
Alberto Del Bimbo

In this demo we present a person re-identification system, based on cameras installed in an environment and featuring AI, that can be used in the design and development of human-centered multimedia applications intended to provide improved situational awareness and context-sensitive user-interfaces. Possible applications are related but not limited to user profiling and personalisation systems, multimedia recommendation, social context understanding and gamification, security, with objectives spanning from environment monitoring, cultural heritage fruition enhancement, retail trade promotion and assistive technologies provision in industry. In the context of the demonstration we are going to set up a system with several workstations equipped with cameras and contextual to paintings reproductions that simulate a museum exhibition in which users are tracked and re-identified at different locations. The data collected is used to enrich the user experience through a reactive voice interface that considers the user's visit, the artworks that most attracted the visitor's attention and the social context.

A High-resolution Image-based Virtual Try-on System in Taobao E-commerce Scenario

Zhilong Zhou
Shiyao Wang
Tiezheng Ge
Yuning Jiang

On an e-commerce platform, virtual try-on not only improves consumers' shopping experience but also attracts more consumers. However, the existing public virtual try-on dataset can't be generalized to Taobao fashion items because of low resolution, racial differences, and small dataset size. In this work, we build a large-scale and high-resolution virtual try-on dataset for Taobao e-commerce. Based on the dataset, we propose a virtual try-on system to generate high-resolution and attractive virtual try-on images without time-consuming image preprocessing. The system has covered more than 10000 fashion items and produced millions of virtual try-on results in Taobao e-commerce scenario.

Interpretable Melody Generation from Lyrics with Discrete-Valued Adversarial Training

Wei Duan
Zhe Zhang
Yi Yu
Keizo Oyama

Generating melody from lyrics is an interesting yet challenging task in the area of artificial intelligence and music. However, the difficulty of keeping the consistency between input lyrics and generated melody limits the generation quality of previous works. In our proposal, we demonstrate our proposed interpretable lyrics-to-melody generation system which can interact with users to understand the generation process and recreate the desired songs. To improve the reliability of melody generation that matches lyrics, mutual information is exploited to strengthen the consistency between lyrics and generated melodies. Gumbel-Softmax is exploited to solve the non-differentiability problem of generating discrete music attributes by Generative Adversarial Networks (GANs). Moreover, the predicted probabilities output by the generator is utilized to recommend music attributes. Interacting with our lyrics-to-melody generation system, users can listen to the generated AI song as well as recreate a new song by selecting from recommended music attributes.

WOC: A Handy Webcam-based 3D Online Chatroom

Chuanhang Yan
Yu Sun
Qian Bao
Jinhui Pang
Wu Liu
Tao Mei

We develop WOC, a webcam-based 3D virtual online chatroom for multi-person interaction, which captures the 3D motion of users and drives their individual 3D virtual avatars in real-time. Compared to the existing wearable equipment-based solution, WOC offers convenient and low-cost 3D motion capture with a single camera. To promote the immersive chat experience, WOC provides high-fidelity virtual avatar manipulation, which also supports the user-defined characters. With the distributed data flow service, the system delivers highly synchronized motion and voice for all users. Deployed on the website and no installation required, users can freely experience the virtual online chat at https://yanch.cloud/.

BetterSight: Immersive Vision Training for Basketball Players

Pin-Xuan Liu
Tse-Yu Pan
Hsin-Shih Lin
Hung-Kuo Chu
Min-Chun Hu

Vision training is important for athletes and is key to win a sports game. Traditional vision training methods are suitable for sports that focus only on the ball. For basketball, however, players need to observe multiple moving objects (i.e., ball and players) concurrently on a large court. We propose BetterSight, an immersive vision training system for basketball that not only trains the vision of the player but also requires the player to dribble the ball stably, mimicking the situation in a real basketball game. BetterSight is composed of an Interaction Module (IM), a Training Content Generation Module (TCGM), and an Analysis Module (AM). IM allows the trainee to interact with the system more intuitively based on gesture and speech rather than the controller. TCGM simulates the training scenarios based on the training configurations selected by the trainee. AM collects the video sequences capturing the trainee and the trainee's eye movements during the training phase, and then analyzes the trainee's gaze, dribbling movements, and number of dribbles. The analyzed data can be used to evaluate the training effectiveness of using the proposed BetterSight.

ALEGORIA: Joint Multimodal Search and Spatial Navigation into the Geographic Iconographic Heritage

Florent Geniet
Valérie Gouet-Brunet
Mathieu Brédif

In this article, we present two online platforms developed for the structuring and valorization of old geographical iconographic collections: a multimodal search engine for their indexing, retrieval and interlinking, and a 3D navigation platform for their visualization in spatial context. In particular, we show how the joint use of these functionalities, guided by geolocation, brings structure and knowledge to the manipulated collections. In the demonstrator, they consist of 54,000 oblique aerial photographs from several French providers (national archives, a museum and a mapping agency).

Restoration of Analog Videos Using Swin-UNet

Lorenzo Agnolucci
Leonardo Galteri
Marco Bertini
Alberto Del Bimbo

In this paper we present a system to restore analog videos of historical archives. These videos often contain severe visual degradation due to the deterioration of their tape supports that require costly and slow manual interventions to recover the original content. The proposed system uses a multi-frame approach and is able to deal also with severe tape mistracking, which results in completely scrambled frames. Tests on real-world videos from a major historical video archive show the effectiveness of our demo system.

GetWild: A VR Editing System with AI-Generated 3D Object and Terrain

Shing Ming Wong
Chien-Wen Chen
Tse-Yu Pan
Hung-Kuo Chu
Min-Chun Hu

3D environment artists typically use 2D screens and 3D modeling software to achieve their creation. However, creating 3D content using 2D tools is counterintuitive. Moreover, the process would be inefficient for junior artists in the absence of a reference. We develop a system called GetWild, which employs artificial intelligence (AI) models to generate the prototype of 3D objects/terrain and allows users to further edit the generated content in the virtual space. With the aid of AI, the user can capture an image to obtain a rough 3D object model, or start with drawing simple sketches representing the river, the mountain peak and the mountain ridge to create a 3D terrain prototype. Further, the virtual reality (VR) technique is used to provide an immersive design environment and intuitive interaction (such as painting, sculpturing, coloring, and transformation) for users to edit the generated prototypes. Compared with the existing 3D modeling software and systems, the proposed VR editing system with AI-generated 3D objects/terrain provides a more efficient way for the user to create virtual artwork.

ScoreActuary: Hoop-Centric Trajectory-Aware Network for Fine-Grained Basketball Shot Analysis

Ting-Yang Kao
Tse-Yu Pan
Chen-Ni Chen
Tsung-Hsun Tsai
Hung-Kuo Chu
Min-Chun Hu

We propose a fine-grained basketball shot analysis system called ScoreActuary to analyze the players' shot events, which can be applied to game analysis, player training, and highlight generation. Given a basketball video as input, our system first detects/segments shot candidates and then analyzes "Shot Type", "Shot Result", and "Ball Status" of each shot candidate in real-time. Our approach is composed of a customized object detector and a trajectory-aware network to learn the information of ball trajectory. Compared to the existing methods that analyze basketball shots, our algorithm can better handle videos with arbitrary camera movements while improving the accuracy. To the best of our knowledge, this work is the first system that can analyze fine-grained shot events accurately in real basketball games with arbitrary camera movements.

A Conversational Shopping Assistant for Online Virtual Stores

Tiago Fornelos
Pedro Valente
Rafael Ferreira
Diogo Tavares
Diogo Silva
David Semedo
Joao Magalhaes
Nuno Correia

The online shopping industry benefits from the usage of virtual assistants that are able to provide a 24/7 way of communication. This is normally done through the use of text based interactions. In order to improve the user experience and provide a more engaging user-to-customer interaction, we present a virtual shopping environment that emulates the physical store, providing the user with an experience similar to buying products in the actual store. This allows users to interact with the store items and have a full conversation with a virtual shopping assistant. This is all done through a virtual store where the user is able to walk, try-on and see the displayed items, while having a conversation with the store's assistant.

TWIZ: The Multimodal Conversational Task Wizard

Rafael Ferreira
Diogo Silva
Diogo Tavares
Frederico Vicente
Mariana Bonito
Gustavo Gonçalves
Rui Margarido
Paula Figueiredo
Helder Rodrigues
David Semedo
Joao Magalhaes

This paper introduces TWIZ, a multimodal conversational task wizard that supports an engaging experience, where users are guided through a multimodal conversation, towards the successful completion of recipes and DIY tasks. TWIZ leverages task guides from WikiHow and Recipe sources, as well as dialog AI-based methods to deliver a rich, compelling, and engaging experience when guiding users through complex manual tasks. TWIZ participated in the Amazon Alexa Prize Taskbot 2021.

Engaging Museum Visitors with Gamification of Body and Facial Expressions

Maria Giovanna Donadio
Filippo Principi
Andrea Ferracani
Marco Bertini
Alberto Del Bimbo

In this demo we present two applications designed for the cultural heritage domain that exploit gamification techniques in order to improve fruition and learning of museum artworks. The two applications encourage users to replicate the poses and facial expressions of characters from paintings or statues, to help museum visitors make connections with works of art. Both applications challenge the user to fulfill a task in a funny way and provide the user with a visual report of the his/her experience that can be shared on social media, improving the engagement of the museums, and providing information on the artworks replicated in the challenge.

SESSION: Grand Challenges

A Multi-Stream Approach for Video Understanding

Lutharsanen Kunam
Luca Rossetto
Abraham Bernstein

The automatic annotation of higher-level semantic information in long-form video content is still a challenging task. The Deep Video Understanding (DVU) Challenge aims at catalyzing progress in this area by offering common data and tasks. In this paper, we present our contribution to the 3rd DVU challenge. Our approach consists of multiple information streams extracted from both the visual and the audio modality. The streams can build on information generated by previous streams to increase their semantic descriptiveness. Finally, the output of all streams can be aggregated in order to produce a graph representation of the input movie to represent the semantic relationships between the relevant characters.

Title-and-Tag Contrastive Vision-and-Language Transformer for Social Media Popularity Prediction

Weilong Chen
Chenghao Huang
Weimin Yuan
Xiaolu Chen
Wenhao Hu
Xinran Zhang
Yanru Zhang

Social media is an indispensable part of modern life, and social media popularity prediction (SMPP) plays a vital role in practice. In current work, the inconsistency of words in labels and titles, user feature transformation, etc have not been well noticed. In this paper, we propose a novel approach named Title-and-Tag Contrastive Vision-and-Language Transformer (TTC-VLT), combining two pre-trained vision and language transformers and other two dense feature parts for this prediction task. On one hand, in order to learn the differences between titles and tags, we design title-tag contrastive learning for title-visual and tag-visual, which separately extracts multimodal information from two types of text. On the other hand, user identification features are transformed to embedding vectors to capture user attribute details. From the extensive experiments, our approach outperforms the other methods on the social media prediction dataset. Our team achieve the 2nd place on the leader board of the Social Media Prediction Challenge 2022.

A Baseline for ViCo Conversational Head Generation Challenge

Meng Liu
Shuyan Zhai
Yongqiang Li
Weili Guan
Liqiang Nie

ViCo conversational head generation challenge is organized in conjunction with ACM Multimedia 2022. The challenge includes two tasks: 1) generating a vivid talking head according to a reference image and driving audio, and 2) synthesizing a listening head conditioned on the reference image and speaker's behaviors. For talking head generation, we design a three-stage method to generate high-quality and lip-sync talking head videos with natural head poses, which wins 3rd place and People's Selection Award. On the listening head generation track, we adopt MakeItTalk to generate vivid head movements of listeners, which achieves 2nd place in the competition.

3D-CNN for Facial Micro- and Macro-expression Spotting on Long Video Sequences using Temporal Oriented Reference Frame

Chuin Hong Yap
Moi Hoon Yap
Adrian Davison
Connah Kendrick
Jingting Li
Su-Jing Wang
Ryan Cunningham

Facial expression spotting is the preliminary step for micro- and macro-expression analysis. The task of reliably spotting such expressions in video sequences is currently unsolved. Current best systems depend upon optical flow methods to extract regional motion features, before categorisation of that motion into a specific class of facial movement. Optical flow is susceptible to drift error, which introduces a serious problem for motions with long-term dependencies, such as high frame-rate macro-expression. We propose a purely deep learning solution which, rather than tracking frame differential motion, compares via a convolutional model, each frame with two temporally local reference frames. Reference frames are sampled according to calculated micro- and macro-expression duration. As baseline for MEGC2021 using leave-one-subject-out evaluation method, we show that our solution performed better in a high frame-rate (200 fps) SAMM long videos dataset (SAMM-LV) than a low frame-rate (30 fps) (CAS(ME)2) dataset. We introduce a new unseen dataset for MEGC2022 challenge (MEGC2022-testSet) and achieves F1-Score of 0.1531 as baseline result.

PDAS: Probability-Driven Adaptive Streaming for Short Video

Chao Zhou
Yixuan Ban
Yangchao Zhao
Liang Guo
Bing Yu

To improve Quality of Experience (QoE) for short video applications, most commercial companies adopt preloading and adaptive streaming technologies concurrently. Though preloading can reduce rebuffering, it may greatly waste bandwidth if the downloaded video chunks are not played. Also, each short video's downloading competes against others, which makes the existing adaptive streaming technologies fail to optimize the QoE for all videos. In this paper, we propose PDAS, a Probability-Driven Adaptive Streaming framework, to minimize the bandwidth waste while guaranteeing QoE simultaneously. We formulate PDAS into an optimization problem, where a probabilistic model is designed to describe the swiping events. Then, the maximum preload size is controlled by the proposed probability-driven max-buffer model, which reduces the bandwidth waste by proactively sleeping. At last, the optimization problem is solved by jointly deciding the preload order and preload bitrate. Extensive experimental results demonstrate that PDAS achieves almost 22.34% gains on QoE and 22.80% reductions on bandwidth usage against the existing methods. As for online evaluation, PDAS ranks first in the ACM MM 2022 Grand Challenge: Short Video Streaming.

Wav2vec2-based Paralinguistic Systems to Recognise Vocalised Emotions and Stuttering

Tamás Grósz
Dejan Porjazovski
Yaroslav Getman
Sudarsana Kadiri
Mikko Kurimo

With the rapid advancement in automatic speech recognition and natural language understanding, a complementary field (paralinguistics) emerged, focusing on the non-verbal content of speech. The ACM Multimedia 2022 Computational Paralinguistics Challenge introduced several exciting tasks of this field. In this work, we focus on tackling two Sub-Challenges using modern, pre-trained models called wav2vec2. Our experimental results demonstrated that wav2vec2 is an excellent tool for detecting the emotions behind vocalisations and recognising different types of stutterings. Albeit they achieve outstanding results on their own, our results demonstrated that wav2vec2-based systems could be further improved by ensembling them with other models. Our best systems outperformed the competition baselines by a considerable margin, achieving an unweighted average recall of 44.0 (absolute improvement of 6.6% over baseline) on the Vocalisation Sub-Challenge and 62.1 (absolute improvement of 21.7% over baseline) on the Stuttering Sub-Challenge.

DAM: Deep Reinforcement Learning based Preload Algorithm with Action Masking for Short Video Streaming

Si-Ze Qian
Yuhong Xie
Zipeng Pan
Yuan Zhang
Tao Lin

Short video streaming has been increasingly popular in recent years. Due to its unique user behavior of watching and sliding, a critical technique issue is to design a preload algorithm deciding which video chunk to download next, bitrate selection and the pause time, in order to improve user experience while reducing bandwidth wastage. However, designing such a preload algorithm is non-trivial, especially taking into account conflicting goals of improving QoE and reducing bandwidth wastage. In this paper, we propose a deep reinforcement learning-based approach to simultaneously decide the aforementioned three decision variables via learning an optimal policy under a complex environment of varying network conditions and unpredictable user behavior. In particular, we incorporate domain knowledge into the decision procedure via action masking to make decisions more transparent, and accelerate the model training. Experimental results validate the proposed approach significantly outperforms baseline algorithms in terms of QoE metrics and bandwidth wastage.

Audio-driven Talking Head Generation with Transformer and 3D Morphable Model

Ricong Huang
Weizhi Zhong
Guanbin Li

In the task of talking head generation, it is hard to learn the mapping relationship between generated head image and input audio signal. To tackle this challenge, we propose to learn the mapping relationship between input audio signal and the parameters of three-dimensional morphable face model (3DMM) first, which is easier to learn. Then the parameters of 3DMM are used to guide the generation of high-quality talking head images. Prior works mostly encode audio features from short audio windows, which may influence the accuracy of lip movements sometimes because of the limited context. In this paper, we propose a transformer-based audio encoder to take full use of the long-term context from audio and then predict a sequence of 3DMM parameters accurately. Unlike prior works that only use the 3DMM parameters of expression, rotation and translation, we propose to include the parameters of identity. Since the location of 3D facial mesh point is decided by the expression and identity parameters, it is helpful to supply more subtle control of lip movement by considering the identity parameters. The experimental results reveal that our method ranks first in 4 of the total 11 evaluation metrics, which ranks first in the talking head generation track.

Two stage Multi-Modal Modeling for Video Interaction Analysis in Deep Video Understanding Challenge

Siyang Sun
Xiong Xiong
Yun Zheng

Interaction understanding between different entities in human-centered movie video is receiving more and more attention. Recently, a deep video understanding (DVU) task is proposed to identify interactions between different person entities on scene level task. However, limited samples of DVU dataset and multiple complex interactions make it difficult. To tackle these problems, we propose a two stage multi-modal method to predict the interaction between entities. Specifically, scene segment is first divided into several sub-scene clips, meanwhile face trajectory and person trajectory are obtained through face tracking/recognition and skeleton-based person tracing. Then, we extract and jointly train multi-modal features in the same semantic space including face emotion feature, person entity feature, visual features, text features and audio features. Finally, zero-shot transfer model and multiple classification model are proposed to predict interactions together. The experimental results show that our method performs new state-of-the-art on the DVU dataset.

Deeply Exploit Visual and Language Information for Social Media Popularity Prediction

Jianmin Wu
Liming Zhao
Dangwei Li
Chen-Wei Xie
Siyang Sun
Yun Zheng

Social media popularity prediction task is to predict future attractiveness of new posts, which could be applied for online advertising, social recommendation, and demand prediction. Existing methods have explored multiple feature types to model the popularity prediction, including user profile, tag, space-time, category, and others. However, images and texts of social media posts, as important and primary information, are usually used by simple or insufficient processing. In this paper, we propose a method to deeply exploit visual and language information to explore the attractiveness of posts. Specifically, images are parsed from multiple perspectives including multi-modal semantic representation, perceptual image quality, and scene analysis. Different word-level and sentence-level semantic embedding are extracted from all available language texts including title, tags, concept and category. It makes social media popularity modeling more reliable with the powerful visual and language representation. Experimental results demonstrate the effectiveness of exploiting visual and language information by the proposed method, and we achieve new state-of-the-art results on the SMP Challenge at ACM Multimedia 2022.

Perceptual Conversational Head Generation with Regularized Driver and Enhanced Renderer

Ailin Huang
Zhewei Huang
Shuchang Zhou

This paper reports our solution for ACM Multimedia ViCo 2022 Conversational Head Generation Challenge, which aims to generate vivid face-to-face conversation videos based on audio and reference images. Our solution focuses on training a generalized audio-to-head driver using regularization and assembling a high visual quality renderer. We carefully tweak the audio-to-behavior model and post-process the generated video using our foreground-background fusion module. We get first place in the listening head generation track and second place in the talking head generation track in the official leaderboard. Our code is available at https://github.com/megvii-research/MM2022-ViCoPerceptualHeadGeneration.

Deep Video Understanding with a Unified Multi-Modal Retrieval Framework

Chen-Wei Xie
Siyang Sun
Liming Zhao
Jianmin Wu
Dangwei Li
Yun Zheng

In this paper, we propose a unified multi-modal retrieval framework to tackle two typical video understanding tasks, i.e., matching movie scenes and text descriptions, and scene sentiment classification. For the task of matching movie scenes and text descriptions, it is a natural multi-modal retrieval problem, while for the task of scene sentiment classification, the proposed framework aims at finding most related sentiment tag for each movie scene, which is also a multi-modal retrieval problem. By considering these two tasks as multi-modal retrieval problems, we propose a unified multi-modal retrieval framework, which can make full use of the models pre-trained on large scale multi-modal datasets, experiments show that it is critical for the tasks which have only hundreds of training examples. To further improve the performance on movie video understanding task, we also collect a large scale video-text dataset, which contains 427,603 movie-shot and text pairs. Experimental results validate the effectiveness of this dataset.

Masked Modeling-based Audio Representation for ACM Multimedia 2022 Computational Paralinguistics ChallengE

Kang You
Kele Xu
Boqing Zhu
Ming Feng
Dawei Feng
Bo Liu
Tian Gao
Bo Ding

In this paper, we present our solution for ACM Multimedia 2022 Computational Paralinguistics Challenge. Our method employs the self-supervised learning paradigm, as it achieves promising results in computer vision and audio signal processing. Specifically, we firstly explore modifying the Swin Transformer architecture to learn general representation for the audio signals, accompanied with random masking on the log-mel spectrogram. The main goal of the pretext task is to predict the masked parts, by combining the advantages of the Swin-Transformer and masked modeling. For the downstream tasks, we utilize the labelled datasets to fine-tune the pre-trained model. Compared with the competitive baselines, our approach can provide significant performance improvements without ensembling.

Semantic-aware Responsive Listener Head Synthesis

Wei Zhao
Peng Xiao
Rongju Zhang
Yijun Wang
Jianxin Lin

Audience providing proper reaction during a conversation can bring positive impact to speaker, which is significant to digital human and social agent areas. Given information sent by speaker, responsive listener head synthesis task aims to generate corresponding listener behaviours such as nodding, thinking and smiling. A common method is to build listener responsive pattern by analyzing acoustic and facial feature of speaker. However, it is hard to understand what speaker means, purely based on acoustic and facial feature since numerous message is buried in language. Traditional method may lead to similar results ignoring the diversity of input. Therefore, in this paper we presents a new Semantic-aware Responsive Listener Head Synthesis (SaRLHS) approach by considering semantic information lied in language patterns in addition to acoustic and facial feature. Besides, we implement a post-face enhancement process to increase the visual effects. Moreover, we won the People's Selection Awards and the second place on Grand Challenges of ACM 2022 conference.

Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training

Yingwei Pan
Yehao Li
Jianjie Luo
Jun Xu
Ting Yao
Tao Mei

In this work, we present Auto-captions on GIF (ACTION), which is a new large-scale pre-training dataset for generic video understanding. All video-sentence pairs are created by automatically extracting and filtering video caption annotations from billions of web pages. Auto-captions on GIF dataset can be utilized to pre-train the generic feature representation or encoder-decoder structure for video captioning, and other downstream tasks (e.g., sentence localization in videos, video question answering, etc.) as well. We present a detailed analysis of Auto-captions on GIF dataset in comparison to existing video-sentence datasets. We also provide an evaluation of a Transformer-based encoder-decoder structure for vision-language pre-training, which is further adapted to video captioning downstream task and yields the compelling generalizability on MSR-VTT. The dataset is available at http://www.auto-video-captions.top/2022/dataset.

The ACM Multimedia 2022 Deep Video Understanding Grand Challenge

Keith Curtis
George Awad
Shahzad Rajput
Ian Soboroff

This is the overview paper for the Deep Video Understanding (DVU) Grand Challenge. In recent years, a growing trend towards working on understanding videos (in particular movies) to a deeper level started to motivate researchers working in multimedia and computer vision to present new approaches and datasets to tackle this problem. This is a challenging research area which aims to develop a deep understanding of the relations which exist between different individuals and entities in movies using all available modalities such as video, audio, text and metadata. The aim of this grand challenge is to foster innovative research in this new direction and to provide benchmarking evaluations to advance technologies in the deep video understanding community.

Generating Smooth and Facial-Details-Enhanced Talking Head Video: A Perspective of Pre and Post Processes

Tian Lv
Yu-Hui Wen
Zhiyao Sun
Zipeng Ye
Yong-Jin Liu

Talking head video generation has received increasing attention recently. So far the quality (especially the facial details) of the videos output from state-of-the-art deep learning methods is limited by either the quality of training data or the performance of generators, and needs to be further improved. In this paper, we propose a data pre- and post- processing strategy based on a key observation: generating talking head video from multi-modal input is a challenging problem and generating smooth video with fine facial details makes the problem even harder. Then we propose to decompose the problem solution into a main deep model, a pre- and a post- processing. The main deep model generates a reasonably good talking face video, with the aid of a pre-process, which also contributes to a post-process for restoring smooth and fine facial details in the final video. In particular, our main deep model reconstructs a 3D face from an input reference frame, and then uses an AudioNet to generate a sequence of facial expression coefficients with an input audio clip. To ensure final facial details in the generated video, we sample the original texture from the reference frame in the pre-process with the aid of reconstructed 3D face and a predefined UV map. Accordingly, in the post-process, we smooth the expression coefficients of adjacent frames to alleviate jitters and apply a pretrained face restoration module to recover the fine facial details. Experimental results and ablation study show the advantage of our proposed method.

Bandwidth-Efficient Multi-video Prefetching for Short Video Streaming

Xutong Zuo
Yishu Li
Mohan Xu
Wei Tsang Ooi
Jiangchuan Liu
Junchen Jiang
Xinggong Zhang
Kai Zheng
Yong Cui

Applications that allow sharing of user-created short videos exploded in popularity in recent years. A typical short video application allows a user to swipe away the current video being watched and start watching the next video in a video queue. Such user interface causes significant bandwidth waste if users frequently swipe a video away before finishing watching. Solutions to reduce bandwidth waste without impairing the Quality of Experience (QoE) are needed. Solving the problem requires adaptively prefetching of short video chunks, which is challenging as the download strategy needs to match unknown user viewing behavior and network conditions. In our work, we first formulate the problem of adaptive multi-video prefetching in short video streaming. Then, to facilitate the integration and comparison of researchers' algorithms towards solving the problem, we design and implement a discrete-event simulator, which we release as open source. Finally, based on the organization of the Short Video Streaming Grand Challenge at ACM Multimedia 2022, we analyze and summarize the algorithms of the contestants, with the hope of promoting the research community towards addressing this problem.

Multiple Temporal Fusion based Weakly-supervised Pre-training Techniques for Video Categorization

Xiaochen Cai
Hengxing Cai
Boqing Zhu
Kele Xu
Weiwei Tu
Dawei Feng

In this paper, we present our solution of the ACM Multimedia 2022 pre-training for video understanding challenge. First, we pre-train the models on large-scale weakly-supervised video datasets with different temporal resolutions, then fine-tune the model for downstream application. Quantitative comparisons are conducted to evaluate the performance of different networks at multiple temporal resolutions. Moreover, we fusion different pre-trained models through weighted averaging. We achieve an accuracy of 62.39% in the testing set, which ranked as the first place in the video categorization track of this challenge.

Deep Learning-Based Acoustic Mosquito Detection in Noisy Conditions Using Trainable Kernels and Augmentations

Sean Campos
Devesh Khandelwal
Shwetha C. Nagaraj
Fred Nugen
Alberto Todeschini

In this paper, we demonstrate a unique recipe to enhance the effectiveness of audio machine learning approaches [3] by fusing pre-processing techniques into a deep learning model [12]. Our solution accelerates training and inference performance by optimizing hyper-parameters through training instead of costly random searches to build a reliable mosquito detector from audio signals. The experiments and the results presented here are part of the MOS-C submission of the ACM'22 challenge [20]. Our results outperform the published baseline by 212% on the unpublished test set. We believe that this is one of the best real-world examples of building a robust bio-acoustic system that provides reliable mosquito detection in noisy conditions.

TA-CNN: A Unified Network for Human Behavior Analysis in Multi-Person Conversations

Fuyan Ma
Ziyu Ma
Bin Sun
Shutao Li

Human behavior analysis in multi-person conversations has been one of the most important research issues for natural human-robot interaction. However, previous datasets and studies mainly focus on single-person behavior analysis, therefore, can hardly be generalized in real-world application scenarios. Fortunately, the MultiMediate'22 Challenge provides various video clips of multi-party conversations. In this paper, we present a unified network named TA-CNN for both sub-challenges. Our TA-CNN can not only model the spatio-temporal dependencies for eye contact detection, but also capture the group-level discriminative features for multi-label next speaker prediction. We empirically evaluate the performance of our method on the officially provided datasets. Our method achieves the state-of-the-art result of 0.7261 for eye contact detection in terms of accuracy and the UAR of 0.5965 for next speaker prediction on the corresponding test sets.

End-to-End and Self-Supervised Learning for ComParE 2022 Stuttering Sub-Challenge

Shakeel A. Sheikh
Md Sahidullah
Slim Ouni
Fabrice Hirsch

In this paper, we present end-to-end and speech embedding based systems trained in a self-supervised fashion to participate in the ACM Multimedia 2022 ComParE Challenge, specifically the stuttering sub-challenge. In particular, we exploit the embeddings from the pre-trained Wav2Vec2.0 model for stuttering detection (SD) on the KSoF dataset. After embedding extraction, we benchmark with several methods for SD. Our proposed self-supervised based SD system achieves a UAR of 36.9% and 41.0% on validation and test sets respectively, which is 31.32% (validation set) and 1.49% (test set) higher than the best (DeepSpectrum) challenge baseline (CBL). Moreover, we show that concatenating layer embeddings with Mel-frequency cepstral coefficients (MFCCs) features further improves the UAR of 33.81% and 5.45% on validation and test sets respectively over the CBL. Finally, we demonstrate that the summing information across all the layers of Wav2Vec2.0 surpasses the CBL by a relative margin of 45.91% and 5.69% on validation and test sets respectively.

MultiMediate'22: Backchannel Detection and Agreement Estimation in Group Interactions

Philipp Müller
Michael Dietz
Dominik Schiller
Dominike Thomas
Hali Lindsay
Patrick Gebhard
Elisabeth André
Andreas Bulling

Backchannels, i.e. short interjections of the listener, serve important meta-conversational purposes like signifying attention or indicating agreement. Despite their key role, automatic analysis of backchannels in group interactions has been largely neglected so far. The MultiMediate challenge addresses, for the first time, the tasks of backchannel detection and agreement estimation from backchannels in group conversations. This paper describes the MultiMediate challenge and presents a novel set of annotations consisting of 7234 backchannel instances for the MPIIGroup Interaction dataset. Each backchannel was additionally annotated with the extent by which it expresses agreement towards the current speaker. In addition to a an analysis of the collected annotations, we present baseline results for both challenge tasks.

QoE-aware Download Control and Bitrate Adaptation for Short Video Streaming

Ximing Wu
Lei Zhang
Laizhong Cui

Nowadays, although the rapidly growing demand for short video sharing has brought enormous commercial value, considerable bandwidth usage becomes a problem for service providers. To save costs of service providers, the short video applications face a critical conflict between maximizing the user quality of experience (QoE) and minimizing the bandwidth usage. Most of existing bitrate adaptation methods are designed for the livecast and video-on-demand instead of short video applications. In this paper, we propose a QoE-aware adaptive download control mechanism to ensure the user QoE and save the bandwidth, which can download the appropriate video according to user retention probabilities and network conditions, and pause the download when the buffered data is enough. The extensive simulation results demonstrate the superiority of our proposed mechanism over the other baseline methods.

The ACM Multimedia 2022 Computational Paralinguistics Challenge: Vocalisations, Stuttering, Activity, & Mosquitoes

Björn Schuller
Anton Batliner
Shahin Amiriparian
Christian Bergler
Maurice Gerczuk
Natalie Holz
Pauline Larrouy-Maestri
Sebastien Bayerl
Korbinian Riedhammer
Adria Mallol-Ragolta
Maria Pateraki
Harry Coppock
Ivan Kiskin
Marianne Sinka
Stephen Roberts

The ACM Multimedia 2022 Computational Paralinguistics Challenge addresses four different problems for the first time in a research competition under well-defined conditions: In the Vocalisations and Stuttering Sub-Challenges, a classification on human non-verbal vocalisations and speech has to be made; the Activity Sub-Challenge aims at beyond-audio human activity recognition from smartwatch sensor data; and in the Mosquitoes Sub-Challenge, mosquitoes need to be detected. We describe the Sub-Challenges, baseline feature extraction, and classifiers based on the 'usual' ComParE and BoAW features, the auDeep toolkit, and deep feature extraction from pre-trained CNNs using the DeepSpectrum toolkit; in addition, we add end-to-end sequential modelling, and a log-mel-128-BNN.

Adaptive Dual Motion Model for Facial Micro-Expression Generation

Xinqi Fan
Ali Raza Shahid
Hong Yan

Facial micro-expression (ME) refers to a brief spontaneous facial movement that can reveal the genuine emotion of a person. The absence of data is a major problem for ME. Thankfully, generative deep neural network models can aid in producing desired samples. In this work, we proposed a deep learning based adaptive dual motion model (ADMM) for generating facial ME samples. A dual motion extraction (DME) module extracts robust motions from two modalities: original color images and edge-based grayscale images, with dual streams. Using edge-based grayscale images can help the method focus on learning subtle movements by eliminating the influences of noises and illumination variants. The motions extracted by the dual streams are fed into an adaptive motion fusion (AMF) module for combing the motions adaptively to generate the dense motion. Our method was trained on the CASME II, SMIC, and SAMM datasets. The evaluation and analysis of the results demonstrated the effectiveness of our method.

A Comprehensive Study of Spatiotemporal Feature Learning for Social Medial Popularity Prediction

Chih-Chung Hsu
Pi-Ju Tsai
Ting-Chun Yeh
Xiu-Yu Hou

For accurately predicting the popularity of social media, the multi-modal approach was usually adopted to have promising performance. However, the popularity is highly correlated to its identity (i.e., user ID). Inappropriate data splitting could result in lower generalizability in real-world applications. Specifically, we observed that the training and testing datasets are partitioned on a specific timestamp, whereas some users were registered after the timestamp, implying that the partial identities could be missing in the testing phase. It turns the social media prediction (SMP) tasks temporally irrelevant, making the temporal-related feature useless due to missing identities. Therefore, we form the SMP task as an identity-preserving time-series task to observe the popularity scores for the specific identity. In addition, more valuable and essential features could be explored. In this paper, by reformulating the SMP tasks and integrating the multi-modal feature aggregation to the base learner for better performance, how identity-preserving is an essential property for SMP tasks is discussed. We firstly explore the impact of the temporal features with/without identity information in the conventional SMP tasks. Moreover, we reformulate the SMP task in the time-series data-splitting and evaluate the temporal features' importance. Comprehensive experiments are conducted to deliver the suggestions for the SMP tasks and offer the corresponding solutions for effectively predicting the popularity scores.

How Much Attention Should we Pay to Mosquitoes?

Moreno La Quatra
Lorenzo Vaiani
Alkis Koudounas
Luca Cagliero
Paolo Garza
Elena Baralis

Mosquitoes are a major global health problem. They are responsible for the transmission of diseases and can have a large impact on local economies. Monitoring mosquitoes is therefore helpful in preventing the outbreak of mosquito-borne diseases. In this paper, we propose a novel data-driven approach that leverages Transformer-based models for the identification of mosquitoes in audio recordings. The task aims at detecting the time intervals corresponding to the acoustic mosquito events in an audio signal. We formulate the problem as a sequence tagging task and train a Transformer-based model using a real-world dataset collecting mosquito recordings. By leveraging the sequential nature of mosquito recordings, we formulate the training objective so that the input recordings do not require fine-grained annotations. We show that our approach is able to outperform baseline methods using standard evaluation metrics, albeit suffering from unexpectedly high false negatives detection rates. In view of the achieved results, we propose future directions for the design of more effective mosquito detection models.

A Combination of Visual-Semantic Reasoning and Text Entailment-based Boosting Algorithm for Cheapfake Detection

Tuan-Vinh La
Minh-Son Dao
Quang-Tien Tran
Thanh-Phuc Tran
Anh-Duy Tran
Duc-Tien Dang-Nguyen

Misuse of real photographs with conflicting image captions in news items is one case of out-of-context (OOC) misuse of media. To detect out-of-context given pair of news (i.e., captions) and attached image, people should determine the truthfulness of the statement and evaluate whether the triplet talks to the same event. This paper presents a new method to detect the OOC media challenge introduced in ACMMM'22 Grand Challenge on Detecting Cheapfakes. For $Task_1$ (i.e., detect conflicting image-caption triplets), our approach uses bottom-up attention with visual semantic reasoning to extract global features of the image to perform comprehensive image-text matching, multiple models of natural language processing to extract semantic relation of the caption and utilize boosting to improve the accuracy of the performance. Our method achieved an 85.5% accuracy score on 80% of the testing dataset and has 4.5% higher accuracy than baseline. For Task2 (i.e., determine whether a given (Image, Caption) pair is real or fake), we detect the veracity of captioned image base semantic features and correlation between image/caption and achieve 76% accuracy. Our source code is available at https://github.com/latuanvinh1998/Cheapfakes_detection_acmmm. Docker for submission is available at https://hub.docker.com/repository/docker/latuanvinh1998/acmmmcheapfakes.

A Textual-Visual-Entailment-based Unsupervised Algorithm for Cheapfake Detection

Quang-Tien Tran
Thanh-Phuc Tran
Minh-Son Dao
Tuan-Vinh La
Anh-Duy Tran
Duc Tien Dang Nguyen

The growth of communication has led to misinformation in many different forms. "Cheapfake" is a recently coined term referring to manipulated media generated by non-AI techniques. One of the most prevalent ways to create cheapfakes is by simply altering the context of an image/video using a misleading caption. The "ACMMM 2022 Grand Challenge on Detecting Cheapfakes" has raised the problem of catching the out-of-context misuse to assist fact-checkers, as detecting conflicting image-caption sets helps narrow the search space. To cope with this challenge, we propose a multimodal heuristic method. The proposed method expands the baseline method of the challenge (i.e., COSMOS) with four additional components (i.e., Natural Language Inference, Fabricated Claims Detection, Visual Entailment, and Online Caption Checking) towards overcoming the current weaknesses of the baseline. During runtime, our proposed method has achieved a maximum of 89.1% accuracy on Task 1, which is 7.2% higher than the baseline method, and 73% accuracy on Task 2. The code for our solution is publicly available on Github1 https://github.com/pwnyniche/acmmmcheapfake2022 and the Docker image can be found on DockerHub2 https://hub.docker.com/repository/docker/tqtnk2000/acmmmcheapfakes.

Fine-grained Micro-Expression Generation based on Thin-Plate Spline and Relative AU Constraint

Sirui Zhao
Shukang Yin
Huaying Tang
Rijin Jin
Yifan Xu
Tong Xu
Enhong Chen

As a typical psychological stress reaction, micro-expression (ME) is usually quickly leaked on a human face and can reveal the true feeling and emotional cognition. Therefore,automatic ME analysis (MEA) has essential applications in safety, clinical and other fields. However, the lack of adequate ME data has severely hindered MEA research. To overcome this dilemma and encouraged by current image generation techniques, this paper proposes a fine-grained ME generation method to enhance ME data in terms of data volume and diversity. Specifically, we first estimate non-linear ME motion using thin-plate spline transformation with a dense motion network. Then, the estimated ME motion transformations, including optical flow and occlusion masks, are sent to the generation network to synthesize the target facial micro-expression. In particular, we obtain the relative action units (AUs) of the source ME to the target face as a constraint to encourage the network to ignore expression-irrelevant movements, thereby generating fine-grained MEs. Through comparative experiments on CASME II, SMIC and SAMM datasets, we demonstrate the effectiveness and superiority of our method. Source code is provided in https://github.com/MEA-LAB-421/MEGC2022-Generation.

A Transformer Based Approach for Activity Detection

Gulshan Sharma
Abhinav Dhall
Ramanathan Subramanian

Non-invasive physiological sensors allow for the collection of user-specific data in realistic environments. In this paper, using physiological data, we investigate the effectiveness of Convolutional Neural Network (CNN) based feature embeddings and Transformer architecture for the human activity recognition task. 1D-CNN representation is used for the heart rate, and 2D-CNN is used for short-term Fourier transformation of the accelerometer data. Post fusion, the feature is input into a transformer. The experiments are performed on the harAGE dataset. The findings indicate the discriminative ability of the feature-fusion on transformer-based architecture, and the method outperforms the harAGE baseline by an absolute 3.7%.

ABPN: Apex and Boundary Perception Network for Micro- and Macro-Expression Spotting

Wenhao Leng
Sirui Zhao
Yiming Zhang
Shiifeng Liu
Xinglong Mao
Hao Wang
Tong Xu
Enhong Chen

Recently, Micro expression~(ME) has achieved remarkable progress in a wide range of applications, since it's an involuntary facial expression that reflects personal psychological state truly. In the procedure of ME analysis, spotting ME is an essential step, and is non trivial to be detected from a long interval video because of the short duration and low intensity issues. To alleviate this problem, in this paper, we propose a novel Micro- and Macro-Expression~(MaE) Spotting framework based on Apex and Boundary Perception Network~(ABPN), which mainly consists of three parts, i.e., video encoding module ~(VEM), probability evaluation module~(PEM), and expression proposal generation module~(EPGM). Firstly, we adopt Main Directional Mean Optical Flow (MDMO) algorithm and calculate optical flow differences to extract facial motion features in VEM, which can alleviate the impact of head movement and other areas of the face on ME spotting. Then, we extract temporal features with one-dimension convolutional layers and introduce PEM to infer the auxiliary probability that each frame belongs to an apex or boundary frame. With these frame-level auxiliary probabilities, the EPGM further combines the frames from different categories to generate expression proposals for the accurate localization. Besides, we conduct comprehensive experiments on MEGC2022 spotting task, and demonstrate that our proposed method achieves significant improvement with the comparison of state-of-the-art baselines on rm CAS(ME)2 and SAMM-LV datasets. The implemented code is also publicly available at https://github.com/wenhaocold/USTC_ME_Spotting.

Multimodal Analysis for Deep Video Understanding with Video Language Transformer

Beibei Zhang
Yaqun Fang
Tongwei Ren
Gangshan Wu

The Deep Video Understanding Challenge (DVUC) is aimed to use multiple modality information to build high-level understanding of video, involving tasks such as relationship recognition and interaction detection. In this paper, we use a joint learning framework to simultaneously predict multiple tasks with visual, text, audio and pose features. In addition, to answer the queries of DVUC, we design multiple answering strategies and use video language transformer which learns cross-modal information for matching videos with text choices. The final DVUC result shows that our method ranks first for group one of movie-level queries, and ranks third for both of group one and group two of scene-level queries.

MEGC2022: ACM Multimedia 2022 Micro-Expression Grand Challenge

Jingting Li
Moi Hoon Yap
Wen-Huang Cheng
John See
Xiaopeng Hong
Xiaobai Li
Su-Jing Wang
Adrian K. Davison
Yante Li
Zizhao Dong

Facial micro-expressions (MEs) are involuntary movements of the face that occur spontaneously when a person experiences an emotion but attempts to suppress or repress the facial expression, typically found in a high-stakes environment. Unfortunately, the small sample problem severely limits the automation of ME analysis. Furthermore, due to the brief and subtle nature of ME, ME spotting is a challenging task, and the performance is still not satisfactory yet. This challenge focuses on two tasks, i.e., the micro- and macro-expression spotting task, and the ME Generation task.

Rethinking Optical Flow Methods for Micro-Expression Spotting

Yuan Zhao
Xin Tong
Zichong Zhu
Jianda Sheng
Lei Dai
Lingling Xu
Xuehai Xia
Yu Jiang
Jiao Li

Micro-expressions (MEs) spotting is popular in some fields, for example, criminal investigation and business communication. But it is still a challenging task to spot the onset and offset of MEs accurately in long videos. This paper refines every step of the workflow before feature extraction, which can reduce error propagation. The workflow takes the advantage of high-quality alignment method, more accurate landmark detector, and also more robust optical flow estimation. Besides, Bayesian optimization hybrid with Nash equilibrium is constructed to search for the optimal parameters. It uses two players to optimize two types of parameters, one player is used to control the ME peak spotting, and another for optical flow field extraction. The algorithm can reduce the search space for each player with better generalization. Finally, our spotting method is evaluated on MEGC2022 spotting task, and achieves F1-score 0.3564 on CAS(ME)3-UNSEEN and F1-score 0.3265 on SAMM-UNSEEN.

Sentiment-aware Classifier for Out-of-Context Caption Detection

Muhannad Alkaddour
Abhinav Dhall
Usman Tariq
Hasan Al Nashash
Fares Al-Shargie

In this work we propose additions to the COSMOS and COSMOS on Steroids pipelines for the detection of Cheapfakes for Task 1 of the ACM Grand Challenge for Detecting Cheapfakes. We compute sentiment features, namely polarity and subjectivity, using the news image captions. Multiple logistic regression results show that these sentiment features are significant in prediction of the outcome. We then combine the sentiment features with the four image-text features obtained in the aforementioned previous works to train an MLP. This classifies sets of inputs into being out-of-context (OOC) or not-out-of-context (NOOC). On a test set of 400 samples, the MLP with all features achieved a score of 87.25%, and that with only the image-text features a score of 88%. In addition to the challenge requirements, we also propose a separate pipeline to automatically construct caption pairs and annotations using the images and captions provided in the large, un-annotated training dataset. We hope that this endeavor will open the door for improvements, since hand-annotating cheapfake labels is time-consuming. To evaluate the performance on the test set, the Docker image with the models is available at: https://hub.docker.com/repository/docker/malkaddour/mmsys22cheapfakes. The open-source code for the project is accessible at: https://github.com/malkaddour/ACMM-22-Cheapfake-Detection-Sentiment-awar....

Unified QA-aware Knowledge Graph Generation Based on Multi-modal Modeling

Penggang Qin
Jiarui Yu
Yan Gao
Derong Xu
Yunkai Chen
Shiwei Wu
Tong Xu
Enhong Chen
Yanbin Hao

Understanding the long duration videos' storyline is often considered a major challenge in the field of video understanding. To promote research on understanding longer videos in the community, the deep video understanding (DVU) task is suggested for recognizing interactions at the scene level and relationships at the movie level, as well as answering questions at these two levels. In this work, we propose a unified QA-aware knowledge graph generation approach, which consists of the relation-centric graph and interaction-centric graph and demonstrates the powerful performance of multimodal pre-training models in solving such problems. Extensive validations on the HLVU dataset demonstrate the effectiveness of our proposed method.

Graph-based Group Modelling for Backchannel Detection

Garima Sharma
Kalin Stefanov
Abhinav Dhall
Jianfei Cai

The brief responses given by listeners in group conversations are known as backchannels rendering the task of backchannel detection an essential facet of group interaction analysis. Most of the current backchannel detection studies explore various audio-visual cues for individuals. However, analysing all group members is of utmost importance for backchannel detection, like any group interaction. This study uses a graph neural network to model group interaction through all members' implicit and explicit behaviours. The proposed method achieves the best and second best performance on agreement estimation and backchannel detection tasks, respectively, of the 2022 MultiMediate: Multi-modal Group Behaviour Analysis for Artificial Mediation challenge.

Audio Features from the Wav2Vec 2.0 Embeddings for the ACM Multimedia 2022 Stuttering Challenge

Claude Montacié
Marie-José Caraty
Nikola Lackovic

The ACM Multimedia 2022 Stuttering Challenge is to determine the stuttering-related class of a speech segment. There are seven stuttering-related classes and an eighth garbage class. For this purpose, we have investigated the Wav2Vec 2.0 deep neural network to extract audio features from Transformer embeddings. Experiments were conducted on a part of the Kassel State of Fluency Corpus (KSoF). First, we introduced 21 functionals allowing to define two large composite audio feature: the first one set called W2V2 Basic audio-feature set (193,536 features) from the Base version of Transformer embeddings and the second one called W2V2 Large audio-feature set (516, 096 features) from the Large version of Transformer embeddings. Some functionals aim at estimating the spatial variability (e.g., mean, standard deviation, quartiles) and others aim at the temporal variability (e.g., linear regression slope). Then, each composite audio feature set have been splitted into a set of audio feature sets by grouping audio features by functional and by layer. Then, the most discriminant audio feature sets have been selected from these audio feature sets. Finally, two audio features sets specializing in stuttering speech have been developed and assessed: W2V2 Advanced audio-feature set (9,984 features) and W2V2 Large Advanced audio-feature set (15,360 features). Experiments have shown an improvement of 9.3% to the first one and 11.9% for the second one on the Test set compared to the official baseline of the Challenge (40.4%).

An Efficient Multi-View Multimodal Data Processing Framework for Social Media Popularity Prediction

YunPeng Tan
Fangyu Liu
BoWei Li
Zheng Zhang
Bo Zhang

Popularity of social media is an important symbol of its communication power. Predictions of social media popularity have tremendous business and social value. In this paper, we propose an efficient multimodal data processing framework, which can comprehensively extract the multi-view features from multimodal social media data and achieve accurate popularity prediction. We utilize Transformer and sliding window average to extract time series features of posts, utilize CatBoost to calculate the importance of different features, and integrate important features extracted from multiple views for accurate prediction of social media popularity. We evaluate our proposed approach with the Social Media Prediction Dataset. Experimental results show that our approach achieves excellent performance in the social media popularity prediction task.

Facial Expression Spotting Based on Optical Flow Features

Jun Yu
Zhongpeng Cai
Zepeng Liu
Guochen Xie
Peng He

The purpose of micro expression (ME) and macro expression (MaE) spotting task is to locate the onset and offset frames of MaE and ME clips. Compared with MaEs, MEs are shorter in duration and lower in intensity, which makes MEs harder to be spotted. In this paper, we propose an efficient pipeline based on optical flow features to spot MEs and MaEs. We crop and align the faces and select the eyebrows area, nose area, and mouth area as our regions of interest to exclude the interference of extraneous factors on the face expression representation. Then, we extract the optical flow in these regions and enhance the features of the expressions in the optical flow with low-pass filter and EMD method. Finally, the sliding window method is used to locate the peaks of optical flow features and get the intervals containing MEs or MaEs. We evaluate the performance of our method on the MEGC2022-TestSet including 10 long videos from SAMM and CAS(ME)3 and achieve the first place in the MEGC2022 Challenge. The results prove the effectiveness of our method.

Micro Expression Generation with Thin-plate Spline Motion Model and Face Parsing

Jun Yu
Guochen Xie
Zhongpeng Cai
Peng He
Fang Gao
Qiang Ling

Micro-expression generation aims at transfering the expression from the driving videos to the source images, which can be viewed as a motion transfer task. Recently, several works have been proposed to tackle this problem and achieve great performance. However, due to the intrinsic complexity of the face motion and different attributes of face regions, the task still remains challenging. In this paper, we propose an end-to-end unsupervised motion transfer network to tackle this challenge. As the motion of the face is non-rigid, we adopt an effective and flexible thin-plate spline motion estimation method to estimate the optical flow of the face motion. What's more, we find that several faces with eyeglasses show weird deformation in motion transfering. Thus, we introduce face parsing method to pay specific attention to the eyeglasses regions to ensure the reasonability of the deformation. We conduct several experiments on the provided datasets of the ACM MM 2022 micro-expression grand challenge (MEGC2022) and compare our method with several other typical methods. In comparison, our method shows the best performance. We (Team: USTC-IAT-United) also compare our method with other competitors' in MEGC2022, and the expert evaluation results show that our method performs best, which verifies the effectiveness of our method. Our code is available at https://github.com/HowToNameMe/micro-expression

Leveraging Text Representation and Face-head Tracking for Long-form Multimodal Semantic Relation Understanding

Raksha Ramesh
Vishal Anand
Zifan Chen
Yifei Dong
Yun Chen
Ching-Yung Lin

In the intricate problem of understanding long-form multi-modal inputs, few key-aspects in scene-understanding and dialogue-and-discourse are often overlooked. In this paper, we investigate two such key-aspects for better semantic and relational understanding - (i). head-object-tracking in addition to usual face-tracking, and (ii). fusing scene-to-text representation with external common-sense knowledge-base for effective mapping to sub-tasks of interest. The usage of head-tracking especially helps with enriching sparse entity mapping to inter-entity conversation interactions. These methods are guided by natural language supervision on visual models, and perform well for interaction and sentiment understanding tasks.

Overview of the Multimedia Grand Challenges 2022

Miriam Redi
Georges Quenot

The Multimedia Grand Challenge track was first presented as part of ACM Multimedia 2009 and has established itself as a prestigious competition in the multimedia community. The purpose of the Multimedia Grand Challenges is to engage the multimedia research community by establishing well-defined and objectively judged challenge problems intended to exercise the state-of-the-art methods and inspire future research directions. The key criteria for Grand Challenges are that they should be useful, interesting, and their solution should involve a series of research tasks over a long period of time, with pointers towards longer-term research. The 2022 edition of ACM Multimedia hosted 10 Grand Challenges covering all aspects of multimedia computing, from delivery systems to video retrieval, from video generation to audio recognition.

SESSION: Interactive Arts

All is Noise: In Search of Enlightenment, a VR Experience

Manuel Silva
Luana Santos
Luís Teixeira
José Vasco Carvalho

Virtual Reality (VR) allows to immerse oneself into different realities, into the past, to motive users to virtually visit objects and places. The Porto Cathedral, one of the oldest and most distinct heritage of Porto city, is the starting point to a journey with no linear destination. The building was rebuilt in VR as it was in 1591 A.C. through a collaboration between researchers, artists, and historians. That space will serve as a platform for questioning issues and disturbances that are transversal to human essence, regardless of historical time. The development of Virtual Reality projects does not generally come from the exploration of artistic concepts, especially those anchored in spatial audio composition. So, the project represents an important step in the convergence of those fields.

Beauty: Machine Microbial Interface as Artistic Experimentation

Johnny DiBlasi
Carlos Castellanos
Bello Bello

Beauty is a machine-microbial artwork, developed by Phylum, an experimental research collective specializing in cultural production informed by the intersections of science, technology and the arts (and of which two of the authors are members). The work is based upon an artificial intelligence agent that uses deep reinforcement learning to interact with and alter cultures of pattern-forming social bacteria in order to make them more aesthetically pleasing.

Being's Spread: Mirror of Life Interconnection

Xinrui Wang
Yulu Song
Xiaohui Wang

Everyone begins to interact with the world from the moment they are born, and individual life will always have more or less influence on other lives in the group. Being's Spread" is an interactive art, which aims to reveal the relationship between the uniqueness of individual life and the life group from the perspective of light and shadow art, and trigger the experiencer's reflection on the real world. The interactive system uses the combination of symbolism and reality. It creates a dynamic light and shadow image and artistic atmosphere by using the elements of water and light, constructing an immersive experience space.

CAPTCHA the Flag: Interactive Plotter Livestream

Tiago Rorke

A live diagram of locations identified from the IP addresses of participants. By solving a CAPTCHA participants can contribute to the drawing, holding the 'flag' until it is taken by somebody else.

A collaborative plotter drawing, where participants contribute simply by being human and present on the network.

Cellular Trending: Fragmented Information Dissemination on Social Media Through Generative Lens

Bo Shui
Xiaohui Wang

Cellular Trending is an artwork that reveals information fragmentation on the social media through generative lens. It visualizes the fragmented information dissemination in social media with affective attributes integrated cellular automata to create artistic experience. A multi-level interactive system consists of CELL, FACT and VIEW is proposed based on information dissemination theory mapping to fragmented communication, thinking and reading. From information acquisition to opinion expression, the artwork resonates and arouses people's reflection on modern information dissemination and information acquisition behavior in the digital age.

Collaboration Superpowers: The Process of Crafting an Interactive Storytelling Animation

Sofia Hinckel Dias
Sara Rodrigues Silva
Beatriz Rodrigues Silva
Rui Nóbrega

Interactive storytelling enables watchers to change the story through an exploratory navigation style. We propose to showcase a collaborative screen to investigate the process of crafting an interactive storytelling animation through the metaphors that built it - from a pre-established database, the watcher can help to create different outputs (e.g. changing sound, color, camera movement, gloss and surface, background and characters). The result is an F-curve graph, time versus animated position, clustering a new layer of added semantic information about the reshaped story.

Dream Painter: An Interactive Art Installation Bridging Audience Interaction, Robotics, and Creative AI

Varvara Guljajeva
Mar Canet Sola

Dream Painter is an interactive robotic art installation that turns the audience's spoken dreams into a collective painting. By telling one's past dream, a participant guides the interactive robotic system in the latent space of the AI model that results in a multicolored line drawing. The artwork consists of several parts: an interaction station, a painting robot, a kinetic and animated mechanism that moves the paper roll when a drawing is finished, and the deep learning model that transforms a spoken word into a painting. All these interconnected components of hardware and software are arranged into an autonomous and interactive robotic art installation. The main aims of this project are to explore the interactive potential of AI technology and robotics, and trigger discussion over the deep learning applications in a wider sense. More precisely, this case study is primarily focused on the translation of different semiotic spaces as a trigger for creativity and audience interaction method.

Emotional Machines: Toward Affective Virtual Environments

Jorge Forero
Gilberto Bernardes
Mónica Mendes

Emotional Machines is an interactive installation that builds affective virtual environments through spoken language. In response to the existing limitations of emotion recognition models incorporating computer vision and electrophysiological activity, whose sources are hindered by a head-mounted display, we propose the adoption of speech emotion recognition (from the audio signal) and semantic sentiment analysis. In detail, we use two machine learning models to predict three main emotional categories from high-level semantic and low-level speech features. Output emotions are mapped to audiovisual representation by an end-to-end process. We use a generative model of chord progressions to transfer speech emotion into music and a synthesized image from the text (transcribed from the user's speech). The generated image is used as the style source in the style-transfer process onto an equirectangular projection image target selected for each emotional category. The installation is an immersive virtual space encapsulating emotions in spheres disposed into a 3D environment. Thus, users can create new affective representations or interact with other previous encoded instances using joysticks.

Fragrance In Sight: Personalized Perfume Production Based on Style Recognition

Jiaxiang You
Yinyu Chen
Xiaohui Wang

Fragrance in sight" is an interactive art installation that correlates visual and olfactory sensations. Besides, the art installation presents human's synesthesia capabilities by artificial intelligence. Fragrance in sight" recognizes the clothing style of user by artificial intelligence, then matches the results to the composition and proportion of the perfume to generate a customized perfume. Ultimately, build an individual cognition at the olfactory level.

Meditation in Motion: Interactive Media Art Visualization Based on Ancient Tai Chi Chuan

Ze Gao
Anqi Wang
Pan Hui
Tristan Braud

Tai Chi is an essential concept of Chinese philosophy which refers to the universe's most primitive state of order. Tai Chi Chuan is a martial art and meditative practice that incorporates the Tai Chi philosophy. With the advent of the digital media age, this traditional martial art is falling into disuse, and most contemporary youths are losing interest. "Meditation in Motion" is an interactive media art installation inspired by the Tai Chi Chuan forms. It aims to convey the central concepts of balance, narrative, and universe of Tai Chi Chuan by breaking down its movements into overlapping circles, building a visual representation of the energy flows in the body.

Read Your Voice: A Playful Interactive Sound Encoder/Decoder

Hugo Pauget Ballesteros
Gilles Azzaro
Jean Mélou
Yvain Quéau
Jean-Denis Durou

Read Your Voice is a playful interactive multimedia system that allows the user to record a sound, encode it as an image, and then play it back using his smartphone, while controlling the speed and direction of playback.

StimulusLoop: Game-Actuated Mutuality Artwork for Evoking Affective State

Tai-Chen Tsai
Tse-Yu Pan
Min-Chun Hu
Ya-Lun Tao

As transmission technology has advanced, large-scale media data is delivered to users. However, machines may collect user behavior while users are receiving messages and analyze how to stimulate users' senses to grab more attention. To demonstrate the relationship between machines and users, we propose a game-actuated mutuality artwork, StimulusLoop. We design the visualization to present the users' behavior and affective state and design the game mechanics that one participant throws the dart to change the video watched by the other participant. The interactions between two participants form a game loop, and different kinds of messages passing between the two participants are visualized as mutuality artwork.

Viva Contemporary! Mobile Music Laboratory

Emily Graber
Charles Picasso
Elaine Chew

The mobile music laboratory is an installation for interactive music listening and research on embodied cognition, with focus on contemporary classical music (CCM). CCM is often less familiar to listeners than other types of music, but its success is essential for composers pioneering musical and cultural developments. This installation gives listeners of any background a way to actively engage with CCM. It enables users to interactively control the temporal expression of CCM through intuitive conducting-tapping gestures while traces of their cognitive and physiological processes are recorded for subsequent analysis. The mobile music laboratory system is comprised of a real-time controller with MIDI playback and a data acquisition system. As the user controls the music, Lab Streaming Layer synchronously records gesture timing and audio as well as neural and cardiac signals. The installation is designed to engender purposeful music engagement with CCM while simultaneously serving as a mobile laboratory for studying embodied CCM listening experiences, thus enabling ecologically valid music cognition research outside of auditory psychology labs.

Wander: An AI-driven Chatbot to Visit the Future Earth

Yuqian Sun
Chenhang Cheng
Ying Xu
Yihua Li
Chang Hee Lee
Ali Asadipour

This artwork presents an intelligent chatbot called Wander. This work used knowledge-based story generation to facilitate a narrative AI chatbot on daily communication platforms, producing interactive fiction with the most accessible natural language input: text messages. On social media platforms such as Discord and WeChat, Wander can generate a science-fiction style travelogue about the future earth, including text, images and global coordinates (GPS) based on real-world locations (e.g. Paris). The journeys are visualised in real-time on an interactive map that can be updated with participants' data. Based on Viktor Shklovsky's defamiliarization technique, we present how an AI agent can become a storyteller through common messages in daily life and lead participants to see the world from new perspectives. The website of this work is: https://wander001.com/

SESSION: Industry session

Layout-Aware Information Extraction for Document-Grounded Dialogue: Dataset, Method and Demonstration

Zhenyu Zhang
Bowen Yu
Haiyang Yu
Tingwen Liu
Cheng Fu
Jingyang Li
Chengguang Tang
Jian Sun
Yongbin Li

Building document-grounded dialogue systems have received growing interest as documents convey a wealth of human knowledge and commonly exist in enterprises. Wherein, how to comprehend and retrieve information from documents is a challenging research problem. Previous work ignores the visual property of documents and treats them as plain text, resulting in incomplete modality. In this paper, we propose a Layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents (VRDs), so as to generate accurate responses in dialogue systems. LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents, becoming the largest VRD-based information extraction dataset to the best of our knowledge. We also develop benchmark methods that extend the token-based language model to consider layout features like humans. Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.

CreaGAN: An Automatic Creative Generation Framework for Display Advertising

Shiyao Wang
Qi Liu
Yicheng Zhong
Zhilong Zhou
Tiezheng Ge
Defu Lian
Yuning Jiang

Creatives are an effective form of delivering product information on the E-commerce platform. Designing an exquisite creative is a time-consuming but crucial task for sellers. In order to accelerate the process, we propose an automatic creative generation framework, named CreaGAN, to make the design procedure easier, faster, and more accurate. Given a well-designed creative for one product, our method can generalize to other product materials by utilizing existing design elements (e.g., background material). The framework consists of two major parts: aesthetics-aware placement and creative inpainting model. The placement model aims to generate plausible locations for new products by considering aesthetic principles. And the inpainting model focus on filling the mismatched regions through contextual information. We conduct experiments on both the public dataset and the real-world creative dataset. Quantitative and qualitative results demonstrate that our method outperforms current state-of-the-art methods and obtains more reasonable and aesthetic visualization results.

Learning Interest-oriented Universal User Representation via Self-supervision

Qinghui Sun
Jie Gu
XiaoXiao Xu
Renjun Xu
Ke Liu
Bei Yang
Hong Liu
Huan Xu

User representation is essential for providing high-quality commercial services in industry. In our business scenarios, we face the challenge of learning universal (general-purpose) user representation. The universal representation is expected to be informative, and can handle various types of real-world applications without fine-tuning (e.g., applicable for both user profiling and the recall process in advertising). It shows great advantages compared to the solution of training a specific model for each downstream application. Specifically, we attempt to improve universal user representation from two points of views. First, a contrastive self-supervised learning paradigm is presented to guide the representation model training. It provides a unified framework that allows for long-term or short-term interest representation learning in a data-driven manner. Moreover, a novel multi-interest extraction module is presented. The module introduces an interest dictionary to capture principal interests of the given user, and then generate his/her interest-oriented representations via behavior aggregation. Experimental results demonstrate the effectiveness and applicability of the learned user representations. Such an industrial solution has now been deployed in various real-world tasks.

MMH-index: Enhancing Apache Lucene with High-Performance Multi-Modal Indexing and Searching

Ruicheng Liu
Jialing Liang
Peiquan Jin
Yi Wang

Data diversity is one of the main characteristics of big data, which makes a growing number of multi-modal data. For example, in e-commerce applications, a product is often described simultaneously through text, image, video, etc. However, traditional text search engines are mainly oriented to text data and cannot offer high-performance indexing and search on multi-modal data. In this paper, we propose a high-performance hybrid index structure named MMH-index to enhance Apache Lucene, an open-source Java library providing powerful indexing and search features, with multi-modal indexing and searching. The MMH-index optimizes the classical inverted index in Lucene with an innovative design called modality bitmap. By adding modality bitmaps to the dictionary of the inverted index, MMH-index can reduce the redundant data in the index and the average length of the posting lists, thus decreasing the space consumption and the time cost of multi-modal queries. We implement MMH-index in Lucene and evaluate its performance on two real multi-modal datasets. The experimental results show that compared with the traditional inverted index in Lucene, MMH-index achieves significant improvement in space consumption and multi-modal query performance.

Personality-Driven Social Multimedia Content Recommendation

Qi Yang
Sergey Nikolenko
Alfred Huang
Aleksandr Farseev

Social media marketing plays a vital role in promoting brand and product values to wide audiences. In order to boost their advertising revenues, global media buying platforms such as Facebook Ads constantly reduce the reach of branded organic posts, pushing brands to spend more on paid media ads. In order to run organic and paid social media marketing efficiently, it is necessary to understand the audience, tailoring the content to fit their interests and online behaviours, which is impossible to do manually at a large scale. At the same time, various personality type categorization schemes such as the Myers-Briggs Personality Type indicator make it possible to reveal the dependencies between personality traits and user content preferences on a wider scale by categorizing audience behaviours in a unified and structured manner. Still, McKinsey-style manual categorization is a very labour-intensive task that is probably impractical in a real-world scenario, so automated incorporation of audience behaviour and personality mining into industrial applications is necessary. This problem is yet to be studied in depth by the research community, while the level of impact of different personality traits on content recommendation accuracy has not been widely utilised and comprehensively evaluated so far. Even worse, there is no dataset available for the research community to serve as a benchmark and drive further research in this direction. The present study is one of the first attempts to bridge this important industrial gap, contributing not just a novel personality-driven content recommendation approach and dataset, but also facilitating a real-world ready solution which is scalable and sufficiently accurate to be applied in real-world settings. Specifically, in this work we investigate the impact of human personality traits on the content recommendation model by applying a novel personality-driven multi-view content recommender system called Personality Content Marketing Recommender Engine, or PersiC. Our experimental results and real-world case study demonstrate not just PersiC's ability to perform efficient human personality-driven multi-view content recommendation, but also allow for actionable digital ad strategy recommendations, which when deployed are able to improve digital advertising efficiency by over 420% as compared to the original human-guided approach.

Learnable Privacy-Preserving Anonymization for Pedestrian Images

Junwu Zhang
Mang Ye
Yao Yang

This paper studies a novel privacy-preserving anonymization problem for pedestrian images, which preserves personal identity information (PII) for authorized models and prevents PII from being recognized by third parties. Conventional anonymization methods unavoidably cause semantic information loss, leading to limited data utility. Besides, existing learned anonymization techniques, while retaining various identity-irrelevant utilities, will change the pedestrian identity, and thus are unsuitable for training robust re-identification models. To explore the privacy-utility trade-off for pedestrian images, we propose a joint learning reversible anonymization framework, which can reversibly generate full-body anonymous images with little performance drop on person re-identification tasks. The core idea is that we adopt desensitized images generated by conventional methods as the initial privacy-preserving supervision and jointly train an anonymization encoder with a recovery decoder and an identity-invariant model. We further propose a progressive training strategy to improve the performance, which iteratively upgrades the initial anonymization supervision. Experiments further demonstrate the effectiveness of our anonymized pedestrian images for privacy protection, which boosts the re-identification performance while preserving privacy. Code is available at https://github.com/whuzjw/privacy-reid.

Few-Shot Model Agnostic Federated Learning

Wenke Huang
Mang Ye
Bo Du
Xiang Gao

Federated learning has received increasing attention for its ability to collaborative learning without leaking privacy. Promising advances have been achieved under the assumption that participants share the same model structure. However, when participants independently customize their models, models suffer communication barriers, which leads the model heterogeneity problem. Moreover, in real scenarios, the data held by participants is often limited, making the local models trained only on private data present poor performance. Consequently, this paper studies a new challenging problem, namely few-shot model agnostic federated learning, where the local participants design their independent models from their limited private datasets. Considering the scarcity of the private data, we propose to utilize the abundant public available datasets for bridging the gap between local private participants. However, its usage also brings in two problems: inconsistent labels and large domain gap between the public and private datasets. To address these issues, this paper presents a novel framework with two main parts: 1) model agnostic federated learning, it performs public-private communication by unifying the model prediction outputs on the shared public datasets; 2) latent embedding adaptation, it addresses the domain gap with an adversarial learning scheme to discriminate the public and private domains. Together with theoretical generalization bound analysis, comprehensive experiments under various settings have verified our advantage over existing methods. It provides a simple but effective baseline for future advancement. The code is available at https://github.com/WenkeHuang/FSMAFL.

Pyramidal Transformer with Conv-Patchify for Person Re-identification

He Li
Mang Ye
Cong Wang
Bo Du

The robust and discriminative feature extraction is the key component in person re-identification (Re-ID). The major weakness of conventional convolution neural network (CNN) based methods is that they cannot extract long-range information from diverse parts, which can be alleviated by recently developed Transformers. Existing vision Transformers show their power on various vision tasks. However, they (i) cannot address translation problems and different viewpoints; (ii) cannot capture detailed features to discriminate people with a similar appearance. In this paper, we propose a powerful Re-ID baseline built on top of the pyramidal transformer with conv-patchify operation, termed PTCR, which inherits the advantages of both CNN and Transformer. The pyramidal structure captures multi-scale fine-grained features, while the conv-patchify enhances the robustness against translation. Moreover, we additionally design two novel modules to improve the robust feature learning. A Token Perception module augments the patch embeddings to enhance the robustness against perturbation and viewpoint changes, while the Auxiliary Embedding module integrates the auxiliary information (cam ID, pedestrian attributes, etc) to reduce feature bias caused by non-visual factors. Our method is validated through extensive experiments to show its superior performance with abundant ablation studies. Notably, without re-ranking, we achieve 98.0% Rank-1 on Market-1501 and 88.6% Rank-1 on MSMT17, significantly outperforming the counterparts. The code is available at: https://github.com/lihe404/PTCR

SESSION: Open Source session

CVNets: High Performance Library for Computer Vision

Sachin Mehta
Farzad Abdolhosseini
Mohammad Rastegari

We introduce CVNets, a high-performance open-source library for training deep neural networks for visual recognition tasks, including classification, detection, and segmentation. CVNets supports image and video understanding tools, including data loading, data transformations, novel data sampling methods, and implementations of several standard networks with similar or better performance than previous studies. Our source code is available at: https://github.com/apple/ml-cvnets.

MMRotate: A Rotated Object Detection Benchmark using PyTorch

Yue Zhou
Xue Yang
Gefan Zhang
Jiabao Wang
Yanyi Liu
Liping Hou
Xue Jiang
Xingzhao Liu
Junchi Yan
Chengqi Lyu
Wenwei Zhang
Kai Chen

We present an open-source toolbox, named MMRotate, which provides a coherent algorithm framework of training, inferring, and evaluation for the popular rotated object detection algorithm based on deep learning. MMRotate implements 18 state-of-the-art algorithms and supports the three most frequently used angle definition methods. To facilitate future research and industrial applications of rotated object detection-related problems, we also provide a large number of trained models and detailed benchmarks to give insights into the performance of rotated object detection. MMRotate is publicly released at https://github.com/open-mmlab/mmrotate.

MoZuMa: A Model Zoo for Multimedia Applications

Stéphane Massonnet
Marco Romanelli
Rémi Lebret
Niels Poulsen
Karl Aberer

Lots of machine learning models with applications in Multimedia Search are released as Open Source Software. However, integrating these models into an application is not always an easy task due to the lack of a consistent interface to run, train or distribute models. With MoZuMa, we aim at reducing this effort by providing a model zoo for image similarity, text-to-image retrieval, face recognition, object similarity search, video key-frames detection and multilingual text search implemented in a generic interface with a modular architecture. The code is released as Open Source Software at https://github.com/mozuma/mozuma.

OpenHardwareVC: An Open Source Library for 8K UHD Video Coding Hardware Implementation

Wei Gao
Hang Yuan
Yang Guo
Lvfang Tao
Zhanyuan Cai
Ge Li

The hardware-accelerated real-time compression of 8K Ultra-High-Definition (UHD) video is an exemplary application that empowered by the latest video coding standard. However, the coding tools added to the recently released third-generation audio video coding standard (AVS3) greatly increase the coding complexity, which seriously hinders the efficient implementation of hardware encoder. In order to break the known bottleneck, this paper presents the first open source software library for 8K UHD video coding hardware implementation, namely OpenHardwareVC. Specifically, based on the analysis of the original AVS3 software algorithm, we provide the hardware acceleration designs of the four major coding stages, including coding unit (CU) partition, intra prediction, transform and entropy coding, in this library. Simulation results on Xilinx VU440 FPGA show that the real-time compression of 8K UHD videos at 30 frames per second (fps) can be easily supported based on software-described modules packaged in this library. The release of this library is quite favorable for the hardware design and system implementation of UHD video coding, which is also beneficial to the promotion of the new coding standard. The open source library for OpenHardwareVC is available at https://git.openi.org.cn/OpenHardwareVC.

Low Latency Live Streaming Implementation in DASH and HLS

Abdelhak Bentaleb
Zhengdao Zhan
Farzad Tashtarian
May Lim
Saad Harous
Christian Timmerer
Hermann Hellwagner
Roger Zimmermann

Low latency live streaming over HTTP using Dynamic Adaptive Streaming over HTTP (LL-DASH) and HTTP Live Streaming (LL- HLS) has emerged as a new way to deliver live content with an respectable video quality and short end-to-end latency. Satisfying these requirements while maintaining viewer experience in practice is challenging, and adopting conventional adaptive bitrate (ABR) schemes directly to do so will not work. Therefore, recent solutions including LoL+, L2A, Stallion, and Llama re-think conventional ABR schemes to support low-latency scenarios. These solutions have been integrated with dash.js [9] that supports LL-DASH. However, their performance in LL-HLS remains in question. To bridge this gap, we implement and integrate existing LL-DASH ABR schemes in the hls.js video player [18] which supports LL-HLS. Moreover, a series of real-world trace-driven experiments have been conducted to check their efficiency under various network conditions including a comparison with results achieved for LL-DASH in dash.js. Our version of hls.js is publicly available at [3] and a demo at [4].

OpenPointCloud: An Open-Source Algorithm Library of Deep Learning Based Point Cloud Compression

Wei Gao
Hua Ye
Ge Li
Huiming Zheng
Yuyang Wu
Liang Xie

This paper gives an overview of OpenPointCloud, the first open-source algorithm library containing outstanding deep learning methods on point cloud compression (PCC). We provide an introduction of our implementations, including 8 methods on lossless geometry PCC and lossy geometry PCC. Principles and contributions of these methods in our algorithm library are illustrated, which are also implemented with different deep learning programming frameworks, such as TensorFlow, Pytorch and TensorLayer. In order to systematically evaluate the performances of all these methods, we conduct a comprehensive benchmarking test. We provide analyses and comparisons of their performances according to their categories and draw constructive conclusions. This algorithm library has been released at https://git.openi.org.cn/OpenPointCloud.

PYSKL: Towards Good Practices for Skeleton Action Recognition

Haodong Duan
Jiaqi Wang
Kai Chen
Dahua Lin

We present PYSKL: an open-source toolbox for skeleton-based action recognition based on PyTorch. The toolbox supports a wide variety of skeleton action recognition algorithms, including approaches based on GCN and CNN. In contrast to existing open-source skeleton action recognition projects that include only one or two algorithms, PYSKL implements six different algorithms under a unified framework with both the latest and original good practices to ease the comparison of efficacy and efficiency. We also provide an original GCN-based skeleton action recognition model named ST-GCN++, which achieves competitive recognition performance without any complicated attention schemes, serving as a strong baseline. Meanwhile, PYSKL supports the training and testing of nine skeleton-based action recognition benchmarks and achieves state-of-the-art recognition performance on eight of them. To facilitate future research on skeleton action recognition, we also provide a large number of trained models and detailed benchmark results to give some insights. PYSKL is released at https://github.com/kennymckormick/pyskl and is actively maintained.

DavarOCR: A Toolbox for OCR and Multi-Modal Document Understanding

Liang Qiao
Hui Jiang
Ying Chen
Can Li
Pengfei Li
Zaisheng Li
Baorui Zou
Dashan Guo
Yingda Xu
Yunlu Xu
Zhanzhan Cheng
Yi Niu

This paper presents DavarOCR, an open-source toolbox for OCR and document understanding tasks. DavarOCR currently implements 19 advanced algorithms, covering 9 different task forms. DavarOCR provides detailed usage instructions and the trained models for each algorithm. Compared with the previous open-source OCR toolbox, DavarOCR has relatively more complete support for the sub-tasks of the cutting-edge technology of document understanding. In order to promote the development and application of OCR technology in academia and industry, we pay more attention to the use of modules that different sub-domains of technology can share. DavarOCR is publicly released at https://github.com/hikopensource/Davar-Lab-OCR.

CurML: A Curriculum Machine Learning Library

Yuwei Zhou
Hong Chen
Zirui Pan
Chuanhao Yan
Fanqi Lin
Xin Wang
Wenwu Zhu

Curriculum learning (CL) is a machine learning paradigm gradually learning from easy to hard, which is inspired by human curricula. As an easy-to-use and general training strategy, CL has been widely applied to various multimedia tasks covering images, texts, audios, videos, etc. The effectiveness of CL has recently facilitated an increasing number of new CL algorithms. However, there has been no open-source library for curriculum learning, making it hard to reproduce, evaluate and compare the numerous CL algorithms on fair benchmarks and settings. To ease and promote future research on CL, we develop CurML, the first Curriculum Machine L earning library to integrate existing CL algorithms into a unified framework. It is convenient to use and flexible to customize by calling the provided five APIs, which are designed for easily plugging into a general training process and conducting the data-oriented, model-oriented and loss-oriented curricula. Furthermore, we present empirical results obtained by CurML to demonstrate the advantages of our library. The code is available online at https://github.com/THUMNLab/CurML.

SESSION: Reproducibility session

Reproducibility Companion Paper: Focusing on Persons: Colorizing Old Images Learning from Modern Historical Movies

Xin Jin
Ke Liu
Dongqing Zou
Zhonglan Li
Heng Huang
Vajira Thambawita

In this paper we reproduce experimental results presented in our earlier work titled "Focusing on Persons: Colorizing Old Images Learning from Modern Historical Movies" that was presented in the course of the 29th ACM International Conference on Multimedia. The paper aims at verifying the soundness of our prior results and helping others understand our software framework. We present artifacts that help reproduce results that were included in our earlier work. Specifically, this paper contains the technical details of the package, including dataset preparation, source code structure and experimental environment. Using the artifacts we show that our results are reproducible. We invite everyone to use our software framework going beyond reproducibility efforts.

SESSION: Tutorial Overviews

Deep Learning-based Point Cloud Coding for Immersive Experiences

Fernando Pereira

The recent advances in visual data acquisition and consumption have led to the emergence of the so-called plenoptic visual models, where Point Clouds (PCs) are playing an increasingly important role. Point clouds are a 3D visual model where the visual scene is represented through a set of points and associated attributes, notably color. To offer realistic and immersive experiences, point clouds need to have millions, or even billions, of points, thus asking for efficient representation and coding solutions. This is critical for emerging applications and services, notably virtual and augmented reality, personal communications and meetings, education and medical applications and virtual museum tours. The point cloud coding field has received many contributions in recent years, notably adopting deep learning-based approaches, and it is critical for the future of immersive media experiences. In this context, the key objective of this tutorial is to review the most relevant point cloud coding solutions available in the literature with a special focus on deep learning-based solutions and its specific novel features. Special attention will be dedicated to the ongoing standardization projects in this domain, notably in JPEG and MPEG.

Advances in Quality Assessment Of Video Streaming Systems: Algorithms, Methods, Tools

Yiannis Andreopoulos
Cosmin Stejerean

Quality assessment of video has matured significantly in the last 10 years due to a flurry of relevant developments in academia and industry, with relevant initiatives in VQEG, AOMedia, MPEG, ITU-T P.910, and other standardization and advisory bodies . Most advanced video streaming systems are now clearly moving away from good old-fashioned' PSNR and structural similarity type of assessment towards metrics that align better to mean opinion scores from viewers. Several of these algorithms, methods and tools have only been developed in the last 3-5 years and, while they are of significant interest to the research community, their advantages and limitations are not widely known in the research community. This tutorial provides this overview, but also focuses on practical aspects and how to design quality assessment tests that can scale to large datasets.

Multimedia Content Understanding in Harsh Environments

Zheng Wang
Dan Xu
Zhedong Zheng
Kui Jiang

Multimedia content understanding methods often encounter a severe performance degradation under harsh environments. This tutorial covers several important components of multimedia content understanding in harsh environments. It introduces some multimedia enhancement methods, presents recent advances in 2D and 3D visual scene understanding, shows strategies to estimate the prediction uncertainty, provides a brief summary, and shows some typical applications.

Autonomous UAV Cinematography

Ioannis Pitas
Ioannis Mademlis

The use of camera-equipped Unmanned Aerial Vehicles (UAVs, or "drones") for professional media production is already an exciting commercial reality. Currently available consumer UAVs for cinematography applications are equipped with high-end cameras and a degree of cognitive autonomy relying on artificial intelligence (AI). Current research promises to further exploit the potential of autonomous functionalities in the immediate future, resulting in portable flying robotic cameras with advanced intelligence concerning autonomous landing, subject detection/tracking, cinematic shot execution, 3D localization and environmental mapping, as well as autonomous obstacle avoidance combined with on-line motion re-planning. Disciplines driving this progress are computer vision, machine/deep learning and aerial robotics. This Tutorial emphasizes the definition and formalization of UAV cinematography aesthetic components, as well as the use of robotic planning/control methods for autonomously capturing them on footage, without the need for manual tele-operation. Additionally, it focuses on state-of-the-art Imitation Learning and Deep Reinforcement Learning approaches for automated UAV/camera control, path planning and cinematography planning, in the general context of "flying & filming".

Video Grounding and Its Generalization

Xin Wang
Xiaohan Lan
Wenwu Zhu

Video grounding aims to ground a sentence query in a video by determining the start and end timestamps of the semantically matched segment. It is a fundamental and essential vision-and-language problem widely investigated by the research community, and it also has potential values applied in industrial domains. This tutorial will give a detailed introduction about the development and evolution of this task, point out the limitations of existing benchmarks, and extend such a text-based grounding task to more general scenarios, especially how it guides the learning of other video-language tasks like video question answering based on event grounding. This topic is at the core of the scope of ACM Multimedia, and is attractive to MM audience from both academia and industry.

Memory Networks

Federico Becattini
Tiberio Uricchio

Memory Networks are models equipped with a storage component where information can generally be written and successively retrieved for any purpose. Simple forms of memory networks like the popular recurrent neural networks (RNN), LSTMs or GRUs, have limited storage capabilities and for specific tasks. In contrast, recent works, starting from Memory Augmented Neural Networks, overcome storage and computational limitations with the addition of a controller network with an external element-wise addressable memory. This tutorial aims at providing an overview of such memory-based techniques and their applications in multimedia. It will cover an explanation of the basic concepts behind recurrent neural networks and will then delve into the advanced details of memory augmented neural networks, their structure and how such models can be trained. We target a broad audience, from beginners to experienced researchers, offering an in-depth introduction to an important crop of literature which is starting to gain interest in the multimedia, computer vision and natural language processing communities.

Open Challenges of Interactive Video Search and Evaluation

Jakub Loko
Klaus Schoeffmann
Werner Bailer
Luca Rossetto
Björn Þór Jónsson

During the last 10 years of Video Browser Showdown (VBS), there were many different approaches tested for known-item search and ad-hoc search tasks. Undoubtedly, teams incorporating state-of-the-art models from the machine learning domain had an advantage over teams focusing just on interactive interfaces. On the other hand, VBS results indicate that effective means of interaction with a search system is still necessary to accomplish challenging search tasks. In this tutorial, we summarize successful deep models tested at the Video Browser Showdown as well as interfaces designed on top of corresponding distance/similarity spaces. Our broad experience with competition organization and evaluation will be presented as well, focusing on promising findings and also challenging problems from the most recent iterations of the Video Browser Showdown.

SESSION: Workshop Overviews

MMSports'22: 5th International ACM Workshop on Multimedia Content Analysis in Sports

Hideo Saito
Thomas B. Moeslund
Rainer Lienhart

The fifth ACM International Workshop on Multimedia Content Analysis in Sports (ACM MMSports'22) is part of the ACM International Conference on Multimedia 2022 (ACM Multimedia 2022). After two years of pure virtual MMSports workshops due to COVID-19, MMSports'22 is held on-site again. The goal of this workshop is to bring together researchers and practitioners from academia and industry to address challenges and report progress in mining, analyzing, understanding, and visualizing multimedia/multimodal data in sports, sports broadcasts, sports games and sports medicine. The combination of sports and modern technology offers a novel and intriguing field of research with promising approaches for visual broadcast augmentation and understanding, for statistical analysis and evaluation, and for sensor fusion during workouts as well as competitions. There is a lack of research communities focusing on the fusion of multiple modalities. We are helping to close this research gap with this workshop series on multimedia content analysis in sports. Related Workshop Proceedings are available in the ACM DL at: https://dl.acm.org/doi/proceedings/10.1145/3552437.

MuSe 2022 Challenge: Multimodal Humour, Emotional Reactions, and Stress

Shahin Amiriparian
Lukas Christ
Andreas König
Eva-Maria Meßner
Alan Cowen
Erik Cambria
Björn W. Schuller

The 3rd Multimodal Sentiment Analysis Challenge (MuSe) focuses on multimodal affective computing. The workshop is held in conjunction with ACM Multimedia'22. Three datasets are provided as part of the challenge: (i) the Passau Spontaneous Football Coach Humor (Passau-SFCH) dataset which contains humour-tagged audio-visual data of German football coaches, (ii) the Hume-Reaction dataset, which contains annotations on how people respond to emotional stimuli in terms of seven different emotional expression intensities, and (iii) the Ulm-Trier Social Stress Test (Ulm-TSST) dataset, which consists of audio-visual recordings labelled with continuous emotion values of individuals in stressful circumstances. Based on these datasets three affective computing challenges are defined: 1) Humor Detection Sub-Challenge (MuSe-Humor), for spontaneous humour recognition, 2) Emotional Reactions Sub-Challenge (MuSe-Reaction), for prediction of seven fine-grained in-the-wild' emotions, and 3) Emotional Stress Sub-Challenge (MuSe-Stress), for continuous prediction of stressed emotion values. In this summary, we describe the motivation behind the challenge, participation and its conditions, as well as the outcomes. The complete MuSe'22 workshop proceedings are available at: https://dl.acm.org/doi/proceedings/10.1145/3551876

APCCPA '22: 1st International Workshop on Advances in Point Cloud Compression, Processing and Analysis

Wei Gao
Ge Li
Hui Yuan
Raouf Hamzaoui
Zhu Li
Shan Liu

Point clouds are attracting much attention from academia, industry and standardization organizations such as MPEG, JPEG, and AVS. 3D Point clouds consisting of thousands or even millions of points with attributes can represent real-world objects and scenes in a way that enables an improved immersive visual experience and facilitates complex 3D vision tasks. In addition to various point cloud analysis and processing tasks (e.g., segmentation, classification, 3D object detection, registration), efficient compression for these large-scale 3D visual data is essential to make point cloud applications more effective. This workshop focuses on point cloud processing, analy sis, and compression in challenging situations to further improve visual experience and machine vision performance. Both learning-based and non-learning-based perception-oriented optimization algorithms for compression and processing are solicited. Contributions that advance the state-of-the-art in analysis tasks, are also welcomed.

M4MM '22: 1st International Workshop on Methodologies for Multimedia

Xavier Alameda-Pineda
Qin Jin
Vincent Oria
Laura Toni

Transversely to all multimedia research, the methods and tools are often shared by several MM topics and applications. M4MM aims to foster discussion around fundamental methodologies and tools that are used in various multimedia research topics. To that aim, the technical program of the workshop consists of two keynote speakers, on complementary and very relevant methods and tools for MM, as well as four technical papers. The complete M4MM'22 workshop proceedings are available at: https://dl.acm.org/doi/proceedings/10.1145/3552487

FME '22: 2nd Workshop on Facial Micro-Expression: Advanced Techniques for Multi-Modal Facial Expression Analysis

Jingting Li
Moi Hoon Yap
Wen-Huang Cheng
John See
Xiaopeng Hong
Xiabai Li
Su-Jing Wang

Micro-expressions are facial movements that are extremely short and not easily detected, which often reflect the genuine emotions of individuals. Micro-expressions are important cues for understanding real human emotions and can be used for non-contact non-perceptual deception detection, or abnormal emotion recognition. It has broad application prospects in national security, judicial practice, health prevention, clinical practice, etc. However, micro-expression feature extraction and learning are highly challenging because micro-expressions have the characteristics of short duration, low intensity, and local asymmetry. In addition, the intelligent micro-expression analysis combined with deep learning technology is also plagued by the problem of small samples. Not only is micro-expression elicitation very difficult, micro-expression annotation is also very time-consuming and laborious. More importantly, the micro-expression generation mechanism is not yet clear, which shackles the application of micro-expressions in real scenarios. FME'22 is the inaugural workshop in this area of research, with the aim of promoting interactions between researchers and scholars from within this niche area of research and also including those from broader, general areas of expression and psychology research. The complete FME'22 workshop proceedings are available at: https://dl.acm.org/doi/proceedings/10.1145/3552465.

NarSUM '22: 1st Workshop on User-centric Narrative Summarization of Long Videos

Mohan Kankanhalli
Jianquan Liu
Yongkang Wong
Karen Stephen
Rishabh Sheoran
Anusha Bhamidipati

With video capture devices becoming widely popular, the amount of video data generated per day has seen a rapid increase over the past few years. Browsing through hours of video data to retrieve useful information is a tedious and boring task. Video Summarization technology has played a crucial role in addressing this issue. It is a well-researched topic in the multimedia community. However, the focus so far has been limited to creating summary to videos which are short (only a few minutes). This workshop aims to call for researchers on relevant background to focus on novel solutions for user-centric narrative summarization of long videos. This workshop will also cover important aspects of video summarization research like what is "important" in a video, how to evaluate the goodness of a created summary, open challenges in video summarization etc.

CEA++'22: 1st International Workshop on Multimedia for Cooking, Eating, and related APPlications

Yoko Yamakata
Atsushi Hashimoto
Jingjing Chen

The International Workshop on Multimedia for Cooking, Eating, and related APPlications is the successor of the former CEA workshop series. The former CEA series started in 2009. 13 years later, we are witnessing various food-related applications enabled by emerging deep learning technologies and related hardware. Based on such a background, the organizing committee of CEA decided to renew the workshop and extend the scope to accept broader topics, especially industrial applications.

This overview introduces the aim of the CEA++'22 workshop, and the list of papers presented in the workshop.

DDAM '22: 1st International Workshop on Deepfake Detection for Audio Multimedia

Jianhua Tao
Jiangyan Yi
Cunhang Fan
Ruibo Fu
Shan Liang
Pengyuan Zhang
Haizhou Li
Helen Meng
Dong Yu
Masato Akagi

Over the last few years, the technology of speech synthesis and voice conversion has made significant improvement with the development of deep learning. The models can generate realistic and human-like speech. It is difficult for most people to distinguish the generated audio from the real. However, this technology also poses a great threat to the global political economy and social stability if some attackers and criminals misuse it with the intent to cause harm. In this workshop, we aim to bring together researchers from the fields of audio deepfake detection, audio deep synthesis, audio fake game and adversarial attacks to further discuss recent research and future directions for detecting deepfake and manipulated audios in multimedia.

HCMA'22: 3rd International Workshop on Human-Centric Multimedia Analysis

Dingwen Zhang
Chaowei Fang
Wu Liu
Xinchen Liu
Jingkuan Song
Hongyuan Zhu
Wenbing Huang
John Smith

The Third International Workshop on Human-Centric Multimedia Analysis concentrates on the tasks of human-centric analysis with multimedia and multimodal information. It involves multiple tasks such as face detection and recognition, human body pattern analysis, person re-identification, human action detection, etc. Today, multiple multimedia sensing technologies and large-scale computing infrastructures are emerging at a rapid velocity a wide variety of big multi-modality data for human-centric analysis, which provides rich knowledge to help tackle these challenges. Researchers have strived to push the limits of human-centric multimedia analysis in a wide variety of applications, such as intelligent surveillance, retailing, fashion design, and services. Therefore, this workshop aims to provide a platform to bridge the gap between the communities of human analysis and multimedia.

IMuR 2022: Introduction to the 2nd Workshop on Interactive Multimedia Retrieval

Luca Rossetto
Werner Bailer
Jakub Loko
Klaus Schoeffmann

The retrieval of multimedia content remains a difficult problem where a high accuracy or specificity can often only be achieved interactively, with a user working closely and iteratively with a retrieval system. While there exist several venues for the exchange of insights in the area of information retrieval in general and multimedia retrieval specifically, there is little discussion on such interactive retrieval approaches. The Workshop on Interactive Multimedia Retrieval offers such a venue. Held for the 2nd time in 2022, it attracted a diverse set of contributions, six of which were accepted for presentation. The following provides a brief overview of the workshop itself as well as the contributions of 2022.

IXR '22: 1st Workshop on Interactive eXtended Reality

Irene Viola
Hadi Amirpour
Maria Torres Vega

Despite remarkable advances, current Extended Reality (XR) applications are in their majority local and individual experiences. A plethora of interactive applications, such as teleconferencing, tele-surgery, interconnection in new buildings project chain, Cultural Heritage and Museum contents communication, are well on their way to integrate immersive technologies. However, interconnected, and interactive XR, where participants can virtually interact across vast distances, remains a distant dream. In fact, three great barriers stand between current technology and remote immersive interactive life-like experiences, namely the (i) content realism, (ii) motion-to-photon latency, and accurate (iii) human centric quality assessment and control. Overcoming these barriers will require novel solutions at all elements of the end-to-end transmission chain. This workshop focuses on the challenges, applications, and major advancements in multimedia, networks and end-user infrastructures to enable the next generation of interactive XR applications and services. The complete IXR'22 workshop proceedings are available at: https://dl.acm.org/doi/proceedings/10.1145/3552483

MADiMa'22: 7th International Workshop on Multimedia Assisted Dietary Management

Stavroula G. Mougiakakou
Giovanni Maria Farinella
Keiji Yanai
Dario Allegra

This abstract provides a summary and overview of the 7th International Workshop on Multimedia Assisted Dietary Management.

Related Workshop Proceedings are available in the ACM DL at: https://dl.acm.org/doi/proceedings/10.1145/3552484

MCFR'22: 1st Workshop on Multimedia Computing towards Fashion Recommendation

Xuemeng Song
Jingjing Chen
Federico Becattini
Weili Guan
Yibing Zhan
Tat-Seng Chua

With the proliferation of online shopping, fashion recommendation, which aims to provide suitable suggestions to support the consumer's purchase in e-commerce platforms, has gained increasing research attention from both academia and industry. Although existing efforts have achieved great progress, they focus on the visual modality, lacking the exploration of other modalities of items, e.g., the textual descriptions and attributes of items. Accordingly, this workshop targets calling for a coordinated effort to promote the multimedia computing towards fashion recommendation. This workshop will showcase the innovative methodologies and ideas on new yet challenging research problems, including (not limited to) fashion recommendation, interactive fashion recommendation, interactive garment retrieval, and outfit compatibility modeling.

PIC'22: 4th Person in Context Workshop

Si Liu
Qin Jin
Luoqi Liu
Zongheng Tang
Linli Lin

Understanding human and the surrounding context is crucial for the perception of the image and video. It benefits many related applications, such as person search, virtual tryon/makeup, abnormal action detection. In the proposed 4th Person in Context (PIC) workshop, to further promote the progress in the above-mentioned areas, we hold three human-centric perception and cognition challenges including Make-up Temporal Video Grounding (MTVG), Make-up Dense Video Caption (MDVC) and Human-centric Spatio-Temporal Video Grounding (HC-STVG). All the human-centric challenges focus on understanding the human behavior, interactions and relationships in video sequences, which requires understanding both visual and linguistic information, as well as complicated multimodal reasoning. The three sub-problems are complementary and collaboratively contribute to a unified human-centric perception and cognition solution.

PIES-ME '22: 1st Workshop on Photorealistic Image and Environment Synthesis for Multimedia Experiments

Ravi Prakash
Mylene C.Q. Farias
Marcelo M. Carvalho
Ryan McMahan

Photorealistic media aim to faithfully represent the world, creating an experience that is perceptually indistinguishable from a real world experience. In the past few years, this area has grown significantly, with new multimedia areas emerging, such as light fields, point clouds, ultra-high definition, high frame rate, high dynamic range imaging, and novel 3D audio and sound field technologies. In spite of all the advances done so far, there are several technological challenges to overcome. In particular, research in this field typically requires the use of big datasets, software tools, and powerful infrastructures. Among these, the availability of meaningful datasets, with a diverse and high-quality content, is of significant importance and, to date, most available datasets are limited and do not provide researchers adequate tools to advance the area of photorealistic applications. To help advancing research efforts in this area, this workshop aims to engage experts and researchers on the synthesis of photorealistic images and/or virtual environments, particularly in the form of public datasets, software tools, or infrastructures, for not only multimedia systems research, but also for other fields, such as machine learning, robotics, computer vision, mixed reality, and virtual reality.

QoEVMA'22: 2nd Workshop on Quality of Experience (QoE) in Visual Multimedia Applications

Jing Li
Patrick Le Callet
Xinbo Gao
Zhi Li
Wen Lu
Jiachen Yang
Junle Wang

Nowadays, people spend dramatically more time on watching videos through different devices. The advanced hardware technology and network allow for the increasing demands of users viewing experience. Thus, enhancing the Quality of Experience of end-users in advanced multimedia is the ultimate goal of service providers, as good services would attract more consumers. Quality assessment is thus important. The second workshop on "Quality of Experience (QoE) in visual multimedia applications" (QoEVMA'22) focuses on the QoE assessment of any visual multimedia applications both subjectively and objectively. The topics include 1) QoE assessment on different visual multimedia applications, including VoD for movies, dramas, variety shows, UGC on social networks, live streaming videos for gaming/shopping/social, etc. 2) QoE assessment for different video formats in multimedia services, including 2D, stereoscopic 3D, High Dynamic Range (HDR), Augmented Reality (AR), Virtual Reality (VR), 360, Free-Viewpoint Video(FVV), etc. 3) Key performance indicators (KPI) analysis for QoE. This summary gives a brief overview of the workshop, which took place on October 14, 2022 in Lisbon, Portugal, as a half-day workshop. The complete QOEVMA'22 workshop proceedings are available at: https://dl.acm.org/doi/proceedings/10.1145/3552469

SUMAC '22: 4th ACM International workshop on Structuring and Understanding of Multimedia heritAge Contents

Valérie Gouet-Brunet
Ronak Kosti
Li Weng

SUMAC 2022 is the fourth edition of the workshop on Structuring and Understanding of Multimedia heritAge Contents. It is held in Lisboa, Portugal on October 10th, 2022 and is co-located with the 30th ACM International Conference on Multimedia. Its objective is to present and discuss the latest and most significant trends and challenges in the analysis, structuring and understanding of multimedia contents dedicated to the valorization of heritage, with the emphasis on the unlocking of and access to the big data of the past. A representative scope of Computer Science methodologies dedicated to the processing of multimedia heritage contents and their exploitation is covered by the works presented, with the ambition of advancing and raising awareness about this fully developing research field. The complete SUMAC'22 workshop proceedings are available at: https://dl.acm.org/doi/proceedings/10.1145/3552464

UoLMM'22: 2nd International Workshop on Robust Understanding of Low-quality Multimedia Data: Unitive Enhancement, Analysis and Evaluation

Liang Liao
Dan Xu
Yang Wu
Xiao Wang
Jing Xiao

Low-quality multimedia data (including low resolution, low illumination, defects, blurriness, etc.) often pose a challenge for content understanding, as algorithms are typically developed under ideal conditions (high resolution and good visibility). To alleviate this problem, data enhancement techniques (e.g., super-resolution, low-light enhancement, derain, and inpainting) have been proposed to restore low-quality multimedia data. Efforts are also being made to develop robust content understanding algorithms in adverse weather and lighting conditions. Some quality assessment techniques aiming at evaluating the analytical quality of data have also emerged. Even though these topics are mostly studied independently, they are tightly related in terms of ensuring a robust understanding of multimedia content. For example, enhancement should maintain the semantic consistency of the analysis, while quality assessment should consider the comprehensibility of the multimedia data. The purpose of this workshop is to bring together individuals in three areas: enhancement, analysis, and evaluation, for sharing ideas and discussion on current developments and future directions.