MM '21: Proceedings of the 29th ACM International Conference on Multimedia

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Full Citation in the ACM Digital Library

SESSION: Keynote Talks I&II

Video Coding for Machine

  • Weno Gao

Video coding systems, started for TV broadcasting services over satellite and cable
networks with limited bandwidth, later on used for surveillance video and internet
video, those target on higher compression ratio with lower quality lose, under the
trade-off of RDO (rate distortion optimization) model, judged by human experts. In
other word, current video coding standards are good for people, for human visual perception,
not design for machine intelligence. However, today more and more applications from
industry require video coding for machine, which targets to compress image and video
for machine usage, object detection and or tracking, image classification, event analysis,
and so on, those target on higher compression ratio with higher recognition accuracy,
under the trade-off of RAO (rate accuracy optimization) model, judged by system. In
this case, video coding needs to do feature compression, which preserves and transmits
the most critical information for computer vision and pattern recognition, not for
human visual perception. So it is quite different between video coding for human and
video coding for machine, even if the two systems will coexist for a long time. In
this talk, I will introduce the history of VCM, list some early works on pattern analysis
based on compressed data domain, some efforts from ISO/IEC MPEG group on MPEG-7 CDVS
(compact descriptor for visual search) and CDVA (compact descriptors for visual analysis),
some ongoing projects on AVS working group and MPEG working group, give the key techniques
and challenges on VCM, and overview its future.

Semantic Media Conversion: Possibilities and Limits

  • H. V. Jagadish

With recent amazing progress in machine intelligence, it is becoming increasingly
easy to "convert" information reliably from one medium to another. For example, there
is already a regular annual conference on "Text as Data". We will soon have similar
facility to deal with images, videos, music, and so on. Let's call this semantic media

In this talk, I will outline some possibilities with high quality semantic media conversion.
In particular, it becomes possible to convert all media into alphanumeric data, nicely
organized in structured tables with limited loss of information. Multimedia data,
so converted, becomes easy to use, to aggregate, and to analyze, leading to new Data
Science opportunities.

But this ease of analysis also leads to questions of appropriateness.

  • We shouldn't necessarily do everything that we have the ability to do.
  • What are our values
  • How do we apply them in practice
  • What limits do we apply to semantic media conversion and the analysis enabled by it.

SESSION: Session 1: Deep Learning for Multimedia-I

Image Re-composition via Regional Content-Style Decoupling

  • Rong Zhang
  • Wei Li
  • Yiqun Zhang
  • Hong Zhang
  • Jinhui Yu
  • Ruigang Yang
  • Weiwei Xu

Typical image composition harmonizes regions from different images to a single plausible
image. We extend the idea of image composition by introducing the content-style decomposition
and combination to form the concept of image re-composition. In other words, our image
re-composition could arbitrarily combine those contents and styles decomposed from
different images to generate more diverse images in a unified framework. In the decomposition
stage, we incorporate the whitening normalization to obtain a more thorough content-style
decoupling, which substantially improves the re-composition results. Moreover, to
handle the variation of structure and texture of different objects in an image, we
design the network to support regional feature representation and achieve region-aware
content-style decomposition. Regarding the composition stage, we propose a cycle consistency
loss to constrain the network preserving the content and style information during
the composition. Our method can produce diverse re-composition results, including
content-content, content-style and style-style. Our experimental results demonstrate
a large improvement over the current state-of-the-art methods.

Deep Clustering based on Bi-Space Association Learning

  • Hao Huang
  • Shinjae Yoo
  • Chenxiao Xu

Clustering is the task of instance grouping so that similar ones are grouped into
the same cluster, while dissimilar ones are in different clusters. However, such similarity
is a local concept in regard to different clusters and their relevant feature space.
This work aims to discover clusters by exploring feature association and instance
similarity concurrently. We propose a deep clustering framework that can localize
the search for relevant features appertaining to different clusters. In turn, this
allows for measuring instance similarity that exist in multiple, possibly overlapping,
feature subsets, which contribute to more accurate clustering of instances. Additionally,
the relevant features of each cluster endow interpretability of clustering results.
Experiments on text and image datasets show that our method outperforms existing state-of-the-art

Feature Stylization and Domain-aware Contrastive Learning for Domain Generalization

  • Seogkyu Jeon
  • Kibeom Hong
  • Pilhyeon Lee
  • Jewook Lee
  • Hyeran Byun

Domain generalization aims to enhance the model robustness against domain shift without
accessing the target domain. Since the available source domains for training are limited,
recent approaches focus on generating samples of novel domains. Nevertheless, they
either struggle with the optimization problem when synthesizing abundant domains or
cause the distortion of class semantics. To these ends, we propose a novel domain
generalization framework where feature statistics are utilized for stylizing original
features to ones with novel domain properties. To preserve class information during
stylization, we first decompose features into high and low frequency components. Afterward,
we stylize the low frequency components with the novel domain styles sampled from
the manipulated statistics, while preserving the shape cues in high frequency ones.
As the final step, we re-merge both the components to synthesize novel domain features.
To enhance domain robustness, we utilize the stylized features to maintain the model
consistency in terms of features as well as outputs. We achieve the feature consistency
with the proposed domain-aware supervised contrastive loss, which ensures domain invariance
while increasing class discriminability. Experimental results demonstrate the effectiveness
of the proposed feature stylization and the domain-aware contrastive loss. Through
quantitative comparisons, we verify the lead of our method upon existing state-of-the-art
methods on two benchmarks, PACS and Office-Home.

HDA-Net: Horizontal Deformable Attention Network for Stereo Matching

  • Qi Zhang
  • Xuesong Zhang
  • Baoping Li
  • Yuzhong Chen
  • Anlong Ming

Stereo matching is a fundamental and challenging task which has various applications
in autonomous driving, dense reconstruction and other depth related tasks. Contextual
information with discriminative features is crucial for accurate stereo matching in
the ill-posed regions (textureless, occlusion, etc.). In this paper, we propose an
efficient horizontal attention module to adaptively capture the global correspondence
clues. Compared with the popular non-local attention, our horizontal attention is
more effective for stereo matching with better performance and lower consumption of
computation and memory. We further introduce a deformable module to refine the contextual
information in the disparity discontinuous areas such as the boundary of objects.
Learning-based method is adopted to construct the cost volume by concatenating the
features of two branches. In order to offer explicit similarity measure to guide learning-based
volume for obtaining more reasonable unimodal matching cost distribution we additionally
combine the learning-based volume with the improved zero-centered group-wise correlation
volume. Finally, we regularize the 4D joint cost volume by a 3D CNN module and generate
the final output by disparity regression. The experimental results show that our proposed
HDA-Net achieves the state-of-the-art performance on the Scene Flow dataset and obtains
competitive performance on the KITTI datasets compared with the relevant networks.

MBRS: Enhancing Robustness of DNN-based Watermarking by Mini-Batch of Real and Simulated
JPEG Compression

  • Zhaoyang Jia
  • Han Fang
  • Weiming Zhang

Based on the powerful feature extraction ability of deep learning architecture, recently,
deep-learning based watermarking algorithms have been widely studied. The basic framework
of such algorithm is the auto-encoder like end-to-end architecture with an encoder,
a noise layer and a decoder. The key to guarantee robustness is the adversarial training
with the differential noise layer. However, we found that none of the existing framework
can well ensure the robustness against JPEG compression, which is non-differential
but is an essential and important image processing operation. To address such limitations,
we proposed a novel end-to-end training architecture, which utilizes Mini-Batch of
Real and Simulated JPEG compression (MBRS) to enhance the JPEG robustness. Precisely,
for different mini-batches, we randomly choose one of real JPEG, simulated JPEG and
noise-free layer as the noise layer. Besides, we suggest to utilize the Squeeze-and-Excitation
blocks which can learn better feature in embedding and extracting stage, and propose
a "message processor" to expand the message in a more appreciate way. Meanwhile, to
improve the robustness against crop attack, we propose an additive diffusion block
into the network. The extensive experimental results have demonstrated the superior
performance of the proposed scheme compared with the state-of-the-art algorithms.
Under the JPEG compression with quality factor $Q=50$, our models achieve a bit error
rate less than 0.01% for extracted messages, with PSNR larger than 36 for the encoded
images, which shows the well-enhanced robustness against JPEG attack. Besides, under
many other distortions such as Gaussian filter, crop, cropout and dropout, the proposed
framework also obtains strong robustness. The code implemented by PyTorch is avaiable

From Synthetic to Real: Image Dehazing Collaborating with Unlabeled Real Data

  • Ye Liu
  • Lei Zhu
  • Shunda Pei
  • Huazhu Fu
  • Jing Qin
  • Qing Zhang
  • Liang Wan
  • Wei Feng

Single image dehazing is a challenging task, for which the domain shift between synthetic
training data and real-world testing images usually leads to degradation of existing
methods. To address this issue, we propose a novel image dehazing framework collaborating
with unlabeled real data. First, we develop a disentangled image dehazing network
(DID-Net), which disentangles the feature representations into three component maps,
i.e. the latent haze-free image, the transmission map, and the global atmospheric
light estimate, respecting the physical model of a haze process. Our DID-Net predicts
the three component maps by progressively integrating features across scales, and
refines each map by passing an independent refinement network. Then a disentangled-consistency
mean-teacher network (DMT-Net) is employed to collaborate unlabeled real data for
boosting single image dehazing. Specifically, we encourage the coarse predictions
and refinements of each disentangled component to be consistent between the student
and teacher networks by using a consistency loss on unlabeled real data. We make comparison
with 13 state-of-the-art dehazing methods on a new collected dataset (Haze4K) and
two widely-used dehazing datasets (i.e., SOTS and HazeRD), as well as on real-world
hazy images. Experimental results demonstrate that our method has obvious quantitative
and qualitative improvements over the existing methods.

SESSION: Session 2: Deep Learning for Multimedia-II

Video Semantic Segmentation via Sparse Temporal Transformer

  • Jiangtong Li
  • Wentao Wang
  • Junjie Chen
  • Li Niu
  • Jianlou Si
  • Chen Qian
  • Liqing Zhang

Currently, video semantic segmentation mainly faces two challenges: 1) the demand
of temporal consistency; 2) the balance between segmentation accuracy and inference
efficiency. For the first challenge, existing methods usually use optical flow to
capture the temporal relation in consecutive frames and maintain the temporal consistency,
but the low inference speed by means of optical flow limits the real-time applications.
For the second challenge, flow based key frame warping is one mainstream solution.
However, the unbalanced inference latency of flow-based key frame warping makes it
unsatisfactory for real-time applications. Considering the segmentation accuracy and
inference efficiency, we propose a novel Sparse Temporal Transformer (STT) to bridge
temporal relation among video frames adaptively, which is also equipped with query
selection and key selection. The key selection and query selection strategies are
separately applied to filter out temporal and spatial redundancy in our temporal transformer.
Specifically, our STT can reduce the time complexity of temporal transformer by a
large margin without harming the segmentation accuracy and temporal consistency. Experiments
on two benchmark datasets, Cityscapes and Camvid, demonstrate that our method achieves
the state-of-the-art segmentation accuracy and temporal consistency with comparable
inference speed.

Diverse Image Inpainting with Bidirectional and Autoregressive Transformers

  • Yingchen Yu
  • Fangneng Zhan
  • Rongliang WU
  • Jianxiong Pan
  • Kaiwen Cui
  • Shijian Lu
  • Feiying Ma
  • Xuansong Xie
  • Chunyan Miao

Image inpainting is an underdetermined inverse problem, which naturally allows diverse
contents to fill up the missing or corrupted regions realistically. Prevalent approaches
using convolutional neural networks (CNNs) can synthesize visually pleasant contents,
but CNNs suffer from limited perception fields for capturing global features. With
image-level attention, transformers enable to model long-range dependencies and generate
diverse contents with autoregressive modeling of pixel-sequence distributions. However,
the unidirectional attention in autoregressive transformers is suboptimal as corrupted
image regions may have arbitrary shapes with contexts from any direction. We propose
BAT-Fill, an innovative image inpainting framework that introduces a novel bidirectional
autoregressive transformer (BAT) for image inpainting. BAT utilizes the transformers
to learn autoregressive distributions, which naturally allows the diverse generation
of missing contents. In addition, it incorporates the masked language model like BERT,
which enables bidirectionally modeling of contextual information of missing regions
for better image completion. Extensive experiments over multiple datasets show that
BAT-Fill achieves superior diversity and fidelity in image inpainting qualitatively
and quantitatively.

SSFlow: Style-guided Neural Spline Flows for Face Image Manipulation

  • Hanbang Liang
  • Xianxu Hou
  • Linlin Shen

Significant progress has been made in high-resolution and photo-realistic image generation
by Generative Adversarial Networks (GANs). However, the generation process is still
lack of control, which is crucial for semantic face editing. Furthermore, it remains
challenging to edit target attributes and preserve the identity at the same time.
In this paper, we propose SSFlow to achieve identity-preserved semantic face manipulation
in StyleGAN latent space based on conditional Neural Spline Flows. To further improve
the performance of Neural Spline Flows on such task, we also propose Constractive
Squash component and Blockwise 1 x 1 Convolution layer. Moreover, unlike other conditional
flow-based approaches that require facial attribute labels during inference, our method
can achieve label-free manipulation in a more flexible way. As a result, our methods
are able to perform well-disentangled edits along various attributes, and generalize
well for both real and artistic face image manipulation. Qualitative and quantitative
evaluations show the advantages of our method for semantic face manipulation over
state-of-the-art approaches.

Constrained Graphic Layout Generation via Latent Optimization

  • Kotaro Kikuchi
  • Edgar Simo-Serra
  • Mayu Otani
  • Kota Yamaguchi

It is common in graphic design humans visually arrange various elements according
to their design intent and semantics. For example, a title text almost always appears
on top of other elements in a document. In this work, we generate graphic layouts
that can flexibly incorporate such design semantics, either specified implicitly or
explicitly by a user. We optimize using the latent space of an off-the-shelf layout
generation model, allowing our approach to be complementary to and used with existing
layout generation models. Our approach builds on a generative layout model based on
a Transformer architecture, and formulates the layout generation as a constrained
optimization problem where design constraints are used for element alignment, overlap
avoidance, or any other user-specified relationship. We show in the experiments that
our approach is capable of generating realistic layouts in both constrained and unconstrained
generation tasks with a single model. The code is available at

Transfer Vision Patterns for Multi-Task Pixel Learning

  • Xiaoya Zhang
  • Ling Zhou
  • Yong Li
  • Zhen Cui
  • Jin Xie
  • Jian Yang

Multi-task pixel perception is one of the most important topics in the field of machine
intelligence. Inspired by the observation of cross-task interdependencies of visual
patterns, we propose a multi-task vision pattern transformation (VPT) method to adaptively
correlate and transfer cross-task visual patterns by leveraging the powerful transformer
mechanism. To better transfer visual patterns, specifically, we build two types of
pattern transformation based on the statistic prior that the affinity relations across
tasks are correlated. One aims to transfer feature patterns for the integration of
different task features; the other aims to exchange structure patterns for mining
and leveraging the latent interaction cues. These two types of transformations are
encapsulated into two VPT units, which provide universal matching interfaces for multi-task
learning, complement each other to guide the transmission of feature/structure patterns,
and finally realize an adaptive selection of important patterns across tasks. Extensive
experiments on the joint learning of semantic segmentation, depth prediction and surface
normal estimation demonstrate that our proposed method is more effective than those
baselines and achieve the state-of-that-art performance in three pixel-level visual

Object-aware Long-short-range Spatial Alignment for Few-Shot Fine-Grained Image Classification

  • Yike Wu
  • Bo Zhang
  • Gang Yu
  • Weixi Zhang
  • Bin Wang
  • Tao Chen
  • Jiayuan Fan

The goal of few-shot fine-grained image classification is to recognize rarely seen
fine-grained objects in the query set, given only a few samples of this class in the
support set. Previous works focus on learning discriminative image features from a
limited number of training samples for distinguishing various fine-grained classes,
but ignore one important fact that spatial alignment of the discriminative semantic
features between the query image with arbitrary changes and the support image, is
also critical for computing the semantic similarity between each support-query pair.
In this work, we propose an object-aware long-short-range spatial alignment approach,
which is composed of a foreground object feature enhancement (FOE) module, a long-range
semantic correspondence (LSC) module and a short-range spatial manipulation (SSM)
module. The FOE is developed to weaken background disturbance and encourage higher
foreground object response. To address the problem of long-range object feature misalignment
between support-query image pairs, the LSC is proposed to learn the transferable long-range
semantic correspondence by a designed feature similarity metric. Further, the SSM
module is developed to refine the transformed support feature after the long-range
step to align short-range misaligned features (or local details) with the query features.
Extensive experiments have been conducted on four benchmark datasets, and the results
show superior performance over most state-of-the-art methods under both 1-shot and
5-shot classification scenarios.

SESSION: Session 3: Brave New Idea

Recycling Discriminator: Towards Opinion-Unaware Image Quality Assessment Using Wasserstein GAN

  • Yunan Zhu
  • Haichuan Ma
  • Jialun Peng
  • Dong Liu
  • Zhiwei Xiong

Generative adversarial networks (GANs) have been extensively used for training networks
that perform image generation. After training, the discriminator in GAN was not used
anymore. We propose to recycle the trained discriminator for another use: no-reference
image quality assessment (NR-IQA). We are motivated by twofold facts. First, in Wasserstein
GAN (WGAN), the discriminator is designed to calculate the distance between the distribution
of generated images and that of real images; thus, the trained discriminator may encode
the distribution of real-world images. Second, NR-IQA often needs to leverage the
distribution of real-world images for assessing image quality. We then conjecture
that using the trained discriminator for NR-IQA may help get rid of any human-labeled
quality opinion scores and lead to a new opinion-unaware (OU) method. To validate
our conjecture, we start from a restricted NR-IQA problem, that is IQA for artificially
super-resolved images. We train super-resolution (SR) WGAN with two kinds of discriminators:
one is to directly evaluate the entire image, and the other is to work on small patches.
For the latter kind, we obtain patch-wise quality scores, and then have the flexibility
to fuse the scores, e.g., by weighted average. Moreover, we directly extend the trained
discriminators for authentically distorted images that have different kinds of distortions.
Our experimental results demonstrate that the proposed method is comparable to the
state-of-the-art OU NR-IQA methods on SR images and is even better than them on authentically
distorted images. Our method provides a better interpretable approach to NR-IQA. Our
code and models are available at

Learning Kinematic Formulas from Multiple View Videos

  • Liangchen Song
  • Sheng Liu
  • Celong Liu
  • Zhong Li
  • Yuqi Ding
  • Yi Xu
  • Junsong Yuan

Given a set of multiple view videos, which records the motion trajectory of an object,
we propose to find out the objects' kinematic formulas with neural rendering techniques.
For example, if the input multiple view videos record the free fall motion of an object
with different initial speed v, the network aims to learn its kinematics: Δ=vt-1over
2 gt2, where Δ, g and t are displacement, gravitational acceleration and time. To
achieve this goal, we design a novel framework consisting of a motion network and
a differentiable renderer. For the differentiable renderer, we employ Neural Radiance
Field (NeRF) since the geometry is implicitly modeled by querying coordinates in the
space. The motion network is composed of a series of blending functions and linear
weights, enabling us to analytically derive the kinematic formulas after training.
The proposed framework is trained end to end and only requires knowledge of cameras'
intrinsic and extrinsic parameters. To validate the proposed framework, we design
three experiments to demonstrate its effectiveness and extensibility. The first experiment
is the video of free fall and the framework can be easily combined with the principle
of parsimony, resulting in the correct free fall kinematics. The second experiment
is on the large angle pendulum which does not have analytical kinematics. We use the
differential equation controlling pendulum dynamics as a physical prior in the framework
and demonstrate that the convergence speed becomes much faster. Finally, we study
the explosion animation and demonstrate that our framework can well handle such black-box-generated

DEPA: Self-Supervised Audio Embedding for Depression Detection

  • Pingyue Zhang
  • Mengyue Wu
  • Heinrich Dinkel
  • Kai Yu

Depression detection research has increased over the last few decades, one major bottleneck
of which is the limited data availability and representation learning. Recently, self-supervised
learning has seen success in pretraining text embeddings and has been applied broadly
on related tasks with sparse data, while pretrained audio embeddings based on self-supervised
learning are rarely investigated. This paper proposes DEPA, a self-supervised, pretrained
dep ression a udio embedding method for depression detection. An encoder-decoder network
is used to extract DEPA on in-domain depressed datasets (DAIC and MDD) and out-domain
(Switchboard, Alzheimer's) datasets. With DEPA as the audio embedding extracted at
response-level, a significant performance gain is achieved on downstream tasks, evaluated
on both sparse datasets like DAIC and large major depression disorder dataset (MDD).
This paper not only exhibits itself as a novel embedding extracting method capturing
response-level representation for depression detection but more significantly, is
an exploration of self-supervised learning in a specific task within audio processing.

Retinomorphic Sensing: A Novel Paradigm for Future Multimedia Computing

  • Zhaodong Kang
  • Jianing Li
  • Lin Zhu
  • Yonghong Tian

Conventional frame-based cameras for multimedia computing have encountered important
challenges in high-speed and extreme light scenarios. However, how to design a novel
paradigm for visual perception that overcomes the disadvantages of conventional cameras
still remains an open issue. In this paper, we propose a novel solution, namely retinomorphic
sensing, which integrates fovea-like and peripheral-like sampling mechanisms to generate
asynchronous visual streams using a unified representation as the retina does. Technically,
our encoder incorporates an interaction controller to switch flexibly between dynamic
and static sensing. Then, the decoder effectively extracts dynamic events for machine
vision and reconstructs visual textures for human vision. The results show that our
strategy enables it to sense dynamic events and visual textures meanwhile reduce data
redundancy. We further build a prototype hybrid camera system to verify this strategy
on vision tasks such as image reconstruction and object detection. We believe that
this novel paradigm will provide insight into future multimedia computing. The code
can be available at

Metaverse for Social Good: A University Campus Prototype

  • Haihan Duan
  • Jiaye Li
  • Sizheng Fan
  • Zhonghao Lin
  • Xiao Wu
  • Wei Cai

In recent years, the metaverse has attracted enormous attention from around the world
with the development of related technologies. The expected metaverse should be a realistic
society with more direct and physical interactions, while the concepts of race, gender,
and even physical disability would be weakened, which would be highly beneficial for
society. However, the development of metaverse is still in its infancy, with great
potential for improvement. Regarding metaverse's huge potential, industry has already
come forward with advance preparation, accompanied by feverish investment, but there
are few discussions about metaverse in academia to scientifically guide its development.
In this paper, we highlight the representative applications for social good. Then
we propose a three-layer metaverse architecture from a macro perspective, containing
infrastructure, interaction, and ecosystem. Moreover, we journey toward both a historical
and novel metaverse with a detailed timeline and table of specific attributes. Lastly,
we illustrate our implemented blockchain-driven metaverse prototype of a university
campus and discuss the prototype design and insights.

SESSION: Session 4: Deep Learning for Multimedia-III

Enhanced Invertible Encoding for Learned Image Compression

  • Yueqi Xie
  • Ka Leong Cheng
  • Qifeng Chen

Although deep learning based image compression methods have achieved promising progress
these days, the performance of these methods still cannot match the latest compression
standard Versatile Video Coding (VVC). Most of the recent developments focus on designing
a more accurate and flexible entropy model that can better parameterize the distributions
of the latent features. However, few efforts are devoted to structuring a better transformation
between the image space and the latent feature space. In this paper, instead of employing
previous autoencoder style networks to build this transformation, we propose an enhanced
Invertible Encoding Network with invertible neural networks (INNs) to largely mitigate
the information loss problem for better compression. Experimental results on the Kodak,
CLIC, and Tecnick datasets show that our method outperforms the existing learned image
compression methods and compression standards, including VVC (VTM 12.1), especially
for high-resolution images. Our source code is available at

DC-GNet: Deep Mesh Relation Capturing Graph Convolution Network for 3D Human Shape

  • Shihao Zhou
  • Mengxi Jiang
  • Shanshan Cai
  • Yunqi Lei

In this paper, we aim to reconstruct a full 3D human shape from a single image. Previous
vertex-level and parameter regression approaches reconstruct 3D human shape based
on a pre-defined adjacency matrix to encode positive relations between nodes. The
deep topological relations for the surface of the 3D human body are not carefully
exploited. Moreover, the performance of most existing approaches often suffer from
domain gap when handling more occlusion cases in real-world scenes. In this work,
we propose a Deep Mesh Relation Capturing Graph Convolution Network, DC-GNet, with
a shape completion task for 3D human shape reconstruction. Firstly, we propose to
capture deep relations within mesh vertices, where an adaptive matrix encoding both
positive and negative relations is introduced. Secondly, we propose a shape completion
task to learn prior about various kinds of occlusion cases. Our approach encodes mesh
structure from more subtle relations between nodes in a more distant region. Furthermore,
our shape completion module alleviates the performance degradation issue in the outdoor
scene. Extensive experiments on several benchmarks show that our approach outperforms
the previous 3D human pose and shape estimation approaches.

Deep Marginal Fisher Analysis based CNN for Image Representation and Classification

  • Xun Cai
  • Jiajing Chai
  • Yanbo Gao
  • Shuai Li
  • Bo Zhu

Deep Convolutional Neural Networks (CNNs) have achieved great success in image classification.
While conventional CNNs optimized with iterative gradient descent algorithms with
large data have been widely used and investigated, there is also research focusing
on learning CNNs with non-iterative optimization methods such as the principle component
analysis network (PCANet). It is very simple and efficient but achieves competitive
performance for some image classification tasks especially on tasks with only a small
amount of data available. This paper further extends this line of research and proposes
a deep Marginal Fisher Analysis (MFA) based CNN, termed as DMNet. It addresses the
limitation of PCANet like CNNs when the samples do not follow Gaussian distribution,
by using a local MFA for CNN filter optimization. It uses a graph embedding framework
for convolution filter optimization by maximizing the inter-class discriminability
among marginal points while minimizing intra-class distance. Cascaded MFA convolution
layers can be used to construct a deep network. Moreover, a binary stochastic hashing
is developed by randomly selecting features with a probability based on the importance
of feature maps for binary hashing. Experimental results demonstrate that the proposed
method achieves state-of-the-art result in non-iterative optimized CNN methods, and
ablation studies have been conducted to verify the effectiveness of the proposed modules
in our DMNet.

Learning Structure Affinity for Video Depth Estimation

  • Yuanzhouhan Cao
  • Yidong Li
  • Haokui Zhang
  • Chao Ren
  • Yifan Liu

Depth estimation is a structure learning problem. The affinity among neighbouring
pixels plays an important role in inferring depth values. In this paper, we propose
to learn structure affinity in both spatial and temporal domain for accurate depth
estimation from monocular videos. Specifically, we first propose a convolutional spatial
temporal propagation network (CSTPN) that learns affinity among neighbouring video
frames. Secondly, we employ a structure knowledge distillation scheme that transfers
the spatial temporal affinity learned by cumbersome network to compact network. By
calculating pixel-wise similarities between neighboring frames and neighbouring sequences,
our knowledge distillation scheme efficiently captures both short-term and long-term
spatial temporal affinity. Finally, we apply a warping loss based on optical flow
between video frames to further enforce the temporal affinity. Experiment results
show that our proposed depth estimation approach outperform the state-of-the-art methods
on both indoor and outdoor benchmark datasets.

X-GGM: Graph Generative Modeling for Out-of-distribution Generalization in Visual
Question Answering

  • Jingjing Jiang
  • Ziyi Liu
  • Yifan Liu
  • Zhixiong Nan
  • Nanning Zheng

Encouraging progress has been made towards Visual Question Answering (VQA) in recent
years, but it is still challenging to enable VQA models to adaptively generalize to
out-of-distribution (OOD) samples. Intuitively, recompositions of existing visual
concepts (i.e., attributes and objects) can generate unseen compositions in the training
set, which will promote VQA models to generalize to OOD samples. In this paper, we
formulate OOD generalization in VQA as a compositional generalization problem and
propose a graph generative modeling-based training scheme (X-GGM) to handle the problem
implicitly. X-GGM leverages graph generative modeling to iteratively generate a relation
matrix and node representations for the predefined graph that utilizes attribute-object
pairs as nodes. Furthermore, to alleviate the unstable training issue in graph generative
modeling, we propose a gradient distribution consistency loss to constrain the data
distribution with adversarial perturbations and the generated distribution. The baseline
VQA model (LXMERT) trained with the X-GGM scheme achieves state-of-the-art OOD performance
on two standard VQA OOD benchmarks, i.e., VQA-CP v2 and GQA-OOD. Extensive ablation
studies demonstrate the effectiveness of X-GGM components.

DSSL: Deep Surroundings-person Separation Learning for Text-based Person Retrieval

  • Aichun Zhu
  • Zijie Wang
  • Yifeng Li
  • Xili Wan
  • Jing Jin
  • Tian Wang
  • Fangqiang Hu
  • Gang Hua

Many previous methods on text-based person retrieval tasks are devoted to learning
a latent common space mapping, with the purpose of extracting modality-invariant features
from both visual and textual modality. Nevertheless, due to the complexity of high-dimensional
data, the unconstrained mapping paradigms are not able to properly catch discriminative
clues about the corresponding person while drop the misaligned information. Intuitively,
the information contained in visual data can be divided into person information (PI)
and surroundings information (SI), which are mutually exclusive from each other. To
this end, we propose a novel Deep Surroundings-person Separation Learning (DSSL) model
in this paper to effectively extract and match person information, and hence achieve
a superior retrieval accuracy. A surroundings-person separation and fusion mechanism
plays the key role to realize an accurate and effective surroundings-person separation
under a mutually exclusion constraint. In order to adequately utilize multi-modal
and multi-granular information for a higher retrieval accuracy, five diverse alignment
paradigms are adopted. Extensive experiments are carried out to evaluate the proposed
DSSL on CUHK-PEDES, which is currently the only accessible dataset for text-base person
retrieval task. DSSL achieves the state-of-the-art performance on CUHK-PEDES. To properly
evaluate our proposed DSSL in the real scenarios, a Real Scenarios Text-based Person
Reidentification (RSTPReid) dataset is constructed to benefit future research on text-based
person retrieval, which will be publicly available.

SESSION: Session 5: Emerging Multimedia Applications-I

Diverse Multimedia Layout Generation with Multi Choice Learning

  • David D. Nguyen
  • Surya Nepal
  • Salil S. Kanhere

Designing visually appealing layouts for multimedia documents containing text, graphs
and images requires a form of creative intelligence. Modelling the generation of layouts
has recently gained attention due to its importance in aesthetics and communication
style. In contrast to standard prediction tasks, there are a range of acceptable layouts
which depend on user preferences. For example, a poster designer may prefer logos
on the top-left while another prefers logos on the bottom-right. Both are correct
choices yet existing machine learning models treat layouts as a single choice prediction
problem. In such situations, these models would simply average over all possible choices
given the same input forming a degenerate sample. In the above example, this would
form an unacceptable layout with a logo in the centre.

In this paper, we present an auto-regressive neural network architecture, called LayoutMCL,
that uses multi-choice prediction and winner-takes-all loss to effectively stabilise
layout generation. LayoutMCL avoids the averaging problem by using multiple predictors
to learn a range of possible options for each layout object. This enables LayoutMCL
to generate multiple and diverse layouts from a single input which is in contrast
with existing approaches which yield similar layouts with minor variations. Through
quantitative benchmarks on real data (magazine, document and mobile app layouts),
we demonstrate that LayoutMCL reduces Fréchet Inception Distance (FID) by 83-98% and
generates significantly more diversity in comparison to existing approaches.

Viewing from Frequency Domain: A DCT-based Information Enhancement Network for Video Person Re-Identification

  • Liangchen Liu
  • Xi Yang
  • Nannan Wang
  • Xinbo Gao

Video-based person re-identification (Re-ID) aims to match the target pedestrians
under non-overlapping camera system by video tracklets. The key issue of video Re-ID
focuses on exploring effective spatio-temporal features. Generally, the spatio-temporal
information of a video sequence can be divided into two aspects: the discriminative
information in each frame and the shared information over the whole sequence. To make
full use of the rich information in video sequences, this paper proposes a Discrete
Cosine Transform based Information Enhancement Network (DCT-IEN) to achieve more comprehensive
spatio-temporal representation from frequency domain. Inspired by the principle that
average pooling is one of the special frequency components in DCT (the lowest frequency
component), DCT-IEN first adopts discrete cosine transform to convert the extracted
feature maps into frequency domain, thereby retaining more information that embedded
in different frequency components. With the help of DCT frequency spectrum, two branches
are adopted to learn the final video representation: Frequency Selection Module (FSM)
and Lowest Frequency Enhancement Module (LFEM). FSM explores the most discriminative
features in each frame by aggregating different frequency components with attention
mechanism. LFEM enhances the shared feature over the whole video sequence by frame
feature regularization. By fusing these two kinds of features together, DCT-IEN finally
achieves comprehensive video representation. We conduct extensive experiments on two
widely used datasets. The experimental results verify our idea and demonstrate the
effectiveness of DCT-IEN for video-based Re-ID.

Unsupervised Portrait Shadow Removal via Generative Priors

  • Yingqing He
  • Yazhou Xing
  • Tianjia Zhang
  • Qifeng Chen

Portrait images often suffer from undesirable shadows cast by casual objects or even
the face itself. While existing methods for portrait shadow removal require training
on a large-scale synthetic dataset, we propose the first unsupervised method for portrait
shadow removal without any training data. Our key idea is to leverage the generative
facial priors embedded in the off-the-shelf pretrained StyleGAN2. To achieve this,
we formulate the shadow removal task as a layer decomposition problem: a shadowed
portrait image is constructed by the blending of a shadow image and a shadow-free
image. We propose an effective progressive optimization algorithm to learn the decomposition
process. Our approach can also be extended to portrait tattoo removal and watermark
removal. Qualitative and quantitative experiments on a real-world portrait shadow
dataset demonstrate that our approach achieves comparable performance with supervised
shadow removal methods. Our source code is available at

Multimodal Global Relation Knowledge Distillation for Egocentric Action Anticipation

  • Yi Huang
  • Xiaoshan Yang
  • Changsheng Xu

In this paper, we consider the task of action anticipation on egocentric videos. Previous
methods ignore explicit modeling of the global context relation among past and future
actions, which is not an easy task due to the vacancy of unobserved videos. To solve
this problem, we propose a Multimodal Global Relation Knowledge Distillation (MGRKD)
framework to distill the knowledge learned from full videos to improve the action
anticipation task on partially observed videos. The proposed MGRKD has a teacher-student
learning strategy, where either the teacher or student model has three branches of
global relation graph networks (GRGN) to explore the pairwise relations between past
and future actions based on three kinds of features (i.e., RGB, motion or object).
The teacher model has a similar architecture with the student model, except that the
teacher model uses true feature of the future video snippet to build the graph in
GRGN while the student model uses a progressive GRU to predict an initialized node
feature of future snippet in GRGN. Through the teacher-student learning strategy,
the discriminative features and relation knowledge of the past and future actions
learned in the teacher model can be distilled to the student model. The experiments
on two egocentric video datasets EPIC-Kitchens and EGTEA Gaze+ show that the proposed
framework achieves state-of-the-art performances.

Exploring Pathologist Knowledge for Automatic Assessment of Breast Cancer Metastases
in Whole-slide Image

  • Liuan Wang
  • Li Sun
  • Mingjie Zhang
  • Huigang Zhang
  • Wang Ping
  • Rong Zhou
  • Jun Sun

Automatic assessment of breast cancer metastases plays an important role to help pathologist
reduce the time-consuming work in histopathological whole-slide image diagnosis. From
the utilization of knowledge point of view, the low-magnification level and high-magnification
level are carefully checked by the pathologists for tumor pattern and cell tumor characteristic.
In this paper, we propose a novel automatic patient-level tumor segmentation and classification
method, which makes full use of the diagnosis knowledge clues from pathologists. For
tumor segmentation, a multi-level view DeepLabV3+ (MLV-DeepLabV3+) is designed to
explore the distinguishing features of cell characteristics between tumor and normal
tissue. Furthermore, the expert segmentation models are selected and integrated by
Pareto-front optimization to imitate the expert consultation to get perfect diagnosis.
For wholeslide classification, multi-level magnifications are adaptive checked to
focus on the effective features in different magnification. The experimental results
demonstrate that our pathologist knowledge-based automatic assessment of whileslide
image is effective and robust on the public benchmark dataset.

Towards Multiple Black-boxes Attack via Adversarial Example Generation Network

  • Duan Mingxing
  • Kenli Li
  • Lingxi Xie
  • Qi Tian
  • Bin Xiao

The current research on adversarial attacks aims at a single model while the research
on attacking multiple models simultaneously is still challenging. In this paper, we
propose a novel black-box attack method, referred to as MBbA, which can attack multiple
black-boxes at the same time. By encoding input image and its target category into
an associated space, each decoder seeks the appropriate attack areas from the image
through the designed loss functions, and then generates effective adversarial examples.
This process realizes end-to-end adversarial example generation without involving
substitute models for the black-box scenario. On the other hand, adopting the adversarial
examples generated by MBbA for adversarial training, the robustness of the attacked
models are greatly improved. More importantly, those adversarial examples can achieve
satisfactory attack performance, even if these black-box models are trained with the
adversarial examples generated by other black-box attack methods, which show good
transferability. Finally, extensive experiments show that compared with other state-of-the-art
methods: (1) MBbA takes the least time to obtain the most effective attack effects
in multi-black-box attack scenario. Furthermore, MBbA achieves the highest attack
success rates in a single black-box attack scenario; (2) the adversarial examples
generated by MBbA can effectively improve the robustness of the attacked models and
exhibit good transferability.

SESSION: Session 6: Emerging Multimedia Applications-II

DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction

  • Hao Feng
  • Yuechen Wang
  • Wengang Zhou
  • Jiajun Deng
  • Houqiang Li

In this work, we propose a new framework, called Document Image Transformer (DocTr),
to address the issue of geometry and illumination distortion of the document images.
Specifically, DocTr consists of a geometric unwarping transformer and an illumination
correction transformer. By setting a set of learned query embedding, the geometric
unwarping transformer captures the global context of the document image by self-attention
mechanism and decodes the pixel-wise displacement solution to correct the geometric
distortion. After geometric unwarping, our illumination correction transformer further
removes the shading artifacts to improve the visual quality and OCR accuracy. Extensive
evaluations are conducted on several datasets, and superior results are reported against
the state-of-the-art methods. Remarkably, our DocTr achieves $20.02%$ Character Error
Rate (CER), a $15%$ absolute improvement over the state-of-the-art methods. Moreover,
it also shows high efficiency on running time and parameter count.

Self-supervised Multi-view Multi-Human Association and Tracking

  • Yiyang Gan
  • Ruize Han
  • Liqiang Yin
  • Wei Feng
  • Song Wang

Multi-view Multi-human association and tracking (MvMHAT) aims to track a group of
people over time in each view, as well as to identify the same person across different
views at the same time. This is a relatively new problem but is very important for
multi-person scene video surveillance. Different from previous multiple object tracking
(MOT) and multi-target multi-camera tracking (MTMCT) tasks, which only consider the
over-time human association, MvMHAT requires to jointly achieve both cross-view and
over-time data association. In this paper, we model this problem with a self-supervised
learning framework and leverage an end-to-end network to tackle it. Specifically,
we propose a spatial-temporal association network with two designed self-supervised
learning losses, including a symmetric-similarity loss and a transitive-similarity
loss, at each time to associate the multiple humans over time and across views. Besides,
to promote the research on MvMHAT, we build a new large-scale benchmark for the training
and testing of different algorithms. Extensive experiments on the proposed benchmark
verify the effectiveness of our method. We have released the benchmark and code to
the public.

Learning Fine-Grained Motion Embedding for Landscape Animation

  • Hongwei Xue
  • Bei Liu
  • Huan Yang
  • Jianlong Fu
  • Houqiang Li
  • Jiebo Luo

In this paper we focus on landscape animation, which aims to generate time-lapse videos
from a single landscape image. Motion is crucial for landscape animation as it determines
how objects move in videos. Existing methods are able to generate appealing videos
by learning motion from real time-lapse videos. However, current methods suffer from
inaccurate motion generation, which leads to unrealistic video results. To tackle
this problem, we propose a model named FGLA to generate high-quality and realistic
videos by learning Fine-Grained motion embedding for Landscape Animation. Our model
consists of two parts: (1) a motion encoder which embeds time-lapse motion in a fine-grained
way. (2) a motion generator which generates realistic motion to animate input images.
To train and evaluate on diverse time-lapse videos, we build the largest high-resolution
Time-lapse video dataset with Diverse scenes, namely Time-lapse-D, which includes
16,874 video clips with over 10 million frames. Quantitative and qualitative experimental
results demonstrate the superiority of our method. In particular, our method achieves
relative improvements by 19% on LIPIS and 5.6% on FVD compared with state-of-the-art
methods on our dataset. A user study carried out with 700 human subjects shows that
our approach visually outperforms existing methods by a large margin.

Multi-label Pattern Image Retrieval via Attention Mechanism Driven Graph Convolutional

  • Ying Li
  • Hongwei Zhou
  • Yeyu Yin
  • Jiaquan Gao

Pattern images are artificially designed images which are discriminative in aspects
of elements, styles, arrangements and so on. Pattern images are widely used in fields
like textile, clothing, art, fashion and graphic design. With the growth of image
numbers, pattern image retrieval has great potential in commercial applications and
industrial production. However, most of existing content-based image retrieval works
mainly focus on describing simple attributes with clear conceptual boundaries, which
are not suitable for pattern image retrieval. It is difficult to accurately represent
and retrieve pattern images which include complex details and multiple elements. Therefore,
in this paper, we collect a new pattern image dataset with multiple labels per image
for the pattern image retrieval task. To extract discriminative semantic features
of multi-label pattern images and construct high-level topology relationships between
features, we further propose an Attention Mechanism Driven Graph Convolutional Network
(AMD-GCN). Different layers of the multi-semantic attention module activate regions
of interest corresponding to multiple labels, respectively. By embedding the learned
labels from attention module into the graph convolutional network, which can capture
the dependency of labels on the graph manifold, the AMD-GCN builds an end-to-end framework
to extract high-level semantic features with label semantics and inner relationships
for retrieval. Experiments on the pattern image dataset show that the proposed method
highlights the relevant semantic regions of multiple labels, and achieves higher accuracy
than state-of-the-art image retrieval methods.

Collocation and Try-on Network: Whether an Outfit is Compatible

  • Na Zheng
  • Xuemeng Song
  • Qingying Niu
  • Xue Dong
  • Yibing Zhan
  • Liqiang Nie

Whether an outfit is compatible? Using machine learning methods to assess an outfit's
compatibility, namely, fashion compatibility modeling (FCM), has recently become a
popular yet challenging topic. However, current FCM studies still perform far from
satisfactory, because they only consider the collocation compatibility modeling, while
neglecting the natural human habits that people generally evaluate outfit compatibility
from both the collocation (discrete assess) and the try-on (unified assess) perspectives.
In light of the above analysis, we propose a Collocation and Try-On Network (CTO-Net)
for FCM, combining both the collocation and try-on compatibilities. In particular,
for the collocation perspective, we devise a disentangled graph learning scheme, where
the collocation compatibility is disentangled into multiple fine-grained compatibilities
between items; regarding the try-on perspective, we propose an integrated distillation
learning scheme to unify all item information in the whole outfit to evaluate the
compatibility based on the latent try-on representation. To further enhance the collocation
and try-on compatibilities, we exploit the mutual learning strategy to obtain a more
comprehensive judgment. Extensive experiments on the real-world dataset demonstrate
that our CTO-Net significantly outperforms the state-of-the-art methods. In particular,
compared with the competitive counterparts, our proposed CTO-Net significantly improves
AUC accuracy from 83.2% to 87.8% and MRR from 15.4% to 21.8%. We have released our
source codes and trained models to benefit other researchers.1

MeronymNet: A Hierarchical Model for Unified and Controllable Multi-Category Object Generation

  • Rishabh Baghel
  • Abhishek Trivedi
  • Tejas Ravichandran
  • Ravi Kiran Sarvadevabhatla

We introduce MeronymNet, a novel hierarchical approach for controllable, part-based
generation of multi-category objects using a single unified model. We adopt a guided
coarse-to-fine strategy involving semantically conditioned generation of bounding
box layouts, pixel-level part layouts and ultimately, the object depictions themselves.
We use Graph Convolutional Networks, Deep Recurrent Networks along with custom-designed
Conditional Variational Autoencoders to enable flexible, diverse and category-aware
generation of 2-D objects in a controlled manner. The performance scores for generated
objects reflect MeronymNet's superior performance compared to multiple strong baselines
and ablative variants. We also showcase MeronymNet's suitability for controllable
object generation and interactive object editing at various levels of structural and
semantic granularity.

SESSION: Session 7: Emerging Multimedia Applications-III

Ada-VSR: Adaptive Video Super-Resolution with Meta-Learning

  • Akash Gupta
  • Padmaja Jonnalagedda
  • Bir Bhanu
  • Amit K. Roy-Chowdhury

Most of the existing works in supervised spatio-temporal video super-resolution (STVSR)
heavily rely on a large-scale external dataset consisting of paired low-resolution
low-frame rate (LR-LFR) and high-resolution high-frame-rate (HR-HFR) videos. Despite
their remarkable performance, these methods make a prior assumption that the low-resolution
video is obtained by down-scaling the high-resolution video using a known degradation
kernel, which does not hold in practical settings. Another problem with these methods
is that they cannot exploit instance-specific internal information of a video at testing
time. Recently, deep internal learning approaches have gained attention due to their
ability to utilize the instance-specific statistics of a video. However, these methods
have a large inference time as they require thousands of gradient updates to learn
the intrinsic structure of the data. In this work, we present Adaptive VideoSuper-Resolution
(Ada-VSR) which leverages external, as well as internal, information through meta-transfer
learning and internal learning, respectively. Specifically, meta-learning is employed
to obtain adaptive parameters, using a large-scale external dataset, that can adapt
quickly to the novel condition (degradation model) of the given test video during
the internal learning task, thereby exploiting external and internal information of
a video for super-resolution. The model trained using our approach can quickly adapt
to a specific video condition with only a few gradient updates, which reduces the
inference time significantly. Extensive experiments on standard datasets demonstrate
that our method performs favorably against various state-of-the-art approaches.

CoReD: Generalizing Fake Media Detection with Continual Representation using Distillation

  • Minha Kim
  • Shahroz Tariq
  • Simon S. Woo

Over the last few decades, artificial intelligence research has made tremendous strides,
but it still heavily relies on fixed datasets in stationary environments. Continual
learning is a growing field of research that examines how AI systems can learn sequentially
from a continuous stream of linked data in the same way that biological systems do.
Simultaneously, fake media such as deepfakes and synthetic face images have emerged
as significant to current multimedia technologies. Recently, numerous method has been
proposed which can detect deepfakes with high accuracy. However, they suffer significantly
due to their reliance on fixed datasets in limited evaluation settings. Therefore,
in this work, we apply continuous learning to neural networks' learning dynamics,
emphasizing its potential to increase data efficiency significantly. We propose Continual
Representation using Distillation (CoReD) method that employs the concept of Continual
Learning (CL), Representation Learning (RL), and Knowledge Distillation (KD). We design
CoReD to perform sequential domain adaptation tasks on new deepfake and GAN-generated
synthetic face datasets, while effectively minimizing the catastrophic forgetting
in a teacher-student model setting. Our extensive experimental results demonstrate
that our method is efficient at domain adaptation to detect low-quality deepfakes
videos and GAN-generated images from several datasets, outperforming the-state-of-art
baseline methods.

SRNet: Spatial Relation Network for Efficient Single-stage Instance Segmentation in Videos

  • Xiaowen Ying
  • Xin Li
  • Mooi Choo Chuah

The task of instance segmentation in videos aims to consistently identify objects
at pixel level throughout the entire video sequence. Existing state-of-the-art methods
either follow the tracking-by-detection paradigm to employ multi-stage pipelines or
directly train a complex deep model to process the entire video clips as 3D volumes.
However, these methods are typically slow and resource-consuming such that they are
often limited to offline processing. In this paper, we propose SRNet, a simple and
efficient framework for joint segmentation and tracking of object instances in videos.
The key to achieving both high efficiency and accuracy in our framework is to formulate
the instance segmentation and tracking problem into a unified spatial-relation learning
task where each pixel in the current frame relates to its object center, and each
object center relates to its location in the previous frame. This unified learning
framework allows our framework to perform join instance segmentation and tracking
through a single stage while maintaining low overheads among different learning tasks.
Our proposed framework can handle two different task settings and demonstrates comparable
performance with state-of-the-art methods on two different benchmarks while running
significantly faster.

Personality Recognition by Modelling Person-specific Cognitive Processes using Graph

  • Zilong Shao
  • Siyang Song
  • Shashank Jaiswal
  • Linlin Shen
  • Michel Valstar
  • Hatice Gunes

Recent research shows that in dyadic and group interactions individuals' nonverbal
behaviours are influenced by the behaviours of their conversational partner(s). Therefore,
in this work we hypothesise that during a dyadic interaction, the target subject's
facial reactions are driven by two main factors: (i) their internal (person-specific)
cognition, and (ii) the externalised nonverbal behaviours of their conversational
partner. Subsequently, our novel proposition is to simulate and represent the target
subject's (i.e., the listener) cognitive process in the form of a person-specific
CNN architecture whose input is the audio-visual non-verbal cues displayed by the
conversational partner (i.e., the speaker), and the output is the target subject's
(i.e., the listener) facial reactions. We then undertake a search for the optimal
CNN architecture whose results are used to create a person-specific graph representation
for recognising the target subject's personality. The graph representation, fortified
with a novel end-to-end edge feature learning strategy, helps with retaining both
the unique parameters of the person-specific CNN and the geometrical relationship
between its layers. Consequently, the proposed approach is the first work that aims
to recognize the true (self-reported) personality of a target subject (i.e., the listener)
from the learned simulation of their cognitive process (i.e., parameters of the person-specific
CNN). The experimental results show that the CNN architectures are well associated
with target subjects' personality traits and the proposed approach clearly outperforms
multiple existing approaches that predict personality directly from non-verbal behaviours.
In light of these findings, this work opens up a new avenue of research for predicting
and recognizing socio-emotional phenomena (personality, affect, engagement etc.) from
simulations of person-specific cognitive processes.

Enhancing Knowledge Tracing via Adversarial Training

  • Xiaopeng Guo
  • Zhijie Huang
  • Jie Gao
  • Mingyu Shang
  • Maojing Shu
  • Jun Sun

We study the problem of knowledge tracing (KT) where the goal is to trace the students'
knowledge mastery over time so as to make predictions on their future performance.
Owing to the good representation capacity of deep neural networks (DNNs), recent advances
on KT have increasingly concentrated on exploring DNNs to improve the performance
of KT. However, we empirically reveal that the DNNs based KT models may run the risk
of overfitting, especially on small datasets, leading to limited generalization. In
this paper, by leveraging the current advances in adversarial training (AT), we propose
an efficient AT based KT method (ATKT) to enhance KT model's generalization and thus
push the limit of KT. Specifically, we first construct adversarial perturbations and
add them on the original interaction embeddings as adversarial examples. The original
and adversarial examples are further used to jointly train the KT model, forcing it
is not only to be robust to the adversarial examples, but also to enhance the generalization
over the original ones. To better implement AT, we then present an efficient attentive-LSTM
model as KT backbone, where the key is a proposed knowledge hidden state attention
module that adaptively aggregates information from previous knowledge hidden states
while simultaneously highlighting the importance of current knowledge hidden state
to make a more accurate prediction. Extensive experiments on four public benchmark
datasets demonstrate that our ATKT achieves new state-of-the-art performance. Code
is available at:

Beyond OCR + VQA: Involving OCR into the Flow for Robust and Accurate TextVQA

  • Gangyan Zeng
  • Yuan Zhang
  • Yu Zhou
  • Xiaomeng Yang

Text-based visual question answering (TextVQA) requires analyzing both the visual
contents and texts in an image to answer a question, which is more practical than
general visual question answering (VQA). Existing efforts tend to regard optical character
recognition (OCR) as a pre-processing and then combine it with a VQA framework. It
makes the performance of multimodal reasoning and question answering highly depend
on the accuracy of OCR. In this work, we address this issue with two perspectives.
First, we take advantages of multimodal cues to complete the semantic information
of texts. A visually enhanced text embedding is proposed to enable understanding of
texts without accurately recognizing them. Second, we further leverage rich contextual
information to modify the answer texts even if the OCR module does not correctly recognize
them. In addition, the visual objects are endued with semantic representations to
enable objects in the same semantic space as OCR tokens. Equipped with these techniques,
the cumulative error propagation caused by poor OCR performance is effectively suppressed.
Extensive experiments on TextVQA and ST-VQA datasets demonstrate that our approach
achieves the state-of-the-art performance in terms of accuracy and robustness.

SESSION: Poster Session 1

JPGNet: Joint Predictive Filtering and Generative Network for Image Inpainting

  • Qing Guo
  • Xiaoguang Li
  • Felix Juefei-Xu
  • Hongkai Yu
  • Yang Liu
  • Song Wang

Image inpainting aims to restore the missing regions of corrupted images and make
the recovery result identical to the originally complete image, which is different
from the common generative task emphasizing the naturalness or realism of generated
images. Nevertheless, existing works usually regard it as a pure generation problem
and employ cutting-edge deep generative techniques to address it. The generative networks
can fill the main missing parts with realistic contents but usually distort the local
structures or introduce obvious artifacts. In this paper, for the first time, we formulate
image inpainting as a mix of two problems, i.e., predictive filtering and deep generation.
Predictive filtering is good at preserving local structures and removing artifacts
but falls short to complete the large missing regions. The deep generative network
can fill the numerous missing pixels based on the understanding of the whole scene
but hardly restores the details identical to the original ones. To make use of their
respective advantages, we propose the joint predictive filtering and generative network
(JPGNet) that contains three branches: predictive filtering & uncertainty network
(PFUNet), deep generative network, and uncertainty-aware fusion network (UAFNet).
The PFUNet can adaptively predict pixel-wise kernels for filtering-based inpainting
according to the input image and output an uncertainty map. This map indicates the
pixels should be processed by filtering or generative networks, which is further fed
to the UAFNet for a smart combination between filtering and generative results. Note
that, our method as a novel framework for the image inpainting problem can benefit
any existing generation-based methods. We validate our method on three public datasets,
i.e., Dunhuang, Places2, and CelebA, and demonstrate that our method can enhance three
state-of-the-art generative methods (i.e., StructFlow, EdgeConnect, and RFRNet) significantly
with slightly extra time costs. We have released the code at

AdvFilter: Predictive Perturbation-aware Filtering against Adversarial Attack via
Multi-domain Learning

  • Yihao Huang
  • Qing Guo
  • Felix Juefei-Xu
  • Lei Ma
  • Weikai Miao
  • Yang Liu
  • Geguang Pu

High-level representation-guided pixel denoising and adversarial training are independent
solutions to enhance the robustness of CNNs against adversarial attacks by pre-processing
input data and re-training models, respectively. Most recently, adversarial training
techniques have been widely studied and improved while the pixel denoising-based method
is getting less attractive. However, it is still questionable whether there exists
a more advanced pixel denoising-based method and whether the combination of the two
solutions benefits each other. To this end, we first comprehensively investigate two
kinds of pixel denoising methods for adversarial robustness enhancement (i.e., existing
additive-based and unexplored filtering-based methods) under the loss functions of
image-level and semantic-level, respectively, showing that pixel-wise filtering can
obtain much higher image quality (e.g., higher PSNR) as well as higher robustness
(e.g., higher accuracy on adversarial examples) than existing pixel-wise additive-based
method. However, we also observe that the robustness results of the filtering-based
method rely on the perturbation amplitude of adversarial examples used for training.
To address this problem, we propose predictive perturbation-aware & pixel-wise filtering,
where dual-perturbation filtering and an uncertainty-aware fusion module are designed
and employed to automatically perceive the perturbation amplitude during the training
and testing process. The method is termed as AdvFilter. Moreover, we combine adversarial
pixel denoising methods with three adversarial training-based methods, hinting that
considering data and models jointly is able to achieve more robust CNNs. The experiments
conduct on NeurIPS-2017DEV, SVHN and CIFAR10 datasets and show advantages over enhancing
CNNs' robustness, high generalization to different models and noise levels.

Pixel-level Intra-domain Adaptation for Semantic Segmentation

  • Zizheng Yan
  • Xianggang Yu
  • Yipeng Qin
  • Yushuang Wu
  • Xiaoguang Han
  • Shuguang Cui

Recent advances in unsupervised domain adaptation have achieved remarkable performance
on semantic segmentation tasks. Despite such progress, existing works mainly focus
on bridging the inter-domain gaps between the source and target domain, while only
few of them noticed the intra-domain gaps within the target data. In this work, we
propose a pixel-level intra-domain adaptation approach to reduce the intra-domain
gaps within the target data. Compared with image-level methods, ours treats each pixel
as an instance, which adapts the segmentation model at a more fine-grained level.
Specifically, we first conduct the inter-domain adaptation between the source and
target domain; Then, we separate the pixels in target images into the easy and hard
subdomains; Finally, we propose a pixel-level adversarial training strategy to adapt
a segmentation network from the easy to the hard subdomain. Moreover, we show that
the segmentation accuracy can be further improved by incorporating a continuous indexing
technique in the adversarial training. Experimental results show the effectiveness
of our method against existing state-of-the-art approaches.

Mask is All You Need: Rethinking Mask R-CNN for Dense and Arbitrary-Shaped Scene Text Detection

  • Xugong Qin
  • Yu Zhou
  • Youhui Guo
  • Dayan Wu
  • Zhihong Tian
  • Ning Jiang
  • Hongbin Wang
  • Weiping Wang

Due to the large success in object detection and instance segmentation, Mask R-CNN
attracts great attention and is widely adopted as a strong baseline for arbitrary-shaped
scene text detection and spotting. However, two issues remain to be settled. The first
is dense text case, which is easy to be neglected but quite practical. There may exist
multiple instances in one proposal, which makes it difficult for the mask head to
distinguish different instances and degrades the performance. In this work, we argue
that the performance degradation results from the learning confusion issue in the
mask head. We propose to use an MLP decoder instead of the "deconv-conv" decoder in
the mask head, which alleviates the issue and promotes robustness significantly. And
we propose instance-aware mask learning in which the mask head learns to predict the
shape of the whole instance rather than classify each pixel to text or non-text. With
instance-aware mask learning, the mask branch can learn separated and compact masks.
The second is that due to large variations in scale and aspect ratio, RPN needs complicated
anchor settings, making it hard to maintain and transfer across different datasets.
To settle this issue, we propose an adaptive label assignment in which all instances
especially those with extreme aspect ratios are guaranteed to be associated with enough
anchors. Equipped with these components, the proposed method named MAYOR achieves
state-of-the-art performance on five benchmarks including DAST1500, MSRA-TD500, ICDAR2015,
CTW1500, and Total-Text.

Windowing Decomposition Convolutional Neural Network for Image Enhancement

  • Chuanjun Zheng
  • Daming Shi
  • Yukun Liu

Image enhancement aims to improve the aesthetic quality of images. Most enhancement
methods are based on image decomposition techniques. For example, an entire image
can be decomposed into a smooth base layer and a residual detail layer. Applying appropriate
algorithms to different layers can solve most enhancement problems. Besides decomposing
the entire image, the local decomposition approach in local Laplacian filter can also
achieve satisfied enhancement results. As a standard convolution is also a local operator
that the output values is determined by neighborhood pixels, we observe that the standard
convolution can be improved by integrating the local decomposition method for better
solving image enhancement problems. Based on this analysis, we propose Windowing Decomposition
Convolution (WDC) that decomposes the content of each convolution window by a windowing
basic value before applying convolution operation. Using different windowing basic
values, the WDC can gather global information and locally separate the processing
of different components of images. Moreover, combined with WDC, a new Windowing Decomposition
Convolutional Neural Network (WDCNN) is presented. The experimental results show that
our WDCNN achieves superior enhancement performance on the MIT-Adobe FiveK and sRGB-SID
datasets for noise-free image retouching and low-light noisy image enhancement compared
with state-of-the-art techniques.

Joint Optimization in Edge-Cloud Continuum for Federated Unsupervised Person Re-identification

  • Weiming Zhuang
  • Yonggang Wen
  • Shuai Zhang

Person re-identification (ReID) aims to re-identify a person from non-overlapping
camera views. Since person ReID data contains sensitive personal information, researchers
have adopted federated learning, an emerging distributed training method, to mitigate
the privacy leakage risks. However, existing studies rely on data labels that are
laborious and time-consuming to obtain. We present FedUReID, a federated unsupervised
person ReID system to learn person ReID models without any labels while preserving
privacy. FedUReID enables in-situ model training on edges with unlabeled data. A cloud
server aggregates models from edges instead of centralizing raw data to preserve data
privacy. Moreover, to tackle the problem that edges vary in data volumes and distributions,
we personalize training in edges with joint optimization of cloud and edge. Specifically,
we propose personalized epoch to reassign computation throughout training, personalized
clustering to iteratively predict suitable labels for unlabeled data, and personalized
update to adapt the server aggregated model to each edge. Extensive experiments on
eight person ReID datasets demonstrate that FedUReID not only achieves higher accuracy
but also reduces computation cost by 29%. Our FedUReID system with the joint optimization
will shed light on implementing federated learning to more multimedia tasks without
data labels.

Multi-view 3D Smooth Human Pose Estimation based on Heatmap Filtering and Spatio-temporal

  • Zehai Niu
  • Ke Lu
  • Jian Xue
  • Haifeng Ma
  • Runchen Wei

The estimation of 3D human poses from time-synchronized, calibrated multi-view video
usually consists of two steps: (1) a 2D detector to locate the 2D coordinate point
position of the joint via heatmaps for each frame and (2) a post-processing method
such as the recursive pictorial structure model or robust triangulation to obtain
3D coordinate points. However, most existing methods are based on a single frame only.
They do not take advantage of the temporal characteristics of the video sequence itself,
and must rely on post-processing algorithms. They are also susceptible to human self-occlusion,
and the generated sequences suffer from jitter. Therefore, we propose a network model
incorporating spatial and temporal features. Using a coarse-to-fine approach, the
proposed heatmap temporal network (HTN) generates temporal heatmap information, with
an occlusion heatmap filter used to filter low-quality heatmaps before they are sent
to the HTN. The heatmap fusion and the triangulation weights are dynamically adjusted,
and intermediate supervision is employed to enable better integration of temporal
and spatial information. Our network is also end-to-end differentiable. This overcomes
the long-standing problem of skeleton jitter being generated and ensures that the
sequence is smooth and stable.

Imitative Learning for Multi-Person Action Forecasting

  • Yu-Ke Li
  • Pin Wang
  • Mang Ye
  • Ching-Yao Chan

Multi-person action forecasting is an emerging task and a pivotal step towards video
understanding. The major challenge lies in estimating a distribution characterizing
the upcoming actions of all individuals in the scene. The state-of-the-art solutions
attempt to solve this problem via a step-by-step prediction procedure. However, they
are not adequate to address some particular limitations, such as the compounding errors,
the innate uncertainty of the future and the spatio-temporal contexts. To handle the
multi-person action forecasting challenges, we put forth a novel imitative learning
framework upon the basis of inverse reinforcement learning. Specifically, we aim to
learn a policy to model the aforementioned distribution up to a coming horizon through
an objective that naturally solves the compounding errors. Such a policy is able to
explore multiple plausible futures via extrapolating a series of latent variables
and taking them into account to generate predictions. The impacts of these latent
variables are further investigated by optimizing the directed information. Moreover,
we reason the spatial context along with the temporal cue in a single pass with the
usage of graph structural data. The experimental outcomes on two large-scale datasets
reveal that our approach yields considerable improvements in terms of both diversity
and quality with respect to recent leading studies.

Stereo Video Super-Resolution via Exploiting View-Temporal Correlations

  • Ruikang Xu
  • Zeyu Xiao
  • Mingde Yao
  • Yueyi Zhang
  • Zhiwei Xiong

Stereo Video Super-Resolution (StereoVSR) aims to generate high-resolution video steams
from two low-resolution videos under stereo settings. Existing video super-resolution
and stereo image super-resolution techniques can be extended to tackle the StereoVSR
task, yet they cannot make full use of the multi-view and temporal information to
achieve satisfactory performance. In this paper, we propose a novel Stereo Video Super-Resolution
Network (SVSRNet) to fulfill the StereoVSR task via exploiting view-temporal correlations.
First, we devise a view-temporal attention module (VTAM) to integrate the information
of cross-time-cross-view for constructing high-resolution stereo videos. Second, we
propose a spatial-temporal fusion module (STFM), which aggregates the information
across time in intra-view to emphasize important features for subsequent restoration.
In addition, we design a view-temporal consistency loss function to enforce consistency
constraint of superresolved stereo videos. Comprehensive experimental results demonstrate
that our method generates superior results.

M3TR: Multi-modal Multi-label Recognition with Transformer

  • Jiawei Zhao
  • Yifan Zhao
  • Jia Li

Multi-label image recognition aims to recognize multiple objects simultaneously in
one image. Recent ideas to solve this problem have focused on learning dependencies
of label co-occurrences to enhance the high-level semantic representations. However,
these methods usually neglect the important relations of intrinsic visual structures
and face difficulties in understanding contextual relationships. To build the global
scope of visual context as well as interactions between visual modality and linguistic
modality, we propose the Multi-Modal Multi-label recognition TRansformers (M3TR) with
the ternary relationship learning for inter-and intra-modalities. For the intra-modal
relationship, we make insightful conjunction of CNNs and Transformers, which embeds
visual structures into the high-level features by learning the semantic cross-attention.
For constructing the interactions between the visual and linguistic modalities, we
propose a linguistic cross-attention to embed the class-wise linguistic information
into the visual structure learning, and finally present a linguistic guided enhancement
module to enhance the representation of high-level semantics. Experimental evidence
reveals that with the collaborative learning of ternary relationship, our proposed
M3TR achieves new state-of-the-art on two public multi-label recognition benchmarks.

TACR-Net: Editing on Deep Video and Voice Portraits

  • Luchuan Song
  • Bin Liu
  • Guojun Yin
  • Xiaoyi Dong
  • Yufei Zhang
  • Jia-Xuan Bai

Utilizing an arbitrary speech clip to edit the mouth of the portrait in the target
video is a novel yet challenging task. Despite impressive results have been achieved,
there are still three limitations in the existing methods: 1) since the acoustic features
are not completely decoupled from person identity, there is no global speech to facial
features (i.e., landmarks, expression blendshape) mapping method. 2) the audio-driven
talking face sequences generated by simple cascade structure usually lack of temporal
consistency and spatial correlation, which leads to defects in the consistency of
changes in details. 3) the operation of forgery is always at the video level, without
considering the forgery of the voice, especially the synchronization of the converted
voice and the mouth. To address these distortion problems, we propose a novel deep
learning framework, named Temporal-Refinement Autoregressive-Cascade Rendering Network
(TACR-Net) for audio-driven dynamic talking face editing. The proposed TACR-Net encodes
facial expression blendshape based on the given acoustic features without separately
training for special video. Then TACR-Net also involves a novel autoregressive cascade
structure generator for video re-rendering. Finally, we transform the in-the-wild
speech to the target portrait and obtain a photo-realistic and audio-realistic video.

Annotation-Efficient Untrimmed Video Action Recognition

  • Yixiong Zou
  • Shanghang Zhang
  • Guangyao Chen
  • Yonghong Tian
  • Kurt Keutzer
  • José M. F. Moura

Deep learning has achieved great success in recognizing video actions, but the collection
and annotation of training data are still quite laborious, which mainly lies in two
aspects: (1) the amount of required annotated data is large; (2) temporally annotating
the location of each action is time-consuming. Works such as few-shot learning or
untrimmed video recognition have been proposed to handle either one aspect or the
other. However, very few existing works can handle both issues simultaneously. In
this paper, we target a new problem, Annotation-Efficient Video Recognition, to reduce
the requirement of annotations for both large amount of samples and the action location.
Such problem is challenging due to two aspects: (1) the untrimmed videos only have
weak supervision; (2) video segments not relevant to current actions of interests
(background, BG) could contain actions of interests (foreground, FG) in novel classes,
which is a widely existing phenomenon but has rarely been studied in few-shot untrimmed
video recognition. To achieve this goal, by analyzing the property of BG, we categorize
BG into informative BG (IBG) and non-informative BG (NBG), and we propose (1) an open-set
detection based method to find the NBG and FG, (2) a contrastive learning method to
learn IBG and distinguish NBG in a self-supervised way, and (3) a self-weighting mechanism
for the better distinguishing of IBG and FG. Extensive experiments on ActivityNet
v1.2 and ActivityNet v1.3 verify the rationale and effectiveness of the proposed methods.

Face-based Voice Conversion: Learning the Voice behind a Face

  • Hsiao-Han Lu
  • Shao-En Weng
  • Ya-Fan Yen
  • Hong-Han Shuai
  • Wen-Huang Cheng

Zero-shot voice conversion (VC) trained by non-parallel data has gained a lot of attention
in recent years. Previous methods usually extract speaker embeddings from audios and
use them for converting the voices into different voice styles. Since there is a strong
relationship between human faces and voices, a promising approach would be to synthesize
various voice characteristics from face representation. Therefore, we introduce a
novel idea of generating different voice styles from different human face photos,
which can facilitate new applications, e.g., personalized voice assistants. However,
the audio-visual relationship is implicit. Moreover, the existing VCs are trained
on laboratory-collected datasets without speaker photos, while the datasets with both
photos and audios are in-the-wild datasets. Directly replacing the target audio with
the target photo and training on the in-the-wild dataset leads to noisy results. To
address these issues, we propose a novel many-to-many voice conversion network, namely
Face-based Voice Conversion (FaceVC), with a 3-stage training strategy. Quantitative
and qualitative experiments on the LRS3-Ted dataset show that the proposed FaceVC
successfully performs voice conversion according to the target face photos. Audio
samples can be found on the demo website at

A Large-Scale Benchmark for Food Image Segmentation

  • Xiongwei Wu
  • Xin Fu
  • Ying Liu
  • Ee-Peng Lim
  • Steven C.H. Hoi
  • Qianru Sun

Food image segmentation is a critical and indispensible task for developing health-related
applications such as estimating food calories and nutrients. Existing food image segmentation
models are underperforming due to two reasons: (1) there is a lack of high quality
food image datasets with fine-grained ingredient labels and pixel-wise location masks---the
existing datasets either carry coarse ingredient labels or are small in size; and
(2) the complex appearance of food makes it difficult to localize and recognize ingredients
in food images, e.g., the ingredients may overlap one another in the same image, and
the identical ingredient may appear distinctly in different food images.

In this work, we build a new food image dataset FoodSeg103 (and its extension FoodSeg154)
containing 9,490 images. We annotate these images with 154 ingredient classes and
each image has an average of 6 ingredient labels and pixel-wise masks. In addition,
we propose a multi-modality pre-training approach called ReLeM that explicitly equips
a segmentation model with rich and semantic food knowledge. In experiments, we use
three popular semantic segmentation methods (i.e., Dilated Convolution based[20],
Feature Pyramid based[25], and Vision Transformer based[60] ) as baselines, and evaluate
them as well as ReLeM on our new datasets. We believe that the FoodSeg103 (and its
extension FoodSeg154) and the pre-trained models using ReLeM can serve as a benchmark
to facilitate future works on fine-grained food image understanding. We make all these
datasets and methods public at

HAT: Hierarchical Aggregation Transformers for Person Re-identification

  • Guowen Zhang
  • Pingping Zhang
  • Jinqing Qi
  • Huchuan Lu

Recently, with the advance of deep Convolutional Neural Networks (CNNs), person Re-Identification
(Re-ID) has witnessed great success in various applications.However, with limited
receptive fields of CNNs, it is still challenging to extract discriminative representations
in a global view for persons under non-overlapped cameras.Meanwhile, Transformers
demonstrate strong abilities of modeling long-range dependencies for spatial and sequential
data.In this work, we take advantages of both CNNs and Transformers, and propose a
novel learning framework named Hierarchical Aggregation Transformer (HAT) for image-based
person Re-ID with high performance.To achieve this goal, we first propose a Deeply
Supervised Aggregation (DSA) to recurrently aggregate hierarchical features from CNN
backbones.With multi-granularity supervision, the DSA can enhance multi-scale features
for person retrieval, which is very different from previous methods.Then, we introduce
a Transformer-based Feature Calibration (TFC) to integrate low-level detail information
as the global prior for high-level semantic information.The proposed TFC is inserted
to each level of hierarchical features, resulting in great performance improvements.To
our best knowledge, this work is the first to take advantages of both CNNs and Transformers
for image-based person Re-ID.Comprehensive experiments on four large-scale Re-ID benchmarks
demonstrate that our method shows better results than several state-of-the-art methods.The
code is released at

Long-Range Feature Propagating for Natural Image Matting

  • Qinglin Liu
  • Haozhe Xie
  • Shengping Zhang
  • Bineng Zhong
  • Rongrong Ji

Natural image matting estimates the alpha values of unknown regions in the trimap.
Recently, deep learning based methods propagate the alpha values from the known regions
to unknown regions according to the similarity between them. However, we find that
more than 50% pixels in the unknown regions cannot be correlated to pixels in known
regions due to the limitation of small effective reception fields of common convolutional
neural networks, which leads to inaccurate estimation when the pixels in the unknown
regions cannot be inferred only with pixels in the reception fields. To solve this
problem, we propose Long-Range Feature Propagating Network (LFPNet), which learns
the long-range context features outside the reception fields for alpha matte estimation.
Specifically, we first design the propagating module which extracts the context features
from the downsampled image. Then, we present Center-Surround Pyramid Pooling (CSPP)
that explicitly propagates the context features from the surrounding context image
patch to the inner center image patch. Finally, we use the matting module which takes
the image, trimap and context features to estimate the alpha matte. Experimental results
demonstrate that the proposed method performs favorably against the state-of-the-art
methods on the AlphaMatting and Adobe Image Matting datasets.

Towards Controllable and Photorealistic Region-wise Image Manipulation

  • Ansheng You
  • Chenglin Zhou
  • Qixuan Zhang
  • Lan Xu

Adaptive and flexible image editing is a desirable function of modern generative models.
In this work, we present a generative model with auto-encoder architecture for per-region
style manipulation. We apply a code consistency loss to enforce an explicit disentanglement
between content and style latent representations, making the content and style of
generated samples consistent with their corresponding content and style references.
The model is also constrained by a content alignment loss to ensure the foreground
editing will not interfere background contents. As a result, given interested region
masks provided by users, our model supports foreground region-wise style transfer.
Specially, our model receives no extra annotations such as semantic labels except
for self-supervision. Extensive experiments show the effectiveness of the proposed
method and exhibit the flexibility of the proposed model for various applications,
including region-wise style editing, latent space interpolation, cross-domain style

Information-Growth Attention Network for Image Super-Resolution

  • Zhuangzi Li
  • Ge Li
  • Thomas Li
  • Shan Liu
  • Wei Gao

It is generally known that a high-resolution (HR) image contains more productive information
compared with its low-resolution (LR) versions, so image super-resolution (SR) satisfies
an information-growth process. Considering the property, we attempt to exploit the
growing information via a particular attention mechanism. In this paper, we propose
a concise but effective Information-Growth Attention Network (IGAN) that shows the
incremental information is beneficial for SR. Specifically, a novel information-growth
attention is proposed. It aims to pay attention to features involving large information-growth
capacity by assimilating the difference from current features to the former features
within a network. We also illustrate its effectiveness contrasted by widely-used self-attention
using entropy and generalization analysis. Furthermore, existing channel-wise attention
generation modules (CAGMs) have large informational attenuation due to directly calculating
global mean for feature maps. Therefore, we present an innovative CAGM that progressively
decreases feature maps' sizes, leading to more adequate feature exploitation. Extensive
experiments also demonstrate IGAN outperforms state-of-the-art attention-aware SR

Anchor-free 3D Single Stage Detector with Mask-Guided Attention for Point Cloud

  • Jiale Li
  • Hang Dai
  • Ling Shao
  • Yong Ding

Most of the existing single-stage and two-stage 3D object detectors are anchor-based
methods, while the efficient but challenging anchor-free single-stage 3D object detection
is not well investigated. Recent studies on 2D object detection show that the anchor-free
methods also are of great potential. However, the unordered and sparse properties
of point clouds prevent us from directly leveraging the advanced 2D methods on 3D
point clouds. We overcome this by converting the voxel-based sparse 3D feature volumes
into the sparse 2D feature maps. We propose an attentive module to fit the sparse
feature maps to dense mostly on the object regions through the deformable convolution
tower and the supervised mask-guided attention. By directly regressing the 3D bounding
box from the enhanced and dense feature maps, we construct a novel single-stage 3D
detector for point clouds in an anchor-free manner. We propose an IoU-based detection
confidence re-calibration scheme to improve the correlation between the detection
confidence score and the accuracy of the bounding box regression. Our code is publicly
available at

Shape Controllable Virtual Try-on for Underwear Models

  • Xin Gao
  • Zhenjiang Liu
  • Zunlei Feng
  • Chengji Shen
  • Kairi Ou
  • Haihong Tang
  • Mingli Song

Image virtual try-on task has abundant applications and has become a hot research
topic recently. Existing 2D image-based virtual try-on methods aim to transfer a target
clothing image onto a reference person, which has two main disadvantages: cannot control
the size and length precisely; unable to accurately estimate the user's figure in
the case of users wearing thick clothing, resulting in inaccurate dressing effect.
In this paper, we put forward an akin task that aims to dress clothing for underwear
models. To solve the above drawbacks, we propose a Shape Controllable Virtual Try-On
Network (SC-VTON), where a graph attention network integrates the information of model
and clothing to generate the warped clothing image. In addition, the control points
are incorporated into SC-VTON for the desired clothing shape. Furthermore, by adding
a Splitting Network and a Synthesis Network, we can use in-shop clothing/model pair
data to help optimize the deformation module and generalize the task to the typical
virtual try-on task. Extensive experiments show that the proposed method can achieve
accurate shape control. Meanwhile, compared with other methods, our method can generate
high-resolution results with detailed textures, which can be applied in real applications.

E2Net: Excitative-Expansile Learning for Weakly Supervised Object Localization

  • Zhiwei Chen
  • Liujuan Cao
  • Yunhang Shen
  • Feihong Lian
  • Yongjian Wu
  • Rongrong Ji

Weakly supervised object localization (WSOL) has gained recent popularity, which seeks
to train localizers with only image-level labels. However, due to relying heavily
on classification objective for training, prevailing WSOL methods only localize discriminative
parts of object, ignoring other useful information, such as the wings of a bird, and
suffer from severe rotation variations. Moreover, learning object localization imposes
CNNs to attend non-salient regions under weak supervision, which may negatively influence
image classification results. To address these challenges, this paper proposes a novel
end-to-end Excitation-Expansion network, coined as E$^2$Net, to localize entire objects
with only image-level labels, which served as the base of most multimedia tasks. The
proposed E$^2$Net consists of two key components: Maxout-Attention Excitation (MAE)
and Orientation-Sensitive Expansion (OSE). Firstly, MAE module aims to activate non-discriminative
localization features while simultaneously recovering discriminative classification
cues. To this end, we couple erasing strategy with maxout learning efficiently to
facilitate entire-object localization without hurting classification accuracy. Secondly,
to address rotation variations, the proposed OSE module expands less salient object
parts along with all possible orientations. Particularly, OSE module dynamically combines
selective attention banks from various orientated expansions of receptive-field, which
introduces additional multi-parallel localization heads. Extensive experiments on
ILSVRC 2012 and CUB-200-2011 demonstrate that the proposed E$^2$Net outperforms the
previous state-of-the-art WSOL methods and also significantly improves classification

Few-shot Fine-Grained Action Recognition via Bidirectional Attention and Contrastive

  • Jiahao Wang
  • Yunhong Wang
  • Sheng Liu
  • Annan Li

Fine-grained action recognition is attracting increasing attention due to the emerging
demand of specific action understanding in real-world applications, whereas the data
of rare fine-grained categories is very limited. Therefore, we propose the few-shot
fine-grained action recognition problem, aiming to recognize novel fine-grained actions
with only few samples given for each class. Although progress has been made in coarse-grained
actions, existing few-shot recognition methods encounter two issues handling fine-grained
actions: the inability to capture subtle action details and the inadequacy in learning
from data with low inter-class variance. To tackle the first issue, a human vision
inspired bidirectional attention module (BAM) is proposed. Combining top-down task-driven
signals with bottom-up salient stimuli, BAM captures subtle action details by accurately
highlighting informative spatio-temporal regions. To address the second issue, we
introduce contrastive meta-learning (CML). Compared with the widely adopted ProtoNet-based
method, CML generates more discriminative video representations for low inter-class
variance data, since it makes full use of potential contrastive pairs in each training
episode. Furthermore, to fairly compare different models, we establish specific benchmark
protocols on two large-scale fine-grained action recognition datasets. Extensive experiments
show that our method consistently achieves state-of-the-art performance across evaluated

Selective Dependency Aggregation for Action Classification

  • Yi Tan
  • Yanbin Hao
  • Xiangnan He
  • Yinwei Wei
  • Xun Yang

Video data are distinct from images for the extra temporal dimension, which results
in more content dependencies from various perspectives. It increases the difficulty
of learning representation for various video actions. Existing methods mainly focus
on the dependency under a specific perspective, which cannot facilitate the categorization
of complex video actions. This paper proposes a novel selective dependency aggregation
(SDA) module, which adaptively exploits multiple types of video dependencies to refine
the features. Specifically, we empirically investigate various long-range and short-range
dependencies achieved by the multi-direction multi-scale feature squeeze and the dependency
excitation. Query structured attention is then adopted to fuse them selectively, fully
considering the diversity of videos' dependency preferences. Moreover, the channel
reduction mechanism is involved in SDA for controlling the additional computation
cost to be lightweight. Finally, we show that the SDA module can be easily plugged
into different backbones to form SDA-Nets and demonstrate its effectiveness, efficiency
and robustness by conducting extensive experiments on several video benchmarks for
action classification. The code and models will be available at

Conditional Directed Graph Convolution for 3D Human Pose Estimation

  • Wenbo Hu
  • Changgong Zhang
  • Fangneng Zhan
  • Lei Zhang
  • Tien-Tsin Wong

Graph convolutional networks have significantly improved 3D human pose estimation
by representing the human skeleton as an undirected graph. However, this representation
fails to reflect the articulated characteristic of human skeletons as the hierarchical
orders among the joints are not explicitly presented. In this paper, we propose to
represent the human skeleton as a directed graph with the joints as nodes and bones
as edges that are directed from parent joints to child joints. By so doing, the directions
of edges can explicitly reflect the hierarchical relationships among the nodes. Based
on this representation, we further propose a spatial-temporal conditional directed
graph convolution to leverage varying non-local dependence for different poses by
conditioning the graph topology on input poses. Altogether, we form a U-shaped network,
named U-shaped Conditional Directed Graph Convolutional Network, for 3D human pose
estimation from monocular videos. To evaluate the effectiveness of our method, we
conducted extensive experiments on two challenging large-scale benchmarks: Human3.6M
and MPI-INF-3DHP. Both quantitative and qualitative results show that our method achieves
top performance. Also, ablation studies show that directed graphs can better exploit
the hierarchy of articulated human skeletons than undirected graphs, and the conditional
connections can yield adaptive graph topologies for different poses.

Cross Chest Graph for Disease Diagnosis with Structural Relational Reasoning

  • Gangming Zhao

Locating lesions is important in the computer-aided diagnosis of X-ray images. However,
box-level annotation is time-consuming and laborious. How to locate lesions accurately
with few, or even without careful annotations is an urgent problem. Although several
works have approached this problem with weakly-supervised methods, the performance
needs to be improved. One obstacle is that general weakly-supervised methods have
failed to consider the characteristics of X-ray images, such as the highly-structural
attribute. We therefore propose the Cross-chest Graph (CCG), which improves the performance
of automatic lesion detection by imitating doctor's training and decision-making process.
CCG models the intra-image relationship between different anatomical areas by leveraging
the structural information to simulate the doctor's habit of observing different areas.
Meanwhile, the relationship between any pair of images is modeled by a knowledge-reasoning
module to simulate the doctor's habit of comparing multiple images. We integrate intra-image
and inter-image information into a unified end-to-end framework. Experimental results
on the NIH Chest-14 database (112,120 frontal-view X-ray images with 14 diseases)
demonstrate that the proposed method achieves state-of-the-art performance in weakly-supervised
localization of lesions by absorbing professional knowledge in the medical field.

ZiGAN: Fine-grained Chinese Calligraphy Font Generation via a Few-shot Style Transfer

  • Qi Wen
  • Shuang Li
  • Bingfeng Han
  • Yi Yuan

Chinese character style transfer is a very challenging problem because of the complexity
of the glyph shapes or underlying structures and large numbers of existed characters,
when comparing with English letters. Moreover, the handwriting of calligraphy masters
has a more irregular stroke and is difficult to obtain in real-world scenarios. Recently,
several GAN-based methods have been proposed for font synthesis, but some of them
require numerous reference data and the other part of them have cumbersome preprocessing
steps to divide the character into different parts to be learned and transferred separately.
In this paper, we propose a simple but powerful end-to-end Chinese calligraphy font
generation framework ZiGAN, which does not require any manual operation or redundant
preprocessing to generate fine-grained target style characters with few-shot references.
To be specific, a few paired samples from different character styles are leveraged
to attain fine-grained correlation between structures underlying different glyphs.
To capture valuable style knowledge in target and strengthen the coarse-grained understanding
of character content, we utilize multiple unpaired samples to align the feature distributions
belonging to different character styles. By doing so, only a few target Chinese calligraphy
characters are needed to generated expected style transferred characters. Experiments
demonstrate that our method has a state-of-the-art generalization ability in few-shot
Chinese character style transfer.

Cycle-Consistent Inverse GAN for Text-to-Image Synthesis

  • Hao Wang
  • Guosheng Lin
  • Steven C. H. Hoi
  • Chunyan Miao

This paper investigates an open research task of text-to-image synthesis for automatically
generating or manipulating images from text descriptions. Prevailing methods mainly
take the textual descriptions as the conditional input for the GAN generation, and
need to train different models for the text-guided image generation and manipulation
tasks. In this paper, we propose a novel unified framework of Cycle-consistent Inverse
GAN (CI-GAN) for both text-to-image generation and text-guided image manipulation
tasks. Specifically, we first train a GAN model without text input, aiming to generate
images with high diversity and quality. Then we learn a GAN inversion model to convert
the images back to the GAN latent space and obtain the inverted latent codes for each
image, where we introduce the cycle-consistency training to learn more robust and
consistent inverted latent codes. We further uncover the semantics of the latent space
of the trained GAN model, by learning a similarity model between text representations
and the latent codes. In the text-guided optimization module, we can generate images
with the desired semantic attributes through optimization on the inverted latent codes.
Extensive experiments on the Recipe1M and CUB datasets validate the efficacy of our
proposed framework.

Fully Quantized Image Super-Resolution Networks

  • Hu Wang
  • Peng Chen
  • Bohan Zhuang
  • Chunhua Shen

With the rising popularity of intelligent mobile devices, it is of great practical
significance to develop accurate, real-time and energy-efficient image Super-Resolution
(SR) methods. A prevailing method for improving inference efficiency is model quantization,
which allows for replacing the expensive floating-point operations with efficient
bitwise arithmetic. To date, it is still challenging for quantized SR frameworks to
deliver a feasible accuracy-efficiency trade-off. Here, we propose a Fully Quantized
image Super-Resolution framework (FQSR) to jointly optimize efficiency and accuracy.
In particular, we target obtaining end-to-end quantized models for all layers, especially
including skip connections, which was rarely addressed in the literature of SR quantization.
We further identify obstacles faced by low-bit SR networks and propose a novel method
to counteract them accordingly. The difficulties are caused by 1) for SR task, due
to the existence of skip connections, high-resolution feature maps would occupy a
huge amount of memory spaces; 2) activation and weight distributions being vastly
distinctive in different layers; 3) the inaccurate approximation of the quantization.
We apply our quantization scheme on multiple mainstream super-resolution architectures,
including SRResNet, SRGAN and EDSR. Experimental results show that our FQSR with low-bits
quantization is able to achieve on par performance compared with the full-precision
counterparts on five benchmark datasets and surpass the state-of-the-art quantized
SR methods with significantly reduced computational cost and memory consumption. Code
is available at

AKECP: Adaptive Knowledge Extraction from Feature Maps for Fast and Efficient Channel

  • Haonan Zhang
  • Longjun Liu
  • Hengyi Zhou
  • Wenxuan Hou
  • Hongbin Sun
  • Nanning Zheng

Pruning can remove redundant parameters and structures of Deep Neural Networks (DNNs)
to reduce inference time and memory overhead. As an important component of neural
networks, the feature map (FM) has stated to be adopted for network pruning. However,
the majority of FM-based pruning methods do not fully investigate effective knowledge
in the FM for pruning. In addition, it is challenging to design a robust pruning criterion
with a small number of images and achieve parallel pruning due to the variability
of FMs. In this paper, we propose Adaptive Knowledge Extraction for Channel Pruning
(AKECP), which can compress the network fast and efficiently. In AKECP, we first investigate
the characteristics of FMs and extract effective knowledge with an adaptive scheme.
Secondly, we formulate the effective knowledge of FMs to measure the importance of
corresponding network channels. Thirdly, thanks to the effective knowledge extraction,
AKECP can efficiently and simultaneously prune all the layers with extremely few or
even one image. Experimental results show that our method can compress various networks
on different datasets without introducing additional constraints, and it has advanced
the state-of-the-arts. Notably, for ResNet-110 on CIFAR-10, AKECP achieves 59.9% of
parameters and 59.8% of FLOPs reduction with negligible accuracy loss. For ResNet-50
on ImageNet, AKECP saves 40.5% of memory footprint and reduces 44.1% of FLOPs with
only 0.32% of Top-1 accuracy drop.

Dynamic Momentum Adaptation for Zero-Shot Cross-Domain Crowd Counting

  • Qiangqiang Wu
  • Jia Wan
  • Antoni B. Chan

Zero-shot cross-domain crowd counting is a challenging task where a crowd counting
model is trained on a source domain (i.e., training dataset) and no additional labeled
or unlabeled data is available for fine-tuning the model when testing on an unseen
target domain (i.e., a different testing dataset). The generalisation performance
of existing crowd counting methods is typically limited due to the large gap between
source and target domains. Here, we propose a novel Crowd Counting framework built
upon an external Momentum Template, termed C2MoT, which enables the encoding of domain
specific information via an external template representation. Specifically, the Momentum
Template (MoT) is learned in a momentum updating way during offline training, and
then is dynamically updated for each test image in online cross-dataset evaluation.
Thanks to the dynamically updated MoT, our C2MoT effectively generates dense target
correspondences that explicitly accounts for head regions, and then effectively predicts
the density map based on the normalized correspondence map. Experiments on large scale
datasets show that our proposed C2MoT achieves leading zero-shot cross-domain crowd
counting performance without model fine-tuning, while also outperforming domain adaptation
methods that use fine-tuning on target domain data. Moreover, C2MoT also obtains state-of-the-art
counting performance on the source domain.

Auto-MSFNet: Search Multi-scale Fusion Network for Salient Object Detection

  • Miao Zhang
  • Tingwei Liu
  • Yongri Piao
  • Shunyu Yao
  • Huchuan Lu

Multi-scale features fusion plays a critical role in salient object detection. Most
of existing methods have achieved remarkable performance by exploiting various multi-scale
features fusion strategies. However, an elegant fusion framework requires expert knowledge
and experience, heavily relying on laborious trial and error. In this paper, we propose
a multi-scale features fusion framework based on Neural Architecture Search (NAS),
named Auto-MSFNet. First, we design a novel search cell, named FusionCell to automatically
decide multi-scale features aggregation. Rather than searching one repeatable cell
stacked, we allow different FusionCells to flexibly integrate multi-level features.
Simultaneously, considering features generated from CNNs are naturally spatial and
channel-wise, we propose a new search space for efficiently focusing on the most relevant
information. The search space mitigates incomplete object structures or over-predicted
foreground regions caused by progressive fusion. Second, we propose a progressive
polishing loss to further obtain exquisite boundaries by penalizing misalignment of
salient object boundaries. Extensive experiments on five benchmark datasets demonstrate
the effectiveness of the proposed method and achieve state-of-the-art performance
on four evaluation metrics. The code and results of our method are available at

Few-shot Unsupervised Domain Adaptation with Image-to-Class Sparse Similarity Encoding

  • Shengqi Huang
  • Wanqi Yang
  • Lei Wang
  • Luping Zhou
  • Ming Yang

This paper investigates a valuable setting called few-shot unsupervised domain adaptation
(FS-UDA), which has not been sufficiently studied in the literature. In this setting,
the source domain data are labelled, but with few-shot per category, while the target
domain data are unlabelled. To address the FS-UDA setting, we develop a general UDA
model to solve the following two key issues: the few-shot labeled data per category
and the domain adaptation between support and query sets. Our model is general in
that once trained it will be able to be applied to various FS-UDA tasks from the same
source and target domains. Inspired by the recent local descriptor based few-shot
learning (FSL), our general UDA model is fully built upon local descriptors (LDs)
for image classification and domain adaptation. By proposing a novel concept called
similarity patterns (SPs), our model not only effectively considers the spatial relationship
of LDs that was ignored in previous FSL methods, but also makes the learned image
similarity better serve the required domain alignment. Specifically, we propose a
novel IMage-to-class sparse Similarity Encoding (IMSE) method. It learns SPs to extract
the local discriminative information for classification and meanwhile aligns the covariance
matrix of the SPs for domain adaptation. Also, domain adversarial training and multi-scale
local feature matching are performed upon LDs. Extensive experiments conducted on
a multi-domain benchmark dataset DomainNet demonstrates the state-of-the-art performance
of our IMSE for the novel setting of FS-UDA. In addition, for FSL, our IMSE can also
show better performance than most of recent FSL methods on miniImageNet.

Semantic-aware Transfer with Instance-adaptive Parsing for Crowded Scenes Pose Estimation

  • Xuanhan Wang
  • Lianli Gao
  • Yan Dai
  • Yixuan Zhou
  • Jingkuan Song

Crowded scenes human pose estimation remains challenging, which requires joint comprehension
of multi-persons and their keypoints in a highly complex scenario. The top-down mechanism,
which is a detect-then-estimate pipeline, has become the mainstream solution for general
pose estimation and obtained impressive progress. However, simply applying this mechanism
to crowded scenes pose estimation results in unsatisfactory performance due to several
issues, in particular involving missing keypoints in crowds and ambiguously labeling
during training. To tackle above two issues, we introduce a novel method named Semantic-aware
Transfer with Instance-adaptive Parsing (STIP). Specifically, our STIP first enhances
the discriminative power of pixel-level representations with a semantic-aware mechanism,
where it smartly decides which pixels to enhance and what semantic embeddings to add.
In this way, the missing keypoints detection can be alleviated.Secondly, instead of
adopting a standard regressor with fixed parameters, we propose a new instance-adaptive
parsing method, where it dynamically generates instance-specific parameters for reducing
adverse effects caused by ambiguously labeling. Notably, STIP is designed in a plugin
fashion and it can be integrated into any top-down models, such as HRNet. Extensive
experiments on two challenging benchmarks, i.e., CrowdPose and MS-COCO, demonstrate
the superiority and generalizability of our approach.

Multimodal Dialog System: Relational Graph-based Context-aware Question Understanding

  • Haoyu Zhang
  • Meng Liu
  • Zan Gao
  • Xiaoqiang Lei
  • Yinglong Wang
  • Liqiang Nie

Multimodal dialog system has attracted increasing attention from both academia and
industry over recent years. Although existing methods have achieved some progress,
they are still confronted with challenges in the aspect of question understanding
(i.e., user intention comprehension). In this paper, we present a relational graph-based
context-aware question understanding scheme, which enhances the user intention comprehension
from local to global. Specifically, we first utilize multiple attribute matrices as
the guidance information to fully exploit the product-related keywords from each textual
sentence, strengthening the local representation of user intentions. Afterwards, we
design a sparse graph attention network to adaptively aggregate effective context
information for each utterance, completely understanding the user intentions from
a global perspective. Moreover, extensive experiments over a benchmark dataset show
the superiority of our model compared with several state-of-the-art baselines.

Shadow Detection via Predicting the Confidence Maps of Shadow Detection Methods

  • Jingwei Liao
  • Yanli Liu
  • Guanyu Xing
  • Housheng Wei
  • Jueyu Chen
  • Songhua Xu

Today's mainstream shadow detection methods are manually designed via a case-by-case
approach. Accordingly, these methods may only be able to detect shadows for specific
scenes. Given the complex and diverse shadow scenes in reality, none of the existing
methods can provide a one-size-fits-all solution with satisfactory performance. To
address this problem, this paper introduces a new concept, named shadow detection
confidence, which can be used to evaluate the effect of any shadow detection method
for any given scene. The best detection effect for a scene is achieved by combining
prediction results by multiple methods. To measure the shadow detection confidence
characteristics of an image, a novel relative confidence map prediction network (RCMPNet)
is proposed. Experimental results show that the proposed method outperforms multiple
state-of-the-art shadow detection methods on four shadow detection benchmark datasets.

Motion Prediction via Joint Dependency Modeling in Phase Space

  • Pengxiang Su
  • Zhenguang Liu
  • Shuang Wu
  • Lei Zhu
  • Yifang Yin
  • Xuanjing Shen

Motion prediction is a classic problem in computer vision, which aims at forecasting
future motion given the observed pose sequence. Various deep learning models have
been proposed, achieving state-of-the-art performance on motion prediction. However,
existing methods typically focus on modeling temporal dynamics in the pose space.
Unfortunately, the complicated and high dimensionality nature of human motion brings
inherent challenges for dynamic context capturing. Therefore, we move away from the
conventional pose based representation and present a novel approach employing a phase
space trajectory representation of individual joints. Moreover, current methods tend
to only consider the dependencies between physically connected joints. In this paper,
we introduce a novel convolutional neural model to effectively leverage explicit prior
knowledge of motion anatomy, and simultaneously capture both spatial and temporal
information of joint trajectory dynamics. We then propose a global optimization module
that learns the implicit relationships between individual joint features. Empirically,
our method is evaluated on large-scale 3D human motion benchmark datasets (i.e., Human3.6M,
CMU MoCap). These results demonstrate that our method sets the new state-of-the-art
on the benchmark datasets. Our code is released at

Q-Art Code: Generating Scanning-robust Art-style QR Codes by Deformable Convolution

  • Hao Su
  • Jianwei Niu
  • Xuefeng Liu
  • Qingfeng Li
  • Ji Wan
  • Mingliang Xu

Quick Response (QR) code is a popular form of matrix barcodes that are widely used
to tag online links on print media (e.g., posters, leaflets, and books). However,
standard QR codes typically appear as noise-like black/white squares (named modules)
which seriously disrupt the attractiveness of their carriers. In this paper, we propose
StyleCode-Net, a method to generate novel art-style QR codes which can better match
the entire style of their carriers to improve the visual quality. For endowing QR
codes with artistic elements, a big challenge is that the scanning-robustness must
be preserved after transforming colors and textures. To address these issues, we propose
a module-based deformable convolutional mechanism (MDCM) and a dynamic target mechanism
(DTM) in StyleCode-Net. MDCM can extract the features of black and white modules of
QR codes respectively. Then, the extracted features are fed to DTM to balance the
scanning-robustness and the style representation. Extensive subjective and objective
experiments show that our art-style QR codes have reached the state-of-the-art level
in both visual quality and scanning-robustness, and these codes have the potential
to replace standard QR codes in real-world applications.

Depth Quality-Inspired Feature Manipulation for Efficient RGB-D Salient Object Detection

  • Wenbo Zhang
  • Ge-Peng Ji
  • Zhuo Wang
  • Keren Fu
  • Qijun Zhao

RGB-D salient object detection (SOD) recently has attracted increasing research interest
by benefiting conventional RGB SOD with extra depth information. However, existing
RGB-D SOD models often fail to perform well in terms of both efficiency and accuracy,
which hinders their potential applications on mobile devices and real-world problems.
An underlying challenge is that the model accuracy usually degrades when the model
is simplified to have few parameters. To tackle this dilemma and also inspired by
the fact that depth quality is a key factor influencing the accuracy, we propose a
novel depth quality-inspired feature manipulation (DQFM) process, which is efficient
itself and can serve as a gating mechanism for filtering depth features to greatly
boost the accuracy. DQFM resorts to the alignment of low-level RGB and depth features,
as well as holistic attention of the depth stream to explicitly control and enhance
cross-modal fusion. We embed DQFM to obtain an efficient light-weight model called
DFM-Net, where we also design a tailored depth backbone and a two-stage decoder for
further efficiency consideration. Extensive experimental results demonstrate that
our DFM-Net achieves state-of-the-art accuracy when comparing to existing non-efficient
models, and meanwhile runs at 140ms on CPU (2.2x faster than the prior fastest efficient
model) with only ~8.5Mb model size (14.9% of the prior lightest). Our code will be
available at

Revisiting Mid-Level Patterns for Cross-Domain Few-Shot Recognition

  • Yixiong Zou
  • Shanghang Zhang
  • Jianpeng Yu
  • Yonghong Tian
  • José M. F. Moura

Existing few-shot learning (FSL) methods usually assume base classes and novel classes
are from the same domain (in-domain setting). However, in practice, it may be infeasible
to collect sufficient training samples for some special domains to construct base
classes. To solve this problem, cross-domain FSL (CDFSL) is proposed very recently
to transfer knowledge from general-domain base classes to special-domain novel classes.
Existing CDFSL works mostly focus on transferring between near domains, while rarely
consider transferring between distant domains, which is in practical need as any novel
classes could appear in real-world applications, and is even more challenging. In
this paper, we study a challenging subset of CDFSL where the novel classes are in
distant domains from base classes, by revisiting the mid-level features, which are
more transferable yet under-explored in main stream FSL work. To boost the discriminability
of mid-level features, we propose a residual-prediction task to encourage mid-level
features to learn discriminative information of each sample. Notably, such mechanism
also benefits the in-domain FSL and CDFSL in near domains. Therefore, we provide two
types of features for both cross- and in-domain FSL respectively, under the same training
framework. Experiments under both settings on six public datasets, including two challenging
medical datasets, validate the our rationale and demonstrate state-of-the-art performance.
Code will be released.

Space-Angle Super-Resolution for Multi-View Images

  • Yuqi Sun
  • Ri Cheng
  • Bo Yan
  • Shili Zhou

The limited spatial and angular resolutions in multi-view multimedia applications
restrict their visual experience in practical use. In this paper, we first argue the
space-angle super-resolution (SASR) problem for irregular arranged multi-view images.
It aims to increase the spatial resolution of source views and synthesize arbitrary
virtual high resolution (HR) views between them jointly. One feasible solution is
to perform super-resolution (SR) and view synthesis (VS) methods separately. However,
it cannot fully exploit the intra-relationship between SR and VS tasks. Intuitively,
multi-view images can provide more angular references, and higher resolution can provide
more high-frequency details. Therefore, we propose a one-stage space-angle super-resolution
network called SASRnet, which simultaneously synthesizes real and virtual HR views.
Extensive experiments on several benchmarks demonstrate that our proposed method outperforms
two-stage methods, meanwhile prove that SR and VS can promote each other. To our knowledge,
this work is the first to address the SASR problem for unstructured multi-view images
in an end-to-end learning-based manner.

Weakly-Supervised Video Object Grounding via Stable Context Learning

  • Wei Wang
  • Junyu Gao
  • Changsheng Xu

We investigate the problem of weakly-supervised video object grounding (WSVOG), where
only the video-sentence annotations are provided for training. It aims at localizing
the queried objects described in the sentence to visual regions in the video. Despite
the recent progress, existing approaches have not fully exploited the potential of
the description sentences for cross-modal alignment in two aspects: (1) Most of them
extract objects from the description sentences and represent them with fixed textual
representations. While achieving promising results, they do not make full use of the
contextual information in the sentence. (2) A few works have attempted to utilize
contextual information to learn object representations, but found a significant decrease
in performance due to the unstable training in cross-modal alignment. To address the
above issues, in this paper, we propose a Stable Context Learning (SCL) framework
for WSVOG which jointly enjoys the merits of stable learning and rich contextual information.
Specifically, we design two modules named Context-Aware Object Stabilizer module and
Cross-Modal Alignment Knowledge Transfer module, which are cooperated together to
inject contextual information to stable object concepts in text modality and transfer
contextualized knowledge in cross-modal alignment. Our approach is finally optimized
under a frame-level MIL paradigm. Extensive experiments on three popular benchmarks
demonstrate its significant effectiveness.

Modeling the Uncertainty for Self-supervised 3D Skeleton Action Representation Learning

  • Yukun Su
  • Guosheng Lin
  • Ruizhou Sun
  • Yun Hao
  • Qingyao Wu

Self-supervised learning (SSL) has been proved very effective in learning representations
from unlabeled data in language and vision domains. Yet, very few instrumental self-supervised
approaches exist for 3D skeleton action understanding, and directly applying the existing
SSL methods from other domains for skeleton action learning may suffer from misalignment
of representations and some limitations. In this paper, we consider that a good representation
learning encoder can distinguish the underlying features of different actions, which
can make the similar motions closer while pushing the dissimilar motions away. There
exists, however, some uncertainties in the skeleton actions due to the inherent ambiguity
of 3D skeleton pose in different viewpoints or the sampling algorithm in contrastive
learning, thus, it is ill-posed to differentiate the action features in the deterministic
embedding space. To address these issues, we rethink the distance between action features
and propose to model each action representation into the probabilistic embedding space
to alleviate the uncertainties upon encountering the ambiguous 3D skeleton inputs.
To validate the effectiveness of the proposed method, extensive experiments are conducted
on Kinetics, NTU60, NTU120, and PKUMMD datasets with several alternative network architectures.
Experimental evaluations demonstrate the superiority of our approach and through which,
we can gain significant performance improvement without using extra labeled data.

D³Net: Dual-Branch Disturbance Disentangling Network for Facial Expression Recognition

  • Rongyun Mo
  • Yan Yan
  • Jing-Hao Xue
  • Si Chen
  • Hanzi Wang

One of the main challenges in facial expression recognition (FER) is to address the
disturbance caused by various disturbing factors, including common ones (such as identity,
pose, and illumination) and potential ones (such as hairstyle, accessory, and occlusion).
Recently, a number of FER methods have been developed to explicitly or implicitly
alleviate the disturbance involved in facial images. However, these methods either
consider only a few common disturbing factors or neglect the prior information of
these disturbing factors, thus resulting in inferior recognition performance. In this
paper, we propose a novel Dual-branch Disturbance Disentangling Network (D3Net), mainly
consisting of an expression branch and a disturbance branch, to perform effective
FER. In the disturbance branch, a label-aware sub-branch (LAS) and a label-free sub-branch
(LFS) are elaborately designed to cope with different types of disturbing factors.
On the one hand, LAS explicitly captures the disturbance due to some common disturbing
factors by transfer learning on a pretrained model. On the other hand, LFS implicitly
encodes the information of potential disturbing factors in an unsupervised manner.
In particular, we introduce an Indian buffet process (IBP) prior to model the distribution
of potential disturbing factors in LFS. Moreover, we leverage adversarial training
to increase the differences between disturbance features and expression features,
thereby enhancing the disentanglement of disturbing factors. By disentangling the
disturbance from facial images, we are able to extract discriminative expression features.
Extensive experiments demonstrate that our proposed method performs favorably against
several state-of-the-art FER methods on both in-the-lab and in-the-wild databases.

Towards a Unified Middle Modality Learning for Visible-Infrared Person Re-Identification

  • Yukang Zhang
  • Yan Yan
  • Yang Lu
  • Hanzi Wang

Visible-infrared person re-identification (VI-ReID) aims to search identities of pedestrians
across different spectra. In this task, one of the major challenges is the modality
discrepancy between the visible (VIS) and infrared (IR) images. Some state-of-the-art
methods try to design complex networks or generative methods to mitigate the modality
discrepancy while ignoring the highly non-linear relationship between the two modalities
of VIS and IR. In this paper, we propose a non-linear middle modality generator (MMG),
which helps to reduce the modality discrepancy. Our MMG can effectively project VIS
and IR images into a unified middle modality image (UMMI) space to generate middle-modality
(M-modality) images. The generated M-modality images and the original images are fed
into the backbone network to reduce the modality discrepancy.Furthermore, in order
to pull together the two types of M-modality images generated from the VIS and IR
images in the UMMI space, we propose a distribution consistency loss (DCL) to make
the modality distribution of the generated M-modalities images as consistent as possible.
Finally, we propose a middle modality network (MMN) to further enhance the discrimination
and richness of features in an explicit manner. Extensive experiments have been conducted
to validate the superiority of MMN for VI-ReID over some state-of-the-art methods
on two challenging datasets. The gain of MMN is more than 11.1% and 8.4% in terms
of Rank-1 and mAP, respectively, even compared with the latest state-of-the-art methods
on the SYSU-MM01 dataset.

ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal
Knowledge Integration

  • Yuhao Cui
  • Zhou Yu
  • Chunqi Wang
  • Zhongzhou Zhao
  • Ji Zhang
  • Meng Wang
  • Jun Yu

Vision-and-language pretraining (VLP) aims to learn generic multimodal representations
from massive image-text pairs. While various successful attempts have been proposed,
learning fine-grained semantic alignments between image-text pairs plays a key role
in their approaches. Nevertheless, most existing VLP approaches have not fully utilized
the intrinsic knowledge within the image-text pairs, which limits the effectiveness
of the learned alignments and further restricts the performance of their models. To
this end, we introduce a new VLP method called ROSITA, which integrates the cross-
and intra-modal knowledge in a unified scene graph to enhance the semantic alignments.
Specifically, we introduce a novel structural knowledge masking (SKM) strategy to
use the scene graph structure as a priori to perform masked language (region) modeling,
which enhances the semantic alignments by eliminating the interference information
within and across modalities. Extensive ablation studies and comprehensive analysis
verifies the effectiveness of ROSITA in semantic alignments. Pretrained with both
in-domain and out-of-domain datasets, ROSITA significantly outperforms existing state-of-the-art
VLP methods on three typical vision-and-language tasks over six benchmark datasets.

Object Point Cloud Classification via Poly-Convolutional Architecture Search

  • Xuanxiang Lin
  • Ke Chen
  • Kui Jia

Existing point cloud classifiers concern on handling irregular data structures to
discover a global and discriminative configuration of local geometries. These classification
methods design a number of effective permutation-invariant feature encoding kernels,
but still suffer from the intrinsic challenge of large geometric feature variations
caused by inconsistent point distributions along object surface. In this paper, point
cloud classification can be addressed via deep graph representation learning on aggregating
multiple convolutional feature kernels (namely, a poly convolutional operation) anchored
on each point with its local neighbours. Inspired by recent success of neural architecture
search, we introduce a novel concept of poly-convolutional architecture search (PolyConv
search in short) to model local geometric patterns in a more flexible manner.

To this end, the Monte Carlo Tree Search (MCTS) method is adopted, which can be formulated
into a Markov Decision Process problem to cast decisions for dependently selecting
layer-wise aggregation kernels. Experiments on the popular ModelNet40 benchmark have
verified that superior performance can be achieved by constructing networks via the
MCTS method, with aggregation kernels in our PolyConv search space.

Semantic-Guided Relation Propagation Network for Few-shot Action Recognition

  • Xiao Wang
  • Weirong Ye
  • Zhongang Qi
  • Xun Zhao
  • Guangge Wang
  • Ying Shan
  • Hanzi Wang

Few-shot action recognition has drawn growing attention as it can recognize novel
action classes by using only a few labeled samples. In this paper, we propose a novel
semantic-guided relation propagation network (SRPN), which leverages semantic information
together with visual information for few-shot action recognition. Different from most
previous works that neglect semantic information in the labeled data, our SRPN directly
utilizes the semantic label as an additional supervisory signal to improve the generalization
ability of the network. Besides, we treat the relation of each visual-semantic pair
as a relational node, and we use a graph convolutional network to model and propagate
such sample relations across visual-semantic pairs, including both intra-class commonality
and inter-class uniqueness, to guide the relation propagation in the graph. However,
since videos contain crucial sequences and ordering information, we propose a novel
spatial-temporal difference module, which can facilitate the network to enhance the
visual feature learning ability at both feature level and granular level for videos.
Extensive experiments conducted on several challenging benchmarks demonstrate that
our SRPN outperforms several state-of-the-art methods with a significant margin.

Anti-Distillation Backdoor Attacks: Backdoors Can Really Survive in Knowledge Distillation

  • Yunjie Ge
  • Qian Wang
  • Baolin Zheng
  • Xinlu Zhuang
  • Qi Li
  • Chao Shen
  • Cong Wang

Motivated by resource-limited scenarios, knowledge distillation (KD) has received
growing attention, effectively and quickly producing lightweight yet high-performance
student models by transferring the dark knowledge from large teacher models. However,
many pre-trained teacher models are downloaded from public platforms that lack necessary
vetting, posing a possible threat to knowledge distillation tasks. Unfortunately,
thus far, there has been little research to consider the backdoor attack from the
teacher model into student models in KD, which may pose a severe threat to its wide
use. In this paper, we, for the first time, propose a novel Anti-Distillation Backdoor
Attack (ADBA), in which the backdoor embedded in the public teacher model can survive
the knowledge distillation process and thus be transferred to secret distilled student
models. We first introduce a shadow to imitate the distillation process and adopt
an optimizable trigger to transfer information to help craft the desired teacher model.
Our attack is powerful and effective, which achieves 95.92%, 94.79%, and 90.19% average
success rates of attacks (SRoAs) against several different structure student models
on MNIST, CIFAR-10, and GTSRB, respectively. Our ADBA also performs robustly under
different user distillation environments with 91.72% and 92.37% average SRoAs on MNIST
and CIFAR-10, respectively. Finally, we show that the ADBA has a low overhead in the
injecting process, which converges on 50 and 70 epochs on CIFAR-10 and GTSRB, respectively,
while the normal training epochs of these datasets are almost 200.

One-stage Context and Identity Hallucination Network

  • Yinglu Liu
  • Mingcan Xiang
  • Hailin Shi
  • Tao Mei

Face swapping aims to synthesize a face image, in which the facial identity is well
transplanted from the source image and the context (e.g., hairstyle, head posture,
facial expression, lighting, and background) keeps consistent with the reference image.
The prior work mainly accomplishes the task in two stages, i.e., generating the inner
face with the source identity, and then stitching the generation with the complementary
part of the reference image by image blending techniques. The blending mask, which
is usually obtained by the additional face segmentation model, is a common practice
towards photo-realistic face swapping. However, artifacts usually appear at the blending
boundary, especially in areas occluded by the hair, eyeglasses, accessories, etc.
To address this problem, rather than struggling with the blending mask in the two-stage
routine, we develop a novel one-stage context and identity hallucination network,
which learns a series of hallucination maps to softly divide the context areas and
identity areas. For context areas, the features are fully utilized by a multi-level
context encoder. For identity areas, we design a novel two-cascading AdaIN to transfer
the identity while retaining the context. Besides, with the help of hallucination
maps, we introduce an effectively improved reconstruction loss to utilize unlimited
unpaired face images for training. Our network performs well on both context areas
and identity areas without any dependency on post-processing. Extensive qualitative
and quantitative experiments demonstrate the superiority of our network.

Mitigating Generation Shifts for Generalized Zero-Shot Learning

  • Zhi Chen
  • Yadan Luo
  • Sen Wang
  • Ruihong Qiu
  • Jingjing Li
  • Zi Huang

Generalized Zero-Shot Learning (GZSL) is the task of leveraging semantic information
to recognize seen and unseen samples, where unseen classes are not observable during
training. It is natural to derive generative models and hallucinate training samples
for unseen classes based on the knowledge learned from the seen samples. However,
most of these models suffer from the generation shifts, where the synthesized samples
may drift from the real distribution of unseen data. In this paper, we propose a novel
generative flow framework that consists of multiple conditional affine coupling layers
for learning unseen data generation. In particular, we identify three potential problems
that trigger the generation shifts, i.e., semantic inconsistency, variance collapse,
and structure disorder and address them respectively. First, to reinforce the correlations
between the generated samples and their corresponding attributes, we explicitly embed
the semantic information into the transformations in each coupling layer. Second,
to recover the intrinsic variance of the real unseen features, we introduce a visual
perturbation strategy to diversify the generated data and hereby help adjust the decision
boundary of the classifiers. Third, a relative positioning strategy is proposed to
revise the attribute embeddings, guiding them to fully preserve the inter-class geometric
structure and further avoid structure disorder in the semantic space. Experimental
results demonstrate that GSMFlow achieves the state-of-the-art performance on GZSL.

Weakly-Supervised Temporal Action Localization via Cross-Stream Collaborative Learning

  • Yuan Ji
  • Xu Jia
  • Huchuan Lu
  • Xiang Ruan

Weakly supervised temporal action localization (WTAL) is a challenging task as only
video-level category labels are available during training stage. Without precise temporal
annotations, most approaches rely on complementary RGB and optical flow features to
predict the start and end frame of each action category in a video. However, existing
approaches simply resort to either concatenation or weighted sum to learn how to take
advantages of these two modalities for accurate action localization, which ignore
the substantial variance between such two modalities. In this paper, we present Cross-Stream
Collaborative Learning (CSCL) to address these issues. The proposed CSCL introduce
a cross-stream weighting module to identify which modality is more robust during training
and take advantage of the robust modality to guide the weaker one. Furthermore, we
suppress the snippets which has high action-ness scores in both modalities to further
exploiting the complementary property between two modalities. In addition, we bring
the concept of co-training for WTAL and take both modalities into account for pseudo
label generation to help training a stronger model. Extensive experiments conducted
on THUMOS14 and ActivityNet dataset demonstrate that CSCL achieves a favorable performance
against state-of-the-arts methods.

Deep Interactive Video Inpainting: An Invisibility Cloak for Harry Potter

  • Cheng Chen
  • Jiayin Cai
  • Yao Hu
  • Xu Tang
  • Xinggang Wang
  • Chun Yuan
  • Xiang Bai
  • Song Bai

In this paper, we propose a new task of deep interactive video inpainting and an application
for users to interact with machines. To our best knowledge, this is the first deep
learning-based interactive video inpainting framework that only uses a free form of
user input as guidance (i.e. scribbles) instead of mask annotations, which has academic,
entertainment, and commercial value.

With users' scribbles on a certain frame, it simultaneously performs interactive video
object segmentation and video inpainting throughout the whole video. To achieve this,
we utilize a shared spatial-temporal memory module, which combines both segmentation
and inpainting into an end-to-end pipeline. In our framework, the past frames with
object masks (either the users' scribbles or the predicted masks) constitute an external
memory, and the current frame as the query is segmented and inpainted by reading the
visual cues stored in that memory. Furthermore, our method allows users to iteratively
refine the segmentation results, which effectively improves the inpainting performance
with frames where inferior segmentation results are witnessed. Hence, one could obtain
high-quality video inpainting results even with challenging video sequences. Qualitative
and quantitative experimental results demonstrate the superiority of our approach.

Searching Motion Graphs for Human Motion Synthesis

  • Chenchen Liu
  • Yadong Mu

This work proposes a graph search based method for human motion sequence synthesis,
complementing the modern generative model (e.g., variational auto-encoder or Gaussian
process) based solutions that currently dominate this task and showing strong advantages
at several aspects. The cornerstone of our method is a novel representation which
we dub as motion graph. Each motion graph is scaffolded by a set of realistic human
motion sequences (e.g., all training data in the Human3.6M benchmark). We devise a
scheme that adds transition edges across different motion sequences, enabling more
longer and diverse routes in the motion graph. Crucially, the proposed motion graph
bridges the problem of human motion synthesis with graph-oriented combinatorial optimization,
by naturally treating pre-specified starting or ending pose in human pose synthesis
as end-points of the retrieved graph path. Based on a jump-sensitive graph path search
algorithm proposed in this paper, our model can efficiently solve human motion completion
over the motion graphs. In contrast, existing methods are mainly effective for human
motion prediction and inadequate to impute missing sequences while jointly satisfying
the two constraints of pre-specified starting / ending poses. For the case of only
specifying the starting pose (i.e., human motion prediction), a forward graph walking
from the starting node is first performed to sample a diverse set of ending nodes
on the motion graph, each of which defines a motion completion problem. We conduct
comprehensive experiments on two large-scale benchmarks (Human3.6M and HumanEva-I).
The proposed method clearly proves to be superior in terms of several metrics, including
the diversity of generated human motion sequences, affinity to real poses, and cross-scenario
generalization etc.

When Video Classification Meets Incremental Classes

  • Hanbin Zhao
  • Xin Qin
  • Shihao Su
  • Yongjian Fu
  • Zibo Lin
  • Xi Li

With the rapid development of social media, tremendous videos with new classes are
generated daily, which raise an urgent demand for video classification methods that
can continuously update new classes while maintaining the knowledge of old videos
with limited storage and computing resources. In this paper, we summarize this task
as Class-Incremental Video Classification (CIVC) and propose a novel framework to
address it. As a subarea of incremental learning tasks, the challenge of catastrophic
forgetting is unavoidable in CIVC. To better alleviate it, we utilize some characteristics
of videos. First, we decompose the spatio-temporal knowledge before distillation rather
than treating it as a whole in the knowledge transfer process; trajectory is also
used to refine the decomposition. Second, we propose a dual granularity exemplar selection
method to select and store representative video instances of old classes and key-frames
inside videos under a tight storage budget. We benchmark our method and previous SOTA
class-incremental learning methods on Something-Something V2 and Kinetics datasets,
and our method outperforms previous methods significantly.

Fast and Accurate Lane Detection via Frequency Domain Learning

  • Yulin He
  • Wei Chen
  • Zhengfa Liang
  • Dan Chen
  • Yusong Tan
  • Xin Luo
  • Chen Li
  • Yulan Guo

It is desirable to maintain both high accuracy and runtime efficiency in lane detection.
State-of-the-art methods mainly address the efficiency problem by direct compression
of high-dimensional features. These methods usually suffer from information loss and
cannot achieve satisfactory accuracy performance. To ensure the diversity of features
and subsequently maintain information as much as possible, we introduce multi-frequency
analysis into lane detection. Specifically, we propose a multi-spectral feature compressor
(MSFC) based on two-dimensional (2D) discrete cosine transform (DCT) to compress features
while preserving diversity information. We group features and associate each group
with an individual frequency component, which incurs only 1/7 overhead of one-dimensional
convolution operation but preserves more information. Moreover, to further enhance
the discriminability of features, we design a multi-spectral lane feature aggregator
(MSFA) based on one-dimensional (1D) DCT to aggregate features from each lane according
to their corresponding frequency components. The proposed method outperforms the state-of-the-art
methods (including LaneATT and UFLD) on TuSimple, CULane, and LLAMAS benchmarks. For
example, our method achieves 76.32% F1 at 237 FPS and 76.98% F1 at 164 FPS on CULane,
which is 1.23% and 0.30% higher than LaneATT. Our code and models are available at

Learning Multi-context Aware Location Representations from Large-scale Geotagged Images

  • Yifang Yin
  • Ying Zhang
  • Zhenguang Liu
  • Yuxuan Liang
  • Sheng Wang
  • Rajiv Ratn Shah
  • Roger Zimmermann

With the ubiquity of sensor-equipped smartphones, it is common to have multimedia
documents uploaded to the Internet that have GPS coordinates associated with them.
Utilizing such geotags as an additional feature is intuitively appealing for improving
the performance of location-aware applications. However, raw GPS coordinates are fine-grained
location indicators without any semantic information. Existing methods on geotag semantic
encoding mostly extract hand-crafted, application-specific location representations
that heavily depend on large-scale supplementary data and thus cannot perform efficiently
on mobile devices. In this paper, we present a machine learning based approach, termed
GPS2Vec+, which learns rich location representations by capitalizing on the world-wide
geotagged images. Once trained, the model has no dependence on the auxiliary data
anymore so it encodes geotags highly efficiently by inference. We extract visual and
semantic knowledge from image content and user-generated tags, and transfer the information
into locations by using geotagged images as a bridge. To adapt to different application
domains, we further present an attention-based fusion framework that estimates the
importance of the learnt location representations under different contexts for effective
feature fusion. Our location representations yield significant performance improvements
over the state-of-the-art geotag encoding methods on image classification and venue

MV-TON: Memory-based Video Virtual Try-on network

  • Xiaojing Zhong
  • Zhonghua Wu
  • Taizhe Tan
  • Guosheng Lin
  • Qingyao Wu

With the development of Generative Adversarial Network, image-based virtual try-on
methods have made great progress. However, limited work has explored the task of video-based
virtual try-on while it is important in real-world applications. Most existing video-based
virtual try-on methods usually require clothing templates and they can only generate
blurred and low-resolution results. To address these challenges, we propose a Memory-based
Video virtual Try-On Network (MV-TON), which seamlessly transfers desired clothes
to a target person without using any clothing templates and generates high-resolution
realistic videos. Specifically, MV-TON consists of two modules: 1) a try-on module
that transfers the desired clothes from model images to frame images by pose alignment
and region-wise replacing of pixels; 2) a memory refinement module that learns to
embed the existing generated frames into the latent space as external memory for the
following frame generation. Experimental results show the effectiveness of our method
in the video virtual try-on task and its superiority over other existing methods.

Token Shift Transformer for Video Classification

  • Hao Zhang
  • Yanbin Hao
  • Chong-Wah Ngo

Transformer achieves remarkable successes in understanding 1 and 2-dimensional signals
(e.g., NLP and Image Content Understanding). As a potential alternative to convolutional
neural networks, it shares merits of strong interpretability, high discriminative
power on hyper-scale data, and flexibility in processing varying length inputs. However,
its encoders naturally contain computational intensive operations such as pair-wise
self-attention, incurring heavy computational burden when being applied on the complex
3-dimensional video signals. This paper presents Token Shift Module (i.e., TokShift),
a novel, zero-parameter, zero-FLOPs operator, for modeling temporal relations within
each transformer encoder. Specifically, the TokShift barely temporally shifts partial
[Class] token features back-and-forth across adjacent frames. Then, we densely plug
the module into each encoder of a plain 2D vision transformer for learning 3D video
representation. It is worth noticing that our TokShift transformer is a pure convolutional-free
video transformer pilot with computational efficiency for video understanding. Experiments
on standard benchmarks verify its robustness, effectiveness, and efficiency. Particularly,
with input clips of 8/12 frames, the TokShift transformer achieves SOTA precision:
79.83%/80.40% on the Kinetics-400, 66.56% on EGTEA-Gaze+, and 96.80% on UCF-101 datasets,
comparable or better than existing SOTA convolutional counterparts. Our code is open-sourced

Attribute-specific Control Units in StyleGAN for Fine-grained Image Manipulation

  • Rui Wang
  • Jian Chen
  • Gang Yu
  • Li Sun
  • Changqian Yu
  • Changxin Gao
  • Nong Sang

Image manipulation with StyleGAN has been an increasing concern in recent years. Recent
works have achieved tremendous success in analyzing several semantic latent spaces
to edit the attributes of the generated images. However, due to the limited semantic
and spatial manipulation precision in these latent spaces, the existing endeavors
are defeated in fine-grained StyleGAN image manipulation, i.e., local attribute translation.
To address this issue, we discover attribute-specific control units, which consist
of multiple channels of feature maps and modulation styles. Specifically, we collaboratively
manipulate the modulation style channels and feature maps in control units rather
than individual ones to obtain the semantic and spatial disentangled controls. Furthermore,
we propose a simple yet effective method to detect the attribute-specific control
units. We move the modulation style along a specific sparse direction vector and replace
the filter-wise styles used to compute the feature maps to manipulate these control
units. We evaluate our proposed method in various face attribute manipulation tasks.
Extensive qualitative and quantitative results demonstrate that our proposed method
performs favorably against the state-of-the-art methods. The manipulation results
of real images further show the effectiveness of our method.

Attention-driven Graph Clustering Network

  • Zhihao Peng
  • Hui Liu
  • Yuheng Jia
  • Junhui Hou

The combination of the traditional convolutional network (i.e., an auto-encoder) and
the graph convolutional network has attracted much attention in clustering, in which
the auto-encoder extracts the node attribute feature and the graph convolutional network
captures the topological graph feature. However, the existing works (i) lack a flexible
combination mechanism to adaptively fuse those two kinds of features for learning
the discriminative representation and (ii) overlook the multi-scale information embedded
at different layers for subsequent cluster assignment, leading to inferior clustering
results. To this end, we propose a novel deep clustering method named Attention-driven
Graph Clustering Network (AGCN). Specifically, AGCN exploits a heterogeneity-wise
fusion module to dynamically fuse the node attribute feature and the topological graph
feature. Moreover, AGCN develops a scale-wise fusion module to adaptively aggregate
the multi-scale features embedded at different layers. Based on a unified optimization
framework, AGCN can jointly perform feature learning and cluster assignment in an
unsupervised fashion. Compared with the existing deep clustering methods, our method
is more flexible and effective since it comprehensively considers the numerous and
discriminative information embedded in the network and directly produces the clustering
results. Extensive quantitative and qualitative results on commonly used benchmark
datasets validate that our AGCN consistently outperforms state-of-the-art methods.

Lifting the Veil of Frequency in Joint Segmentation and Depth Estimation

  • Tianhao Fu
  • Yingying Li
  • Xiaoqing Ye
  • Xiao Tan
  • Hao Sun
  • Fumin Shen
  • Errui Ding

Joint learning of scene parsing and depth estimation remains a challenging task due
to the rivalry between the two tasks. In this paper, we revisit the mutual enhancement
for joint semantic segmentation and depth estimation. Inspired by the observation
that the competition and cooperation could be reflected in the feature frequency components
of different tasks, we propose a Frequency Aware Feature Enhancement (FAFE) network
that can effectively enhance the reciprocal relationship whereas avoiding the competition.
In FAFE, a frequency disentanglement module is proposed to fetch the favorable frequency
component sets for each task and resolve the discordance between the two tasks. For
task cooperation, we introduce a re-calibration unit to aggregate features of the
two tasks, so as to complement task information with each other. Accordingly, the
learning of each task can be boosted by the complementary task appropriately. Besides,
a novel local-aware consistency loss function is proposed to impose on the predicted
segmentation and depth so as to strengthen the cooperation. With the FAFE network
and new local-aware consistency loss encapsulated into the multi-task learning network,
the proposed approach achieves superior performance over previous state-of-the-art
methods. Extensive experiments and ablation studies on multi-task datasets demonstrate
the effectiveness of our proposed approach.

SESSION: Panel 1

The Next Generation Multimodal Conversational Search and Recommendation

  • Joao Magalhaes
  • Tat-Seng Chua
  • Tao Mei
  • Alan Smeaton

The world has become multimodal. In addition to text, we have been sharing a huge
amount of multimedia information in the form of images and videos on the Internet.
The wide spread use of smart mobile devices has also changed the way we interact with
the Internet. It is now natural for us to capture images and videos freely and use
as part of a query, in addition to the traditional text and voices. These, along with
the rapid advancements in multimedia, natural language processing, information retrieval,
and conversation technologies, mean that it is time for us to explore multimodal conversation
and its roles in search and recommendation. Multimodal conversation has the potential
to help us to uncover and digest the huge amount of multimedia information and knowledge
hidden within many systems. It also enables a natural 2-way interactions between humans
and machines, with mutual benefits in enriching their respective knowledge. Finally,
it opens up the possibilities of disrupting many existing applications and launching
new innovative applications. This panel is timely and aims to explore this emerging
trend, and discuss its potential benefits and pitfalls to society. The panel will
also explore the limitations of current technologies and highlight future research
directions towards developing a multimedia conversational system.

SESSION: Session 8: Emerging Multimedia Applications-IV

VoteHMR: Occlusion-Aware Voting Network for Robust 3D Human Mesh Recovery from Partial
Point Clouds

  • Guanze Liu
  • Yu Rong
  • Lu Sheng

3D human mesh recovery from point clouds is essential for various tasks, including
AR/VR and human behavior understanding. Previous works in this field either require
high-quality 3D human scans or sequential point clouds, which cannot be easily applied
to low-quality 3D scans captured by consumer-level depth sensors. In this paper, we
make the first attempt to reconstruct reliable 3D human shapes from single-frame partial
point clouds. To achieve this, we propose an end-to-end learnable method, named VoteHMR.
The core of VoteHMR is a novel occlusion-aware voting network that can first reliably
produce visible joint-level features from the input partial point clouds, and then
complete the joint-level features through the kinematic tree of the human skeleton.
Compared with holistic features used by previous works, the joint-level features can
not only effectively encode the human geometry information but also be robust to noisy
inputs with self-occlusions and missing areas. By exploiting the rich complementary
clues from the joint-level features and global features from the input point clouds,
the proposed method encourages reliable and disentangled parameter predictions for
statistical 3D human models, such as SMPL. The proposed method achieves state-of-the-art
performances on two large-scale datasets, namely SURREAL and DFAUST. Furthermore,
VoteHMR also demonstrates superior generalization ability on real-world datasets,
such as Berkeley MHAD.

MageAdd: Real-Time Interaction Simulation for Scene Synthesis

  • Shao-Kui Zhang
  • Yi-Xiao Li
  • Yu He
  • Yong-Liang Yang
  • Song-Hai Zhang

While recent researches on computational 3D scene synthesis have achieved impressive
results, automatically synthesized scenes do not guarantee satisfaction of end users.
On the other hand, manual scene modelling can always ensure high quality, but requires
a cumbersome trial-and-error process. In this paper, we bridge the above gap by presenting
a data-driven 3D scene synthesis framework that can intelligently infer objects to
the scene by incorporating and simulating user preferences with minimum input. While
the cursor is moved and clicked in the scene, our framework automatically selects
and transforms suitable objects into scenes in real time. This is based on priors
learnt from the dataset for placing different types of objects, and updated according
to the current scene context. Through extensive experiments we demonstrate that our
framework outperforms the state-of-the-art on result aesthetics, and enables effective
and efficient user interactions.

Cross-View Exocentric to Egocentric Video Synthesis

  • Gaowen Liu
  • Hao Tang
  • Hugo M. Latapie
  • Jason J. Corso
  • Yan Yan

Cross-view video synthesis task seeks to generate video sequences of one view from
another dramatically different view. In this paper, we investigate the exocentric
(third-person) view to egocentric (first-person) view video generation task. This
is challenging because egocentric view sometimes is remarkably different from the
exocentric view. Thus, transforming the appearances across the two different views
is a non-trivial task. Particularly, we propose a novel Bi-directional Spatial Temporal
Attention Fusion Generative Adversarial Network (STA-GAN) to learn both spatial and
temporal information to generate egocentric video sequences from the exocentric view.
The proposed STA-GAN consists of three parts: temporal branch, spatial branch, and
attention fusion. First, the temporal and spatial branches generate a sequence of
fake frames and their corresponding features. The fake frames are generated in both
downstream and upstream directions for both temporal and spatial branches. Next, the
generated four different fake frames and their corresponding features (spatial and
temporal branches in two directions) are fed into a novel multi-generation attention
fusion module to produce the final video sequence. Meanwhile, we also propose a novel
temporal and spatial dual-discriminator for more robust network optimization. Extensive
experiments on the Side2Ego and Top2Ego datasets show that the proposed STA-GAN significantly
outperforms the existing methods.

EVRNet: Efficient Video Restoration on Edge Devices

  • Sachin Mehta
  • Amit Kumar
  • Fitsum Reda
  • Varun Nasery
  • Vikram Mulukutla
  • Rakesh Ranjan
  • Vikas Chandra

In video transmission applications, video signals are transmitted over lossy channels,
resulting in low-quality received signals. To re- store videos on recipient edge devices
in real-time, we introduce an efficient video restoration network, EVRNet. EVRNet
efficiently allocates parameters inside the network using alignment, differential,
and fusion modules. With extensive experiments on different video restoration tasks
(deblocking, denoising, and super-resolution), we demonstrate that EVRNet delivers
competitive performance to existing methods with significantly fewer parameters and
MACs. For example, EVRNet has 260× fewer parameters and 958× fewer MACs than enhanced
deformable convolution-based video restoration net- work (EDVR) for 4× video super-resolution
while its SSIM score is 0.018 less than EDVR. We also evaluated the performance of
EVR-Net under multiple distortions on unseen dataset to demonstrate its ability in
modeling variable-length sequences under both camera and object motion.

Multimodal Entity Linking: A New Dataset and A Baseline

  • Jingru Gan
  • Jinchang Luo
  • Haiwei Wang
  • Shuhui Wang
  • Wei He
  • Qingming Huang

In this paper, we introduce a new Multimodal Entity Linking (MEL) task on the multimodal
data. The MEL task discovers entities in multiple modalities and various forms within
large-scale multimodal data and maps multimodal mentions in a document to entities
in a structured knowledge base such as Wikipedia. Different from the conventional
Neural Entity Linking (NEL) task that focuses on textual information solely, MEL aims
at achieving human-level disambiguation among entities in images, texts, and knowledge
bases. Due to the lack of sufficient labeled data for the MEL task, we release a large-scale
multimodal entity linking dataset M3EL (abbreviated for MultiModal Movie Entity Linking).
Specifically, we collect reviews and images of 1,100 movies, extract textual and visual
mentions, and label them with entities registered in Wikipedia. In addition, we construct
a new baseline method to solve the MEL problem, which models the alignment of textual
and visual mentions as a bipartite graph matching problem and solves it with an optimal-transportation-based
linking method. Extensive experiments on the M3EL dataset verify the quality of the
dataset and the effectiveness of the proposed method. We envision this work to be
helpful for soliciting more research effort and applications regarding multimodal
computing and inference in the future. We make the dataset and the baseline algorithm
publicly available at

AI-Lyricist: Generating Music and Vocabulary Constrained Lyrics

  • Xichu Ma
  • Ye Wang
  • Min-Yen Kan
  • Wee Sun Lee

We propose AI-Lyricist: a system to generate novel yet meaningful lyrics given a required
vocabulary and a MIDI file as inputs. This task involves multiple challenges, including
automatically identifying the melody and extracting a syllable template from multi-channel
music, generating creative lyrics that match the input music's style and syllable
alignment, and satisfying vocabulary constraints. To address these challenges, we
propose an automatic lyrics generation system consisting of four modules: (1) A music
structure analyzer to derive the musical structure and syllable template from a given
MIDI file, utilizing the concept of expected syllable number to better identify the
melody, (2) a SeqGAN-based lyrics generator optimized by multi-adversarial training
through policy gradients with twin discriminators for text quality and syllable alignment,
(3) a deep coupled music-lyrics embedding model to project music and lyrics into a
joint space to allow fair comparison of both melody and lyric constraints, and a module
called (4) Polisher, to satisfy vocabulary constraints by applying a mask to the generator
and substituting the words to be learned. We trained our model on a dataset of over
7,000 music-lyrics pairs, enhanced with manually annotated labels in terms of theme,
sentiment and genre. Both objective and subjective evaluations show AI-Lyricist's
superior performance against the state-of-the-art for the proposed tasks.

SESSION: Session 9: Emotional and Social Signals in Multimedia

CaFGraph: Context-aware Facial Multi-graph Representation for Facial Action Unit Recognition

  • Yingjie Chen
  • Diqi Chen
  • Yizhou Wang
  • Tao Wang
  • Yun Liang

Facial action unit (AU) recognition has attracted increasing attention due to its
indispensable role in affective computing, especially in the field of affective human-computer
interaction. Due to the subtle and transient nature of AU, it is challenging to capture
the delicate and ambiguous motions in local facial regions among consecutive frames.
Considering that context is essential to resolve ambiguity in human visual system,
modeling context within or among facial images emerges as a promising approach for
AU recognition task. To this end, we propose CaFGraph, a novel context-aware facial
multi-graph that can model both morphological & muscular-based region-level local
context and region-level temporal context. CaFGraph is the first work to construct
a universal facial multi-graph structure that is independent of both task settings
and dataset statistics for almost all fine-grained facial behavior analysis tasks,
including but not limited to AU recognition. To make full use of the context, we then
present CaFNet that learns context-aware facial graph representations via CaFGraph
from facial images for multi-label AU recognition. Experiments on two widely used
benchmark datasets, BP4D and DISFA, demonstrate the superiority of our CaFNet over
the state-of-the-art methods.

Self-Supervised Regional and Temporal Auxiliary Tasks for Facial Action Unit Recognition

  • Jingwei Yan
  • Jingjing Wang
  • Qiang Li
  • Chunmao Wang
  • Shiliang Pu

Automatic facial action unit (AU) recognition is a challenging task due to the scarcity
of manual annotations. To alleviate this problem, a large amount of efforts has been
dedicated to exploiting various methods which leverage numerous unlabeled data. However,
many aspects with regard to some unique properties of AUs, such as the regional and
relational characteristics, are not sufficiently explored in previous works. Motivated
by this, we take the AU properties into consideration and propose two auxiliary AU
related tasks to bridge the gap between limited annotations and the model performance
in a self-supervised manner via the unlabeled data. Specifically, to enhance the discrimination
of regional features with AU relation embedding, we design a task of RoI inpainting
to recover the randomly cropped AU patches. Meanwhile, a single image based optical
flow estimation task is proposed to leverage the dynamic change of facial muscles
and encode the motion information into the global feature representation. Based on
these two self-supervised auxiliary tasks, local features, mutual relation and motion
cues of AUs are better captured in the backbone network with the proposed regional
and temporal based auxiliary task learning (RTATL) framework. Extensive experiments
on BP4D and DISFA demonstrate the superiority of our method and new state-of-the-art
performances are achieved.

HetEmotionNet: Two-Stream Heterogeneous Graph Recurrent Neural Network for Multi-modal Emotion Recognition

  • Ziyu Jia
  • Youfang Lin
  • Jing Wang
  • Zhiyang Feng
  • Xiangheng Xie
  • Caijie Chen

The research on human emotion under multimedia stimulation based on physiological
signals is an emerging field and important progress has been achieved for emotion
recognition based on multi-modal signals. However, it is challenging to make full
use of the complementarity among spatial-spectral-temporal domain features for emotion
recognition, as well as model the heterogeneity and correlation among multi-modal
signals. In this paper, we propose a novel two-stream heterogeneous graph recurrent
neural network, named HetEmotionNet, fusing multi-modal physiological signals for
emotion recognition. Specifically, HetEmotionNet consists of the spatial-temporal
stream and the spatial-spectral stream, which can fuse spatial-spectral-temporal domain
features in a unified framework. Each stream is composed of the graph transformer
network for modeling the heterogeneity, the graph convolutional network for modeling
the correlation, and the gated recurrent unit for capturing the temporal domain or
spectral domain dependency. Extensive experiments on two real-world datasets demonstrate
that our proposed model achieves better performance than state-of-the-art baselines.

Simplifying Multimodal Emotion Recognition with Single Eye Movement Modality

  • Xu Yan
  • Li-Ming Zhao
  • Bao-Liang Lu

Multimodal emotion recognition has long been a popular topic in affective computing
since it significantly enhances the performance compared with that of a single modality.
Among all, the combination of electroencephalography (EEG) and eye movement signals
is one of the most attractive practices due to their complementarity and objectivity.
However, the high cost and inconvenience of EEG signal acquisition severely hamper
the popularization of multimodal emotion recognition in practical scenarios, while
eye movement signals are much easier to acquire. To increase the feasibility and the
generalization ability of emotion decoding without compromising the performance, we
propose a generative adversarial network-based framework. In our model, a single modality
of eye movements is used as input and it is capable of mapping the information onto
multimodal features. Experimental results on SEED series datasets with different emotion
categories demonstrate that our model with multimodal features generated by the single
eye movement modality maintains competitive accuracies compared to those with multimodality
input and drastically outperforms those single-modal emotion classifiers. This illustrates
that the model has the potential to reduce the dependence on multimodalities without
sacrificing performance which makes emotion recognition more applicable and practicable.

Learning What and When to Drop: Adaptive Multimodal and Contextual Dynamics for Emotion Recognition in Conversation

  • Feiyu Chen
  • Zhengxiao Sun
  • Deqiang Ouyang
  • Xueliang Liu
  • Jie Shao

Multi-sensory data has exhibited a clear advantage in expressing richer and more complex
feelings, on the Emotion Recognition in Conversation (ERC) task. Yet, current methods
for multimodal dynamics that aggregate modalities or employ additional modality-specific
and modality-shared networks are still inadequate in balancing between the sufficiency
of multimodal processing and the scalability to incremental multi-sensory data type
additions. This incurs a bottleneck of performance improvement of ERC. To this end,
we present MetaDrop, a differentiable and end-to-end approach for the ERC task that
learns module-wise decisions across modalities and conversation flows simultaneously,
which supports adaptive information sharing pattern and dynamic fusion paths. Our
framework mitigates the problem of modelling complex multimodal relations while ensuring
it enjoys good scalability to the number of modalities. Experiments on two popular
multimodal ERC datasets show that MetaDrop achieves new state-of-the-art results.

Zero-shot Video Emotion Recognition via Multimodal Protagonist-aware Transformer Network

  • Fan Qi
  • Xiaoshan Yang
  • Changsheng Xu

Recognizing human emotions from videos has attracted significant attention in numerous
computer vision and multimedia applications, such as human-computer interaction and
health care. It aims to understand the emotional response of humans, where candidate
emotion categories are generally defined by specific psychological theories. However,
with the development of psychological theories, emotion categories become increasingly
diverse and fine-grained, samples are also increasingly difficult to collect. In this
paper, we investigate a new task of zero-shot video emotion recognition, which aims
to recognize rare unseen emotions. Specifically, we propose a novel multimodal protagonist-aware
transformer network, which is composed of two branches: one is equipped with a novel
dynamic emotional attention mechanism and a visual transformer to learn better visual
representations; the other is an acoustic transformer for learning discriminative
acoustic representations. We manage to align the visual and acoustic representations
with semantic embeddings of fine-grained emotion labels through jointly mapping them
into a common space under a noise contrastive estimation objective. Extensive experimental
results on three datasets demonstrate the effectiveness of the proposed method.

SESSION: Session 10: Industrial Track

Show, Read and Reason: Table Structure Recognition with Flexible Context Aggregator

  • Hao Liu
  • Xin Li
  • Bing Liu
  • Deqiang Jiang
  • Yinsong Liu
  • Bo Ren
  • Rongrong Ji

We investigate the challenging problem of table structure recognition in this work.
Many recent methods adopt graph-based context aggregator with strong inductive bias
to reason sparse contextual relationships of table elements. However, the strong constraints
may be too restrictive to represent the complicated table relationships. In order
to learn more appropriate inductive bias from data, we try to introduce Transformer
as context aggregator in this work. Nevertheless, Transformer taking dense context
as input requires larger scale data and may suffer from unstable training procedure
due to the weakening of inductive bias. To overcome the above limitations, we in this
paper design a FLAG (FLexible context AGgregator), which marries Transformer with
graph-based context aggregator in an adaptive way. Based on FLAG, an end-to-end framework
requiring no extra meta-data or OCR information, termed FLAG-Net, is proposed to flexibly
modulate the aggregation of dense context and sparse one for the relational reasoning
of table elements. We investigate the modulation pattern in FLAG and show what contextual
information is focused, which is vital for recognizing table structure. Extensive
experimental results on benchmarks demonstrate the performance of our proposed FLAG-Net
surpasses other compared methods by a large margin.

TransFusion: Multi-Modal Fusion for Video Tag Inference via Translation-based Knowledge

  • Di Jin
  • Zhongang Qi
  • Yingmin Luo
  • Ying Shan

Tag inference is an important task in the business of video platforms with wide applications
such as recommendation, interpretation, and more. Existing works are mainly based
on extracting video information from multiple modalities such as frames or music,
and then infer tags through classification or object detection. This, however, does
not apply to inferring generic tags or taxonomy that are less relevant to video contents,
such as video originality or its broader category, which are important in practice.
In this paper, we claim that these generic tags can be modeled through the semantic
relations between videos and tags, and can be utilized simultaneously with the multi-modal
features to achieve better video tagging. We propose TransFusion, an end-to-end supervised
learning framework that fuses multi-modal embeddings (e.g., vision, audio, texts,
etc.) with the knowledge embedding to derive the video representation. To infer the
diverse tags following heterogeneous relations, TransFusion adopts a dual attentive
approach to learn both the modality importance in fusion and relation importance in
inference. Besides, it is general enough and can be used with the existing translation-based
knowledge embedding approaches. Extensive experiments show that TransFusion outperforms
the baseline methods with lowered mean rank and at least 9.59% improvement in HITS@10
on the real-world video knowledge graph.

RecycleNet: An Overlapped Text Instance Recovery Approach

  • Yiqing Hu
  • Yan Zheng
  • Xinghua Jiang
  • Hao Liu
  • Deqiang Jiang
  • Yinsong Liu
  • Bo Ren
  • Rongrong Ji

Text recognition is the key pillar for many real-world multimedia applications. Existing
text recognition approaches focus on recognizing isolated instances, whose text fields
are visually separated and have no interference with each other. Moreover, these approaches
cannot handle overlapped instances that often appear in sheets like invoices, receipts
and math exercises, where printed templates are generated beforehand and extra contents
are added afterward on existing texts. In this paper, we aim to tackle this problem
by proposing RecycleNet, which automatically extracts and reconstructs overlapped
instances by fully recycling the intersecting pixels that used to be obstacles for
recognition. RecycleNet parallels to existing recognition systems, and serves as a
plug-and-play module to boost recognition performance with zero-effort. We also released
an OverlapText-500 dataset, which helps to boost the design of better overlapped text
recovery and recognition solutions.

ARShoe: Real-Time Augmented Reality Shoe Try-on System on Smartphones

  • Shan An
  • Guangfu Che
  • Jinghao Guo
  • Haogang Zhu
  • Junjie Ye
  • Fangru Zhou
  • Zhaoqi Zhu
  • Dong Wei
  • Aishan Liu
  • Wei Zhang

Virtual try-on technology enables users to try various fashion items using augmented
reality and provides a convenient online shopping experience. However, most previous
works focus on the virtual try-on for clothes while neglecting that for shoes, which
is also a promising task. To this concern, this work proposes a real-time augmented
reality virtual shoe try-on system for smartphones, namely ARShoe. Specifically, ARShoe
adopts a novel multi-branch network to realize pose estimation and segmentation simultaneously.
A solution to generate realistic 3D shoe model occlusion during the try-on process
is presented. To achieve a smooth and stable try-on effect, this work further develop
a novel stabilization method. Moreover, for training and evaluation, we construct
the very first large-scale foot benchmark with multiple virtual shoe try-on task-related
labels annotated. Exhaustive experiments on our newly constructed benchmark demonstrate
the satisfying performance of ARShoe. Practical tests on common smartphones validate
the real-time performance and stabilization of the proposed approach.

Inferring the Importance of Product Appearance with Semi-supervised Multi-modal Enhancement: A Step Towards the Screenless Retailing

  • Yongshun Gong
  • Jinfeng Yi
  • Dong-Dong Chen
  • Jian Zhang
  • Jiayu Zhou
  • Zhihua Zhou

Nowadays, almost all the online orders were placed through screened devices such as
mobile phones, tablets, and computers. With the rapid development of the Internet
of Things (IoT) and smart appliances, more and more screenless smart devices, e.g.,
smart speaker and smart refrigerator, appear in our daily lives. They open up new
means of interaction and may provide an excellent opportunity to reach new customers
and increase sales. However, not all the items are suitable for screenless shopping,
since some items' appearance play an important role in consumer decision making. Typical
examples include clothes, dolls, bags, and shoes. In this paper, we aim to infer the
significance of every item's appearance in consumer decision making and identify the
group of items that are suitable for screenless shopping. Specifically, we formulate
the problem as a classification task that predicts if an item's appearance has a significant
impact on people's purchase behavior. To solve this problem, we extract multi-modal
features from three different views, and collect a set of necessary labels via crowdsourcing.
We then propose an iterative semi-supervised learning framework with a carefully designed
multi-modal enhancement module. Experimental results verify the effectiveness of the
proposed method.

AsyNCE: Disentangling False-Positives for Weakly-Supervised Video Grounding

  • Cheng Da
  • Yanhao Zhang
  • Yun Zheng
  • Pan Pan
  • Yinghui Xu
  • Chunhong Pan

Weakly-supervised video grounding has been investigated to ground textual phases in
video content with only video-sentence pairs provided during training, for the lack
of prohibitively costly bounding box annotations. Existing methods cast this task
into a frame-level multiple instance learning (MIL) problem with the ranking loss.
While an object might appear sparsely across multiple frames, causing uncertain false-positive
frames. Thus, directly computing the average loss of all frames is inadequate in video
domain. Moreover, the positive and negative pairs are equally coupling in ranking
loss, so that it is impossible to handle false-positive frames individually. Additionally,
naive inner production is suboptimal for the similarity measure of cross domains.
To solve these issues, we propose a novel AsyNCE loss to flexibly disentangle the
positive pairs from negative ones in frame-level MIL, which allows for mitigating
the uncertainty of false-positive frames effectively. Besides, a cross-modal transformer
block is introduced to purify the text feature by image frame context, generating
a visual-guided text feature for better similarity measure. Extensive experiments
on YouCook2, RoboWatch and WAB datasets demonstrate the superiority and robustness
of our method over state-of-the-art methods.

Unifying Multimodal Transformer for Bi-directional Image and Text Generation

  • Yupan Huang
  • Hongwei Xue
  • Bei Liu
  • Yutong Lu

We study the joint learning of image-to-text and text-to-image generations, which
are naturally bi-directional tasks. Typical existing works design two separate task-specific
models for each task, which impose expensive design efforts. In this work, we propose
a unified image-and-text generative framework based on a single multimodal model to
jointly study the bi-directional tasks. We adopt Transformer as our unified architecture
for its strong performance and task-agnostic design. Specifically, we formulate both
tasks as sequence generation tasks, where we represent images and text as unified
sequences of tokens, and the Transformer learns multimodal interactions to generate
sequences. We further propose two-level granularity feature representations and sequence-level
training to improve the Transformer-based unified framework. Experiments show that
our approach significantly improves previous Transformer-based model X-LXMERT's FID
from 37.0 to 29.9 (lower is better) for text-to-image generation, and improves CIDEr-D
score from 100.9% to 122.6% for fine-tuned image-to-text generation on the MS-COCO
dataset. Our code is available online.

Once and for All: Self-supervised Multi-modal Co-training on One-billion Videos at Alibaba

  • Lianghua Huang
  • Yu Liu
  • Xiangzeng Zhou
  • Ansheng You
  • Ming Li
  • Bin Wang
  • Yingya Zhang
  • Pan Pan
  • Xu Yinghui

Videos grow to be one of the largest mediums on the Internet. E-commerce platforms
like Alibaba need to process millions of video data across multimedia (e.g., visual,
audio, image, and text) and on a variety of tasks (e.g., retrieval, tagging, and summary)
every day. In this work, we aim to develop a once and for all pretraining technique
for diverse modalities and downstream tasks. To achieve this, we make the following
contributions: (1) We propose a self-supervised multi-modal co-training framework.
It takes cross-modal pseudo-label consistency as the supervision and can jointly learn
representations of multiple modalities. (2) We introduce several novel techniques
(e.g., sliding-window subset sampling, coarse-to-fine clustering, fast spatial-temporal
convolution and parallel data transmission and processing) to optimize the training
process, making billion-scale stable training feasible. (3) We construct a large-scale
multi-modal dataset consisting of 1.4 billion videos (~0.5 PB) and train our framework
on it. The training takes only 4.6 days on an in-house 256 GPUs cluster, and it simultaneously
produces pretrained video, audio, image, motion, and text networks. (4) Finetuning
from our pretrained models, we obtain significant performance gains and faster convergence
on diverse multimedia tasks at Alibaba. Furthermore, we also validate the learned
representation on public datasets. Despite the domain gap between our commodity-centric
pretraining and the action-centric evaluation data, we show superior results against

L2RS: A Learning-to-Rescore Mechanism for Hybrid Speech Recognition

  • Yuanfeng Song
  • Di Jiang
  • Xuefang Zhao
  • Qian Xu
  • Raymond Chi-Wing Wong
  • Lixin Fan
  • Qiang Yang

This paper aims to advance the performance of industrial ASR systems by exploring
a more effective method for N-best rescoring, a critical step that greatly affects
the final recognition accuracy. Existing rescoring approaches suffer the following
issues: (i) limited performance since they optimize an unnecessarily harder problem,
namely predicting accurate grammatical legitimacy scores of the N-best hypotheses
rather than directly predicting their partial orders regarding a specific acoustic
input; (ii) hard to incorporate various information by advanced natural language processing
(NLP) models such as BERT to achieve a comprehensive evaluation of each N-best candidate.
To relieve the above drawbacks, we propose a simple yet effective mechanism, Learning-to-Rescore
(L2RS), to empower ASR systems with state-of-the-art information retrieval (IR) techniques.
Specifically, L2RS utilizes a wide range of textual information from the state-of-the-art
NLP models and automatically deciding their weights to directly learn the ranking
order of each N-best hypothesis with respect to a specific acoustic input. We incorporate
various features including BERT sentence embeddings, the topic vectors, and perplexity
scores produced by an n-gram language model (LM), topic modeling LM, BERT, and RNNLM
to train the rescoring model. Experimental results on a public dataset show that L2RS
outperforms not only traditional rescoring methods but also its deep neural network
counterparts by a substantial margin of 20.85% in terms of NDCG@10. The L2RS toolkit
has been successfully deployed for many online commercial services in WeBank Co.,
Ltd, China's leading digital bank. The efficacy and applicability of L2RS are validated
by real-life online customer datasets.

Distantly Supervised Semantic Text Detection and Recognition for Broadcast Sports
Videos Understanding

  • Avijit Shah
  • Topojoy Biswas
  • Sathish Ramadoss
  • Deven Santosh Shah

Comprehensive understanding of key players and actions in multiplayer sports broadcast
videos is a challenging problem. Unlike in news or finance videos, sports videos have
limited text. While both action recognition for multiplayer sports and detection of
players has seen robust research, understanding contextual text in video frames still
remains one of the most impactful avenues of sports video understanding. In this work
we study extremely accurate semantic text detection and recognition in sports clocks,
and challenges therein. We observe unique properties of sports clocks, which makes
it hard to utilize general-purpose pre-trained detectors and recognizers, so that
text can be accurately understood to the degree of being used to align to external
knowledge. We propose a novel distant supervision technique to automatically build
sports clock datasets. Along with suitable data augmentations, combined with any state-of-the-art
text detection and recognition model architectures, we extract extremely accurate
semantic text. Finally, we share our computational architecture pipeline to scale
this system in industrial setting and proposed a robust dataset for the same to validate
our results.

Focusing on Persons: Colorizing Old Images Learning from Modern Historical Movies

  • Xin Jin
  • Zhonglan Li
  • Ke Liu
  • Dongqing Zou
  • Xiaodong Li
  • Xingfan Zhu
  • Ziyin Zhou
  • Qilong Sun
  • Qingyu Liu

In industry, there exist plenty of scenarios where old gray photos need to be automatically
colored, such as video sites and archives. In this paper, we present the HistoryNet
focusing on historical person's diverse high fidelity clothing colorization based
on fine grained semantic understanding and prior. Colorization of historical persons
is realistic and practical, however, existing methods do not perform well in the regards.
In this paper, a HistoryNet including three parts, namely, classification, fine grained
semantic parsing and colorization, is proposed. Classification sub-module supplies
classifying of images according to the eras, nationalities and garment types; Parsing
sub-network supplies the semantic for person contours, clothing and background in
the image to achieve more accurate colorization of clothes and persons and prevent
color overflow. In the training process, we integrate classification and semantic
parsing features into the coloring generation network to improve colorization. Through
the design of classification and parsing subnetwork, the accuracy of image colorization
can be improved and the boundary of each part of image can be more clearly. Moreover,
we also propose a novel Modern Historical Movies Dataset (MHMD) containing 1,353,166
images and 42 labels of eras, nationalities, and garment types for automatic colorization
from 147 historical movies or TV series made in modern time. Various quantitative
and qualitative comparisons demonstrate that our method outperforms the state-of-the-art
colorization methods, especially on military uniforms, which has correct colors according
to the historical literatures.

Personalized Multi-modal Video Retrieval on Mobile Devices

  • Haotian Zhang
  • Allan D. Jepson
  • Iqbal Mohomed
  • Konstantinos G. Derpanis
  • Ran Zhang
  • Afsaneh Fazly

Current video retrieval systems on mobile devices cannot process complex natural language
queries, especially if they contain personalized concepts, such as proper names. To
address these shortcomings, we propose an efficient and privacy-preserving video retrieval
system that works well with personalized queries containing proper names, without
re-training using personalized labelled data from users. Our system first computes
an initial ranking of a video collection by using a generic attention-based video-text
matching model (i.e., a model designed for non-personalized queries), and then uses
a face detector to conduct personalized adjustments to these initial rankings. These
adjustments are done by reasoning over the face information from the detector and
the attention information provided by the generic model. We show that our system significantly
outperforms existing keyword-based retrieval systems, and achieves comparable performance
to the generic matching model fine-tuned on plenty of labelled data. Our results suggest
that the proposed system can effectively capture both semantic context and personalized
information in queries.

Boosting End-to-end Multi-Object Tracking and Person Search via Knowledge Distillation

  • Wei Zhang
  • Lingxiao He
  • Peng Chen
  • Xingyu Liao
  • Wu Liu
  • Qi Li
  • Zhenan Sun

Multi-Object Tracking (MOT) and Person Search both demand to localize and identify
specific targets from raw image frames. Existing methods can be classified into two
categories, namely two-step strategy and end-to-end strategy. Two-step approaches
have high accuracy but suffer from costly computations, while end-to-end methods show
greater efficiency with limited performance. In this paper, we dissect the gap between
two-step and end-to-end strategy and propose a simple yet effective end-to-end framework
with knowledge distillation. Our proposed framework is simple in concept and easy
to benefit from external datasets. Experimental results demonstrate that our model
performs competitively with other sophisticated two-step and end-to-end methods in
multi-object tracking and person search.

A Virtual Character Generation and Animation System for E-Commerce Live Streaming

  • Li Hu
  • Bang Zhang
  • Peng Zhang
  • Jinwei Qi
  • Jian Cao
  • Daiheng Gao
  • Haiming Zhao
  • Xiaoduan Feng
  • Qi Wang
  • Lian Zhuo
  • Pan Pan
  • Yinghui Xu

Virtual character has been widely adopted in many areas, such as virtual assistant,
virtual customer service, robotics and etc. In this paper, we focus on its application
in e-commerce live streaming. Particularly, we propose a virtual character generation
and animation system that supports e-commerce live streaming with virtual characters
as anchors. The system offers a virtual character face generation tool based on a
weakly supervised 3D face reconstruction method. The method takes a single photo as
input and generates a 3D face model with both similarity and aesthetics considered.
It does not require 3D face annotation data due to the assist of differentiable neural
rendering technique which seamlessly integrates rendering into a deep learning based
3D face reconstruction framework. Moreover, the system provides two animation approaches
which support two different ways of live stream respectively. The first approach is
based on real-time motion capture. An actor's performance is captured in real-time
via a monocular camera, and then utilized for animating a virtual anchor. The second
approach is text driven animation, in which the human-like animation is automatically
generated based on a text script. The relationship between text script and animation
is learned based on the training data which can be accumulated via the motion capture
based animation. To our best knowledge, the presented work is the first sophisticated
virtual character generation and animation system that is designed for e-commerce
live streaming and actually deployed on an online shopping platform with millions
of daily audiences.

Improving Fake News Detection by Using an Entity-enhanced Framework to Fuse Diverse
Multimodal Clues

  • Peng Qi
  • Juan Cao
  • Xirong Li
  • Huan Liu
  • Qiang Sheng
  • Xiaoyue Mi
  • Qin He
  • Yongbiao Lv
  • Chenyang Guo
  • Yingchao Yu

Recently, fake news with text and images have achieved more effective diffusion than
text-only fake news, raising a severe issue of multimodal fake news detection. Current
studies on this issue have made significant contributions to developing multimodal
models, but they are defective in modeling the multimodal content sufficiently. Most
of them only preliminarily model the basic semantics of the images as a supplement
to the text, which limits their performance on detection. In this paper, we find three
valuable text-image correlations in multimodal fake news: entity inconsistency, mutual
enhancement, and text complementation. To effectively capture these multimodal clues,
we innovatively extract visual entities (such as celebrities and landmarks) to understand
the news-related high-level semantics of images, and then model the multimodal entity
inconsistency and mutual enhancement with the help of visual entities. Moreover, we
extract the embedded text in images as the complementation of the original text. All
things considered, we propose a novel entity-enhanced multimodal fusion framework,
which simultaneously models three cross-modal correlations to detect diverse multimodal
fake news. Extensive experiments demonstrate the superiority of our model compared
to the state of the art.

SESSION: Session 11: Multimedia HCI and Quality of Experience

Fast Video Visual Quality and Resolution Improvement using SR-UNet

  • Federico Vaccaro
  • Marco Bertini
  • Tiberio Uricchio
  • Alberto Del Bimbo

In this paper, we address the problem of real-time video quality enhancement, considering
both frame super-resolution and compression artifact-removal. The first operation
increases the sampling resolution of video frames, the second removes visual artifacts
such as blurriness, noise, aliasing, or blockiness introduced by lossy compression
techniques, such as JPEG encoding for single-images, or H.264/H.265 for video data.

We propose to use SR-UNet, a novel network architecture based on UNet, that has been
specialized for fast visual quality improvement (i.e. capable of operating in less
than 40ms, to be able to operate on videos at 25FPS). We show how this network can
be used in a streaming context where the content is generated live, e.g. in video
calls, and how it can be optimized when video to be streamed are prepared in advance.
The network can be used as a final post processing, to optimize the visual appearance
of a frame before showing it to the end-user in a video player. Thus, it can be applied
without any change to existing video coding and transmission pipelines.

Experiments carried on standard video datasets, also considering the H.265 compression,
show that the proposed approach is able to either improve visual quality metrics given
a fixed bandwidth budget, or video distortion given a fixed quality goal.

MS-GraphSIM: Inferring Point Cloud Quality via Multiscale Graph Similarity

  • Yujie Zhang
  • Qi Yang
  • Yiling Xu

To address the point cloud quality assessment (PCQA) problem, GraphSIM was proposed
via jointly considering geometrical and color features, which shows compelling performance
in multiple distortion detection. However, GraphSIM does not take into account the
mutiscale characteristics of human perception. In this paper, we propose a multiscale
PCQA model, called Multiscale Graph Similarity (MS-GraphSIM), that can better predict
human subjective perception. First, exploring the multiscale processing method used
in image processing, we introduce a multiscale representation of point clouds based
on graph signal processing. Second, we extend GraphSIM into multiscale version based
on the proposed multiscale representation. Specifically, MS-GraphSIM constructs a
multiscale representation for each local patch extracted from the reference point
cloud or the distorted point cloud, and then fuses GraphSIM at different scales to
obtain an overall quality score. Experiment results demonstrate that the proposed
MS-GraphSIM outperforms the state-of-the-art PCQA metrics over two fairly large and
independent databases. Ablation studies further prove the proposed MS-GraphSIM is
robust to different model hyperparameter settings. The code is available at

I Know Your Keyboard Input: A Robust Keystroke Eavesdropper Based-on Acoustic Signals

  • Jia-Xuan Bai
  • Bin Liu
  • Luchuan Song

Recently, smart devices equipped with microphones have become increasingly popular
in people's lives. However, when users type on a keyboard near devices with microphones,
the acoustic signals generated by different keystrokes may leak the user's privacy.
This paper proposes a robust side-channel attack scheme to infer keystrokes on the
surrounding keyboard, leveraging the smart devices' microphones. To address the challenge
of non-cooperative attacking environments, we propose an efficient scheme to estimate
the relative position between the microphones and the keyboard, and extract two robust
features from the acoustic signals to alleviate the impact of various victims and
keyboards. As a result, we can realize the side-channel attack through acoustic signals,
regardless of the exact location of microphones, the victims, and the type of keyboards.
We implement the proposed scheme on the commercial smartphone and conduct extensive
experiments to evaluate its performance. Experimental results show that the proposed
scheme could achieve good performance in predicting keyboard input under various conditions.
Overall, we can correctly identify 91.2% of keystrokes with 10-fold cross-validation.
When predicting keystrokes from unknown victims, the attack can obtain a Top-5 accuracy
of 91.52%. Furthermore, the Top-5 accuracy of predicting keystrokes can reach 72.25%
when the victims and keyboards are both unknown. When predicting meaningful contents,
we can obtain a Top-5 accuracy of 96.67% for the words entered by the victim.

Perceptual Quality Assessment of Internet Videos

  • Jiahua Xu
  • Jing Li
  • Xingguang Zhou
  • Wei Zhou
  • Baichao Wang
  • Zhibo Chen

With the fast proliferation of online video sites and social media platforms, user,
professionally and occupationally generated content (UGC, PGC, OGC) videos are streamed
and explosively shared over the Internet. Consequently, it is urgent to monitor the
content quality of these Internet videos to guarantee the user experience. However,
most existing modern video quality assessment (VQA) databases only include UGC videos
and cannot meet the demands for other kinds of Internet videos with real-world distortions.
To this end, we collect 1,072 videos from Youku, a leading Chinese video hosting service
platform, to establish the Internet video quality assessment database (Youku-V1K).
A special sampling method based on several quality indicators is adopted to maximize
the content and distortion diversities within a limited database, and a probabilistic
graphical model is applied to recover reliable labels from noisy crowdsourcing annotations.
Based on the properties of Internet videos originated from Youku, we propose a spatio-temporal
distortion-aware model (STDAM). First, the model works blindly which means the pristine
video is unnecessary. Second, the model is familiar with diverse contents by pre-training
on the large-scale image quality assessment databases. Third, to measure spatial and
temporal distortions, we introduce the graph convolution and attention module to extract
and enhance the features of the input video. Besides, we leverage the motion information
and integrate the frame-level features into video-level features via a bi-directional
long short-term memory network. Experimental results on the self-built database and
the public VQA databases demonstrate that our model outperforms the state-of-the-art
methods and exhibits promising generalization ability.

Using Interaction Data to Predict Engagement with Interactive Media

  • Jonathan Carlton
  • Andy Brown
  • Caroline Jay
  • John Keane

Media is evolving from traditional linear narratives to personalised experiences,
where control over information (or how it is presented) is given to individual audience
members. Measuring and understanding audience engagement with this media is important
in at least two ways: (1) a post-hoc understanding of how engaged audiences are with
the content will help production teams learn from experience and improve future productions;
(2), this type of media has potential for real-time measures of engagement to be used
to enhance the user experience by adapting content on-the-fly. Engagement is typically
measured by asking samples of users to self-report, which is time consuming and expensive.
In some domains, however, interaction data have been used to infer engagement. Fortuitously,
the nature of interactive media facilitates a much richer set of interaction data
than traditional media; our research aims to understand if these data can be used
to infer audience engagement. In this paper, we report a study using data captured
from audience interactions with an interactive TV show to model and predict engagement.
We find that temporal metrics, including overall time spent on the experience and
the interval between events, are predictive of engagement. The results demonstrate
that interaction data can be used to infer users' engagement during and after an experience,
and the proposed techniques are relevant to better understand audience preference
and responses.

Air-Text: Air-Writing and Recognition System

  • Sun-Kyung Lee
  • Jong-Hwan Kim

Text entry takes an important role of effectively delivering the intention of users
to computers, where physical and soft keyboards have been widely used. However, with
the recent trends of developing technologies like augmented reality and increasing
contactless services due to COVID-19, a more advanced type of text entry is required.
To tackle this issue, we propose Air-Text which is an intuitive system to write in
the air using fingertips as a pen. Unlike previously suggested air-writing systems,
Air-Text provides various functionalities by the seamless integration of air-writing
and text-recognition modules. Specifically, the air-writing module takes a sequence
of RGB images as input and tracks both the location of fingertips (5.33 pixel error
in 640x480 image) and current hand gesture class (98.29% classification accuracy)
frame by frame. Users can easily perform writing operations such as writing or deleting
a text by changing hand gestures, and tracked fingertip locations can be stored as
a binary image. Then the text-recognition module, which is compatible with any pre-trained
recognition models, predicts a written text in the binary image. In this paper, examples
of single digit recognition with MNIST classifier (96.0% accuracy) and word-level
recognition with text recognition model (79.36% character recognition rate) are provided.

SESSION: Session 12: Multimodal Analysis and Description-I

How to Learn a Domain-Adaptive Event Simulator?

  • Daxin Gu
  • Jia Li
  • Yu Zhang
  • Yonghong Tian

The low-latency streams captured by event cameras have shown impressive potential
in addressing vision tasks such as video reconstruction and optical flow estimation.
However, these tasks often require massive training event streams, which are expensive
to collect and largely bypassed by recently proposed event camera simulators. To align
the statistics of synthetic events with that of target event cameras, existing simulators
often need to be heuristically tuned with elaborative manual efforts and thus become
incompetent to automatically adapt to various domains. To address this issue, this
work proposes one of the first learning-based, domain-adaptive event simulator. Given
a specific domain, the proposed simulator learns pixel-wise distributions of event
contrast thresholds that, after stochastic sampling and paralleled rendering, can
generate event representations well aligned with those from the data from realistic
event cameras. To achieve such domain-specific alignment, we design a novel divide-and-conquer
discrimination scheme that adaptively evaluates the synthetic-to-real consistency
of event representations according to the local statistics of images and events. Trained
with the data synthesized by the proposed simulator, the performances of state-of-the-art
event-based video reconstruction and optical flow estimation approaches are boosted
up to 22.9% and 2.8%, respectively. In addition, we show significantly improved domain
adaptation capability over existing event simulators and tuning strategies, consistently
on three real event datasets.

A Stepwise Matching Method for Multi-modal Image based on Cascaded Network

  • Jinming Mu
  • Shuiping Gou
  • Shasha Mao
  • Shankui Zheng

Template matching of multi-modal image has been a challenge to image matching, and
it is difficult to balance the speed and the accuracy, especially for images with
large sizes. Based on this, we propose a stepwise image matching method to achieve
a precise location from the coarse-to-fine image matching by utilizing cascaded networks.
In the proposed method, a coarse-grained matching network is firstly constructed to
locate a rough matching position based on cross-correlating features of optical and
SAR images. Specially, to enhance the credible matching position, a suppression network
is designed to evaluate for the obtained cross-correlation feature and added into
the coarse-grained network as a feedback. Secondly, a fine-grained matching network
is constructed based on the obtained rough matching result to gain a more precise
matching. In this part, ternary groups are utilized to construct the training samples.
Interestingly, we apply the region with a few pixels offset as the negative class,
which effectively distinguishes similar neighbourhoods of the rough matching position.
Moreover, a modified Siamese network is used to extract features of SAR and optical
images, respectively. Finally, experimental results illustrate that the proposed method
obtains more precise matching compared with the state-of-the-art methods.

SINGA-Easy: An Easy-to-Use Framework for MultiModal Analysis

  • Naili Xing
  • Sai Ho Yeung
  • Cheng-Hao Cai
  • Teck Khim Ng
  • Wei Wang
  • Kaiyuan Yang
  • Nan Yang
  • Meihui Zhang
  • Gang Chen
  • Beng Chin Ooi

Deep learning has achieved great success in a wide spectrum of multimedia applications
such as image classification, natural language processing and multimodal data analysis.
Recent years have seen the development of many deep learning frameworks that provide
a high-level programming interface for users to design models, conduct training and
deploy inference. However, it remains challenging to build an efficient end-to-end
multimedia application with most existing frameworks. Specifically, in terms of usability,
it is demanding for non-experts to implement deep learning models, obtain the right
settings for the entire machine learning pipeline, manage models and datasets, and
exploit external data sources all together. Further, in terms of adaptability, elastic
computation solutions are much needed as the actual serving workload fluctuates constantly,
and scaling the hardware resources to handle the fluctuating workload is typically
infeasible. To address these challenges, we introduce SINGA-Easy, a new deep learning
framework that provides distributed hyper-parameter tuning at the training stage,
dynamic computational cost control at the inference stage, and intuitive user interactions
with multimedia contents facilitated by model explanation. Our experiments on the
training and deployment of multi-modality data analysis applications show that the
framework is both usable and adaptable to dynamic inference loads. We implement SINGA-Easy
on top of Apache SINGA and demonstrate our system with the entire machine learning
life cycle.

Informative Class-Conditioned Feature Alignment for Unsupervised Domain Adaptation

  • Wanxia Deng
  • Yawen Cui
  • Zhen Liu
  • Gangyao Kuang
  • Dewen Hu
  • Matti Pietikäinen
  • Li Liu

The goal of unsupervised domain adaptation is to learn a task classifier that performs
well for the unlabeled target domain by borrowing rich knowledge from a well-labeled
source domain. Although remarkable breakthroughs have been achieved in learning transferable
representation across domains, two bottlenecks remain to be further explored. First,
many existing approaches focus primarily on the adaptation of the entire image, ignoring
the limitation that not all features are transferable and informative for the object
classification task. Second, the features of the two domains are typically aligned
without considering the class labels; this can lead the resulting representations
to be domain-invariant but non-discriminative to the category. To overcome the two
issues, we present a novel Informative Class-Conditioned Feature Alignment (IC2FA)
approach for UDA, which utilizes a twofold method: informative feature disentanglement
and class-conditioned feature alignment, designed to address the above two challenges,
respectively. More specifically, to surmount the first drawback, we cooperatively
disentangle the two domains to obtain informative transferable features; here, Variational
Information Bottleneck (VIB) is employed to encourage the learning of task-related
semantic representations and suppress task-unrelated information. With regard to the
second bottleneck, we optimize a new metric, termed Conditional Sliced Wasserstein
Distance (CSWD), which explicitly estimates the intra-class discrepancy and the inter-class
margin. The intra-class and inter-class CSWDs are minimized and maximized, respectively,
to yield the domain-invariant discriminative features. IC2FA equips class-conditioned
feature alignment with informative feature disentanglement and causes the two procedures
to work cooperatively, which facilitates informative discriminative features adaptation.
Extensive experimental results on three domain adaptation datasets confirm the superiority
of IC2FA.

Hierarchical Multi-Task Learning for Diagram Question Answering with Multi-Modal Transformer

  • Zhaoquan Yuan
  • Xiao Peng
  • Xiao Wu
  • Changsheng Xu

Diagram question answering (DQA) is an effective way to evaluate the reasoning ability
for diagram semantic understanding, which is a very challenging task and largely understudied
compared with natural images. Existing separate two-stage methods for DQA are limited
in ineffective feedback mechanisms. To address this problem, in this paper, we propose
a novel structural parsing-integrated Hierarchical Multi-Task Learning (HMTL) model
for diagram question answering based on a multi-modal transformer framework. In the
proposed paradigm of multi-task learning, the two tasks of diagram structural parsing
and question answering are in the different semantic levels and equipped with different
transformer blocks, which constituents a hierarchical architecture. The structural
parsing module encodes the information of constituents and their relationships in
diagrams, while the diagram question answering module decodes the structural signals
and combines question-answers to infer correct answers. Visual diagrams and textual
question-answers are interplayed in the multi-modal transformer, which achieves cross-modal
semantic comprehension and reasoning. Extensive experiments on the benchmark AI2D
and FOODWEBS datasets demonstrate the effectiveness of our proposed HMTL over other
state-of-the-art methods.

Differentiated Learning for Multi-Modal Domain Adaptation

  • Jianming Lv
  • Kaijie Liu
  • Shengfeng He

Directly deploying a trained multi-modal classifier to a new environment usually leads
to poor performance due to the well-known domain shift problem. Existing multi-modal
domain adaptation methods treated each modality equally and optimize the sub-models
of different modalities synchronously. However, as observed in this paper, the degrees
of domain shift in different modalities are usually diverse. We propose a novel Differentiated
Learning framework to make use of the diversity between multiple modalities for more
effective domain adaptation. Specifically, we model the classifiers of different modalities
as a group of teacher/student sub-models, and a novel Prototype based Reliability
Measurement is presented to estimate the reliability of the recognition results made
by each sub-model on the target domain. More reliable results are then picked up as
teaching materials for all sub-models in the group. Considering the diversity of different
modalities, each sub-model performs the Asynchronous Curriculum Learning by choosing
the teaching materials from easy to hard measured by itself. Furthermore, a reliability-aware
fusion scheme is proposed to combine all optimized sub-models to support final decision.
Comprehensive experiments based on three multi-modal datasets with different learning
tasks have been conducted, which show the superior performance of our model while
comparing with state-of-the-art multi-modal domain adaptation models.

SESSION: Session 13: Multimodal Analysis and Description-II

Two-stage Visual Cues Enhancement Network for Referring Image Segmentation

  • Yang Jiao
  • Zequn Jie
  • Weixin Luo
  • Jingjing Chen
  • Yu-Gang Jiang
  • Xiaolin Wei
  • Lin Ma

Referring Image Segmentation (RIS) aims at segmenting the target object from an image
referred by one given natural language expression. The diverse and flexible expressions
and complex visual contents in the images raise the RIS model with higher demands
for investigating fine-grained matching behaviors between words in expressions and
objects presented in images. However, such matching behaviors are hard to be learned
and captured when the visual cues of referents (i.e. referred objects) are insufficient,
as the referents of weak visual cues tend to be easily confused by cluttered background
at boundary or even overwhelmed by salient objects in the image. And the insufficient
visual cues issue can not be handled by the cross-modal fusion mechanisms as done
in previous work.In this paper, we tackle this problem from a novel perspective of
enhancing the visual information for the referents by devising a Two-stage Visual
cues enhancement Network (TV-Net), where a novel Retrieval and Enrichment Scheme (RES)
and an Adaptive Multi-resolution feature Fusion (AMF) module are proposed. Specifically,
RES retrieves the most relevant image from an external data pool with regard to both
the visual and textual similarities, and then enriches the visual information of the
referent with the retrieved image for better multimodal feature learning. AMF further
enhances the visual detailed information by incorporating the high-resolution feature
maps from lower convolution layers of the image. Through the two-stage enhancement,
our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors
between the natural language expression and image, especially when the visual information
of the referent is inadequate, thus produces better segmentation results. Extensive
experiments are conducted to validate the effectiveness of the proposed method on
the RIS task, with our proposed TV-Net surpassing the state-of-the-art approaches
on four benchmark datasets.

Partial Tubal Nuclear Norm Regularized Multi-view Learning

  • Yongyong Chen
  • Shuqin Wang
  • Chong Peng
  • Guangming Lu
  • Yicong Zhou

Multi-view clustering and multi-view dimension reduction explore ubiquitous and complementary
information between multiple features to enhance the clustering, recognition performance.
However, multi-view clustering and multi-view dimension reduction are treated independently,
ignoring the underlying correlations between them. In addition, previous methods mainly
focus on using the tensor nuclear norm for low-rank representation to explore the
high correlation of multi-view features, which often causes the estimation bias of
the tensor rank. To overcome these limitations, we propose the partial tubal nuclear
norm regularized multi-view learning (PTN2ML) method, in which the partial tubal nuclear
norm as a non-convex surrogate of the tensor tubal multi-rank, only minimizes the
partial sum of the smaller tubal singular values to preserve the low-rank property
of the self-representation tensor. PTN2ML pursues the latent representation from the
projection space rather than from the input space to reveal the structural consensus
and suppress the disturbance of noisy data. The proposed method can be efficiently
optimized by the alternating direction method of multipliers. Extensive experiments,
including multi-view clustering and multi-view dimension reduction substantiate the
superiority of the proposed methods beyond state-of-the-arts.

Deep Unsupervised 3D SfM Face Reconstruction Based on Massive Landmark Bundle Adjustment

  • Yuxing Wang
  • Yawen Lu
  • Zhihua Xie
  • Guoyu Lu

We address the problem of reconstructing 3D human face from multi-view facial images
using Structure-from-Motion (SfM) based on deep neural networks. While recent learning-based
monocular view methods have shown impressive results for 3D facial reconstruction,
the single-view setting is easily affected by depth ambiguities and poor face pose
issues. In this paper, we propose a novel unsupervised 3D face reconstruction architecture
by leveraging the multi-view geometry constraints to train accurate face pose and
depth maps. Facial images from multiple perspectives of each 3D face model are input
to train the network. Multi-view geometry constraints are fused into unsupervised
network by establishing loss constraints from spatial and spectral perspectives. To
make the trained 3D face have more details, facial landmark detector is explored to
acquire massive facial information to constrain face pose and depth estimation. Through
minimizing massive landmark displacement distance by bundle adjustment, an accurate
3D face model can be reconstructed. Extensive experiments demonstrate the superiority
of our proposed approach over other methods.

SimulLR: Simultaneous Lip Reading Transducer with Attention-Guided Adaptive Memory

  • Zhijie Lin
  • Zhou Zhao
  • Haoyuan Li
  • Jinglin Liu
  • Meng Zhang
  • Xingshan Zeng
  • Xiaofei He

Lip reading, aiming to recognize spoken sentences according to the given video of
lip movements without relying on the audio stream, has attracted great interest due
to its application in many scenarios. Although prior works that explore lip reading
have obtained salient achievements, they are all trained in a non-simultaneous manner
where the predictions are generated requiring access to the full video. To breakthrough
this constraint, we study the task of simultaneous lip reading and devise SimulLR,
a simultaneous lip Reading transducer with attention-guided adaptive memory from three
aspects: (1) To address the challenge of monotonic alignments while considering the
syntactic structure of the generated sentences under simultaneous setting, we build
a transducer-based model and design several effective training strategies including
CTC pre-training, model warm-up and curriculum learning to promote the training of
the lip reading transducer. (2) To learn better spatio-temporal representations for
simultaneous encoder, we construct a truncated 3D convolution and time-restricted
self-attention layer to perform the frame-to-frame interaction within a video segment
containing fixed number of frames. (3) The history information is always limited due
to the storage in real-time scenarios, especially for massive video data. Therefore,
we devise a novel attention-guided adaptive memory to organize semantic information
of history segments and enhance the visual representations with acceptable computation-aware
latency. The experiments show that the SimulLR achieves the translation speedup 9.10x
compared with the state-of-the-art non-simultaneous methods, and also obtains competitive
results, which indicates the effectiveness of our proposed methods.

Dense Semantic Contrast for Self-Supervised Visual Representation Learning

  • Xiaoni Li
  • Yu Zhou
  • Yifei Zhang
  • Aoting Zhang
  • Wei Wang
  • Ning Jiang
  • Haiying Wu
  • Weiping Wang

Self-supervised representation learning for visual pre-training has achieved remarkable
success with sample (instance or pixel) discrimination and semantics discovery of
instance, whereas there still exists a non-negligible gap between pre-trained model
and downstream dense prediction tasks. Concretely, these downstream tasks require
more accurate representation, in other words, the pixels from the same object must
belong to a shared semantic category, which is lacking in the previous methods. In
this work, we present Dense Semantic Contrast (DSC) for modeling semantic category
decision boundaries at a dense level to meet the requirement of these tasks. Furthermore,
we propose a dense cross-image semantic contrastive learning framework for multi-granularity
representation learning. Specially, we explicitly explore the semantic structure of
the dataset by mining relations among pixels from different perspectives. For intra-image
relation modeling, we discover pixel neighbors from multiple views. And for inter-image
relations, we enforce pixel representation from the same semantic class to be more
similar than the representation from different classes in one mini-batch. Experimental
results show that our DSC model outperforms state-of-the-art methods when transferring
to downstream dense prediction tasks, including object detection, semantic segmentation,
and instance segmentation. Code will be made available.

Multiple Object Tracking by Trajectory Map Regression with Temporal Priors Embedding

  • Xingyu Wan
  • Sanping Zhou
  • Jinjun Wang
  • Rongye Meng

Prevailing Multiple Object Tracking (MOT) works following the Tracking-by-Detection
(TBD) paradigm pay most attention to either object detection in a first step or data
association in a second step. In this paper, we approach the MOT problem from a different
perspective by directly obtaining the embedded spatial-temporal information of trajectories
from raw video data. For the purpose we propose a joint trajectory locating and attributes
encoding framework for real-time, on-line MOT. We firstly introduce a trajectory attribute
representation scheme designed for each tracked target (instead of object) where the
extracted Trajectory Map (TM) encodes the spatial-temporal attributes of a trajectory
across a window of consecutive video frames. Next we present a Temporal Priors Embedding
(TPE) methodology to infer these attributes with a logical reasoning strategy based
on long-term feature dynamics. The proposed MOT framework projects multiple attributes
of tracked targets, e.g., presence, enter/exit, location, scale, motion, etc. into
a continuous TM to perform one-shot regression for real-time MOT. Experimental results
show that, our proposed video-based method runs at 33 FPS and is more accurate and
robust as compared to the detection-based tracking methods and a few other State-of-the-
Art (SOTA) approaches on MOT16/17/20 benchmarks.

SESSION: Session 14: Multimedia Cloud, Edge and Device Computing

DeepGame: Efficient Video Encoding for Cloud Gaming

  • Omar Mossad
  • Khaled Diab
  • Ihab Amer
  • Mohamed Hefeeda

Cloud gaming enables users to play games on virtually any device. This is achieved
by offloading the game rendering and encoding to cloud datacenters. As game resolutions
and frame rates increase, cloud gaming platforms face a major challenge to stream
high quality games due to the high bandwidth and low latency requirements. In this
paper, we propose a new video encoding pipeline, called DeepGame, for cloud gaming
platforms to reduce the bandwidth requirements with limited to no impact on the player
quality of experience. DeepGame learns the player's contextual interest in the game
and the temporal correlation of that interest using a spatio-temporal deep neural
network. Then, it encodes various areas in the video frames with different quality
levels proportional to their contextual importance. DeepGame does not change the source
code of the video encoder or the video game, and it does not require any additional
hardware or software at the client side. We implemented DeepGame in an open-source
cloud gaming platform and evaluated its performance using multiple popular games.
We also conducted a subjective study with real players to demonstrate the potential
gains achieved by DeepGame and its practicality. Our results show that DeepGame can
reduce the bandwidth requirements by up to 36% compared to the baseline encoder, while
maintaining the same level of perceived quality for players and running in real time.

ChartPointFlow for Topology-Aware 3D Point Cloud Generation

  • Takumi Kimura
  • Takashi Matsubara
  • Kuniaki Uehara

A point cloud serves as a representation of the surface of a three-dimensional (3D)
shape. Deep generative models have been adapted to model their variations typically
using a map from a ball-like set of latent variables. However, previous approaches
did not pay much attention to the topological structure of a point cloud, despite
that a continuous map cannot express the varying numbers of holes and intersections.
Moreover, a point cloud is often composed of multiple subparts, and it is also difficult
to express. In this study, we propose ChartPointFlow, a flow-based generative model
with multiple latent labels for 3D point clouds. Each label is assigned to points
in an unsupervised manner. Then, a map conditioned on a label is assigned to a continuous
subset of a point cloud, similar to a chart of a manifold. This enables our proposed
model to preserve the topological structure with clear boundaries, whereas previous
approaches tend to generate blurry point clouds and fail to generate holes. The experimental
results demonstrate that ChartPointFlow achieves state-of-the-art performance in terms
of generation and reconstruction compared with other point cloud generators. Moreover,
ChartPointFlow divides an object into semantic subparts using charts, and it demonstrates
superior performance in case of unsupervised segmentation.

Co-learning: Learning from Noisy Labels with Self-supervision

  • Cheng Tan
  • Jun Xia
  • Lirong Wu
  • Stan Z. Li

Noisy labels, resulting from mistakes in manual labeling or webly data collecting
for supervised learning, can cause neural networks to overfit the misleading information
and degrade the generalization performance. Self-supervised learning works in the
absence of labels and thus eliminates the negative impact of noisy labels. Motivated
by co-training with both supervised learning view and self-supervised learning view,
we propose a simple yet effective method called Co-learning for learning with noisy
labels. Co-learning performs supervised learning and self-supervised learning in a
cooperative way. The constraints of intrinsic similarity with the self-supervised
module and the structural similarity with the noisily-supervised module are imposed
on a shared common feature encoder to regularize the network to maximize the agreement
between the two constraints. Co-learning is compared with peer methods on corrupted
data from benchmark datasets fairly, and extensive results are provided which demonstrate
that Co-learning is superior to many state-of-the-art approaches.

Graph Convolutional Multi-modal Hashing for Flexible Multimedia Retrieval

  • Xu Lu
  • Lei Zhu
  • Li Liu
  • Liqiang Nie
  • Huaxiang Zhang

Multi-modal hashing makes an important contribution to multimedia retrieval, where
a key challenge is to encode heterogeneous modalities into compact hash codes. To
solve this dilemma, graph-based multi-modal hashing methods generally define individual
affinity matrix of each independent modality and apply linear algorithm for heterogeneous
modalities fusion and compact hash learning. Several other methods construct graph
Laplacian matrix based on semantic information to help learn discriminative hash code.
However, these conventional methods roughly ignore the structural similarity of training
set and the complex relations among multi-modal samples, which leads to unsatisfactory
complementarity of fused hash codes. More notably, they are faced with two other important
problems: huge computing and storage costs caused by graph construction and partial
modality feature lost problem when incomplete query sample comes. In this paper, we
propose a Flexible Graph Convolutional Multi-modal Hashing (FGCMH) method that adopts
GCNs with linear complexity to preserve both the modality-individual and modality-fused
structural similarity for discriminative hash learning. Necessarily, accurate multimedia
retrieval can be performed on complete and incomplete datasets with our method. Specifically,
multiple modality-individual GCNs under semantic guidance are proposed to act on each
individual modality independently for intra-modality similarity preserving, then the
output representations are fused into a fusion graph with adaptive weighting scheme.
Hash GCN and semantic GCN, which share parameters in the first two layers, propagate
fusion information and generate hash codes under high-level label space supervision.
In the query stage, our method adaptively captures various multi-modal contents in
a flexible and robust way, even if partial modality features are lost. Experimental
results on three publicly datasets show the flexibility and effectiveness of our proposed

Hybrid Network Compression via Meta-Learning

  • Jianming Ye
  • Shiliang Zhang
  • Jingdong Wang

Neural network pruning and quantization are two major lines of network compression.
This raises a natural question that whether we can find the optimal compression by
considering multiple network compression criteria in a unified framework. This paper
incorporates two criteria and seeks layer-wise compression by leveraging the meta-learning
framework. A regularization loss is applied to unify the constraint of input and output
channel numbers, bit-width of network activations and weights, so that the compressed
network can satisfy a given Bit-OPerations counts (BOPs) constraint. We further propose
an iterative compression constraint for optimizing the compression procedure, which
effectively achieves a high compression rate and maintains the original network performance.
Extensive experiments on various networks and vision tasks show that the proposed
method yields better performance and compression rates than recent methods. For instance,
our method achieves better image classification accuracy and compactness than the
recent DJPQ. It achieves similar performance with the recent DHP in image super-resolution,
meanwhile saves about 50% computation.

Two-pronged Strategy: Lightweight Augmented Graph Network Hashing for Scalable Image Retrieval

  • Hui Cui
  • Lei Zhu
  • Jingjing Li
  • Zhiyong Cheng
  • Zheng Zhang

Hashing learns compact binary codes to store and retrieve massive data efficiently.
Particularly, unsupervised deep hashing is supported by powerful deep neural networks
and has the desirable advantage of label independence. It is a promising technique
for scalable image retrieval. However, deep models introduce a large number of parameters,
which is hard to optimize due to the lack of explicit semantic labels and brings considerable
training cost. As a result, the retrieval accuracy and training efficiency of existing
unsupervised deep hashing are still limited. To tackle the problems, in this paper,
we propose a simple and efficient Lightweight Augmented Graph Network Hashing (LAGNH)
method with a two-pronged strategy. For one thing, we extract the inner structure
of the image as the auxiliary semantics to enhance the semantic supervision of the
unsupervised hash learning process. For another, we design a lightweight network structure
with the assistance of the auxiliary semantics, which greatly reduces the number of
network parameters that needs to be optimized and thus greatly accelerates the training
process. Specifically, we design a cross-modal attention module based on the auxiliary
semantic information to adaptively mitigate the adverse effects in the deep image
features. Besides, the hash codes are learned by multi-layer message passing within
an adversarial regularized graph convolutional network. Simultaneously, the semantic
representation capability of hash codes is further enhanced by reconstructing the
similarity graph. Experimental results show that our method achieves significant performance
improvement compared with the state-of-the-art unsupervised deep hashing methods in
terms of both retrieval accuracy and efficiency. Notably, on MS-COCO dataset, our
method achieves more than 10% improvement on retrieval precision and 2.7x speedup
on training time compared with the second best result.

SESSION: Interactive Arts

Reconstruction: A Motion Driven Interactive Artwork Inspired by Chinese Shadow Puppet

  • Wenli Jiang
  • Chong Cao

Shadow puppet play is a representative Chinese intangible cultural heritage, which
has a history of more than two thousand years. However, with the popularity of digital
media, this traditional art form has become desolate. "Reconstruction" is an interactive
digital artwork inspired by the production and performance of Chinese shadow puppet.
The scenes and characters are designed based on the art style of shadow puppet. The
participant's motion is captured with a Kinect and used to control the motion of the

Syntropic Counterpoints: Metaphysics of The Machines

  • Predrag K. Nikolic
  • Ruiyang Liu
  • Shengcheng Luo

In the artwork Syntropic Counterpoints: Metaphysics of The Machines, we tend to explore
phenomena of AI aesthetic and challenge machine abstraction. Our approach toward the
liberation of machine creativity is through the use of words and grammar as a creative
tool humans developed to express worlds "beyond" the world, existing and non-existing
realities. We are lead by Nietzsche's claim that grammar is the "Metaphysics of the
People," as such grammar, content, and vision generated during the philosophical discussion
between our AI clones is "Metaphysics of Machines" through we can experience their
realities and start to question our own.

Kandinsky Mobile: Abstract Art-Inspired Interactive Visualization of Social Discussions on Mobile Devices

  • Castillo Clarence Fitzgerald Gumtang
  • Sourav S. Bhowmick

Kandinsky Mobile is a mobile device-based interactive artwork that generates and displays
the social discussion landscape associated with a social mediaanchor post using a
collection of colorfulcircles andconcentric circles. It draws inspiration from the
famous abstract geometric art forms of Russian painter Wassily Kandinsky (1866-1944).
Intuitively, a circle and a concentric circle represent a social comment and a collection
of comments in a discussion thread, respectively. The artwork aims to facilitate user-friendly
and effective understanding and visualization of large volumes of comments associated
with ananchor post.

Sand Scope: An Interactive Installation for Revealing the Connection Between Mental Space and
Life Space in a Microcosm of the World

  • Lyn Chao-ling Chen

In the artwork, the topic of life space has been discussed. Instead of physical space,
mental space of human being was considered. People usually focus on themselves to
solve various life tasks, and the scales of their mental space influence how they
realize the world. The artwork tried to arouse people to aware the connection between
mental space and life space. The Sand Scope introduces a microcosm of the world, for
comparing the scale of mental space with the scale of the microcosm, from the relative
scales between the microcosm and the whole world. From the new perspective, the Sand
Scope reminds people to escape from the routine of their daily lives, for rethinking
meaning of life. Multimedia input contains gray image analysis to form the stamps
with portraits of current audiences and past participants, color image subtraction
to compose texture of the mountain drawing with wearing cloth information on the painting,
and a buffer with timer to capture and replay ambient sounds continuously in a delayed
time, along with color images with blending effect in the period. In the interactive
installation, an improvisational painting in the form of a Chinese brush painting
with stamps from connoisseurs was exhibited. The generation of stamps from the audiences
on the painting also indicates that they are parts of the microcosm. The microcosm
was constructed from the elements of the inhabitants who live in the real world in
physical aspect, and the awareness of meaning of life implies harmony between nature
and humanity on the Zen painting in mental aspect.

Heraclitus's Forest: An Interactive Artwork for Oral History

  • Lin Wang
  • Zhonghao Lin
  • Wei Cai

Heraclitus's Forest is an interactive artwork that utilizes birch trees as a metaphor
for the life stories recorded in an oral history database. We design a day/night cycle
system to present the forest experience along the time elapse, multiple interaction
modes to engage audiences' participation in history exploration, and evolving forest
to arouse people's reflection on the feature of history, which is constantly being
constructed but can never be returned to.

Affective Color Fields: Reimagining Rothkoesque Artwork as an Interactive Companion for Artistic Self-Expression

  • Aiden Kang
  • Liang Wang
  • Ziyu Zhou
  • Zhe Huang
  • Robert J.K. Jacob

In this art project, we create Affective Color Fields: an interactive artifact that
takes in a user's narrative of their emotional experiences and dynamically transforms
it into Rothkoesque color fields through emotion classification. Inspired by Mark
Rothko's abstract depiction of human emotions and Merleau-Ponty's phenomenological
inquiry, we wish to establish an intimate relationship between interactive art and
the subject by employing user's own interpretation and framing of life events. Through
the performative and improvisational art-making process, users can playfully appropriate
our artifact for a rich and personal aesthetic experience.

Apercevoir: Bio Internet of Things Interactive System

  • You-Yang Hu
  • Chiao-Chi Chou
  • Chia-Wei Li

Apercevoir is an artwork that can perceive its environmental perturbations and convert
them into a spatial sound field with location information. It consists of multiple
plant cyborgs comprised of a Mimosa Pudica (sensitive plant) connected to a bioamplifier
and can sense human movements by analyzing the biosignals with a machine learning
model. Through sharing multiple cyborgs' biosignals, this network portrays the concept
of multiple beings transcending an individual's physical confines to form a Bio Internet
of Things (IOT) system capable of perception, feedback, and group decision-making
within a wider scope. A particular feature of this system is its interactive bone
induction headphones, where the audience can listen to a sound field including 'vibrations'
of nearby human activities detected by plant cyborgs, and even warnings among the
cyborg network responding to foreign disturbance and damage. This sound field invites
audiences to close their eyes and listen attentively to plants while the biosignals
and changes in sound reveal the presence of other entities in the space.

SESSION: Poster Session 2

Visual Co-Occurrence Alignment Learning for Weakly-Supervised Video Moment Retrieval

  • Zheng Wang
  • Jingjing Chen
  • Yu-Gang Jiang

Video moment retrieval aims to localize the most relevant video moment given the text
query. Weakly supervised approaches leverage video-text pairs only for training, without
temporal annotations. Most current methods align the proposed video moment and the
text in a joint embedding space. However, in lack of temporal annotations, the semantic
gap between these two modalities makes it predominant to learn joint feature representation
for most methods, with less emphasis on learning visual feature representation. This
paper aims to improve the visual feature representation with supervisions in the visual
domain, obtaining discriminative visual features for cross-modal learning. Based on
the observation that relevant video moments (i.e., share similar activities) from
different videos are commonly described by similar sentences; hence the visual features
of these relevant video moments should also be similar despite that they come from
different videos. Therefore, to obtain more discriminative and robust visual features
for video moment retrieval, we propose to align the visual features of relevant video
moments from different videos that co-occurred in the same training batch. Besides,
a contrastive learning approach is introduced for learning the moment-level alignment
of these videos. Through extensive experiments, we demonstrate that the proposed visual
co-occurrence alignment learning method outperforms the cross-modal alignment learning
counterpart and achieves promising results for video moment retrieval.

Adaptive Normalized Representation Learning for Generalizable Face Anti-Spoofing

  • ShuBao Liu
  • Ke-Yue Zhang
  • Taiping Yao
  • Mingwei Bi
  • Shouhong Ding
  • Jilin Li
  • Feiyue Huang
  • Lizhuang Ma

With various face presentation attacks arising under unseen scenarios, face anti-spoofing
(FAS) based on domain generalization (DG) has drawn growing attention due to its robustness.
Most existing methods utilize DG frameworks to align the features to seek a compact
and generalized feature space. However, little attention has been paid to the feature
extraction process for the FAS task, especially the influence of normalization, which
also has a great impact on the generalization of the learned representation. To address
this issue, we propose a novel perspective of face anti-spoofing that focuses on the
normalization selection in the feature extraction process. Concretely, an Adaptive
Normalized Representation Learning (ANRL) framework is devised, which adaptively selects
feature normalization methods according to the inputs, aiming to learn domain-agnostic
and discriminative representation. Moreover, to facilitate the representation learning,
Dual Calibration Constraints are designed, including Inter-Domain Compatible loss
and Inter-Class Separable loss, which provide a better optimization direction for
generalizable representation. Extensive experiments and visualizations are presented
to demonstrate the effectiveness of our method against the SOTA competitors.

Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis

  • Haozhe Wu
  • Jia Jia
  • Haoyu Wang
  • Yishun Dou
  • Chao Duan
  • Qingshan Deng

People talk with diversified styles. For one piece of speech, different talking styles
exhibit significant differences in the facial and head pose movements. For example,
the "excited" style usually talks with the mouth wide open, while the "solemn" style
is more standardized and seldomly exhibits exaggerated motions. Due to such huge differences
between different styles, it is necessary to incorporate the talking style into audio-driven
talking face synthesis framework. In this paper, we propose to inject style into the
talking face synthesis framework through imitating arbitrary talking style of the
particular reference video. Specifically, we systematically investigate talking styles
with our collected Ted-HD dataset and construct style codes as several statistics
of 3D morphable model (3DMM) parameters. Afterwards, we devise a latent-style-fusion
(LSF) model to synthesize stylized talking faces by imitating talking styles from
the style codes. We emphasize the following novel characteristics of our framework:
(1) It doesn't require any annotation of the style, the talking style is learned in
an unsupervised manner from talking videos in the wild. (2) It can imitate arbitrary
styles from arbitrary videos, and the style codes can also be interpolated to generate
new styles. Extensive experiments demonstrate that the proposed framework has the
ability to synthesize more natural and expressive talking styles compared with baseline

Pose-guided Inter- and Intra-part Relational Transformer for Occluded Person Re-Identification

  • Zhongxing Ma
  • Yifan Zhao
  • Jia Li

Person Re-Identification (Re-Id) in occlusion scenarios is a challenging problem because
a pedestrian can be partially occluded. The use of local information for feature extraction
and matching is still necessary. Therefore, we propose a Pose-guided inter- and intra-part
relational transformer (Pirt) for occluded person Re-Id, which builds part-aware long-term
correlations by introducing transformer. In our framework, we firstly develop a pose-guided
feature extraction module with regional grouping and mask construction for robust
feature representations. The positions of a pedestrian in the image under surveillance
scenarios are relatively fixed, hence we propose intra-part and inter-part relational
transformer. The intra-part module creates local relations with mask-guided features,
while the inter-part relationship builds correlations with transformers, to develop
cross relationships between part nodes. With the collaborative learning inter- and
intra-part relationships, experiments reveal that our proposed Pirt model achieves
a new state of the art on the public occluded dataset, and further extensions on standard
non-occluded person Re-Id datasets also reveal our comparable performances.

VLAD-VSA: Cross-Domain Face Presentation Attack Detection with Vocabulary Separation
and Adaptation

  • Jiong Wang
  • Zhou Zhao
  • Weike Jin
  • Xinyu Duan
  • Zhen Lei
  • Baoxing Huai
  • Yiling Wu
  • Xiaofei He

For face presentation attack detection (PAD), most of the spoofing cues are subtle,
local image patterns (e.g., local image distortion, 3D mask edge and cut photo edges).
The representations of existing PAD works with simple global pooling method, however,
lose the local feature discriminability. In this paper, the VLAD aggregation method
is adopted to quantize local features with visual vocabulary locally partitioning
the feature space, and hence preserve the local discriminability. We further propose
the vocabulary separation and adaptation method to modify VLAD for cross-domain PAD
task. The proposed vocabulary separation method divides vocabulary into domain-shared
and domain-specific visual words to cope with the diversity of live and attack faces
under the cross-domain scenario.The proposed vocabulary adaptation method imitates
the maximization step of the k-means algorithm in the end-to-end training, which guarantees
the visual words be close to the center of assigned local features and thus brings
robust similarity measurement. We give illustrations and extensive experiments to
demonstrate the effectiveness of VLAD with the proposed vocabulary separation and
adaptation method on standard cross-domain PAD benchmarks. The codes are available

End-to-End Video Object Detection with Spatial-Temporal Transformers

  • Lu He
  • Qianyu Zhou
  • Xiangtai Li
  • Li Niu
  • Guangliang Cheng
  • Xiao Li
  • Wenxuan Liu
  • Yunhai Tong
  • Lizhuang Ma
  • Liqing Zhang

Recently, DETR and Deformable DETR have been proposed to eliminate the need for many
hand-designed components in object detection while demonstrating good performance
as previous complex hand-crafted detectors. However, their performance on Video Object
Detection (VOD) has not been well explored. In this paper, we present TransVOD, an
end-to-end video object detection model based on a spatial-temporal Transformer architecture.
The goal of this paper is to streamline the pipeline of VOD, effectively removing
the need for many hand-crafted components for feature aggregation, e.g., optical flow,
recurrent neural networks, relation networks. Besides, benefited from the object query
design in DETR, our method does not need complicated post-processing methods such
as Seq-NMS or Tubelet rescoring, which keeps the pipeline simple and clean. In particular,
we present temporal Transformer to aggregate both the spatial object queries and the
feature memories of each frame. Our temporal Transformer consists of three components:
Temporal Deformable Transformer Encoder (TDTE) to encode the multiple frame spatial
details, Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable
Transformer Decoder (TDTD) to obtain current frame detection results. These designs
boost the strong baseline deformable DETR by a significant margin (3%-4% mAP) on the
ImageNet VID dataset. TransVOD yields comparable results performance on the benchmark
of ImageNet VID. We hope our TransVOD can provide a new perspective for video object

Joint-teaching: Learning to Refine Knowledge for Resource-constrained Unsupervised
Cross-modal Retrieval

  • Peng-Fei Zhang
  • Jiasheng Duan
  • Zi Huang
  • Hongzhi Yin

Cross-modal retrieval has received considerable attention owing to its applicability
to enable users to search desired information with diversified forms. Existing retrieval
methods retain good performance mainly relying on complex deep neural networks and
high-quality supervision signals, which deters them from real-world resource-constrained
development and deployment. In this paper, we propose an effective unsupervised learning
framework named JOint-teachinG (JOG) to pursue a high-performance yet light-weight
cross-modal retrieval model. The key idea is to utilize the knowledge of a pre-trained
model (a.k.a. the "teacher") to endow the to-be-learned model (a.k.a. the "student")
with strong feature learning ability and predictive power. Considering that a teacher
model serving the same task as the student is not always available, we resort to a
cross-task teacher to leverage transferrable knowledge to guide student learning.
To eliminate the inevitable noises in the distilled knowledge resulting from the task
discrepancy, an online knowledge-refinement strategy is designed to progressively
improve the quality of the cross-task knowledge in a joint-teaching manner, where
a peer student is engaged. In addition, the proposed JOG learns to represent the original
high-dimensional data with compact binary codes to accelerate the query processing,
further facilitating resource-limited retrieval. Through extensive experiments, we
demonstrate that in various network structures, the proposed method can yield promising
learning results on widely-used benchmarks. The proposed research is a pioneering
work for resource-constrained cross-modal retrieval, which has strong potential to
be applied to on-device deployment and is hoped to pave the way for further study.

AggNet for Self-supervised Monocular Depth Estimation: Go An Aggressive Step Furthe

  • Zhi Chen
  • Xiaoqing Ye
  • Liang Du
  • Wei Yang
  • Liusheng Huang
  • Xiao Tan
  • Zhenbo Shi
  • Fumin Shen
  • Errui Ding

Without appealing to exhaustive labeled data, self-supervised monocular depth estimation
(MDE) plays a fundamental role in computer vision. Previous methods usually adopt
a one-stage MDE network, which is insufficient to achieve high performance. In this
paper, we dig deep into this task to propose an aggressive framework termed AggNet.
The framework is based on a training-only progressive two-stage module to perform
pseudo counter-surveillance as well as a simple yet effective dual-warp loss function
between image pairs. In particular, we first propose a residual module, which follows
the MDE network to learn a refined depth. The residual module takes both the initial
depth generated from MDE and the initial color image as input to generate refined
depth with residual depth learning. Then, the refined depth is leveraged to supervise
the initial depth simultaneously during the training period. For inference, only the
MDE network is retained to regress depth from a single image, which gains better performance
without introducing extra computation. In addition to self-distillation loss, a simple
yet effective dual-warp consistency loss is introduced to encourage the MDE network
to keep depth consistency between stereo image pairs. Extensive experiments show that
our AggNet achieves state-of-the-art performance on the KITTI and Make3D datasets.

Boosting Lightweight Single Image Super-resolution via Joint-distillation

  • Xiaotong Luo
  • Qiuyuan Liang
  • Ding Liu
  • Yanyun Qu

The rising of deep learning has facilitated the development of single image super-resolution
(SISR). However, the growing burdensome model complexity and memory occupation severely
hinder its practical deployments on resource-limited devices. In this paper, we propose
a novel joint-distillation (JDSR) framework to boost the representation of various
off-the-shelf lightweight SR models. The framework includes two stages: the superior
LR generation and the joint-distillation learning. The superior LR is obtained from
the HR image itself. With less than $300$K parameters, the peer network using superior
LR as input can achieve comparable SR performance with large models, e.g., RCAN, with
15M parameters, which enables it as the input of peer network to save the training
expense. The joint-distillation learning consists of internal self-distillation and
external mutual learning. The internal self-distillation aims to achieve model self-boosting
by transferring the knowledge from the deeper SR output to the shallower one. Specifically,
each intermediate SR output is supervised by the HR image and the soft label from
subsequent deeper outputs. To shrink the capacity gap between shallow and deep layers,
a soft label generator is designed in a progressive backward fusion way with meta-learning
for adaptive weight fine-tuning. The external mutual learning focuses on obtaining
interaction information from a peer network in the process. Moreover, a curriculum
learning strategy and a performance gap threshold are introduced for balancing the
convergence rate of the original SR model and its peer network. Comprehensive experiments
on benchmark datasets demonstrate that our proposal improves the performance of recent
lightweight SR models by a large margin, with the same model architecture and inference

Discriminator-free Generative Adversarial Attack

  • Shaohao Lu
  • Yuqiao Xian
  • Ke Yan
  • Yi Hu
  • Xing Sun
  • Xiaowei Guo
  • Feiyue Huang
  • Wei-Shi Zheng

The Deep Neural Networks are vulnerable to adversarial examples (Figure 1), making
the DNNs-based systems collapsed by adding the inconspicuous perturbations to the
images. Most of the existing works for adversarial attack are gradient-based and suffer
from the latency efficiencies and the load on GPU memory. The generative-based adversarial
attacks can get rid of this limitation, and some relative works propose the approaches
based on GAN. However, suffering from the difficulty of the convergence of training
a GAN, the adversarial examples have either bad attack ability or bad visual quality.
In this work, we find that the discriminator could be not necessary for generative-based
adversarial attack, and propose the Symmetric Saliency-based Auto-Encoder (SSAE) to
generate the perturbations, which is composed of the saliency map module and the angle-norm
disentanglement of the features module. The advantage of our proposed method lies
in that it is not depending on discriminator, and uses the generative saliency map
to pay more attention to label-relevant regions. The extensive experiments among the
various tasks, datasets, and models demonstrate that the adversarial examples generated
by SSAE not only make the widely-used models collapse, but also achieves good visual
quality. The code is available at:

Former-DFER: Dynamic Facial Expression Recognition Transformer

  • Zengqun Zhao
  • Qingshan Liu

This paper proposes a dynamic facial expression recognition transformer (Former-DFER)
for the in-the-wild scenario. Specifically, the proposed Former-DFER mainly consists
of a convolutional spatial transformer (CS-Former) and a temporal transformer (T-Former).
The CS-Former consists of five convolution blocks and N spatial encoders, which is
designed to guide the network to learn occlusion and pose-robust facial features from
the spatial perspective. And the temporal transformer consists of M temporal encoders,
which is designed to allow the network to learn contextual facial features from the
temporal perspective. The heatmaps of the leaned facial features demonstrate that
the proposed Former-DFER is capable of handling the issues such as occlusion, non-frontal
pose, and head motion. And the visualization of the feature distribution shows that
the proposed method can learn more discriminative facial features. Moreover, our Former-DFER
also achieves state-of-the-art results on the DFEW and AFEW benchmarks.

Discovering Density-Preserving Latent Space Walks in GANs for Semantic Image Transformations

  • Guanyue Li
  • Yi Liu
  • Xiwen Wei
  • Yang Zhang
  • Si Wu
  • Yong Xu
  • Hau-San Wong

Generative adversarial network (GAN)-based models possess superior capability of high-fidelity
image synthesis. There are a wide range of semantically meaningful directions in the
latent representation space of well-trained GANs, and the corresponding latent space
walks are meaningful for semantic controllability in the synthesized images. To explore
the underlying organization of a latent space, we propose an unsupervised Density-Preserving
Latent Semantics Exploration model (DP-LaSE). The important latent directions are
determined by maximizing the variations in intermediate features, while the correlation
between the directions is minimized. Considering that latent codes are sampled from
a prior distribution, we adopt a density-preserving regularization approach to ensure
latent space walks are maintained in iso-density regions, since moving to a higher/lower
density region tends to cause unexpected transformations. To further refine semantics-specific
transformations, we perform subspace learning over intermediate feature channels,
such that the transformations are limited to the most relevant subspaces. Extensive
experiments on a variety of benchmark datasets demonstrate that DP-LaSE is able to
discover interpretable latent space walks, and specific properties of synthesized
images can thus be precisely controlled.

MGH: Metadata Guided Hypergraph Modeling for Unsupervised Person Re-identification

  • Yiming Wu
  • Xintian Wu
  • Xi Li
  • Jian Tian

As a challenging task, unsupervised person ReID aims to match the same identity with
query images which does not require any labeled information. In general, most existing
approaches focus on the visual cues only, leaving potentially valuable auxiliary metadata
information (e.g., spatio-temporal context) unexplored. In the real world, such metadata
is normally available alongside captured images, and thus plays an important role
in separating several hard ReID matches. With this motivation in mind, we propose
MGH, a novel unsupervised person ReID approach that uses meta information to construct
a hypergraph for feature learning and label refinement. In principle, the hypergraph
is composed of camera-topology-aware hyperedges, which can model the heterogeneous
data correlations across cameras. Taking advantage of label propagation on the hypergraph,
the proposed approach is able to effectively refine the ReID results, such as correcting
the wrong labels or smoothing the noisy labels. Given the refined results, we further
present a memory-based listwise loss to directly optimize the average precision in
an approximate manner. Extensive experiments on three benchmarks demonstrate the effectiveness
of the proposed approach against the state-of-the-art.

Recovering the Unbiased Scene Graphs from the Biased Ones

  • Meng-Jiun Chiou
  • Henghui Ding
  • Hanshu Yan
  • Changhu Wang
  • Roger Zimmermann
  • Jiashi Feng

Given input images, scene graph generation (SGG) aims to produce comprehensive, graphical
representations describing visual relationships among salient objects. Recently, more
efforts have been paid to the long tail problem in SGG; however, the imbalance in
the fraction of missing labels of different classes, or reporting bias, exacerbating
the long tail is rarely considered and cannot be solved by the existing debiasing
methods. In this paper we show that, due to the missing labels, SGG can be viewed
as a "Learning from Positive and Unlabeled data" (PU learning) problem, where the
reporting bias can be removed by recovering the unbiased probabilities from the biased
ones by utilizing label frequencies, i.e., the per-class fraction of labeled, positive
examples in all the positive examples. To obtain accurate label frequency estimates,
we propose Dynamic Label Frequency Estimation (DLFE) to take advantage of training-time
data augmentation and average over multiple training iterations to introduce more
valid examples. Extensive experiments show that DLFE is more effective in estimating
label frequencies than a naive variant of the traditional estimate, and DLFE significantly
alleviates the long tail and achieves state-of-the-art debiasing performance on the
VG dataset. We also show qualitatively that SGG models with DLFE produce prominently
more balanced and unbiased scene graphs. The source code is publicly available.

Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization

  • Fa-Ting Hong
  • Jia-Chang Feng
  • Dan Xu
  • Ying Shan
  • Wei-Shi Zheng

Weakly supervised temporal action localization (WS-TAL) is a challenging task that
aims to localize action instances in the given video with video-level categorical
supervision. Previous works use the appearance and motion features extracted from
pre-trained feature encoder directly,e.g., feature concatenation or score-level fusion.
In this work, we argue that the features extracted from the pre-trained extractors,e.g.,
I3D, which are trained for trimmed video action classification, but not specific for
WS-TAL task, leading to inevitable redundancy and sub-optimization. Therefore, the
feature re-calibration is needed for reducing the task-irrelevant information redundancy.
Here, we propose a cross-modal consensus network(CO2-Net) to tackle this problem.
In CO2-Net, we mainly introduce two identical proposed cross-modal consensus modules
(CCM) that design a cross-modal attention mechanism to filter out the task-irrelevant
information redundancy using the global information from the main modality and the
cross-modal local information from the auxiliary modality. Moreover, we further explore
inter-modality consistency, where we treat the attention weights derived from each
CCM as the pseudo targets of the attention weights derived from another CCM to maintain
the consistency between the predictions derived from two CCMs, forming a mutual learning
manner. Finally, we conduct extensive experiments on two commonly used temporal action
localization datasets, THUMOS14 and ActivityNet1.2, to verify our method, which we
achieve state-of-the-art results. The experimental results show that our proposed
cross-modal consensus module can produce more representative features for temporal
action localization.

Searching a Hierarchically Aggregated Fusion Architecture for Fast Multi-Modality
Image Fusion

  • Risheng Liu
  • Zhu Liu
  • Jinyuan Liu
  • Xin Fan

Multi-modality image fusion refers to generating a complementary image that integrates
typical characteristics from source images. In recent years, we have witnessed the
remarkable progress of deep learning models for multi-modality fusion. Existing CNN-based
approaches strain every nerve to design various architectures for realizing these
tasks in an end-to-end manner. However, these handcrafted designs are unable to cope
with the high demanding fusion tasks, resulting in blurred targets and lost textural
details. To alleviate these issues, in this paper, we propose a novel approach, aiming
at searching effective architectures according to various modality principles and
fusion mechanisms.

Specifically, we construct a hierarchically aggregated fusion architecture to extract
and refine fused features from feature-level and object-level fusion perspectives,
which is responsible for obtaining complementary target/detail representations. Then
by investigating diverse effective practices, we composite a more flexible fusion-specific
search space. Motivated by the collaborative principle, we employ a new search strategy
with different principled losses and hardware constraints for sufficient discovery
of components. As a result, we can obtain a task-specific architecture with fast inference
time. Extensive quantitative and qualitative results demonstrate the superiority and
versatility of our method against state-of-the-art methods.

SuperFront: From Low-resolution to High-resolution Frontal Face Synthesis

  • Yu Yin
  • Joseph P. Robinson
  • Songyao Jiang
  • Yue Bai
  • Can Qin
  • Yun Fu

Even the most impressive achievement in frontal face synthesis is challenged by large
poses and low-quality data given one single side-view face. We propose a synthesizer
called SuperFront GAN (SF-GAN) to accept one or more low-resolution (LR) faces at
the input to then output a high-resolution (HR) frontal face with various poses and
such to preserve identity information. SF-GAN includes intra-class and inter-class
constraints, which allow it to learn an identity-preserving representation from multiple
LR faces in an improved, comprehensive manner. We adopt an orthogonal loss as the
intra-class constraint that diversifies the learned feature-space per subject. Hence,
each sample is made to complement the others to its max ability. Additionally, a triplet
loss is used as the inter-class constraint: it improves the discriminative power of
the new representation, which, hence, maintains the identity information. Furthermore,
we integrate a super-resolution (SR) side-view module as part of the SF-GAN to help
preserve the finer details of HR side-views. This helps the model reconstruct the
high-frequency parts of the face (i.e. periocular region, nose, and mouth regions).
Quantitative and qualitative results demonstrate the superiority of SF-GAN. SF-GAN
holds promise as a pre-processing step to normalize and align faces before passing
to CV system for processing.

Learning Segment Similarity and Alignment in Large-Scale Content Based Video Retrieval

  • Chen Jiang
  • Kaiming Huang
  • Sifeng He
  • Xudong Yang
  • Wei Zhang
  • Xiaobo Zhang
  • Yuan Cheng
  • Lei Yang
  • Qing Wang
  • Furong Xu
  • Tan Pan
  • Wei Chu

With the explosive growth of web videos in recent years, large-scale Content-Based
Video Retrieval (CBVR) becomes increasingly essential in video filtering, recommendation,
and copyright protection. Segment-level CBVR (S-CBVR) locates the start and end time
of similar segments in finer granularity, which is beneficial for user browsing efficiency
and infringement detection especially in long video scenarios. The challenge of S-CBVR
task is how to achieve high temporal alignment accuracy with efficient computation
and low storage consumption. In this paper, we propose a Segment Similarity and Alignment
Network (SSAN) in dealing with the challenge which is firstly trained end-to-end in
S-CBVR. SSAN is based on two newly proposed modules in video retrieval: (1) An efficient
Self-supervised Keyframe Extraction (SKE) module to reduce redundant frame features,
(2) A robust Similarity Pattern Detection (SPD) module for temporal alignment. In
comparison with uniform frame extraction, SKE not only saves feature storage and search
time, but also introduces comparable accuracy and limited extra computation time.
In terms of temporal alignment, SPD localizes similar segments with higher accuracy
and efficiency than existing deep learning methods. Furthermore, we jointly train
SSAN with SKE and SPD and achieve an end-to-end improvement. Meanwhile, the two key
modules SKE and SPD can also be effectively inserted into other video retrieval pipelines
and gain considerable performance improvements. Experimental results on public datasets
show that SSAN can obtain higher alignment accuracy while saving storage and online
query computational cost compared to existing methods.

Cut-Thumbnail: A Novel Data Augmentation for Convolutional Neural Network

  • Tianshu Xie
  • Xuan Cheng
  • Xiaomin Wang
  • Minghui Liu
  • Jiali Deng
  • Tao Zhou
  • Ming Liu

In this paper, we propose a novel data augmentation strategy named Cut-Thumbnail,
that aims to improve the shape bias of the network. We reduce an image to a certain
size and replace the random region of the original image with the reduced image. The
generated image not only retains most of the original image information but also has
global information in the reduced image. We call the reduced image as thumbnail. Furthermore,
we find that the idea of thumbnail can be perfectly integrated with Mixed Sample Data
Augmentation, so we put one image's thumbnail on another image while the ground truth
labels are also mixed, making great achievements on various computer vision tasks.
Extensive experiments show that Cut-Thumbnail works better than state-of-the-art augmentation
strategies across classification, fine-grained image classification, and object detection.
On ImageNet classification, ResNet-50 architecture with our method achieves 79.21%
accuracy, which is more than 2.8% improvement on the baseline.

Diffusing the Liveness Cues for Face Anti-spoofing

  • Sheng Li
  • Xun Zhu
  • Guorui Feng
  • Xinpeng Zhang
  • Zhenxing Qian

Face anti-spoofing is an important step for secure face recognition. One of the main
challenges is how to learn and build a general classifier that is able to resist various
presentation attacks. Recently, the patch-based face anti-spoofing schemes are shown
to be able to improve the robustness of the classifier. These schemes extract subtle
liveness cues from small local patches independently, which do not fully exploit the
correlations among the patches. In this paper, we propose a Patch-based Compact Graph
Network (PCGN) to diffuse the subtle liveness cues from all the patches. Firstly,
the image is encoded into a compact graph by connecting each node with its backward
neighbors. We then propose an asymmetrical updating strategy to update the compact
graph. Such a strategy aggregates the node based on whether it is a sender or receiver,
which leads to better message-passing. The updated graph is eventually decoded for
making the final decision. We conduct the experiments on four public databases with
four intra-database protocols and eight cross-database protocols, the results of which
demonstrate the effectiveness of our PCGN for face anti-spoofing.

Co-Transport for Class-Incremental Learning

  • Da-Wei Zhou
  • Han-Jia Ye
  • De-Chuan Zhan

Traditional learning systems are trained in closed-world for a fixed number of classes,
and need pre-collected datasets in advance. However, new classes often emerge in real-world
applications and should be learned incrementally. For example, in electronic commerce,
new types of products appear daily, and in a social media community, new topics emerge
frequently. Under such circumstances, incremental models should learn several new
classes at a time without forgetting. We find a strong correlation between old and
new classes in incremental learning, which can be applied to relate and facilitate
different learning stages mutually. As a result, we propose CO-transport for class
Incremental Learning (COIL), which learns to relate across incremental tasks with
the class-wise semantic relationship. In detail, co-transport has two aspects: prospective
transport tries to augment the old classifier with optimal transported knowledge as
fast model adaptation. Retrospective transport aims to transport new class classifiers
backward as old ones to overcome forgetting. With these transports, COIL efficiently
adapts to new tasks, and stably resists forgetting. Experiments on benchmark and real-world
multimedia datasets validate the effectiveness of our proposed method.

Skeleton-Contrastive 3D Action Representation Learning

  • Fida Mohammad Thoker
  • Hazel Doughty
  • Cees G. M. Snoek

This paper strives for self-supervised learning of a feature space suitable for skeleton-based
action recognition. Our proposal is built upon learning invariances to input skeleton
representations and various skeleton augmentations via a noise contrastive estimation.
In particular, we propose inter-skeleton contrastive learning, which learns from multiple
different input skeleton representations in a cross-contrastive manner. In addition,
we contribute several skeleton-specific spatial and temporal augmentations which further
encourage the model to learn the spatio-temporal dynamics of skeleton data. By learning
similarities between different skeleton representations as well as augmented views
of the same sequence, the network is encouraged to learn higher-level semantics of
the skeleton data than when only using the augmented views. Our approach achieves
state-of-the-art performance for self-supervised learning from skeleton data on the
challenging PKU and NTU datasets with multiple downstream tasks, including action
recognition, action retrieval and semi-supervised learning. Code is available at

Fast-forwarding, Rewinding, and Path Exploration in Interactive Branched Video Streaming

  • Albin Vogel
  • Erik Kronberg
  • Niklas Carlsson

With interactive branched video, the storyline is typically determined by branch choices
made by the user during playback. Despite putting users in control of their viewing
experiences, prior work has not considered how to best help users that may want to
quickly navigate, explore, or skip parts of the branched video. Such functionalities
are important for both impatient users and those rewatching the video. To address
this void, we present the design, implementation and evaluation of interface solutions
that help users effectively navigate the video, and to identify and explore previously
unviewed storylines. Our solutions work with large, general video structures and allow
users to effectively forward/rewind the branched structures. Our user study demonstrates
the added value of our novel designs, presents promising tradeoffs, provides insights
into the pros/cons of different design alternatives, and highlights the features that
best address specific tasks and design aspects.

Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation)

  • Yunzhong Hou
  • Liang Zheng

Multiview detection incorporates multiple camera views to deal with occlusions, and
its central problem is multiview aggregation. Given feature map projections from multiple
views onto a common ground plane, the state-of-the-art method addresses this problem
via convolution, which applies the same calculation regardless of object locations.
However, such translation-invariant behaviors might not be the best choice, as object
features undergo various projection distortions according to their positions and cameras.
In this paper, we propose a novel multiview detector, MVDeTr, that adopts a newly
introduced shadow transformer to aggregate multiview information. Unlike convolutions,
shadow transformer attends differently at different positions and cameras to deal
with various shadow-like distortions. We propose an effective training scheme that
includes a new view-coherent data augmentation method, which applies random augmentations
while maintaining multiview consistency. On two multiview detection benchmarks, we
report new state-of-the-art accuracy with the proposed system. Code is available at

Domain Generalization via Feature Variation Decorrelation

  • Chang Liu
  • Lichen Wang
  • Kai Li
  • Yun Fu

Domain generalization aims to learn a model that generalizes to unseen target domains
from multiple source domains. Various approaches have been proposed to address this
problem by adversarial learning, meta-learning, and data augmentation. However, those
methods have no guarantee for target domain generalization. Motivated by an observation
that the class-irrelevant information of sample in the form of semantic variation
would lead to negative transfer, we propose to linearly disentangle the variation
out of sample in feature space and impose a novel class decorrelation regularization
on the feature variation. By doing so, the model would focus on the high-level categorical
concept for model prediction while ignoring the misleading clue from other variations
(including domain changes). As a result, we achieve state-of-the-art performances
over all of widely used domain generalization benchmarks, namely PACS, VLCS, Office-Home,
and Digits-DG with large margins. Further analysis reveals our method could learn
a better domain-invariant representation, and decorrelated feature variation could
successfully capture semantic meaning.

Occlusion-aware Bi-directional Guided Network for Light Field Salient Object Detection

  • Dong Jing
  • Shuo Zhang
  • Runmin Cong
  • Youfang Lin

Existing light field based works utilize either views or focal stacks for saliency
detection. However, since depth information exists implicitly in adjacent views or
different focal slices, it is difficult to exploit scene depth information from both.
By comparison, Epipolar Plane Images (EPIs) provide explicit accurate scene depth
and occlusion information by projected pixel lines. Due to the fact that the depth
of an object is often continuous, the distribution of occlusion edges concentrates
more on object boundaries compared with traditional color edges, which is more beneficial
for improving accuracy and completeness of saliency detection. In this paper, we propose
a learning-based network to exploit occlusion features from EPIs and integrate high-level
features from the central view for accurate salient object detection. Specifically,
a novel Occlusion Extraction Module is proposed to extract occlusion boundary features
from horizontal and vertical EPIs. In order to naturally combine occlusion features
in EPIs and high-level features in central view, we design a concise Bi-directional
Guiding Flow based on cascaded decoders. The flow leverages generated salient edge
predictions and salient object predictions to refine features in mutual encoding processes.
Experimental results demonstrate that our approach achieves state-of-the-art performance
in both segmentation accuracy and edge clarity.

One-Stage Visual Grounding via Semantic-Aware Feature Filter

  • Jiabo Ye
  • Xin Lin
  • Liang He
  • Dingbang Li
  • Qin Chen

Visual grounding has attracted much attention with the popularity of vision language.
Existing one-stage methods are far ahead of two-stage methods in speed. However, these
methods fuse the textual feature and visual feature map by simply concatenation, which
ignores the textual semantics and limits these models' ability in cross-modal understanding.
To overcome this weakness, we propose a semantic-aware framework that utilizes both
queries' structured knowledge and context-sensitive representations to filter the
visual feature maps to localize the referents more accurately. Our framework contains
an entity filter, an attribute filter, and a location filter. These three filters
filter the input visual feature map step by step according to each query's aspects
respectively. A grounding module further regresses the bounding boxes to localize
the referential object. Experiments on various commonly used datasets show that our
framework achieves a real-time inference speed and outperforms all state-of-the-art

Few-Shot Multi-Agent Perception

  • Chenyou Fan
  • Junjie Hu
  • Jianwei Huang

We study few-shot learning (FSL) under multi-agent scenarios, in which participating
agents only have local scarce labeled data and need to collaborate to predict query
data labels. Though each of the agents, such as drones and robots, has minimal communication
and computation capability, we aim at designing coordination schemes such that they
can collectively perceive the environment accurately and efficiently. We propose a
novel metric-based multi-agent FSL framework which has three main components: an efficient
communication mechanism that propagates compact and fine-grained query feature maps
from query agents to support agents; an asymmetric attention mechanism that computes
region-level attention weights between query and support feature maps; and a metric-learning
module which calculates the image-level relevance between query and support data fast
and accurately. Through analysis and extensive numerical studies, we demonstrate that
our approach can save communication and computation costs and significantly improve
performance in both visual and acoustic perception tasks such as face identification,
semantic segmentation, and sound genre recognition.

SI3DP: Source Identification Challenges and Benchmark for Consumer-Level 3D Printer

  • Bo Seok Shim
  • Yoo Seung Shin
  • Seong Wook Park
  • Jong-Uk Hou

This paper lays the foundation for a new 3D content market by establishing a content
security framework using databases and benchmarks for in-depth research on source
identification of 3D printed objects. The proposed benchmark, SI3DP dataset, offers
a more generalized multimedia forensic technique. Assuming that identifying the source
of a 3D printing object can arise from various invisible traces occurring in the printing
process, we obtain close-up images, full object images from 252 printed objects from
18 different printing setups. We then propose a benchmark with five challenging tasks
such as device-level identification and scan-and-reprint detection using the provided
dataset. Our baseline shows that the printer type and its attributes can be identified
based on the microscopic difference of surface texture. Contrary to the conventional
belief that only microscopic views such as close-up images are useful to identify
printer model, we also achieved a certain level of performance even at a relatively
macroscopic point of view. We then propose a multitask-multimodal architecture for
device-level identification task to exploit rich knowledge from different image modality
and task. The SI3DP dataset can promote future in-depth research studies related to
digital forensics and intellectual property protection.

Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers

  • Wen Wang
  • Yang Cao
  • Jing Zhang
  • Fengxiang He
  • Zheng-Jun Zha
  • Yonggang Wen
  • Dacheng Tao

Detection transformers have recently shown promising object detection results and
attracted increasing attention. However, how to develop effective domain adaptation
techniques to improve its cross-domain performance remains unexplored and unclear.
In this paper, we delve into this topic and empirically find that direct feature distribution
alignment on the CNN backbone only brings limited improvements, as it does not guarantee
domain-invariant sequence features in the transformer for prediction. To address this
issue, we propose a novel Sequence Feature Alignment (SFA) method that is specially
designed for the adaptation of detection transformers. Technically, SFA consists of
a domain query-based feature alignment (DQFA) module and a token-wise feature alignment
(TDA) module. In DQFA, a novel domain query is used to aggregate and align global
context from the token sequence of both domains. DQFA reduces the domain discrepancy
in global feature representations and object relations when deploying in the transformer
encoder and decoder, respectively. Meanwhile, TDA aligns token features in the sequence
from both domains, which reduces the domain gaps in local and instance-level feature
representations in the transformer encoder and decoder, respectively. Besides, a novel
bipartite matching consistency loss is proposed to enhance the feature discriminability
for robust object detection. Experiments on three challenging benchmarks show that
SFA outperforms state-of-the-art domain adaptive object detection methods. Code has
been made available at:

Towards Realistic Visual Dubbing with Heterogeneous Sources

  • Tianyi Xie
  • Liucheng Liao
  • Cheng Bi
  • Benlai Tang
  • Xiang Yin
  • Jianfei Yang
  • Mingjie Wang
  • Jiali Yao
  • Yang Zhang
  • Zejun Ma

The task of few-shot visual dubbing focuses on synchronizing the lip movements with
arbitrary speech input for any talking head video. Albeit moderate improvements in
current approaches, they commonly require high-quality homologous data sources of
videos and audios, thus causing the failure to leverage heterogeneous data sufficiently.
In practice, it may be intractable to collect the perfect homologous data in some
cases, for example, audio-corrupted or picture-blurry videos. To explore this kind
of data and support high-fidelity few-shot visual dubbing, in this paper, we novelly
propose a simple yet efficient two-stage framework with a higher flexibility of mining
heterogeneous data. Specifically, our two-stage paradigm employs facial landmarks
as intermediate prior of latent representations and disentangles the lip movements
prediction from the core task of realistic talking head generation. By this means,
our method makes it possible to independently utilize the training corpus for two-stage
sub-networks using more available heterogeneous data easily acquired. Besides, thanks
to the disentanglement, our framework allows a further fine-tuning for a given talking
head, thereby leading to better speaker-identity preserving in the final synthesized
results. Moreover, the proposed method can also transfer appearance features from
others to the target speaker. Extensive experimental results demonstrate the superiority
of our proposed method in generating highly realistic videos synchronized with the
speech over the state-of-the-art.

Deep Self-Supervised t-SNE for Multi-modal Subspace Clustering

  • Qianqian Wang
  • Wei Xia
  • Zhiqiang Tao
  • Quanxue Gao
  • Xiaochun Cao

Existing multi-modal subspace clustering methods, aiming to exploit the correlation
information between different modalities, have achieved promising preliminary results.
However, these methods might be incapable of handling real problems with complex heterogeneous
structures between different modalities, since the large heterogeneous structure makes
it difficult to directly learn a discriminative shared self-representation for multi-modal
clustering. To tackle this problem, in this paper, we propose a deep Self-supervised
t-SNE method (StSNE) for multi-modal subspace clustering, which learns soft label
features by multi-modal encoders and utilizes the common label feature to supervise
soft label feature of each modal by adversarial training and reconstruction networks.
Specifically, the proposed StSNE consists of four components: 1) multi-modal convolutional
encoders; 2) a self-supervised t-SNE module; 3) a self-expressive layer; 4) multi-modal
convolutional decoders. Multi-modal data are fed to encoders to obtain soft label
features, for which the self-supervised t-SNE module is added to make full use of
the label information among different modalities. Simultaneously, the latent representations
given by encoders are constrained by a self-expressive layer to capture the hierarchical
information of each modal, followed by decoders reconstructing the encoded features
to preserve the structure of the original data. Experimental results on several public
datasets demonstrate the superior clustering performance of the proposed method over
state-of-the-art methods.

Multimodal Video Summarization via Time-Aware Transformers

  • Xindi Shang
  • Zehuan Yuan
  • Anran Wang
  • Changhu Wang

With the growing number of videos in video sharing platforms, how to facilitate the
searching and browsing of the user-generated video has attracted intense attention
by multimedia community. To help people efficiently search and browse relevant videos,
summaries of videos become important. The prior works in multimodal video summarization
mainly explore visual and ASR tokens as two separate sources and struggle to fuse
the multimodal information for generating the summaries. However, the time information
inside videos is commonly ignored. In this paper, we find that it is important to
leverage the timestamps to accurately incorporate multimodal signals for the task.
We propose a Time-Aware Multimodal Transformer (TAMT) with a novel short-term order-sensitive
attention mechanism. The attention mechanism can attend the inputs differently based
on time difference to explore the time information inherent inside video more thoroughly.
As such, TAMT can fuse the different modalities better for summarizing the videos.
Experiments show that our proposed approach is effective and achieves the state-of-the-art
performances on both YouCookII and open-domain How2 datasets.

State-aware Video Procedural Captioning

  • Taichi Nishimura
  • Atsushi Hashimoto
  • Yoshitaka Ushiku
  • Hirotaka Kameko
  • Shinsuke Mori

Video procedural captioning (VPC), which generates procedural text from instructional
videos, is an essential task for scene understanding and real-world applications.
The main challenge of VPC is to describe how to manipulate materials accurately. This
paper focuses on this challenge by designing a new VPC task, generating a procedural
text from the clip sequence of an instructional video and material list. In this task,
the state of materials is sequentially changed by manipulations, yielding their state-aware
visual representations (e.g., eggs are transformed into cracked, stirred, then fried
forms). The essential difficulty is to convert such visual representations into textual
representations; that is, a model should track the material states after manipulations
to better associate the cross-modal relations. To achieve this, we propose a novel
VPC method, which modifies an existing textual simulator for tracking material states
as a visual simulator and incorporates it into a video captioning model. Our experimental
results show the effectiveness of the proposed method, which outperforms state-of-the-art
video captioning models. We further analyze the learned embedding of materials to
demonstrate that the simulators capture their state transition. The code and dataset
are available from

AMSS-Net: Audio Manipulation on User-Specified Sources with Textual Queries

  • Woosung Choi
  • Minseok Kim
  • Marco A. Martínez Ramírez
  • Jaehwa Chung
  • Soonyoung Jung

This paper proposes a neural network that performs audio transformations to user-specified
sources (e.g., vocals) of a given audio track according to a given description while
preserving other sources not mentioned in the description. Audio Manipulation on a
Specific Source (AMSS) is challenging because a sound object (i.e., a waveform sample
or frequency bin) is 'transparent'; it usually carries information from multiple sources,
in contrast to a pixel in an image. To address this challenging problem, we propose
AMSS-Net, which extracts latent sources and selectively manipulates them while preserving
irrelevant sources. We also propose an evaluation benchmark for several AMSS tasks,
and we show that AMSS-Net outperforms baselines on several AMSS tasks via objective
metrics and empirical verification.

Fully Functional Image Manipulation Using Scene Graphs in A Bounding-Box Free Way

  • Sitong Su
  • Lianli Gao
  • Junchen Zhu
  • Jie Shao
  • Jingkuan Song

Recently, performing semantic editing of an image by modifying a scene graph has been
proposed to support high-level image manipulation, and plays an important role for
image generation. However, existing methods are all based on bounding boxes, and they
suffer from the bounding box constraint. First, a bounding box often involves other
instances (e.g, objects or environments) which do not need to be modified, but existing
methods manipulate all the contents included in the bounding box. Secondly, prior
methods fail to support adding instances when the bounding box of the target instance
cannot be provided. To address the two issues above, we propose a novel bounding box
free approach, which consists of two parts: a Local Bounding Box Free (Local-BBox-Free)
Mask Generation and a Global Bounding Box Free (Global-BBox-Free) Instance Generation.
The first part relieves the model of reliance on bounding boxes by generating the
mask of the target instance to be manipulated without using the target instance bounding
box. This enables our method to be the first to support fully functional image manipulation
using scene graphs, including adding, removing, replacing and repositing instances.
The second part is designed to synthesize the target instance directly from the generated
mask and then paste it back to the inpainted original image using the generated mask,
which preserves the unchanged part to the largest extent and precisely controls the
target instance generation. Extensive experiments on Visual Genome and COCO-Stuff
demonstrate that our model significantly surpasses the state-of-the-art both quantitatively
and qualitatively.

Multi-Level Counterfactual Contrast for Visual Commonsense Reasoning

  • Xi Zhang
  • Feifei Zhang
  • Changsheng Xu

Given a question about an image, a Visual Commonsense Reasoning (VCR) model needs
to provide not only a correct answer, but also a rationale to justify the answer.
It is a challenging task due to the requirements of diverse visual content understanding,
abstract language comprehending, and complicated inter-modality relationship reasoning.
To solve above challenges, previous methods either resort to holistic attention mechanism
or explore transformer-based model with pre-training, which, however, cannot perform
comprehensive understanding and usually suffer from heavy computing burden. In this
paper, we propose a novel multi-level counterfactual contrastive learning network
for VCR by jointly modeling the hierarchical visual contents and the inter-modality
relationships between the visual and linguistic domains. The proposed method enjoys
several merits. First, with sufficient instance-level, image-level, and semantic-level
contrastive learning, our model can extract discriminative features and perform comprehensive
understanding for the image and linguistic expressions. Second, taking advantage of
counterfactual thinking, we can generate informative factual and counterfactual samples
for contrastive learning, resulting in stronger perception ability of our model. Third,
an auxiliary contrast module is incorporated into our method to directly optimize
the answer prediction in VCR, which further facilitates the representation learning.
Extensive experiments on the VCR dataset demonstrate that our approach performs favorably
against the state-of-the-arts.

Data-Free Ensemble Knowledge Distillation for Privacy-conscious Multimedia Model Compression

  • Zhiwei Hao
  • Yong Luo
  • Han Hu
  • Jianping An
  • Yonggang Wen

Recent advances in deep learning bring impressive performance for multimedia applications.
Hence, compressing and deploying these applications on resource-limited edge devices
via model compression becomes attractive. Knowledge distillation (KD) is one of the
most popular model compression techniques. However, most well-behaved KD approaches
require the original dataset, which is usually unavailable due to privacy issues,
while existing data-free KD methods perform much worse than data-required counterparts.
In this paper, we analyze previous data-free KD methods from the data perspective
and point out that using a single pre-trained model limits the performance of these
approaches. We then propose a Data-Free Ensemble knowledge Distillation (DFED) framework,
which contains a student network, a generator network, and multiple pre-trained teacher
networks. During training, the student mimics behaviors of the ensemble of teachers
using samples synthesized by a generator, which aims to enlarge the prediction discrepancy
between the student and teachers. A moment matching loss term assists the generator
training by minimizing the distance between activations of synthesized samples and
real samples. We evaluate DFED on three popular image classification datasets. Results
demonstrate that our method achieves significant performance improvements compared
with previous works. We also design an ablation study to verify the effectiveness
of each component of the proposed framework.

SM-SGE: A Self-Supervised Multi-Scale Skeleton Graph Encoding Framework for Person

  • Haocong Rao
  • Xiping Hu
  • Jun Cheng
  • Bin Hu

Person re-identification via 3D skeletons is an emerging topic with great potential
in security-critical applications. Existing methods typically learn body and motion
features from the body-joint trajectory, whereas they lack a systematic way to model
body structure and underlying relations of body components beyond the scale of body
joints. In this paper, we for the first time propose a Self-supervised Multi-scale
Skeleton Graph Encoding (SM-SGE) framework that comprehensively models human body,
component relations, and skeleton dynamics from unlabeled skeleton graphs of various
scales to learn an effective skeleton representation for person Re-ID. Specifically,
we first devise multi-scale skeleton graphs with coarse-to-fine human body partitions,
which enables us to model body structure and skeleton dynamics at multiple levels.
Second, to mine inherent correlations between body components in skeletal motion,
we propose a multi-scale graph relation network to learn structural relations between
adjacent body-component nodes and collaborative relations among nodes of different
scales, so as to capture more discriminative skeleton graph features. Last, we propose
a novel multi-scale skeleton reconstruction mechanism to enable our framework to encode
skeleton dynamics and high-level semantics from unlabeled skeleton graphs, which encourages
learning a discriminative skeleton representation for person Re-ID. Extensive experiments
show that SM-SGE outperforms most state-of-the-art skeleton-based methods. We further
demonstrate its effectiveness on 3D skeleton data estimated from large-scale RGB videos.
Our codes are open at

Video Transformer for Deepfake Detection with Incremental Learning

  • Sohail Ahmed Khan
  • Hang Dai

Face forgery by deepfake is widely spread over the internet and this raises severe
societal concerns. In this paper, we propose a novel video transformer with incremental
learning for detecting deepfake videos. To better align the input face images, we
use a 3D face reconstruction method to generate UV texture from a single input face
image. The aligned face image can also provide pose, eyes blink and mouth movement
information that cannot be perceived in the UV texture image, so we use both face
images and their UV texture maps to extract the image features. We present an incremental
learning strategy to fine-tune the proposed model on a smaller amount of data and
achieve better deepfake detection performance. The comprehensive experiments on various
public deepfake datasets demonstrate that the proposed video transformer model with
incremental learning achieves state-of-the-art performance in the deepfake video detection
task with enhanced feature learning from the sequenced data.

Chinese Character Inpainting with Contextual Semantic Constraints

  • Jiahao Wang
  • Gang Pan
  • Di Sun
  • Jiawan Zhang

Chinese character inpainting is a challenging task where large missing regions have
to be filled with both visually and semantic realistic contents. Existing methods
generally produce pseudo or ambiguous characters due to lack of semantic information.
Given the key observation that Chinese characters contain visually glyph representation
and intrinsic contextual semantics, we tackle the challenge of similar Chinese characters
by modeling the underlying regularities among glyph and semantic information. We propose
a semantics enhanced generative framework for Chinese character inpainting, where
a global semantic supervising module (GSSM) is introduced to constrain contextual
semantics. In particular, sentence embedding is used to guide the encoding of continuous
contextual characters. The method can not only generate realistic Chinese character,
but also explicitly utilize context as reference during network training to eliminate
ambiguity. The proposed method is evaluated on both handwritten and printed Chinese
characters with various masks. The experiments show that the method successfully predicts
missing character information without any mask input, and achieves significant sentence-level
results benefiting from global semantic supervising in a wide variety of scenes.

Curriculum-Based Meta-learning

  • Ji Zhang
  • Jingkuan Song
  • Yazhou Yao
  • Lianli Gao

Meta-learning offers an effective solution to learn new concepts with scarce supervision
through an episodic training scheme: a series of target-like tasks sampled from base
classes are sequentially fed into a meta-learner to extract common knowledge across
tasks, which can facilitate the quick acquisition of task-specific knowledge of the
target task with few samples. Despite its noticeable improvements, the episodic training
strategy samples tasks randomly and uniformly, without considering their hardness
and quality, which may not progressively improve the meta-leaner's generalization
ability. In this paper, we present a Curriculum-Based Meta-learning (CubMeta) method
to train the meta-learner using tasks from easy to hard. Specifically, the framework
of CubMeta is in a progressive way, and in each step, we design a module named BrotherNet
to establish harder tasks and an effective learning scheme for obtaining an ensemble
of stronger meta-learners. In this way, the meta-learner's generalization ability
can be progressively improved, and better performance can be obtained even with fewer
training tasks. We evaluate our method for few-shot classification on two benchmarks
- mini-ImageNet and tiered-ImageNet, where it achieves consistent performance improvements
on various meta-learning paradigms.

Ego-Deliver: A Large-Scale Dataset For Egocentric Video Analysis

  • Haonan Qiu
  • Pan He
  • Shuchun Liu
  • Weiyuan Shao
  • Feiyun Zhang
  • Jiajun Wang
  • Liang He
  • Feng Wang

The egocentric video provides a unique view of event participants to show their attention,
vision, and interaction with objects. In this paper, we introduce Ego-Deliver, a new
large-scale egocentric video benchmark recorded by takeaway riders about their daily
work. To the best of our knowledge, Ego-Deliver presents the first attempt in understanding
activities from the takeaway delivery process while being one of the largest egocentric
video action datasets to date. Our dataset provides a total of 5,360 videos with more
than 139,000 multi-track annotations and 45 different attributes, which we believe
is pivotal to future research in this area. We introduce the FS-Net architecture,
a new anchor-free action detection approach handling extreme variations of action
durations. We partition videos into fragments and build dynamic graphs over fragments,
where multi-fragment context information is aggregated to boost fragment classification.
A splicing and scoring module is applied to obtain final action proposals. Our experimental
evaluation confirms that the proposed framework outperforms existing approaches on
the proposed Ego-Deliver benchmark and is competitive on other popular benchmarks.
In our current version, Ego-Deliver is used to make a comprehensive comparison between
algorithms for activity detection. We also show its application to action recognition
with promising results. The dataset, toolkits and baseline results will be made available

Adversarial Pixel Masking: A Defense against Physical Attacks for Pre-trained Object Detectors

  • Ping-Han Chiang
  • Chi-Shen Chan
  • Shan-Hung Wu

Object detection based on pre-trained deep neural networks (DNNs) has achieved impressive
performance and enabled many applications. However, DNN-based object detectors are
shown to be vulnerable to physical adversarial attacks. Despite that recent efforts
have been made to defend against these attacks, they either use strong assumptions
or become less effective with pre-trained object detectors. In this paper, we propose
adversarial pixel masking (APM), a defense against physical attacks, which is designed
specifically for pre-trained object detectors. APM does not require any assumptions
beyond the "patch-like" nature of a physical attack and can work with different pre-trained
object detectors of different architectures and weights, making it a practical solution
in many applications. We conduct extensive experiments, and the empirical results
show that APM can significantly improve model robustness without significantly degrading
clean performance.

Knowledge-Supervised Learning: Knowledge Consensus Constraints for Person Re-Identification

  • Li Wang
  • Baoyu Fan
  • Zhenhua Guo
  • Yaqian Zhao
  • Runze Zhang
  • Rengang Li
  • Weifeng Gong
  • Endong Wang

The consensus of multiple views on the same data will provide extra regularization,
thereby improving accuracy. Based on this idea, we proposed a novel Knowledge-Supervised
Learning (KSL) method for person re-identification (Re-ID), which can improve the
performance without introducing extra inference cost. Firstly, we introduce isomorphic
auxiliary training strategy to conduct basic multiple views that simultaneously train
multiple classifier heads of the same network on the same training data. The consensus
constraints aim to maximize the agreement among multiple views. To introduce this
regular constraint, inspired by knowledge distillation that paired branches can be
trained collaboratively through mutual imitation learning. Three novel constraints
losses are proposed to distill the knowledge that needs to be transferred across different
branches: similarity of predicted classification probability for cosine space constraints,
distance of embedding features for euclidean space constraints, hard sample mutual
mining for hard sample space constraints. From different perspectives, these losses
complement each other. Experiments on four mainstream Re-ID datasets show that a standard
model with KSL method trained from scratch outperforms its ImageNet pre-training results
by a clear margin. With KSL method, a lightweight model without ImageNet pre-training
outperforms most large models. We expect that these discoveries can attract some attention
from the current de facto paradigm of "pre-training and fine-tuning" in Re-ID task
to the knowledge discovery during model training.

View-normalized Skeleton Generation for Action Recognition

  • Qingzhe Pan
  • Zhifu Zhao
  • Xuemei Xie
  • Jianan Li
  • Yuhan Cao
  • Guangming Shi

Skeleton-based action recognition has attracted great interest due to low cost of
skeleton data acquisition and high robustness to external conditions. A challenging
problem of skeleton-based action recognition is the large intra-class gap caused by
various viewpoints of skeleton data, which makes the action modeling difficult for
network. To alleviate this problem, a feasible solution is to utilize label supervised
methods to learn a view-normalization model. However, since the skeleton data in real
scenes is acquired from diverse viewpoints, it is difficult to obtain the corresponding
view-normalized skeleton as label. Therefore, how to learn a view-normalization model
without the supervised label is the key to solving view-variance problem. To this
end, we propose a view normalization-based action recognition framework, which is
composed of view-normalization generative adversarial network (VN-GAN) and classification
network. For VN-GAN, the model is designed to learn the mapping from diverse-view
distribution to normalized-view distribution. In detail, it is implemented by graph
convolution, where the generator predicts the transformation angles for view normalization
and discriminator classifies the real input samples from the generated ones. For classification
network, view-normalized data is processed to predict the action class. Without the
interference of view variances, classification network can extract more discriminative
feature of action. Furthermore, by combining the joint and bone modalities, the proposed
method reaches the state-of-the-art performance on NTU RGB+D and NTU-120 RGB+D datasets.
Especially in NTU-120 RGB+D, the accuracy is improved by 3.2% and 2.3% under cross-subject
and cross-set criteria, respectively.

Learning Hierarchical Embedding for Video Instance Segmentation

  • Zheyun Qin
  • Xiankai Lu
  • Xiushan Nie
  • Xiantong Zhen
  • Yilong Yin

In this paper, we address video instance segmentation using a new generative model
that learns effective representations of the target and background appearance. We
propose to exploit hierarchical structural embedding over spatio-temporal space, which
is compact, powerful, and flexible in contrast to current tracking-by-detection methods.
Specifically, our model segments and tracks instances across space and time in a single
forward pass, which is formulated as hierarchical embedding learning. The model is
trained to locate the pixels belonging to specific instances over a video clip. We
firstly take advantage of a novel mixing function to better fuse spatio-temporal embeddings.
Moreover, we introduce normalizing flows to further improve the robustness of the
learned appearance embedding, which theoretically extends conventional generative
flows to a factorized conditional scheme. Comprehensive experiments on the video instance
segmentation benchmark, i.e., YouTube-VIS, demonstrate the effectiveness of the proposed
approach. Furthermore, we evaluate our method on an unsupervised video object segmentation
dataset to demonstrate its generalizability.

Text as Neural Operator:Image Manipulation by Text Instruction

  • Tianhao Zhang
  • Hung-Yu Tseng
  • Lu Jiang
  • Weilong Yang
  • Honglak Lee
  • Irfan Essa

n recent years, text-guided image manipulation has gained increasing attention in
the multimedia and computer vision community. The input to conditional image generation
has evolved from image-only to multimodality. In this paper, we study a setting that
allows users to edit an image with multiple objects using complex text instructions
to add, remove, or change the objects. The inputs of the task are multimodal including
(1) a reference image and (2) an instruction in natural language that describes desired
modifications to the image. We propose a GAN-based method to tackle this problem.
The key idea is to treat text as neural operators to locally modify the image feature.
We show that the proposed model performs favorably against recent strong baselines
on three public datasets. Specifically, it generates images of greater fidelity and
semantic relevance, and when used as a image query, leads to better retrieval performance.

DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning

  • Wenhao Wu
  • Yuxiang Zhao
  • Yanwu Xu
  • Xiao Tan
  • Dongliang He
  • Zhikang Zou
  • Jin Ye
  • Yingying Li
  • Mingde Yao
  • Zichao Dong
  • Yifeng Shi

Long-range and short-range temporal modeling are two complementary and crucial aspects
of video recognition. Most of the state-of-the-arts focus on short-range spatio-temporal
modeling and then average multiple snippet-level predictions to yield the final video-level
prediction. Thus, their video-level prediction does not consider spatio-temporal features
of how video evolves along the temporal dimension. In this paper, we introduce a novel
Dynamic Segment Aggregation (DSA) module to capture relationship among snippets. To
be more specific, we attempt to generate a dynamic kernel for a convolutional operation
to aggregate long-range temporal information among adjacent snippets adaptively. The
DSA module is an efficient plug-and-play module and can be combined with the off-the-shelf
clip-based models (i.e., TSM, I3D) to perform powerful long-range modeling with minimal
overhead. The final video architecture, coined as DSANet. We conduct extensive experiments
on several video recognition benchmarks (i.e., Mini-Kinetics-200, Kinetics-400, Something-Something
V1 and ActivityNet) to show its superiority. Our proposed DSA module is shown to benefit
various video recognition models significantly. For example, equipped with DSA modules,
the top-1 accuracy of I3D ResNet-50 is improved from 74.9% to 78.2% on Kinetics-400.
Codes are available at

StrucTexT: Structured Text Understanding with Multi-Modal Transformers

  • Yulin Li
  • Yuxi Qian
  • Yuechen Yu
  • Xiameng Qin
  • Chengquan Zhang
  • Yan Liu
  • Kun Yao
  • Junyu Han
  • Jingtuo Liu
  • Errui Ding

Structured text understanding on Visually Rich Documents (VRDs) is a crucial part
of Document Intelligence. Due to the complexity of content and layout in VRDs, structured
text understanding has been a challenging task. Most existing studies decoupled this
problem into two sub-tasks: entity labeling and entity linking, which require an entire
understanding of the context of documents at both token and segment levels. However,
little work has been concerned with the solutions that efficiently extract the structured
data from different levels. This paper proposes a unified framework named StrucTexT,
which is flexible and effective for handling both sub-tasks. Specifically, based on
the transformer, we introduce a segment-token aligned encoder to deal with the entity
labeling and entity linking tasks at different levels of granularity. Moreover, we
design a novel pre-training strategy with three self-supervised tasks to learn a richer
representation. StrucTexT uses the existing Masked Visual Language Modeling task and
the new Sentence Length Prediction and Paired Boxes Direction tasks to incorporate
the multi-modal information across text, image, and layout. We evaluate our method
for structured text understanding at segment-level and token-level and show it outperforms
the state-of-the-art counterparts with significantly superior performance on the FUNSD,
SROIE, and EPHOIE datasets.

Local Graph Convolutional Networks for Cross-Modal Hashing

  • Yudong Chen
  • Sen Wang
  • Jianglin Lu
  • Zhi Chen
  • Zheng Zhang
  • Zi Huang

Cross-modal hashing aims to map the data of different modalities into a common binary
space to accelerate the retrieval speed. Recently, deep cross-modal hashing methods
have shown promising performance by applying deep neural networks to facilitate feature
learning. However, the known supervised deep methods mainly rely on the labeled information
of datasets, which is insufficient to characterize the latent structures that exist
among different modalities. To mitigate this problem, in this paper, we propose to
use Graph Convolutional Networks (GCNs) to exploit the local structure information
of datasets for cross-modal hash learning. Specifically, a local graph is constructed
according to the neighborhood relationships between samples in deep feature spaces
and fed into GCNs to generate graph embeddings. Then, a within-modality loss is designed
to measure the inner products between deep features and graph embeddings so that hashing
networks and GCNs can be jointly optimized. By taking advantage of GCNs to assist
model's training, the performance of hashing networks can be improved. Extensive experiments
on benchmarks verify the effectiveness of the proposed method.

Metric Learning for Anti-Compression Facial Forgery Detection

  • Shenhao Cao
  • Qin Zou
  • Xiuqing Mao
  • Dengpan Ye
  • Zhongyuan Wang

Detecting facial forgery images and videos is an increasingly important topic in multimedia
forensics. As forgery images and videos are usually compressed into different formats
such as JPEG and H264 when circulating on the Internet, existing forgery-detection
methods trained on uncompressed data often suffer from significant performance degradation
in identifying them. To solve this problem, we propose a novel anti-compression facial
forgery detection framework, which learns a compression-insensitive embedding feature
space utilizing both original and compressed forgeries. Specifically, our approach
consists of three ideas: (i) extracting compression-insensitive features from both
uncompressed and compressed forgeries using an adversarial learning strategy; (ii)
learning a robust partition by constructing a metric loss that can reduce the distance
of the paired original and compressed images in the embedding space; (iii) improving
the accuracy of tampered localization with an attention-transfer module. Experimental
results demonstrate that, the proposed method is highly effective in handling both
compressed and uncompressed facial forgery images.

ASFM-Net: Asymmetrical Siamese Feature Matching Network for Point Completion

  • Yaqi Xia
  • Yan Xia
  • Wei Li
  • Rui Song
  • Kailang Cao
  • Uwe Stilla

We tackle the problem of object completion from point clouds and propose a novel point
cloud completion network employing an Asymmetrical Siamese Feature Matching strategy,
termed as ASFM-Net. Specifically, the Siamese auto-encoder neural network is adopted
to map the partial and complete input point cloud into a shared latent space, which
can capture detailed shape prior. Then we design an iterative refinement unit to generate
complete shapes with fine-grained details by integrating prior information. Experiments
are conducted on the PCN dataset and the Completion3D benchmark, demonstrating the
state-of-the-art performance of the proposed ASFM-Net. Our method achieves the 1st
place in the leaderboard of Completion3D and outperforms existing methods with a large
margin, about 12%. The codes and trained models are released publicly at

Capsule-based Object Tracking with Natural Language Specification

  • Ding Ma
  • Xiangqian Wu

Tracking with Natural-Language Specification (TNL) is a joint topic of understanding
the vision and natural language with a wide range of applications. In previous works,
the communication between two heterogeneous features of vision and language is mainly
through a simple dynamic convolution. However, the performance of prior works is capped
by the difficulty of linguistic variation of natural language in modeling the dynamically
changing target and its surroundings. In the meanwhile, natural language and vision
are firstly fused and then utilized for tracking, which is hard to model the query-focused
context. Query-focused should pay more attention to context modeling to promote the
correlation between these two features. To address these issues, we propose a capsule-based
network, referred to as CapsuleTNL, which performs regression tracking with natural
language query. In the beginning, the visual and textual input is encoded with capsules,
which can not only establish the relationship between entities but also the relationship
between the parts of the entity itself. Then, we devise two interaction routing modules,
which consist of visual-textual routing module to reduce the linguistic variation
of input query and textual-visual routing module to precisely incorporate query-based
visual cues simultaneously. To validate the potential of the proposed network for
visual object tracking, we evaluate our method on two large tracking benchmarks. The
experimental evaluation demonstrates the effectiveness of our capsule-based network.

Faster-PPN: Towards Real-Time Semantic Segmentation with Dual Mutual Learning for
Ultra-High Resolution Images

  • Bicheng Dai
  • Kaisheng Wu
  • Tong Wu
  • Kai Li
  • Yanyun Qu
  • Yuan Xie
  • Yun Fu

Despite recent progress on semantic segmentation, there still exist huge challenges
in high or ultra-high resolution images semantic segmentation. Although the latest
collaborative global-local semantic segmentation methods such as GLNet [4] and PPN
[18] have achieved impressive results, they are inefficient and not fit for practical
applications. Thus, in this paper, we propose a novel and efficient collaborative
global-local framework on the basis of PPN named Faster-PPN for high or ultra-high
resolution images semantic segmentation which makes a better trade-off between the
efficient and effectiveness towards the real-time speed. Specially, we propose Dual
Mutual Learning to improve the feature representation of global and local branches,
which conducts knowledge distillation mutually between the global and local branches.
Furthermore, we design the Pixel Proposal Fusion Module to conduct the fine-grained
selection mechanism which further reduces the redundant pixels for fusion resulting
in the improvement of inference speed. The experimental results on three challenging
high or ultra-high resolution datasets DeepGlobe, ISIC and BACH demonstrate that Faster-PPN
achieves the best performance on accuracy, inference speed and memory usage compared
with state-of-the-art approaches. Especially, our method achieves real-time and near
real-time speed with 36 FPS and 17.7 FPS on ISIC and DeepGlobe, respectively.

Distributed Attention for Grounded Image Captioning

  • Nenglun Chen
  • Xingjia Pan
  • Runnan Chen
  • Lei Yang
  • Zhiwen Lin
  • Yuqiang Ren
  • Haolei Yuan
  • Xiaowei Guo
  • Feiyue Huang
  • Wenping Wang

We study the problem of weakly supervised grounded image captioning. That is, given
an image, the goal is to automatically generate a sentence describing the context
of the image with each noun word grounded to the corresponding region in the image.
This task is challenging due to the lack of explicit fine-grained region word alignments
as supervision. Previous weakly supervised methods mainly explore various kinds of
regularization schemes to improve attention accuracy. However, their performances
are still far from the fully supervised ones. One main issue that has been ignored
is that the attention for generating visually groundable words may only focus on the
most discriminate parts and can not cover the whole object. To this end, we propose
a simple yet effective method to alleviate the issue, termed as partial grounding
problem in our paper. Specifically, we design a distributed attention mechanism to
enforce the network to aggregate information from multiple spatially different regions
with consistent semantics while generating the words. Therefore, the union of the
focused region proposals should form a visual region that encloses the object of interest
completely. Extensive experiments have demonstrated the superiority of our proposed
method compared with the state-of-the-arts.

Multi-initialization Optimization Network for Accurate 3D Human Pose and Shape Estimation

  • Zhiwei Liu
  • Xiangyu Zhu
  • Lu Yang
  • Xiang Yan
  • Ming Tang
  • Zhen Lei
  • Guibo Zhu
  • Xuetao Feng
  • Yan Wang
  • Jinqiao Wang

3D human pose and shape recovery from a monocular RGB image is a challenging task.
Existing learning based methods highly depend on weak supervision signals, e.g. 2D
and 3D joint location, due to the lack of in-the-wild paired 3D supervision. However,
considering the 2D-to-3D ambiguities existed in these weak supervision labels, the
network is easy to get stuck in local optima when trained with such labels. In this
paper, we reduce the ambituity by optimizing multiple initializations. Specifically,
we propose a three-stage framework named Multi-Initialization Optimization Network
(MION). In the first stage, we strategically select different coarse 3D reconstruction
candidates which are compatible with the 2D keypoints of input sample. Each coarse
reconstruction can be regarded as an initialization leads to one optimization branch.
In the second stage, we design a mesh refinement transformer (MRT) to respectively
refine each coarse reconstruction result via a self-attention mechanism. Finally,
a Consistency Estimation Network (CEN) is proposed to find the best result from mutiple
candidates by evaluating if the visual evidence in RGB image matches a given 3D reconstruction.
Experiments demonstrate that our Multi-Initialization Optimization Network outperforms
existing 3D mesh based methods on multiple public benchmarks.

Feedback Network for Mutually Boosted Stereo Image Super-Resolution and Disparity

  • Qinyan Dai
  • Juncheng Li
  • Qiaosi Yi
  • Faming Fang
  • Guixu Zhang

Under stereo settings, the problem of image super-resolution (SR) and disparity estimation
are interrelated that the result of each problem could help to solve the other. The
effective exploitation of correspondence between different views facilitates the SR
performance, while the high-resolution (HR) features with richer details benefit the
correspondence estimation. According to this motivation, we propose a Stereo Super-Resolution
and Disparity Estimation Feedback Network (SSRDE-FNet), which simultaneously handles
the stereo image super-resolution and disparity estimation in a unified framework
and interact them with each other to further improve their performance. Specifically,
the SSRDE-FNet is composed of two dual recursive sub-networks for left and right views.
Besides the cross-view information exploitation in the low-resolution (LR) space,
HR representations produced by the SR process are utilized to perform HR disparity
estimation with higher accuracy, through which the HR features can be aggregated to
generate a finer SR result. Afterward, the proposed HR Disparity Information Feedback
(HRDIF) mechanism delivers information carried by HR disparity back to previous layers
to further refine the SR image reconstruction. Extensive experiments demonstrate the
effectiveness and advancement of SSRDE-FNet.

Merging Multiple Template Matching Predictions in Intra Coding with Attentive Convolutional
Neural Network

  • Qijun Wang
  • Guodong Zheng

In intra coding, template matching prediction is an effective method to reduce the
non-local redundancy inside image content. However, the prediction indicated by the
best template matching is not always the actually best prediction. To solve this problem,
we propose a method, which merges multiple template matching predictions through a
convolutional neural network with attention module. The convolutional neural network
aims at exploring different combinations of the candidate template matching predictions,
and the attention module focuses on determining the most significant prediction candidate.
Besides, the spatial module in attention mechanism can be utilized to model the relationship
between the original pixels in current block and the reconstructed pixels in adjacent
regions (template). Compared to the directional intra prediction and traditional template
matching prediction, our method can provide a unified framework to generate prediction
with high accuracy. The experimental results show that, compared the averaging strategy,
the BD-rate reductions can reach up to 4.7%, 5.5% and 18.3% on the classic standard
sequences (classB-classF), SIQAD dataset (screen content), and Urban100 dataset (natural
scenes) respectively, while the average bit rate saving are 0.5%, 2.7% and 1.8%, respectively.

Camera-Agnostic Person Re-Identification via Adversarial Disentangling Learning

  • Hao Ni
  • Jingkuan Song
  • Xiaosu Zhu
  • Feng Zheng
  • Lianli Gao

Despite the success of single-domain person re-identification (ReID), current supervised
models degrade dramatically when deployed to unseen domains, mainly due to the discrepancy
across cameras. To tackle this issue, we propose an Adversarial Disentangling Learning
(ADL) framework to decouple camera-related and ID-related features, which can be readily
used for camera-agnostic person ReID. ADL adopts a discriminative way instead of the
mainstream generative styles in disentangling methods, eg., GAN or VAE based, because
for person ReID task only the information to discriminate IDs is needed, and more
information to generate images are redundant and may be noisy. Specifically, our model
involves a feature separation module that encodes images into two separate feature
spaces and a disentangled feature learning module that performs adversarial training
to minimize mutual information. We design an effective solution to approximate and
minimize mutual information by transforming it into a discrimination problem. The
two modules are co-designed to obtain strong generalization ability by only using
source dataset. Extensive experiments on three public benchmarks show that our method
outperforms the state-of-the-art generalizable person ReID model by a large margin.
Our code is publicly available at

SESSION: Session 15: Best Paper Session

Speech2AffectiveGestures: Synthesizing Co-Speech Gestures with Generative Adversarial
Affective Expression Learning

  • Uttaran Bhattacharya
  • Elizabeth Childs
  • Nicholas Rewkowski
  • Dinesh Manocha

We present a generative adversarial network to synthesize 3D pose sequences of co-speech
upper-body gestures with appropriate affective expressions. Our network consists of
two components: a generator to synthesize gestures from a joint embedding space of
features encoded from the input speech and the seed poses, and a discriminator to
distinguish between the synthesized pose sequences and real 3D pose sequences. We
leverage the Mel-frequency cepstral coefficients and the text transcript computed
from the input speech in separate encoders in our generator to learn the desired sentiments
and the associated affective cues. We design an affective encoder using multi-scale
spatial-temporal graph convolutions to transform 3D pose sequences into latent, pose-based
affective features. We use our affective encoder in both our generator, where it learns
affective features from the seed poses to guide the gesture synthesis, and our discriminator,
where it enforces the synthesized gestures to contain the appropriate affective expressions.
We perform extensive evaluations on two benchmark datasets for gesture synthesis from
the speech, the TED Gesture Dataset and the GENEA Challenge 2020 Dataset. Compared
to the best baselines, we improve the mean absolute joint error by 10-33%, the mean
acceleration difference by 8-58%, and the Fréchet Gesture Distance by 21-34%. We also
conduct a user study and observe that compared to the best current baselines, around
15.28% of participants indicated our synthesized gestures appear more plausible, and
around 16.32% of participants felt the gestures had more appropriate affective expressions
aligned with the speech.

Video Background Music Generation with Controllable Music Transformer

  • Shangzhe Di
  • Zeren Jiang
  • Si Liu
  • Zhaokai Wang
  • Leyan Zhu
  • Zexin He
  • Hongming Liu
  • Shuicheng Yan

In this work, we address the task of video background music generation. Some previous
works achieve effective music generation but are unable to generate melodious music
specifically for a given video, and none of them considers the video-music rhythmic
consistency. To generate the background music that matches the given video, we first
establish the rhythmic relationships between video and background music. In particular,
we connect timing, motion speed, and motion saliency from video with beat, simu-note
density, and simu-note strength from music, respectively. We then propose CMT, a Controllable
Music Transformer that enables the local control of the aforementioned rhythmic features,
as well as the global control of the music genre and the used instrument specified
by users. Objective and subjective evaluations show that the generated background
music has achieved satisfactory compatibility with the input videos, and at the same
time, impressive music quality.

PIMNet: A Parallel, Iterative and Mimicking Network for Scene Text Recognition

  • Zhi Qiao
  • Yu Zhou
  • Jin Wei
  • Wei Wang
  • Yuan Zhang
  • Ning Jiang
  • Hongbin Wang
  • Weiping Wang

Nowadays, scene text recognition has attracted more and more attention due to its
various applications. Most state-of-the-art methods adopt an encoder-decoder framework
with attention mechanism, which generates text autoregressively from left to right.
Despite the convincing performance, the speed is limited because of the one-by-one
decoding strategy. As opposed to autoregressive models, non-autoregressive models
predict the results in parallel with a much shorter inference time, but the accuracy
falls behind the autoregressive counterpart considerably. In this paper, we propose
a Parallel, Iterative and Mimicking Network (PIMNet) to balance accuracy and efficiency.
Specifically, PIMNet adopts a parallel attention mechanism to predict the text faster
and an iterative generation mechanism to make the predictions more accurate. In each
iteration, the context information is fully explored. To improve learning of the hidden
layer, we exploit the mimicking learning in the training phase, where an additional
autoregressive decoder is adopted and the parallel decoder mimics the autoregressive
decoder with fitting outputs of the hidden layer. With the shared backbone between
the two decoders, the proposed PIMNet can be trained end-to-end without pre-training.
During inference, the branch of the autoregressive decoder is removed for a faster
speed. Extensive experiments on public benchmarks demonstrate the effectiveness and
efficiency of PIMNet. Our code is available in the supplementary material.

Theophany: Multimodal Speech Augmentation in Instantaneous Privacy Channels

  • Abhishek Kumar
  • Tristan Braud
  • Lik Hang Lee
  • Pan Hui

Many factors affect speech intelligibility in face-to-face conversations. These factors
lead conversation participants to speak louder and more distinctively, exposing the
content to potential eavesdroppers. To address these issues, we introduce Theophany,
a privacy-preserving framework for augmenting speech. Theophany establishes ad-hoc
social networks between conversation participants to exchange contextual information,
improving speech intelligibility in real-time. At the core of Theophany, we develop
the first privacy perception model that assesses the privacy risk of a face-to-face
conversation based on its topic, location, and participants. This framework allows
to develop any privacy-preserving application for face-to-face conversation. We implement
the framework within a prototype system that augments the speaker's speech with real-life
subtitles to overcome the loss of contextual cues brought by mask-wearing and social
distancing during the COVID-19 pandemic. We evaluate Theophany through a user survey
and a user study on 53 and 17 participants, respectively. Theophany's privacy predictions
match the participants' privacy preferences with an accuracy of 71.26%. Users considered
Theophany to be useful to protect their privacy (3.88/5), easy to use (4.71/5), and
enjoyable to use (4.24/5). We also raise the question of demographic and individual
differences in the design of privacy-preserving solutions.

aBio: Active Bi-Olfactory Display Using Subwoofers for Virtual Reality

  • You-Yang Hu
  • Yao-Fu Jan
  • Kuan-Wei Tseng
  • You-Shin Tsai
  • Hung-Ming Sung
  • Jin-Yao Lin
  • Yi-Ping Hung

Including olfactory cues in virtual reality (VR) would enhance user immersion in the
virtual environment, and precise control of smell would facilitate a more realistic
experience for users. In this paper, we present aBio, an active bi-olfactory display
system that delivers scents precisely to specific locations rather than diffusing
scented air into the atmosphere. aBio provides users with a natural olfactory experience
in free air by colliding two vortex rings launched from dual speaker-based vortex
generators, which also has the effect of cushioning the force of air impact. According
to the various requests of different applications, the collision point of the vortex
rings can be positioned anywhere in front of the user's nose. To verify the effectiveness
of our device and understand user sensations when using different parameters in our
system, we conduct a series of experiments and user studies. The results show that
the proposed system is effective in the sense that users perceive smell without sensible
haptic disturbance while the system consumes only a very small amount of fragrant
essential oil. We believe that aBio has great potential for increasing the level of
presence in VR by delivering smells with high efficiency.

SESSION: Poster Session 3

Learning to Understand Traffic Signs

  • Yunfei Guo
  • Wei Feng
  • Fei Yin
  • Tao Xue
  • Shuqi Mei
  • Cheng-Lin Liu

One of the intelligent transportation system's critical tasks is to understand traffic
signs and convey traffic information to humans. However, most related works are focused
on the detection and recognition of traffic sign texts or symbols, which is not sufficient
for understanding. Besides, there has been no public dataset for traffic sign understanding
research. Our work takes the first step towards addressing this problem. First, we
propose a "CASIA-Tencent Chinese Traffic Sign Understanding Dataset" (CTSU Dataset),
which contains 5000 images of traffic signs with rich semantic descriptions. Second,
we introduce a novel multi-task learning architecture that extracts text and symbol
information from traffic signs, reasons the relationship between texts and symbols,
classifies signs into different categories, and finally, composes the descriptions
of the signs. Experiments show that the task of traffic sign understanding is achievable,
and our architecture demonstrates state-of-the-art and superior performance. The CTSU
Dataset is available at

R-GAN: Exploring Human-like Way for Reasonable Text-to-Image Synthesis via Generative
Adversarial Networks

  • Yanyuan Qiao
  • Qi Chen
  • Chaorui Deng
  • Ning Ding
  • Yuankai Qi
  • Mingkui Tan
  • Xincheng Ren
  • Qi Wu

Despite recent significant progress on generative models, context-rich text-to-image
synthesis depicting multiple complex objects is still non-trivial. The main challenges
lie in the ambiguous semantic of a complex description and the intricate scene of
an image with various objects, different positional relationship and diverse appearances.
To address these challenges, we propose R-GAN, which can generate reasonable images
according to the given text in a human-like way. Specifically, just like humans will
first find and settle the essential elements to create a simple sketch, we first capture
a monolithic-structural text representation by building a scene graph to find the
essential semantic elements. Then, based on this representation, we design a bounding
box generator to estimate the layout with position and size of target objects, and
a following shape generator, which draws a fine-detailed shape for each object. Different
from previous work only generating coarse shapes blindly, we introduce a coarse-to-fine
shape generator based on a shape knowledge base. At last, to finish the final image
synthesis, we propose a multi-modal geometry-aware spatially-adaptive generator conditioned
on the monolithic-structural text representation and the geometry-aware map of the
shapes. Extensive experiments on the real-world dataset MSCOCO show the superiority
of our method in terms of both quantitative and qualitative metrics.

Cross-modality Discrepant Interaction Network for RGB-D Salient Object Detection

  • Chen Zhang
  • Runmin Cong
  • Qinwei Lin
  • Lin Ma
  • Feng Li
  • Yao Zhao
  • Sam Kwong

The popularity and promotion of depth maps have brought new vigor and vitality into
salient object detection (SOD), and a mass of RGB-D SOD algorithms have been proposed,
mainly concentrating on how to better integrate cross-modality features from RGB image
and depth map. For the cross-modality interaction in feature encoder, existing methods
either indiscriminately treat RGB and depth modalities, or only habitually utilize
depth cues as auxiliary information of the RGB branch. Different from them, we reconsider
the status of two modalities and propose a novel Cross-modality Discrepant Interaction
Network (CDINet) for RGB-D SOD, which differentially models the dependence of two
modalities according to the feature representations of different layers. To this end,
two components are designed to implement the effective cross-modality interaction:
1) the RGB-induced Detail Enhancement (RDE) module leverages RGB modality to enhance
the details of the depth features in low-level encoder stage. 2) the Depth-induced
Semantic Enhancement (DSE) module transfers the object positioning and internal consistency
of depth features to the RGB branch in high-level encoder stage. Furthermore, we also
design a Dense Decoding Reconstruction (DDR) structure, which constructs a semantic
block by combining multi-level encoder features to upgrade the skip connection in
the feature decoding. Extensive experiments on five benchmark datasets demonstrate
that our network outperforms $15$ state-of-the-art methods both quantitatively and
qualitatively. Our code is publicly available at:

Deconfounded and Explainable Interactive Vision-Language Retrieval of Complex Scenes

  • Junda Wu
  • Tong Yu
  • Shuai Li

In vision-language retrieval systems, users provide natural language feedback to find
target images. Vision-language explanations in the systems can better guide users
to provide feedback and thus improve the retrieval. However, developing explainable
vision-language retrieval systems can be challenging, due to limited labeled multimodal
data. In the retrieval of complex scenes, the issue of limited labeled data can be
more severe. With multiple objects in the complex scenes, each user query may not
exhaustively describe all objects in the desired image and thus more labeled queries
are needed. The issue of limited labeled data can cause data selection biases, and
result in spurious correlations learned by the models. When learning spurious correlations,
existing explainable models may not be able to accurately extract regions from images
and keywords from user queries.

In this paper, we discover that deconfounded learning is an important step to provide
better vision-language explanations. Thus we propose a deconfounded explainable vision-language
retrieval system. By introducing deconfounded learning to pretrain our vision-language
model, the spurious correlations in the model can be reduced through interventions
by potential confounders. This helps to train more accurate representations and further
enable better explainability. Based on explainable retrieval results, we propose novel
interactive mechanisms. In such interactions, users can better understand why the
system returns particular results and give feedback effectively improving the results.
This additional feedback is sample efficient and thus alleviates the data limitation
problem. Through extensive experiments, our system achieves about $60%$ improvements,
compared to the state-of-the-art.

Long Short-term Convolutional Transformer for No-Reference Video Quality Assessment

  • Junyong You

No-reference video quality assessment has not been widely benefited from deep learning,
mainly due to the complexity, diversity and particularity of modelling spatial and
temporal characteristics in quality assessment scenario. Image quality assessment
(IQA) performed on video frames plays a key role in NR-VQA. A perceptual hierarchical
network (PHIQNet) with an integrated attention module is first proposed that can appropriately
simulate the visual mechanisms of contrast sensitivity and selective attention in
IQA. Subsequently, perceptual quality features of video frames derived from PHIQNet
are fed into a long short-term convolutional Transformer (LSCT) architecture to predict
the perceived video quality. LSCT consists of CNN formulating quality features in
video frames within short-term units that are then fed into Transformer to capture
the long-range dependence and attention allocation over temporal units. Such architecture
is in line with the intrinsic properties of VQA. Experimental results on publicly
available video quality databases have demonstrated that the LSCT architecture based
on PHIQNet significantly outperforms state-of-the-art video quality models.

Automatic Channel Pruning with Hyper-parameter Search and Dynamic Masking

  • Baopu Li
  • Yanwen Fan
  • Zhihong Pan
  • Yuchen Bian
  • Gang Zhang

Modern deep neural network models tend to be large and computationally intensive.
One typical solution to this issue is model pruning. However, most current model pruning
algorithms depend on hand crafted rules or need to input the pruning ratio beforehand.
To overcome this problem, we propose a learning based automatic channel pruning algorithm
for deep neural network, which is inspired by recent automatic machine learning (Auto
ML). A two objectives' pruning problem that aims for the weights and the remaining
channels for each layer is first formulated. An alternative optimization approach
is then proposed to derive the channel numbers and weights simultaneously. In the
process of pruning, we utilize a searchable hyper-parameter, remaining ratio, to denote
the number of channels in each convolution layer, and then a dynamic masking process
is proposed to describe the corresponding channel evolution. To adjust the trade-off
between accuracy of a model and the pruning ratio of floating point operations, a
new loss function is further introduced. Extensive experimental results on benchmark
datasets demonstrate that our scheme achieves competitive results for neural network

SVHAN: Sequential View Based Hierarchical Attention Network for 3D Shape Recognition

  • Yue Zhao
  • Weizhi Nie
  • An-An Liu
  • Zan Gao
  • Yuting Su

As an important field of multimedia, 3D shape recognition has attracted much research
attention in recent years. A lot of deep learning models have been proposed for effective
3D shape representation. The view-based methods show the superiority due to the comprehensive
exploration of the visual characteristics with the help of established 2D CNN architectures.
Generally, the current approaches contain the following disadvantages: First, the
most majority of methods lack the consideration for sequential information among the
multiple views, which can provide descriptive characteristics for shape representation.
Second, the incomprehensive exploration for the multi-view correlations directly affects
the discrimination of shape descriptors. Finally, roughly aggregating multi-view features
leads to the loss of descriptive information, which limits the shape representation
effectiveness. To handle these issues, we propose a novel sequential view based hierarchical
attention network (SVHAN) for 3D shape recognition. Specifically, we first divide
the view sequence into several view blocks. Then, we introduce a novel hierarchical
feature aggregation module (HFAM), which hierarchically exploits the view-level, block-level,
and shape-level features, the intra- and inter- view-block correlations are also captured
to improve the discrimination of learned features. Subsequently, a novel selective
fusion module (SFM) is designed for feature aggregation, considering the correlations
between different levels and preserving effective information. Finally, discriminative
and informative shape descriptors are generated for the recognition task. We validate
the effectiveness of our proposed method on two public databases. The experimental
results show the superiority of SVHAN against the current state-of-the-art approaches.

ASFD: Automatic and Scalable Face Detector

  • Jian Li
  • Bin Zhang
  • Yabiao Wang
  • Ying Tai
  • Zhenyu Zhang
  • Chengjie Wang
  • Jilin Li
  • Xiaoming Huang
  • Yili Xia

Along with current multi-scale based detectors, Feature Aggregation and Enhancement
(FAE) modules have shown superior performance gains for cutting-edge object detection.
However, these hand-crafted FAE modules show inconsistent improvements on face detection,
which is mainly due to the significant distribution difference between its training
and applying corpus, i.e. COCO vs. WIDER Face. To tackle this problem, we essentially
analyse the effect of data distribution, and consequently propose to search an effective
FAE architecture, termed AutoFAE by a differentiable architecture search, which outperforms
all existing FAE modules in face detection with a considerable margin. Upon the found
AutoFAE and existing backbones, a supernet is further built and trained, which automatically
obtains a family of detectors under the different complexity constraints. Extensive
experiments conducted on popular benchmarks, i.e. WIDER Face and FDDB, demonstrate
the state-of-the-art performance-efficiency trade-off for the proposed automatic and
scalable face detector (ASFD) family. In particular, our strong ASFD-D6 outperforms
the best competitor with AP 96.7/96.2/92.1 on WIDER Face test, and the lightweight
ASFD-D0 costs about 3.1 ms, i.e. more than 320 FPS, on the V100 GPU with VGA-resolution

BridgeNet: A Joint Learning Network of Depth Map Super-Resolution and Monocular Depth

  • Qi Tang
  • Runmin Cong
  • Ronghui Sheng
  • Lingzhi He
  • Dan Zhang
  • Yao Zhao
  • Sam Kwong

Depth map super-resolution is a task with high practical application requirements
in the industry. Existing color-guided depth map super-resolution methods usually
necessitate an extra branch to extract high-frequency detail information from RGB
image to guide the low-resolution depth map reconstruction. However, because there
are still some differences between the two modalities, direct information transmission
in the feature dimension or edge map dimension cannot achieve satisfactory result,
and may even trigger texture copying in areas where the structures of the RGB-D pair
are inconsistent. Inspired by the multi-task learning, we propose a joint learning
network of depth map super-resolution (DSR) and monocular depth estimation (MDE) without
introducing additional supervision labels. For the interaction of two subnetworks,
we adopt a differentiated guidance strategy and design two bridges correspondingly.
One is the high-frequency attention bridge (HABdg) designed for the feature encoding
process, which learns the high-frequency information of the MDE task to guide the
DSR task. The other is the content guidance bridge (CGBdg) designed for the depth
map reconstruction process, which provides the content guidance learned from DSR task
for MDE task. The entire network architecture is highly portable and can provide a
paradigm for associating the DSR and MDE tasks. Extensive experiments on benchmark
datasets demonstrate that our method achieves competitive performance. Our code and
models are available at

LSTC: Boosting Atomic Action Detection with Long-Short-Term Context

  • Yuxi Li
  • Boshen Zhang
  • Jian Li
  • Yabiao Wang
  • Weiyao Lin
  • Chengjie Wang
  • Jilin Li
  • Feiyue Huang

In this paper, we place the atomic action detection problem intoa Long-Short Term
Context (LSTC) to analyze how the temporalreliance among video signals affect the
action detection results. Todo this, we decompose the action recognition pipeline
into short-term and long-term reliance, in terms of the hypothesis that the twokinds
of context are conditionally independent given the objectiveaction instance. Within
our design, a local aggregation branch isutilized to gather dense and informative
short-term cues, while ahigh order long-term inference branch is designed to reason
theobjective action class from high-order interaction between actor andother person
or person pairs. Both branches independently predictthe context-specific actions and
the results are merged in the end.We demonstrate that both temporal grains are beneficial
to atomicaction recognition. On the mainstream benchmarks of atomic actiondetection,
our design can bring significant performance gain fromthe existing state-of-the-art

UACANet: Uncertainty Augmented Context Attention for Polyp Segmentation

  • Taehun Kim
  • Hyemin Lee
  • Daijin Kim

We propose Uncertainty Augmented Context Attention network (UACANet) for polyp segmentation
which considers an uncertain area of the saliency map. We construct a modified version
of U-Net shape network with additional encoder and decoder and compute a saliency
map in each bottom-up stream prediction module and propagate to the next prediction
module. In each prediction module, previously predicted saliency map is utilized to
compute foreground, background and uncertain area map and we aggregate the feature
map with three area maps for each representation. Then we compute the relation between
each representation and each pixel in the feature map. We conduct experiments on five
popular polyp segmentation benchmarks, Kvasir, CVC-ClinicDB, ETIS, CVC-ColonDB and
CVC-300, and our method achieves state-of-the-art performance. Especially, we achieve
76.6% mean Dice on ETIS dataset which is 13.8% improvement compared to the previous
state-of-the-art method. Source code is publicly available at

Weight Evolution: Improving Deep Neural Networks Training through Evolving Inferior
Weight Values

  • Zhenquan Lin
  • Kailing Guo
  • Xiaofen Xing
  • Xiangmin Xu

To obtain good performance, convolutional neural networks are usually over-parameterized.
This phenomenon has stimulated two interesting topics: pruning the unimportant weights
for compression and reactivating the unimportant weights to make full use of network
capability. However, current weight reactivation methods usually reactivate the entire
filters, which may not be precise enough. Looking back in history, the prosperity
of filter pruning is mainly due to its friendliness to hardware implementation, but
pruning at a finer structure level, i.e., weight elements, usually leads to better
network performance. We study the problem of weight element reactivation in this paper.
Motivated by evolution, we select the unimportant filters and update their unimportant
elements by combining them with the important elements of important filters, just
like gene crossover to produce better offspring, and the proposed method is called
weight evolution (WE). WE is mainly composed of four strategies. We propose a global
selection strategy and a local selection strategy and combine them to locate the unimportant
filters. A forward matching strategy is proposed to find the matched important filters
and a crossover strategy is proposed to utilize the important elements of the important
filters for updating unimportant filters. WE is plug-in to existing network architectures.
Comprehensive experiments show that WE outperforms the other reactivation methods
and plug-in training methods with typical convolutional neural networks, especially
lightweight networks. Our code is available at

Coarse to Fine: Domain Adaptive Crowd Counting via Adversarial Scoring Network

  • Zhikang Zou
  • Xiaoye Qu
  • Pan Zhou
  • Shuangjie Xu
  • Xiaoqing Ye
  • Wenhao Wu
  • Jin Ye

Recent deep networks have convincingly demonstrated high capability in crowd counting,
which is a critical task attracting widespread attention due to its various industrial
applications. Despite such progress, trained data-dependent models usually can not
generalize well to unseen scenarios because of the inherent domain shift. To facilitate
this issue, this paper proposes a novel adversarial scoring network (ASNet) to gradually
bridge the gap across domains from coarse to fine granularity. In specific, at the
coarse-grained stage, we design a dual-discriminator strategy to adapt source domain
to be close to the targets from the perspectives of both global and local feature
space via adversarial learning. The distributions between two domains can thus be
aligned roughly. At the fine-grained stage, we explore the transferability of source
characteristics by scoring how similar the source samples are to target ones from
multiple levels based on generative probability derived from coarse stage. Guided
by these hierarchical scores, the transferable source features are properly selected
to enhance the knowledge transfer during the adaptation process. With the coarse-to-fine
design, the generalization bottleneck induced from the domain discrepancy can be effectively
alleviated. Three sets of migration experiments show that the proposed methods achieve
state-of-the-art counting performance compared with major unsupervised methods.

Towards Adversarial Patch Analysis and Certified Defense against Crowd Counting

  • Qiming Wu
  • Zhikang Zou
  • Pan Zhou
  • Xiaoqing Ye
  • Binghui Wang
  • Ang Li

Crowd counting has drawn much attention due to its importance in safety-critical surveillance
systems. Especially, deep neural network (DNN) methods have significantly reduced
estimation errors for crowd counting missions. Recent studies have demonstrated that
DNNs are vulnerable to adversarial attacks, i.e., normal images with human-imperceptible
perturbations could mislead DNNs to make false predictions. In this work, we propose
a robust attack strategy called Adversarial Patch Attack with Momentum (APAM) to systematically
evaluate the robustness of crowd counting models, where the attacker's goal is to
create an adversarial perturbation that severely degrades their performances, thus
leading to public safety accidents (e.g., stampede accidents). Especially, the proposed
attack leverages the extreme-density background information of input images to generate
robust adversarial patches via a series of transformations (e.g., interpolation, rotation,
etc.). We observe that by perturbing less than 6% of image pixels, our attacks severely
degrade the performance of crowd counting systems, both digitally and physically.
To better enhance the adversarial robustness of crowd counting models, we propose
the first regression model-based Randomized Ablation (RA), which is more sufficient
than Adversarial Training (ADT) (Mean Absolute Error of RA is 5 lower than ADT on
clean samples and 30 lower than ADT on adversarial examples). Extensive experiments
on five crowd counting models demonstrate the effectiveness and generality of the
proposed method.

Conceptual and Syntactical Cross-modal Alignment with Cross-level Consistency for
Image-Text Matching

  • Pengpeng Zeng
  • Lianli Gao
  • Xinyu Lyu
  • Shuaiqi Jing
  • Jingkuan Song

Image-Text Matching (ITM) is a fundamental and emerging task, which plays a key role
in cross-modal understanding. It remains a challenge because prior works mainly focus
on learning fine-grained (i.e. coarse and/or phrase) correspondence, without considering
the syntactical correspondence. In theory, a sentence is not only a set of words or
phrases but also a syntactic structure, consisting of a set of basic syntactic tuples
(i.e.(attribute) object - predicate - (attribute) subject). Inspired by this, we propose
a Conceptual and Syntactical Cross-modal Alignment with Cross-level Consistency (CSCC)
for Image-text Matching by simultaneously exploring the multiple-level cross-modal
alignments across the concept and syntactic with a consistency constraint. Specifically,
a conceptual-level cross-modal alignment is introduced for exploring the fine-grained
correspondence, while a syntactical-level cross-modal alignment is proposed to explicitly
learn a high-level syntactic similarity function. Moreover, an empirical cross-level
consistent attention loss is introduced to maintain the consistency between cross-modal
attentions obtained from the above two cross-modal alignments. To justify our method,
comprehensive experiments are conducted on two public benchmark datasets, i.e. MS-COCO
(1K and 5K) and Flickr30K, which show that our CSCC outperforms state-of-the-art methods
with fairly competitive improvements.

SSPU-Net: Self-Supervised Point Cloud Upsampling via Differentiable Rendering

  • Yifan Zhao
  • Le Hui
  • Jin Xie

Point clouds obtained from 3D sensors are usually sparse. Existing methods mainly
focus on upsampling sparse point clouds in a supervised manner by using dense ground
truth point clouds. In this paper, we propose a self-supervised point cloud upsampling
network (SSPU-Net) to generate dense point clouds without using ground truth. To achieve
this, we exploit the consistency between the input sparse point cloud and generated
dense point cloud for the shapes and rendered images. Specifically, we first propose
a neighbor expansion unit (NEU) to upsample the sparse point clouds, where the local
geometric structures of the sparse point clouds are exploited to learn weights for
point interpolation. Then, we develop a differentiable point cloud rendering unit
(DRU) as an end-to-end module in our network to render the point cloud into multi-view
images. Finally, we formulate a shape-consistent loss and an image-consistent loss
to train the network so that the shapes of the sparse and dense point clouds are as
consistent as possible. Extensive results on the CAD and scanned datasets demonstrate
that our method can achieve impressive results in a self-supervised manner.

VmAP: A Fair Metric for Video Object Detection

  • Anupam Sobti
  • Vaibhav Mavi
  • M Balakrishnan
  • Chetan Arora

Video object detection is the task of detecting objects in a sequence of frames, typically,
with a significant overlap in content among consecutive frames. Mean Average Precision
(mAP) was originally proposed for evaluating object detection techniques in independent
frames, but has been used for evaluating video based object detectors as well. This
is undesirable since the average precision over all frames masks the biases that a
certain object detector might have against certain types of objects depending on the
number of frames for which the object is present in a video sequence. In this paper
we show several disadvantages of mAP as a metric for evaluating video based object
detection. Specifically, we show that: (a) some object detectors could be severely
biased against some specific kind of objects, such as small, blurred, or low contrast
objects, and such differences may not reflect in mAP based evaluation, (b) operating
a video based object detector at the best frame based precision/recall value (high
F1 score) may lead to many false positives without a significant increase in the number
of objects detected. (c) mAP does not take into account that tracking can be potentially
used to recover missed detections in the temporal neighborhood while this can be account
for while evaluating detectors. As an alternate, we suggest a novel evaluation metric
(VmAP) which takes the focus away from evaluating detections on every frame. Unlike
mAP, VmAP rewards a high recall of different object views throughout the video. We
form sets of bounding boxes having similar views of an object in a temporal neighborhood
and use a set-level recall for evaluation. We show that VmAP is able to address all
the challenges with the mAP listed above. Our experiments demonstrate hidden biases
in object detectors, shows upto 99% reduction in false positives while maintaining
similar object recall and shows a 9% improvement in correlation with post-tracking

Source Data-free Unsupervised Domain Adaptation for Semantic Segmentation

  • Mucong Ye
  • Jing Zhang
  • Jinpeng Ouyang
  • Ding Yuan

Deep\footnote learning-based semantic segmentation methods require a huge amount of
training images with pixel-level annotations. Unsupervised domain adaptation (UDA)
for semantic segmentation enables transferring knowledge learned from the synthetic
data (source domain) with low-cost annotations to the real images (target domain).
However, current UDA methods mostly require full access to the source domain data
for feasible adaptation, which limits their applications in real-world scenarios with
privacy, storage, or transmission issues. To this end, this paper identifies and addresses
a more practical but challenging problem of UDA for semantic segmentation, where access
to the original source domain data is forbidden. In other words, only the pre-trained
source model and unlabelled target domain data are available for adaptation. To tackle
the problem, we propose to construct a set of source domain virtual data to mimic
the source domain distribution by identifying the target domain high-confidence samples
predicted by the pre-trained source model. Then by analyzing the data properties in
the cross-domain semantic segmentation tasks, we propose an uncertainty and prior
distribution-aware domain adaptation method to align the virtual source domain and
the target domain with both adversarial learning and self-training strategies. Extensive
experiments on three cross-domain semantic segmentation datasets with in-depth analyses
verify the effectiveness of the proposed method.

Yes, "Attention Is All You Need", for Exemplar based Colorization

  • Wang Yin
  • Peng Lu
  • Zhaoran Zhao
  • Xujun Peng

Conventional exemplar based image colorization tends to transfer colors from reference
image only to grayscale image based on the semantic correspondence between them. But
their practical capabilities are limited when semantic correspondence can hardly be
found. To overcome this issue, additional information, such as colors from the database
is normally introduced. However, it's a great challenge to consider color information
from reference image and database simultaneously because there lacks a unified framework
to model different color information and the multi-modal ambiguity in database cannot
be removed easily. Also, it is difficult to fuse different color information effectively.
Thus, a general attention based colorization framework is proposed in this work, where
the color histogram of reference image is adopted as a prior to eliminate the ambiguity
in database. Moreover, a sparse loss is designed to guarantee the success of information
fusion. Both qualitative and quantitative experimental results show that the proposed
approach achieves better colorization performance compared with the state-of-the-art
methods on public databases with different quality metrics.

Heuristic Depth Estimation with Progressive Depth Reconstruction and Confidence-Aware

  • Jiehua Zhang
  • Liang Li
  • Chenggang Yan
  • Yaoqi Sun
  • Tao Shen
  • Jiyong Zhang
  • Zhan Wang

Recently deep learning-based depth estimation has shown the promising result, especially
with the help of sparse depth reference samples. Existing works focus on directly
inferring the depth information from sparse samples with high confidence. In this
paper, we propose a Heuristic Depth Estimation Network (HDEN) with progressive depth
reconstruction and confidence-aware loss. The HDEN leverages the reference samples
with low confidence to distill the spatial geometric and local semantic information
for dense depth prediction. Specifically, we first train a U-NET network to generate
a coarse-level dense reference map. Second, the progressive depth reconstruction module
successively reconstructs the fine-level dense depth map from different scales, where
a multi-level upsampling block is designed to recover the local structure of object.
Finally, the confidence-aware loss is proposed to trigger the reference samples with
low confidence, which enforces the model focusing on estimating the depth of the tiny
structure. Extensive experiments on the NYU-Depth-v2 and KITTI-Odometry dataset show
the effectiveness of our method. Visualization results demonstrate that the dense
depth maps generated by HDEN have better consistency at the entity edge with RGB image.

Unsupervised Cross-Modal Distillation for Thermal Infrared Tracking

  • Jingxian Sun
  • Lichao Zhang
  • Yufei Zha
  • Abel Gonzalez-Garcia
  • Peng Zhang
  • Wei Huang
  • Yanning Zhang

The target representation learned by convolutional neural networks plays an important
role in Thermal Infrared (TIR) tracking. Currently, most of the top-performing TIR
trackers are still employing representations learned by the model trained on the RGB
data. However, this representation does not take into account the information in the
TIR modality itself, limiting the performance of TIR tracking.

To solve this problem, we propose to distill representations of the TIR modality from
the RGB modality with Cross-Modal Distillation (CMD) on a large amount of unlabeled
paired RGB-TIR data. We take advantage of the two-branch architecture of the baseline
tracker, i.e. DiMP, for cross-modal distillation working on two components of the
tracker. Specifically, we use one branch as a teacher module to distill the representation
learned by the model into the other branch. Benefiting from the powerful model in
the RGB modality, the cross-modal distillation can learn the TIR-specific representation
for promoting TIR tracking. The proposed approach can be incorporated into different
baseline trackers conveniently as a generic and independent component. Furthermore,
the semantic coherence of paired RGB and TIR images is utilized as a supervised signal
in the distillation loss for cross-modal knowledge transfer. In practice, three different
approaches are explored to generate paired RGB-TIR patches with the same semantics
for training in an unsupervised way. It is easy to extend to an even larger scale
of unlabeled training data. Extensive experiments on the LSOTB-TIR dataset and PTB-TIR
dataset demonstrate that our proposed cross-modal distillation method effectively
learns TIR-specific target representations transferred from the RGB modality. Our
tracker outperforms the baseline tracker by achieving absolute gains of 2.3% Success,
2.7% Precision, and 2.5% Normalized Precision respectively. Code and models are available

ABPNet: Adaptive Background Modeling for Generalized Few Shot Segmentation

  • Kaiqi Dong
  • Wei Yang
  • Zhenbo Xu
  • Liusheng Huang
  • Zhidong Yu

Existing Few Shot Segmentation (FS-Seg) methods mostly study a restricted setting
where only foreground and background are required to be discriminated and fall short
at discriminating multiple classes. In this paper, we focus on a challenging but more
practical variant: Generalized Few Shot Segmentation (GFS-Seg), where all SEEN and
UNSEEN classes are segmented simultaneously. Previous methods treat the background
as a regular class, leading to difficulty in differentiating UNSEEN classes from it
at the test stage. To address this issue, we propose Adaptive Background Modeling
and Prototype Query Network (ABPNet), in which the background is formulated as the
complement of the set of interested classes. With the help of the attention mechanism
and a novel meta-training strategy, it learns an effective set difference function
that predicts task-specific background adaptively. Furthermore, we design a Prototype
Querying (PQ) module that effectively transfers the learned knowledge to UNSEEN classes
with a neural dictionary. Experimental results demonstrate that ABPNet significantly
outperforms the state-of-the-art method CAPL on PASCAL-5i and COCO-20i, especially
on UNSEEN classes. Also, without retraining, ABPNet can generalize well to FS-Seg.

Towards Reasoning Ability in Scene Text Visual Question Answering

  • Qingqing Wang
  • Liqiang Xiao
  • Yue Lu
  • Yaohui Jin
  • Hao He

Works on scene text visual question answering (TextVQA) always emphasize the importance
of reasoning questions and image contents. However, we find current TextVQA models
lack reasoning ability and tend to answer questions by exploiting dataset bias and
language priors. Moreover, our observations indicate that recent accuracy improvement
in TextVQA is mainly contributed by stronger OCR engines, better pre-training strategies
and more Transformer layers, instead of newly proposed networks. In this work, towards
the reasoning ability, we 1) conduct module-wise contribution analysis to quantitatively
investigate how existing works improve accuracies in TextVQA; 2) design a gradient-based
explainability method to explore why TextVQA models answer what they answer and find
evidence for their predictions; 3) perform qualitative experiments to visually analyze
models reasoning ability and explore potential reasons behind such a poor ability.

Multi-caption Text-to-Face Synthesis: Dataset and Algorithm

  • Jianxin Sun
  • Qi Li
  • Weining Wang
  • Jian Zhao
  • Zhenan Sun

Text-to-Face synthesis with multiple captions is still an important yet less addressed
problem because of the lack of effective algorithms and large-scale datasets. We accordingly
propose a Semantic Embedding and Attention (SEA-T2F) network that allows multiple
captions as input to generate highly semantically related face images. With a novel
Sentence Features Injection Module, SEA-T2F can integrate any number of captions into
the network. In addition, an attention mechanism named Attention for Multiple Captions
is proposed to fuse multiple word features and synthesize fine-grained details. Considering
text-to-face generation is an ill-posed problem, we also introduce an attribute loss
to guide the network to generate sentence-related attributes. Existing datasets for
text-to-face are either too small or roughly generated according to attribute labels,
which is not enough to train deep learning based methods to synthesize natural face
images. Therefore, we build a large-scale dataset named CelebAText-HQ, in which each
image is manually annotated with 10 captions. Extensive experiments demonstrate the
effectiveness of our algorithm.

Multimodal Compatibility Modeling via Exploring the Consistent and Complementary Correlations

  • Weili Guan
  • Haokun Wen
  • Xuemeng Song
  • Chung-Hsing Yeh
  • Xiaojun Chang
  • Liqiang Nie

Existing methods towards outfit compatibility modeling seldom explicitly consider
multimodal correlations. In this work, we explore the consistent and complementary
correlations for better compatibility modeling. This is, however, non-trivial due
to the following challenges: 1) how to separate and model these two kinds of correlations;
2) how to leverage the derived complementary cues to strengthen the text and vision-oriented
representations of the given item; and 3) how to reinforce the compatibility modeling
with text and vision-oriented representations. To address these challenges, we present
a comprehensive multimodal outfit compatibility modeling scheme. It first nonlinearly
projects each modality into separable consistent and complementary spaces via multi-layer
perceptron, and then models the consistent and complementary correlations between
two modalities by parallel and orthogonal regularization. Thereafter, we strengthen
the visual and textual representation of items with complementary information, and
further induct both the text-oriented and vision- oriented outfit compatibility modeling.
We ultimately employ the mutual learning strategy to reinforce the final performance
of compatibility modeling. Extensive experiments demonstrate the superiority of our

CDD: Multi-view Subspace Clustering via Cross-view Diversity Detection

  • Shudong Huang
  • Ivor W. Tsang
  • Zenglin Xu
  • Jiancheng Lv
  • Quanhui Liu

The goal of multi-view subspace clustering is to explore a common latent space where
the multi-view data points lying on. Myriads of subspace learning algorithms have
been investigated to boost the performance of multi-view clustering, but seldom exploiting
both the multi-view consistency and multi-view diversity, let alone taking them into
consideration simultaneously. To do so, we lodge a novel multi-view subspace clustering
via cross-view diversity detection (CDD). CDD is able to exploit these two complementary
criteria seamlessly into a holistic design of clustering algorithms. With the consistent
part and diverse part being detected, a pure graph can be derived for each view. The
consistent pure parts of different views are further fused to a consensus structured
graph with exactly k connected components where k is the number of clusters. Thus
we can directly obtain the final clustering result without any postprocessing as each
connected component precisely corresponds to an individual cluster. We model the above
concerns into a unified optimization framework. Our empirical studies validate that
the proposed model outperforms several other state-of-the-art methods.

Learning Spatio-temporal Representation by Channel Aliasing Video Perception

  • Yiqi Lin
  • Jinpeng Wang
  • Manlin Zhang
  • Andy J. Ma

In this paper, we propose a novel pretext task namely Channel Aliasing Video Perception
(CAVP) for self-supervised video representation learning. The main idea of our approach
is to generate channel aliasing videos, which carry different motion cues simultaneously
by assembling distinct channels from different videos. With the generated channel
aliasing videos, we propose to recognize the number of different motion flows within
a channel aliasing video for perception of discriminative motion cues. As a plug-and-play
method, the proposed pretext task can be integrated into a co-training framework with
other self-supervised learning methods to further improve the performance. Experimental
results on publicly available action recognition benchmarks verify the effectiveness
of our method for spatio-temporal representation learning.

Efficient Sparse Attacks on Videos using Reinforcement Learning

  • Huanqian Yan
  • Xingxing Wei

More and more deep neural network models have been deployed in real-time video systems.
However, it is proved that deep models are susceptible to the crafted adversarial
examples. The adversarial examples are imperceptible and can make the normal deep
models misclassify them. Although there exist a few works aiming at the adversarial
examples of video recognition in the black-box attack mode, most of them need large
perturbations or hundreds of thousands of queries. There are still lack of effective
adversarial methods to produce adversarial videos with small perturbations and limited
query numbers at the same time.

In this paper, an efficient and powerful method is proposed for adversarial video
attacks in the black-box attack mode. The proposed method is based on Reinforcement
Learning (RL) like the previous work, i.e. using the agent in RL to adaptively find
the sparse key frames to add perturbations. The key difference is that we design the
new reward functions based on the loss reduction and the perturbation increment, and
thus propose an efficient update mechanism to guide the agent to finish the attacks
with smaller perturbations and fewer query numbers. The proposed algorithm has a new
working mechanism. It is simple, efficient, and effective. Extensive experiments show
our method has a good trade-off between the perturbation amplitude and the query numbers.
Compared with the state-of-the-art algorithms, it has reduced 65.75% query numbers
without image quality loss in the un-targeted attacks and simultaneously reduced 22.47%
perturbations and 54.77% query numbers in the targeted attacks.

AdvHash: Set-to-set Targeted Attack on Deep Hashing with One Single Adversarial Patch

  • Shengshan Hu
  • Yechao Zhang
  • Xiaogeng Liu
  • Leo Yu Zhang
  • Minghui Li
  • Hai Jin

In this paper, we propose AdvHash, the first targeted mismatch attack on deep hashing
through adversarial patch. After superimposed with the same adversarial patch, any
query image with a chosen label will retrieve a set of irrelevant images with the
target label. Concretely, we first formulate a set-to-set problem, where a set of
samples are pushed into a predefined clustered area in the Hamming space. Then we
obtain a target anchor hash code and transform the attack to a set-to-point optimization.
In order to generate a image-agnostic stable adversarial patch for a chosen label
more efficiently, we propose a product-based weighted gradient aggregation strategy
to dynamically adjust the gradient directions of the patch, by exploiting the Hamming
distances between training samples and the target anchor hash code and assigning different
weights to discriminatively aggregate gradients. Extensive experiments on benchmark
datasets verify that AdvHash is highly effective at attacking two state-of-the-art
deep hashing schemes. Our codes are available at:

TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding

  • Dailan He
  • Yusheng Zhao
  • Junyu Luo
  • Tianrui Hui
  • Shaofei Huang
  • Aixi Zhang
  • Si Liu

Recently proposed fine-grained 3D visual grounding is an essential and challenging
task, whose goal is to identify the 3D object referred by a natural language sentence
from other distractive objects of the same category. Existing works usually adopt
dynamic graph networks to indirectly model the intra/inter-modal interactions, making
the model difficult to distinguish the referred object from distractors due to the
monolithic representations of visual and linguistic contents. In this work, we exploit
Transformer for its natural suitability on permutation-invariant 3D point clouds data
and propose a TransRefer3D network to extract entity-and-relation aware multimodal
context among objects for more discriminative feature learning. Concretely, we devise
an Entity-aware Attention (EA) module and a Relation-aware Attention (RA) module to
conduct fine-grained cross-modal feature matching. Facilitated by co-attention operation,
our EA module matches visual entity features with linguistic entity features while
RA module matches pair-wise visual relation features with linguistic relation features,
respectively. We further integrate EA and RA modules into an Entity-and-Relation aware
Contextual Block (ERCB) and stack several ERCBs to form our TransRefer3D for hierarchical
multimodal context modeling. Extensive experiments on both Nr3D and Sr3D datasets
demonstrate that our proposed model significantly outperforms existing approaches
by up to 10.6% and claims the new state-of-the-art performance. To the best of our
knowledge, this is the first work investigating Transformer architecture for fine-grained
3D visual grounding task.

Single Image 3D Object Estimation with Primitive Graph Networks

  • Qian He
  • Desen Zhou
  • Bo Wan
  • Xuming He

Reconstructing 3D object from a single image (RGB or depth) is a fundamental problem
in visual scene understanding and yet remains challenging due to its ill-posed nature
and complexity in real-world scenes. To address those challenges, we adopt a primitive-based
representation for 3D object, and propose a two-stage graph network for primitive-based
3D object estimation, which consists of a sequential proposal module and a graph reasoning
module. Given a 2D image, our proposal module first generates a sequence of 3D primitives
from input image with local feature attention. Then the graph reasoning module performs
joint reasoning on a primitive graph to capture the global shape context for each
primitive. Such a framework is capable of taking into account rich geometry and semantic
constraints during 3D structure recovery, producing 3D objects with more coherent
structure even under challenging viewing conditions. We train the entire graph neural
network in a stage-wise strategy and evaluate it on three benchmarks: Pix3D, ModelNet
and NYU Depth V2. Extensive experiments show that our approach outperforms the previous
state of the arts with a considerable margin.

Boosting Mobile CNN Inference through Semantic Memory

  • Yun Li
  • Chen Zhang
  • Shihao Han
  • Li Lyna Zhang
  • Baoqun Yin
  • Yunxin Liu
  • Mengwei Xu

Human brains are known to be capable of speeding up visual recognition of repeatedly
presented objects through faster memory encoding and accessing procedures on activated
neurons. For the first time, we borrow and distill such a capability into a semantic
memory design, namely SMTM, to improve on-device CNN inference. SMTM employs a hierarchical
memory architecture to leverage the long-tail distribution of objects of interest,
and further incorporates several novel techniques to put it into effects: (1) it encodes
high-dimensional feature maps into low-dimensional, semantic vectors for low-cost
yet accurate cache and lookup; (2) it uses a novel metric in determining the exit
timing considering different layers' inherent characteristics; (3) it adaptively adjusts
the cache size and semantic vectors to fit the scene dynamics. SMTM is prototyped
on commodity CNN engine and runs on both mobile CPU and GPU. Extensive experiments
on large-scale datasets and models show that SMTM can significantly speed up the model
inference over standard approach (up to 2×) and prior cache designs (up to 1.5x),
with acceptable accuracy loss.

Knowing When to Quit: Selective Cascaded Regression with Patch Attention for Real-Time
Face Alignment

  • Gil Shapira
  • Noga Levy
  • Ishay Goldin
  • Roy J. Jevnisek

Facial landmarks (FLM) estimation is a critical component in many face-related applications.
In this work, we aim to optimize for both accuracy and speed and explore the trade-off
between them. Our key observation is that not all faces are created equal. Frontal
faces with neutral expressions converge faster than faces with extreme poses or expressions.
To differentiate among samples, we train our model to predict the regression error
after each iteration. If the current iteration is accurate enough, we stop iterating,
saving redundant iterations while keeping the accuracy in check. We also observe that
as neighboring patches overlap, we can infer all facial landmarks (FLMs) with only
a small number of patches without a major accuracy sacrifice. Architecturally, we
offer a multi-scale, patch-based, lightweight feature extractor with a fine-grained
local patch attention module, which computes a patch weighting according to the information
in the patch itself and enhances the expressive power of the patch features. We analyze
the patch attention data to infer where the model is attending when regressing facial
landmarks and compare it to face attention in humans. Our model runs in real-time
on a mobile device GPU, with 95 Mega Multiply-Add (MMA) operations, outperforming
all state-of-the-art methods under 1000 MMA, with a normalized mean error of 8.16
on the 300W challenging dataset. The code is available at

End-to-end Boundary Exploration for Weakly-supervised Semantic Segmentation

  • Jianjun Chen
  • Shancheng Fang
  • Hongtao Xie
  • Zheng-Jun Zha
  • Yue Hu
  • Jianlong Tan

It is full of challenges for weakly supervised semantic segmentation (WSSS) acquiring
the pixel-level object location with only image-level annotations. Especially, the
single-stage methods learn image- and pixel-level labels simultaneously to avoid complicated
multi-stage computations and sophisticated training procedures. In this paper, we
argue that using a single model to accomplish image- and pixel-level classification
will fall into the balance of multi-target and consequently weakens the recognition
capability. Because the image-level task tends to learn position-independent features,
but the pixel-level task tends to be position-sensitive. Hence, we propose an effective
encoder-decoder framework to explore object boundaries and solve the above dilemma.
The encoder and decoder learn position-independent and position-sensitive features
independently during the end-to-end training. In addition, a global soft pooling is
suggested to suppress background pixels' activation for the encoder training and further
improve the class activation map (CAM) performance. The edge annotations for the decoder
training are synthesized by the high confidence CAMs, which do not requires extra
supervision. The extensive experiments on the Pascal VOC12 dataset demonstrate that
our method achieves state-of-the-art compared to the end-to-end approaches. It gets
63.6% and 65.7% mIoU scores on val and test sets respectively.

SFE-Net: EEG-based Emotion Recognition with Symmetrical Spatial Feature Extraction

  • Xiangwen Deng
  • Junlin Zhu
  • Shangming Yang

Emotion recognition based on EEG (electroencephalography) has been widely used in
human-computer interaction, distance education and health care. However, the conventional
methods ignore the adjacent and symmetrical characteristics of EEG signals, which
also contain salient information related to emotion. In this paper, a spatial folding
ensemble network (SFE-Net) is presented for EEG feature extraction and emotion recognition.
Firstly, for the undetected area between EEG electrodes, an improved Bicubic-EEG interpolation
algorithm is developed for EEG channels information completion, which allows us to
extract a wider range of adjacent space features. Then, motivated by the spatial symmetric
mechanism of human brain, we fold the input EEG channels data with five different
symmetrical strategies, which enable the proposed network to extract the information
of space features of EEG signals more effectively. Finally, a 3DCNN-based spatial,
temporal extraction, and a multi-voting strategy of ensemble learning are integrated
to model a new neural network. With this network, the spatial features of different
symmetric folding signals can be extracted simultaneously, which greatly improves
the robustness and accuracy of emotion recognition. The experimental results on DEAP
and SEED datasets show that the proposed algorithm has comparable performance in terms
of recognition accuracy.

Bridging the Gap between Low-Light Scenes: Bilevel Learning for Fast Adaptation

  • Dian Jin
  • Long Ma
  • Risheng Liu
  • Xin Fan

Brightening low-light images of diverse scenes is a challenging but widely concerned
task in the multimedia community. Convolutional Neural Networks (CNNs) based approaches
mostly acquire the enhanced model by learning the data distribution from the specific
scenes. However, these works present poor adaptability (even fail) when meeting real-world
scenarios that never encountered before. To conquer it, we develop a novel bilevel
learning scheme for fast adaptation to bridge the gap between low-light scenes. Concretely,
we construct a Retinex-induced encoder-decoder with an adaptive denoising mechanism,
aiming at covering more practical cases. Different from existing works that directly
learn model parameters by using the massive data, we provide a new hyperparameter
optimization perspective to formulate a bilevel learning scheme towards general low-light
scenarios. This scheme depicts the latent correspondence (i.e., scene-irrelevant encoder)
and the respective characteristic (i.e., scene-specific decoder) among different data
distributions. Due to the expensive inner optimization, estimating the hyper-parameter
gradient exactly can be prohibitive, we develop an approximate hyper-parameter gradient
method by introducing the one-step forward approximation and finite difference approximation
to ensure the high-efficient inference. Extensive experiments are conducted to reveal
our superiority against other state-of-the-art methods. A series of analytical experiments
are also executed to verify our effectiveness.

Handling Difficult Labels for Multi-label Image Classification via Uncertainty Distillation

  • Liangchen Song
  • Jialian Wu
  • Ming Yang
  • Qian Zhang
  • Yuan Li
  • Junsong Yuan

Multi-label image classification aims to predict multiple labels for a single image.
However, the difficulties of predicting different labels may vary dramatically due
to semantic variations of the label as well as the image context. Direct learning
of multi-label classification models has the risk of being biased and overfitting
those difficult labels, e.g., deep network based classifiers are over-trained on the
difficult labels, therefore, lead to false-positive errors of those difficult labels
during testing. To handle difficult labels of multi-label image classification, we
propose to calibrate the model, which not only predicts the labels but also estimates
the uncertainty of the prediction. With the new calibration branch of the network,
the classification model is trained with the pick-all-labels normalized loss and optimized
pertaining to the number of positive labels. Moreover, to improve performance on difficult
labels, instead of annotating them, we leverage the calibrated model as the teacher
network and teach the student network about handling difficult labels via uncertainty
distillation. Our proposed uncertainty distillation teaches the student network which
labels are highly uncertain through prediction distribution distillation, and locates
the image regions that cause such uncertain predictions through uncertainty attention
distillation. Conducting extensive evaluations on benchmark datasets, we demonstrate
that our proposed uncertainty distillation is valuable to handle difficult labels
of multi-label image classification.

Perception-Oriented Stereo Image Super-Resolution

  • Chenxi Ma
  • Bo Yan
  • Weimin Tan
  • Xuhao Jiang

Recent studies of deep learning based stereo image super-resolution (StereoSR) have
promoted the development of StereoSR. However, existing StereoSR models mainly concentrate
on improving quantitative evaluation metrics and neglect the visual quality of super-resolved
stereo images. To improve the perceptual performance, this paper proposes the first
perception-oriented stereo image super-resolution approach by exploiting the feedback,
provided by the evaluation on the perceptual quality of StereoSR results. To provide
accurate guidance for the StereoSR model, we develop the first special stereo image
super-resolution quality assessment (StereoSRQA) model, and further construct a StereoSRQA
database. Extensive experiments demonstrate that our StereoSR approach significantly
improves the perceptual quality and enhances the reliability of stereo images for
disparity estimation.

ReLLIE: Deep Reinforcement Learning for Customized Low-Light Image Enhancement

  • Rongkai Zhang
  • Lanqing Guo
  • Siyu Huang
  • Bihan Wen

Low-light image enhancement (LLIE) is a pervasive yet challenging problem, since:
1) low-light measurements may vary due to different imaging conditions in practice;
2) images can be enlightened subjectively according to diverse preference by each
individual. To tackle these two challenges, this paper presents a novel deep reinforcement
learning based method, dubbed ReLLIE, for customized low-light enhancement. ReLLIE
models LLIE as a markov decision process, i.e., estimating the pixel-wise image-specific
curves sequentially and recurrently. Given the reward computed from a set of carefully
crafted non-reference loss functions, a lightweight network is proposed to estimate
the curves for enlightening of a low-light image input. As ReLLIE learns a policy
instead of one-one image translation, it can handle various low-light measurements
and provide customized enhanced outputs by flexibly applying the policy different
times. Furthermore, ReLLIE can enhance real-world images with hybrid corruptions,
i.e., noise, by using a plug-and-play denoiser easily. Extensive experiments on various
benchmarks demonstrate the advantages of ReLLIE, comparing to the state-of-the-art
methods. (Code is available:

Intrinsic Temporal Regularization for High-resolution Human Video Synthesis

  • Lingbo Yang
  • Zhanning Gao
  • Siwei Ma
  • Wen Gao

Fashion video synthesis has attracted increasing attention due to its huge potential
in immersive media, virtual reality and online retail applications, yet traditional
3D graphic pipelines often require extensive manual labor on data capture and model
rigging. In this paper, we investigate an image-based approach to this problem that
generates a fashion video clip from a still source image of the desired outfit, which
is then rigged in a framewise fashion under the guidance of a driving video. A key
challenge for this task lies in the modeling of feature transformation across source
and driving frames, where fine-grained transform helps promote visual details at garment
regions, but often at the expense of intensified temporal flickering. To resolve this
dilemma, we propose a novel framework with 1) a multi-scale transform estimation and
feature fusion module to preserve fine-grained garment details, and 2) an intrinsic
regularization loss to enforce temporal consistency of learned transform between adjacent
frames. Our solution is capable of generating 512\times512 fashion videos with rich
garment details and smooth fabric movements beyond existing results. Extensive experiments
over the FashionVideo benchmark dataset have demonstrated the superiority of the proposed
framework over several competitive baselines.

A2W: Context-Aware Recommendation System for Mobile Augmented Reality Web Browser

  • Kit Yung Lam
  • Lik Hang Lee
  • Pan Hui

Augmented Reality (AR) offers new capabilities for blurring the boundaries between
physical reality and digital media. However, the capabilities of integrating web contents
and AR remain underexplored. This paper presents an AR web browser with an integrated
context-aware AR-to-Web content recommendation service named as A2W browser, to provide
continuously user-centric web browsing experiences driven by AR headsets. We implement
the A2W browser on an AR headset as our demonstration application, demonstrating the
features and performance of A2W framework. The A2W browser visualizes the AR-driven
web contents to the user, which is suggested by the content-based filtering model
in our recommendation system. In our experiments, 20 participants with the adaptive
UIs and recommendation system in A2W browser achieve up to 30.69% time saving compared
to smartphone conditions. Accordingly, A2W-supported web browsing on workstations
facilitates the recommended information leading to 41.67% faster reaches to the target
information than typical web browsing.

Cross-modal Self-Supervised Learning for Lip Reading: When Contrastive Learning meets
Adversarial Training

  • Changchong Sheng
  • Matti Pietikäinen
  • Qi Tian
  • Li Liu

The goal of this work is to learn discriminative visual representations for lip reading
without access to manual text annotation. Recent advances in cross-modal self-supervised
learning have shown that the corresponding audio can serve as a supervisory signal
to learn effective visual representations for lip reading. However, existing methods
only exploit the natural synchronization of the video and the corresponding audio.
We find that both video and audio are actually composed of speech-related information,
identity-related information, and modal information. To make the visual representations
(i) more discriminative for lip reading and (ii) indiscriminate with respect to the
identities and modals, we propose a novel self-supervised learning framework called
Adversarial Dual-Contrast Self-Supervised Learning (ADC-SSL), to go beyond previous
methods by explicitly forcing the visual representations disentangled from speech-unrelated
information. Experimental results clearly show that the proposed method outperforms
state-of-the-art cross-modal self-supervised baselines by a large margin. Besides,
ADC-SSL can outperform its supervised counterpart without any finetune.

OsGG-Net: One-step Graph Generation Network for Unbiased Head Pose Estimation

  • Shentong Mo
  • Xin Miao

Head pose estimation is a crucial problem that involves the prediction of the Euler
angles of a human head in an image. Previous approaches predict head poses through
landmarks detection, which can be applied to multiple downstream tasks. However, previous
landmark-based methods can not achieve comparable performance to the current landmark-free
methods due to lack of modeling the complex nonlinear relationships between the geometric
distribution of landmarks and head poses. Another reason for the performance bottleneck
is that there exists biased underlying distribution of the 3D pose angles in the current
head pose benchmarks. In this work, we propose OsGG-Net, a One-step Graph Generation
Network for estimating head poses from a single image by generating a landmark-connection
graph to model the 3D angle associated with the landmark distribution robustly. To
further ease the angle-biased issues caused by the biased data distribution in learning
the graph structure, we propose the UnBiased Head Pose Dataset, called UBHPD, and
a new unbiased metric, namely UBMAE, for unbiased head pose estimation. We conduct
extensive experiments on various benchmarks and UBHPD where our method achieves the
state-of-the-art results in terms of the commonly-used MAE metric and our proposed
UBMAE. Comprehensive ablation studies also demonstrate the effectiveness of each part
in our approach.

Multi-Modal Multi-Instance Learning for Retinal Disease Recognition

  • Xirong Li
  • Yang Zhou
  • Jie Wang
  • Hailan Lin
  • Jianchun Zhao
  • Dayong Ding
  • Weihong Yu
  • Youxin Chen

This paper attacks an emerging challenge of multi-modal retinal disease recognition.
Given a multi-modal case consisting of a color fundus photo (CFP) and an array of
OCT B-scan images acquired during an eye examination, we aim to build a deep neural
network that recognizes multiple vision-threatening diseases for the given case. As
the diagnostic efficacy of CFP and OCT is disease-dependent, the network's ability
of being both selective and interpretable is important. Moreover, as both data acquisition
and manual labeling are extremely expensive in the medical domain, the network has
to be relatively lightweight for learning from a limited set of labeled multi-modal
samples. Prior art on retinal disease recognition focuses either on a single disease
or on a single modality, leaving multi-modal fusion largely underexplored. We propose
in this paper Multi-Modal Multi-Instance Learning (MM-MIL) for selectively fusing
CFP and OCT modalities. Its lightweight architecture (as compared to current multi-head
attention modules) makes it suited for learning from relatively small-sized datasets.
For an effective use of MM-MIL, we propose to generate a pseudo sequence of CFPs by
over sampling a given CFP. The benefits of this tactic include well balancing instances
across modalities, increasing the resolution of the CFP input, and finding out regions
of the CFP most relevant with respect to the final diagnosis. Extensive experiments
on a real-world dataset consisting of 1,206 multi-modal cases from 1,193 eyes of 836
subjects demonstrate the viability of the proposed model.

Locally Adaptive Structure and Texture Similarity for Image Quality Assessment

  • Keyan Ding
  • Yi Liu
  • Xueyi Zou
  • Shiqi Wang
  • Kede Ma

The latest advances in full-reference image quality assessment (IQA) involve unifying
structure and texture similarity based on deep representations. The resulting Deep
Image Structure and Texture Similarity (DISTS) metric, however, makes rather global
quality measurements, ignoring the fact that natural photographic images are locally
structured and textured across space and scale. In this paper, we describe a locally
adaptive structure and texture similarity index for full-reference IQA, which we term
A-DISTS. Specifically, we rely on a single statistical feature, namely the dispersion
index, to localize texture regions at different scales. The estimated probability
(of one patch being texture) is in turn used to adaptively pool local structure and
texture measurements. The resulting A-DISTS is adapted to local image content, and
is free of expensive human perceptual scores for supervised training. We demonstrate
the advantages of A-DISTS in terms of correlation with human data on ten IQA databases
and optimization of single image super-resolution methods.

CALLip: Lipreading using Contrastive and Attribute Learning

  • Yiyang Huang
  • Xuefeng Liang
  • Chaowei Fang

Lipreading, aiming at interpreting speech by watching the lip movements of the speaker,
has great significance in human communication and speech understanding. Despite having
reached a feasible performance, lipreading still faces two crucial challenges: 1)
the considerable lip movement variations cross different persons when they utter the
same words; 2) the similar lip movements of people when they utter some confused phonemes.
To tackle these two problems, we propose a novel lipreading framework, CALLip, which
employs attribute learning and contrastive learning. The attribute learning extracts
the speaker identity-aware features through a speaker recognition branch, which are
able to normalize the lip shapes to eliminate cross-speaker variations. Considering
that audio signals are intrinsically more distinguishable than visual signals, the
contrastive learning is devised between visual and audio signals to enhance the discrimination
of visual features and alleviate the viseme confusion problem. Experimental results
show that CALLip does learn better features of lip movements. The comparisons on both
English and Chinese benchmark datasets, GRID and CMLR, demonstrate that CALLip outperforms
six state-of-the-art lipreading methods without using any additional data.

Cross-Modal Recipe Embeddings by Disentangling Recipe Contents and Dish Styles

  • Yu Sugiyama
  • Keiji Yanai

Nowadays, cooking recipe sharing sites on the Web are widely used, and play a major
role in everyday home cooking. Since cooking recipes consist of dish photos and recipe
texts, cross-modal recipe search is being actively explored. To enable cross-modal
search, both food image features and cooking text recipe features are embedded into
the same shared space in general. However, in most of the existing studies, a one-to-one
correspondence between a recipe text and a dish image in the embedding space is assumed,
although an unlimited number of photos with different serving styles and different
plates can be associated with the same recipe. In this paper, we propose a RDE-GAN
(Recipe Disentangled Embedding GAN) which separates food image information into a
recipe image feature and a non-recipe shape feature. In addition, we generate a food
image by integrating both the recipe embedding and a shape feature. Since the proposed
embedding is free from serving and plate styles which are unrelated to cooking recipes,
the experimental results showed that it outperformed the existing methods on cross-modal
recipe search. We also confirmed that only either shape or recipe elements can be
changed at the time of food image generation.

TDI TextSpotter: Taking Data Imbalance into Account in Scene Text Spotting

  • Yu Zhou
  • Hongtao Xie
  • Shancheng Fang
  • Jing Wang
  • Zhengjun Zha
  • Yongdong Zhang

Recent scene text spotters that integrate text detection module and recognition module
have made significant progress. However, existing methods encounter two problems.
1). The data imbalance issue between text detection module and text recognition module
limits the performance of text spotters. 2). The default left-to-right reading direction
leads to errors in unconventional text spotting. In this paper, we propose a novel
scene text spotter TDI to solve these problems. Firstly, in order to solve the data
imbalance problem, a sample generation algorithm is proposed to generate plenty of
samples online for training the text recognition module by using character features
and character labels. Secondly, a weakly supervised character generation algorithm
is designed to generate character-level labels from word-level labels for the sample
generation algorithm and the training of the text detection module. Finally, in order
to spot arbitrarily arranged text correctly, a direction perception module is proposed
to perceive the reading direction of text instance. Experiments on several benchmarks
show that these designs can significantly improve the performance of text spotter.
Specifically, our method outperforms state-of-the-art methods on three public datasets
in both text detection and end-to-end text recognition, which fully proves the effectiveness
and robustness of our method.

Position-Augmented Transformers with Entity-Aligned Mesh for TextVQA

  • Xuanyu Zhang
  • Qing Yang

In addition to visual components, many images usually contain valuable text information,
which is essential for understanding the scene. Thus, we study the TextVQA task that
requires reading texts in images to answer corresponding questions. However, most
of previous works utilize sophisticated graph structure and manually crafted features
to model the position relationship between visual entities and texts in images. And
traditional multimodal transformers cannot effectively capture relative position information
and original image features. To address these issues in an intuitive but effective
way, we propose a novel model, position-augmented transformers with entity-aligned
mesh, for the TextVQA task. Different from traditional attention mechanism in transformers,
we explicitly introduce continuous relative position information of objects and OCR
tokens without complex rules. Furthermore, we replace the complicated graph structure
with intuitive entity-aligned mesh according to perspective mapping. In this mesh,
the information of discrete entities and image patches at different positions can
interact with each other. Extensive experiments on two benchmark datasets (TextVQA
and ST-VQA) show that our proposed model is superior to several state-of-the-art methods.

Learning Contextual Transformer Network for Image Inpainting

  • Ye Deng
  • Siqi Hui
  • Sanping Zhou
  • Deyu Meng
  • Jinjun Wang

Fully Convolutional Networks with attention modules have been proven effective for
learning-based image inpainting. While many existing approaches could produce visually
reasonable results, the generated images often show blurry textures or distorted structures
around corrupted areas. The main reason is due to the fact that convolutional neural
networks have limited capacity for modeling contextual information with long range
dependencies. Although the attention mechanism can alleviate this problem to some
extent, existing attention modules tend to emphasize similarities between the corrupted
and the uncorrupted regions while ignoring the dependencies from within each of them.
Hence, this paper proposes the Contextual Transformer Network (CTN) which not only
learns relationships between the corrupted and the uncorrupted regions but also exploits
their respective internal closeness. Besides, instead of a fully convolutional network,
in our CTN, we stack several transformer blocks to replace convolution layers to better
model the long range dependencies. Finally, by dividing the image into patches of
different sizes, we propose a multi-scale multi-head attention module to better model
the affinity among various image regions. Experiments on several benchmark datasets
demonstrate superior performance by our proposed approach.

Milliseconds Color Stippling

  • Lei Ma
  • Jian Shi
  • Yanyun Chen

Stippling is a popular and fascinating sketching art in stylized illustrations. Various
digital stippling techniques have been proposed to reduce tedious manual work. In
this paper, we present a novel method to create high-quality color stippling from
an input image in milliseconds. The key idea is to obtain stipples with predetermined
incremental 2D sample sequences, which algorithms generate with sequential incrementality
and distributional uniformity features. Two typical sequences are employed in our
work: one is constructed from incremental Voronoi sets, and the other is from Poisson
disk distributions. A threshold-based algorithm is then applied to determine stipple
appearance and guarantee result quality. We extend color stippling with multitone
level and radius adjustment to achieve improved visual quality. Detailed comparisons
of the two sequences are conducted to explore further the strengths and weaknesses
of the proposed method. For more information, please visit

AFD-Net: Adaptive Fully-Dual Network for Few-Shot Object Detection

  • Longyao Liu
  • Bo Ma
  • Yulin Zhang
  • Xin Yi
  • Haozhi Li

Few-shot object detection (FSOD) aims at learning a detector that can fast adapt to
previously unseen objects with scarce annotated examples. Existing methods solve this
problem by performing subtasks of classification and localization utilizing a shared
component in the detector, yet few of them take the distinct preferences towards feature
embedding of two subtasks into consideration. In this paper, we carefully analyze
the characteristics of FSOD, and present that a few-shot detector should consider
the explicit decomposition of two subtasks, as well as leveraging information from
both of them to enhance feature representations. To the end, we propose a simple yet
effective Adaptive Fully-Dual Network (AFD-Net). Specifically, we extend Faster R-CNN
by introducing Dual Query Encoder and Dual Attention Generator for separate feature
extraction, and Dual Aggregator for separate model reweighting. In this way, separate
state estimation is achieved by the R-CNN detector. Furthermore, we introduce Adaptive
Fusion Mechanism to guide the design of encoders for efficient feature fusion in the
specific subtask. Extensive experiments on PASCAL VOC and MS COCO show that our approach
achieves state-of-the-art performance by a large margin, demonstrating its effectiveness
and generalization ability.

Missing Data Imputation for Solar Yield Prediction using Temporal Multi-Modal Variational

  • Meng Shen
  • Huaizheng Zhang
  • Yixin Cao
  • Fan Yang
  • Yonggang Wen

The accurate and robust prediction of short-term solar power generation is significant
for the management of modern smart grids, where solar power has become a major energy
source due to its green and economical nature. However, the solar yield prediction
can be difficult to conduct in the real world where hardware and network issues can
make the sensors unreachable. Such data missing problem is so prevalent that it degrades
the performance of deployed prediction models and even fails the model execution.
In this paper, we propose a novel temporal multi-modal variational auto-encoder (TMMVAE)
model, to enhance the robustness of short-term solar power yield prediction with missing
data. It can impute the missing values in time-series sensor data, and reconstruct
them by consolidating multi-modality data, which then facilitates more accurate solar
power yield prediction. TMMVAE can be deployed efficiently with an end-to-end framework.
The framework is verified at our real-world testbed on campus. The results of extensive
experiments show that our proposed framework can significantly improve the imputation
accuracy when the inference data is severely corrupted, and can hence dramatically
improve the robustness of short-term solar energy yield forecasting.

Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training

  • Chenyi Lei
  • Shixian Luo
  • Yong Liu
  • Wanggui He
  • Jiamang Wang
  • Guoxin Wang
  • Haihong Tang
  • Chunyan Miao
  • Houqiang Li

The pre-trained neural models have recently achieved impressive performance in understanding
multimodal content. However, it is still very challenging to pre-train neural models
for video and language understanding, especially for Chinese video-language data,
due to the following reasons. Firstly, existing video-language pre-training algorithms
mainly focus on the co-occurrence of words and video frames, but ignore other valuable
semantic and structure information of video-language content, e.g., sequential order
and spatiotemporal relationships. Secondly, there exist conflicts between video sentence
alignment and other proxy tasks. Thirdly, there is a lack of large-scale and high-quality
Chinese video-language datasets (eg. including 10 million unique videos), which are
the fundamental success conditions for pre-training techniques. In this work, we propose
a novel video-language understanding framework named Victor, which stands for VIdeo-language
understanding via Contrastive mulTimOdal pRe-training. Besides general proxy tasks
such as masked language modeling, Victor constructs several novel proxy tasks under
the contrastive learning paradigm, making the model be more robust and able to capture
more complex multimodal semantic and structural relationships from different perspectives.
Victor is trained on a large-scale Chinese video-language dataset, including over
10 million complete videos with corresponding high-quality textual descriptions. We
apply the pre-trained Victor model to a series of downstream applications and demonstrate
its superior performance, comparing against the state-of-the-art pre-training methods
such as VideoBERT and UniVL.

DehazeFlow: Multi-scale Conditional Flow Network for Single Image Dehazing

  • Hongyu Li
  • Jia Li
  • Dong Zhao
  • Long Xu

Single image dehazing is a crucial and preliminary task for many computer vision applications,
making progress with deep learning. The dehazing task is an ill-posed problem since
the haze in the image leads to the loss of information. Thus, there are multiple feasible
solutions for image restoration of a hazy image. Most existing methods learn a deterministic
one-to-one mapping between a hazy image and its ground-truth, which ignores the ill-posedness
of the dehazing task. To solve this problem, we propose DehazeFlow, a novel single
image dehazing framework based on conditional normalizing flow. Our method learns
the conditional distribution of haze-free images given a hazy image, enabling the
model to sample multiple dehazed results. Furthermore, we propose an attention-based
coupling layer to enhance the expression ability of a single flow step, which converts
natural images into latent space and fuses features of paired data. These designs
enable our model to achieve state-of-the-art performance while considering the ill-posedness
of the task. We carry out sufficient experiments on both synthetic datasets and real-world
hazy images to illustrate the effectiveness of our method. The extensive experiments
indicate that DehazeFlow surpasses the state-of-the-art methods in terms of PSNR,
SSIM, LPIPS, and subjective visual effects.

GCM-Net: Towards Effective Global Context Modeling for Image Inpainting

  • Huan Zheng
  • Zhao Zhang
  • Yang Wang
  • Zheng Zhang
  • Mingliang Xu
  • Yi Yang
  • Meng Wang

Deep learning based inpainting methods have obtained promising performance for image
restoration, however current image inpainting methods still tend to produce unreasonable
structures and blurry textures when processing the damaged images with heavy corruptions.
In this paper, we propose a new image inpainting method termed Global Context Modeling
Network (GCM-Net). By capturing the global contextual information, GCM-Net can potentially
improve the performance of recovering the missing region in the damaged images with
irregular masks. To be specific, we first use four convolution layers to extract the
shadow features. Then, we design a progressive multi-scale fusion block termed PMSFB
to extract and fuse the multi-scale features for obtaining local features. Besides,
a dense context extraction (DCE) module is also designed to aggregate the local features
extracted by PMSFBs. To improve the information flow, a channel attention guided residual
learning module is deployed in both the DCE and PMSFB, which can reweight the learned
residual features and refine the extracted information. To capture more global contextual
information and enhance the representation ability, a coordinate context attention
(CCA) based module is also presented. Finally, the extracted features with rich information
are decoded as the image inpainting result. Extensive results on the Paris Street
View, Places2 and CelebA-HQ datasets demonstrate that our method can better recover
the structures and textures, and deliver significant improvements, compared with some
related inpainting methods.

Embracing the Dark Knowledge: Domain Generalization Using Regularized Knowledge Distillation

  • Yufei Wang
  • Haoliang Li
  • Lap-pui Chau
  • Alex C. Kot

Though convolutional neural networks are widely used in different tasks, lack of generalization
capability in the absence of sufficient and representative data is one of the challenges
that hinders their practical application. In this paper, we propose a simple, effective,
and plug-and-play training strategy named Knowledge Distillation for Domain Generalization
(KDDG) which is built upon a knowledge distillation framework with the gradient filter
as a novel regularization term. We find that both the "richer dark knowledge" from
the teacher network, as well as the gradient filter we proposed, can reduce the difficulty
of learning the mapping which further improves the generalization ability of the model.
We also conduct experiments extensively to show that our framework can significantly
improve the generalization capability of deep neural networks in different tasks including
image classification, segmentation, reinforcement learning by comparing our method
with existing state-of-the-art domain generalization techniques. Last but not the
least, we propose to adopt two metrics to analyze our proposed method in order to
better understand how our proposed method benefits the generalization capability of
deep neural networks.

Cluster and Scatter: A Multi-grained Active Semi-supervised Learning Framework for
Scalable Person Re-identification

  • Bingyu Hu
  • Zheng-Jun Zha
  • Jiawei Liu
  • Xierong Zhu
  • Hongtao Xie

Active learning has recently attracted increasing attention in the task of person
re-identification, due to its unique scalability that not only maximally reduces the
annotation cost but also retains the satisfying performance. Although some preliminary
active learning methods have been explored in scalable person re-identification task,
they have the following two problems: 1) the inefficiency in the selection process
of image pairs due to the huge search space, and 2) the ineffectiveness caused by
ignoring the impact of unlabeled data in model training. Considering that, we propose
a Multi-grained Active Semi-Supervised learning framework, named MASS, to address
the scalable person re-identification problem existing in the practical scenarios.
Specifically, we firstly design a cluster-scatter procedure to alleviate the inefficiency
problem, which consists of two components: cluster step and scatter step. The cluster
step shrinks the search space into individual small clusters by a coarse-grained clustering
method, and the subsequent scatter step further mines the hard distinguished image
pairs from unlabelled set to purify the learned clusters by a novel centrality-based
adaptive purification strategy. Afterward, we introduce a customized purification
loss for the purified clustering, which utilizes the complementary information in
both labeled and unlabeled data to optimize the model for solving the ineffectiveness
problem. The cluster-scatter procedure and the model optimization are performed in
an iterative fashion to achieve the promising performance while greatly reducing the
annotation cost. Extensive experimental results have demonstrated that MASS can even
achieve a competitive performance with fully supervised methods in the case of extremely
less annotation requirements.

Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image

  • Xinzhi Dong
  • Chengjiang Long
  • Wenju Xu
  • Chunxia Xiao

Existing image captioning methods just focus on understanding the relationship between
objects or instances in a single image, without exploring the contextual correlation
existed among contextual image. In this paper, we propose Dual Graph Convolutional
Networks (Dual-GCN) with transformer and curriculum learning for image captioning.
In particular, we not only use an object-level GCN to capture the object to object
spatial relation within a single image, but also adopt an image-level GCN to capture
the feature information provided by similar images. With the well-designed Dual-GCN,
we can make the linguistic transformer better understand the relationship between
different objects in a single image and make full use of similar images as auxiliary
information to generate a reasonable caption description for a single image. Meanwhile,
with a cross-review strategy introduced to determine difficulty levels, we adopt curriculum
learning as the training strategy to increase the robustness and generalization of
our proposed model. We conduct extensive experiments on the large-scale MS COCO dataset,
and the experimental results powerfully demonstrate that our proposed method outperforms
recent state-of-the-art approaches. It achieves a BLEU-1 score of 82.2 and a BLEU-2
score of 67.6. Our source code is available at

Build Your Own Bundle - A Neural Combinatorial Optimization Method

  • Qilin Deng
  • Kai Wang
  • Minghao Zhao
  • Runze Wu
  • Yu Ding
  • Zhene Zou
  • Yue Shang
  • Jianrong Tao
  • Changjie Fan

In the business domain,bundling is one of the most important marketing strategies
to conduct product promotions, which is commonly used in online e-commerce and offline
retailers. Existing recommender systems mostly focus on recommending individual items
that users may be interested in, such as the considerable research work on collaborative
filtering that directly models the interaction between users and items. In this paper,
we target at a practical but less explored recommendation problem named personalized
bundle composition, which aims to offer an optimal bundle (i.e., a combination of
items) to the target user. To tackle this specific recommendation problem, we formalize
it as a combinatorial optimization problem on a set of candidate items and solve it
within a neural combinatorial optimization framework. Extensive experiments on public
datasets are conducted to demonstrate the superiority of the proposed method.

Unsupervised Image Deraining: Optimization Model Driven Deep CNN

  • Changfeng Yu
  • Yi Chang
  • Yi Li
  • Xile Zhao
  • Luxin Yan

The deep convolutional neural network has achieved significant progress for single
image rain streak removal. However, most of the data-driven learning methods are full-supervised
or semi-supervised, unexpectedly suffering from significant performance drop when
dealing with the real rain. These data-driven learning methods are representative
yet generalize poor for real rain. The opposite holds true for the model-driven unsupervised
optimization methods. To overcome these problems, we propose a unified unsupervised
learning framework which inherits the generalization and representation merits for
real rain removal. Specifically, we first discover a simple yet important domain knowledge
that directional rain streak is anisotropic while the natural clean image is isotropic,
and formulate the structural discrepancy into the energy function of the optimization
model. Consequently, we design an optimization model driven deep CNN in which the
unsupervised loss function of the optimization model is enforced on the proposed network
for better generalization. In addition, the architecture of the network mimics the
main role of the optimization models with better feature representation. On one hand,
we take advantage of the deep network to improve the representation. On the other
hand, we utilize the unsupervised loss of the optimization model for better generalization.
Overall, the unsupervised learning framework achieves good generalization and representation:
unsupervised training (loss) with only a few real rainy images (input) and physical
meaning network (architecture). Extensive experiments on synthetic and real-world
rain datasets show the superiority of the proposed method.

SESSION: Keynote Talks III&IV

Do you see what I see?: Large-scale Learning from Multimodal Videos

  • Cordelia Schmid

In this talk we present recent progress on large-scale learning of multimodal video
representations. We start by presenting VideoBert, a joint model for video and language,
repurposing the Bert model for multimodal data. This model achieves state-of-the-art
results on zero shot prediction and video captioning. Next we show how to extend learning
from instruction videos to general movies based on cross-modal supervision. We use
movie screenplays to learn a speech to action classifiers and use these classifiers
to mine video clips from thousands of hours of movies. We demonstrate a performance
comparable or better than fully supervised approaches for action classification. Next
we present an approach for video question answering which relies on training from
instruction videos and cross-modal supervision with a textual question answer module.
We show state-of-the-art results for video question answering without any supervision
(zero-shot VQA) and demonstrate that our approach obtains competitive results for
pre-training and then fine-tuning on video question answering datasets. We conclude
our talk by presenting a recent video feature which is fully transformer based. Our
Video Vision Transformer (ViViT) is shown to outperform the state-of-the-art on video
classification. Furthermore, it is flexible and allows for performance / accuracy
trade-off based on several different architectures.

Large-scale Multi-Modality Pretrained Models: Applications and Experiences

  • Jingren Zhou

In this talk, we present our experiences and applications of large-scale multi-modality
pretrained models, developed at Alibaba and Ant Group. We first present a cross-modal
pretraining method called M6 (Multi-Modality to Multi-Modality Multitask Mega-transformer)
[1], for unified pretraining on the data of multiple modalities. We scale the model
size up to 1 trillion parameters [2], and build the largest pretrained model in Chinese.
We apply the model to a series of downstream applications, and demonstrate its outstanding
performance in comparison with strong baselines. Furthermore, we specifically design
a downstream task of text-guided image generation [3], and show that the finetuned
M6 can create high-quality images with high resolution and fidelity.

We also present research and applications of image editing with pretrained Generative
Adversarial Networks (GANs). A general principle between the underlying manifold and
the generator is discovered. Based on our discovery, we propose an algorithm for GANs
with low-rank factorization [4], which can be harnessed for image editing with pretrained
GAN models.

SESSION: Session 17: Multimodal Fusion and Embedding-I

Multi-Source Fusion and Automatic Predictor Selection for Zero-Shot Video Object Segmentation

  • Xiaoqi Zhao
  • Youwei Pang
  • Jiaxing Yang
  • Lihe Zhang
  • Huchuan Lu

Location and appearance are the key cues for video object segmentation. Many sources
such as RGB, depth, optical flow and static saliency can provide useful information
about the objects. However, existing approaches only utilize the RGB or RGB and optical
flow. In this paper, we propose a novel multi-source fusion network for zero-shot
video object segmentation. With the help of interoceptive spatial attention module
(ISAM), spatial importance of each source is highlighted. Furthermore, we design a
feature purification module (FPM) to filter the inter-source incompatible features.
By the ISAM and FPM, the multi-source features are effectively fused. In addition,
we put forward an automatic predictor selection network (APS) to select the better
prediction of either the static saliency predictor or the moving object predictor
in order to prevent over-reliance on the failed results caused by low-quality optical
flow maps. Extensive experiments on three challenging public benchmarks (i.e. DAVIS$_16
$, Youtube-Objects and FBMS) show that the proposed model achieves compelling performance
against the state-of-the-arts. The source code will be publicly available at

Self-supervised Consensus Representation Learning for Attributed Graph

  • Changshu Liu
  • Liangjian Wen
  • Zhao Kang
  • Guangchun Luo
  • Ling Tian

Attempting to fully exploit the rich information of topological structure and node
features for attributed graph, we introduce self-supervised learning mechanism to
graph representation learning and propose a novel Self-supervised Consensus Representation
Learning (SCRL) framework. In contrast to most existing works that only explore one
graph, our proposed SCRL method treats graph from two perspectives: topology graph
and feature graph. We argue that their embeddings should share some common information,
which could serve as a supervisory signal. Specifically, we construct the feature
graph of node features via k-nearest neighbour algorithm. Then graph convolutional
network (GCN) encoders extract features from two graphs respectively. Self-supervised
loss is designed to maximize the agreement of the embeddings of the same node in the
topology graph and the feature graph. Extensive experiments on real citation networks
and social networks demonstrate the superiority of our proposed SCRL over the state-of-the-art
methods on semi-supervised node classification task. Meanwhile, compared with its
main competitors, SCRL is rather efficient.

Efficient Multi-Modal Fusion with Diversity Analysis

  • Shuhui Qu
  • Yan Kang
  • Janghwan Lee

Multi-modal machine learning has been a prominent multi-disciplinary research area
since its success in complex real-world problems. Empirically, multi-branch fusion
models tend to generate better results when there is a high diversity among each branch
of the model. However, such experience alone does not guarantee the fusion model's
best performance nor have sufficient theoretical support. We present the theoretical
estimation of the fusion models' performance by measuring each branch model's performance
and the distance between branches based on the analysis of several most popular fusion
methods. The theorem is validated empirically by numerical experiments. We further
present a branch model selection framework to identify the candidate branches for
fusion models to achieve the optimal multi-modal performance by using the theorem.
The framework's effectiveness is demonstrated on various datasets by showing how effectively
selecting the combination of branch models to attain superior performance.

GCCN: Geometric Constraint Co-attention Network for 6D Object Pose Estimation

  • Yongming Wen
  • Yiquan Fang
  • Junhao Cai
  • Kimwa Tung
  • Hui Cheng

In 6D object pose estimation task, object models are usually available and represented
as the point cloud set in canonical object frame, which are important references for
estimating object poses to the camera frame. However, directly introducing object
models as the prior knowledge (i.e., object model point cloud) will cause potential
perturbations and even degenerate pose estimation performance. To make the most of
object model priors and eliminate the problem, we present an end-to-end deep learning
approach called the Geometric Constraint Co-attention Network (GCCN) for 6D object
pose estimation. GCCN is designed to explicitly leverage the object model priors effectively
with the co-attention mechanism. We add explicit geometric constraints to a co-attention
module to inform the geometric correspondence relationships between points in the
scene and object model priors and develop a novel geometric constraint loss to guide
the training. In this manner, our method effectively eliminates the side effect of
directly introducing the object model priors into the network. Experiments on the
YCB-Video and LineMOD datasets demonstrate that our GCCN substantially improves the
performance of pose estimation and is robust against heavy occlusions. We also demonstrate
that GCCN is accurate and robust enough to be deployed in real-world robotic tasks.

Cross-Modal Generalization: Learning in Low Resource Modalities via Meta-Alignment

  • Paul Pu Liang
  • Peter Wu
  • Liu Ziyin
  • Louis-Philippe Morency
  • Ruslan Salakhutdinov

How can we generalize to a new prediction task at test time when it also uses a new
modality as input? More importantly, how can we do this with as little annotated data
as possible? This problem of cross-modal generalization is a new research milestone
with concrete impact on real-world applications. For example, can an AI system start
understanding spoken language from mostly written text? Or can it learn the visual
steps of a new recipe from only text descriptions? In this work, we formalize cross-modal
generalization as a learning paradigm to train a model that can (1) quickly perform
new tasks (from new domains) while (2) being originally trained on a different input
modality. Such a learning paradigm is crucial for generalization to low-resource modalities
such as spoken speech in rare languages while utilizing a different high-resource
modality such as text. One key technical challenge that makes it different from other
learning paradigms such as meta-learning and domain adaptation is the presence of
different source and target modalities which will require different encoders. We propose
an effective solution based on meta-alignment, a novel method to align representation
spaces using strongly and weakly paired cross-modal data while ensuring quick generalization
to new tasks across different modalities. This approach uses key ideas from cross-modal
learning and meta-learning, and presents strong results on the cross-modal generalization
problem. We benchmark several approaches on 3 real-world classification tasks: few-shot
recipe classification from text to images of recipes, object classification from images
to audio of objects, and language classification from text to spoken speech across
100 languages spanning many rare languages. Our results demonstrate strong performance
even when the new target modality has only a few (1-10) labeled samples and in the
presence of noisy labels, a scenario particularly prevalent in low-resource modalities.

Elastic Tactile Simulation Towards Tactile-Visual Perception

  • Yikai Wang
  • Wenbing Huang
  • Bin Fang
  • Fuchun Sun
  • Chang Li

Tactile sensing plays an important role in robotic perception and manipulation tasks.
To overcome the real-world limitations of data collection, simulating tactile response
in a virtual environment comes as a desirable direction of robotic research. In this
paper, we propose Elastic Interaction of Particles (EIP) for tactile simulation, which
is capable of reflecting the elastic property of the tactile sensor as well as characterizing
the fine-grained physical interaction during contact. Specifically, EIP models the
tactile sensor as a group of coordinated particles, and the elastic property is applied
to regulate the deformation of particles during contact. With the tactile simulation
by EIP, we further propose a tactile-visual perception network that enables information
fusion between tactile data and visual images. The perception network is based on
a global-to-local fusion mechanism where multi-scale tactile features are aggregated
to the corresponding local region of the visual modality with the guidance of tactile
positions and directions. The fusion method exhibits superiority regarding the 3D
geometric reconstruction task. Our code for EIP is available at

SESSION: Session 18: Multimodal Fusion and Embedding-II

A Novel Patch Convolutional Neural Network for View-based 3D Model Retrieval

  • Zan Gao
  • Yuxiang Shao
  • Weili Guan
  • Meng Liu
  • Zhiyong Cheng
  • Shengyong Chen

In industrial enterprises, effective retrieval of three-dimensional (3-D) computer-aided
design (CAD) models can greatly save time and cost in new product development and
manufacturing, thus, many researchers have focused on it. Recently, many view-based
3D model retrieval methods have been proposed and have achieved state-of-the-art performance.
However, most of these methods focus on extracting more discriminative view-level
features and effectively aggregating the multi-view images of a 3D model, and the
latent relationship among these multi-view images is not fully explored. Thus, we
tackle this problem from the perspective of exploiting the relationships between patch
features to capture long-range associations among multi-view images. To capture associations
among views, in this work, we propose a novel patch convolutional neural network (PCNN
) for view-based 3D model retrieval. Specifically, we first employ a CNN to extract
patch features of each view image separately. Second, a novel neural network module
named PatchConv is designed to exploit intrinsic relationships between neighboring
patches in the feature space to capture long-range associations among multi-view images.
Then, an adaptive weighted view layer is further embedded into PCNN to automatically
assign a weight to each view according to the similarity between each view feature
and the view-pooling feature. Finally, a discrimination loss function is employed
to extract the discriminative 3D model feature, which consists of softmax loss values
generated by the fusion classifier and the specific classifier. Extensive experimental
results on two public 3D model retrieval benchmarks, namely, the ModelNet40, and ModelNet10,
demonstrate that our proposed PCNN can outperform state-of-the-art approaches, with
mAP values of 93.67%, and 96.23%, respectively.

Semi-Autoregressive Image Captioning

  • Xu Yan
  • Zhengcong Fei
  • Zekang Li
  • Shuhui Wang
  • Qingming Huang
  • Qi Tian

Current state-of-the-art approaches for image captioning typically adopt an autoregressive
manner, i.e., generating descriptions word by word, which suffers from slow decoding
issue and becomes a bottleneck in real-time applications. Non-autoregressive image
captioning with continuous iterative refinement, which eliminates the sequential dependence
in a sentence generation, can achieve comparable performance to the autoregressive
counterparts with a considerable acceleration. Nevertheless, based on a well-designed
experiment, we empirically proved that iteration times can be effectively reduced
when providing sufficient prior knowledge for the language decoder. Towards that end,
we propose a novel two-stage framework, referred to as Semi-Autoregressive Image Captioning
(SAIC), to make a better trade-off between performance and speed. The proposed SAIC
model maintains autoregressive property in global but relieves it in local. Specifically,
SAIC model first jumpily generates an intermittent sequence in an autoregressive manner,
that is, it predicts the first word in every word group in order. Then, with the help
of the partially deterministic prior information and image features, SAIC model non-autoregressively
fills all the skipped words with one iteration. Experimental results on the MS COCO
benchmark demonstrate that our SAIC model outperforms the preceding non-autoregressive
image captioning models while obtaining a competitive inference speedup.

One-Stage Incomplete Multi-view Clustering via Late Fusion

  • Yi Zhang
  • Xinwang Liu
  • Siwei Wang
  • Jiyuan Liu
  • Sisi Dai
  • En Zhu

As a representative of multi-view clustering (MVC), late fusion MVC (LF-MVC) algorithm
has attracted intensive attention due to its superior clustering accuracy and high
computational efficiency. One common assumption adopted by existing LF-MVC algorithms
is that all views of each sample are available. However, it is widely observed that
there are incomplete views for partial samples in practice. In this paper, we propose
One-Stage Late Fusion Incomplete Multi-view Clustering (OS-LF-IMVC) to address this
issue. Specifically, we propose to unify the imputation of incomplete views and the
clustering task into a single optimization procedure, so that the learning of the
consensus partition matrix can directly assist the final clustering task. To optimize
the resultant optimization problem, we develop a five-step alternate strategy with
theoretically proved convergence. Comprehensive experiments on multiple benchmark
datasets are conducted to demonstrate the efficiency and effectiveness of the proposed
OS-LF-IMVC algorithm.

Self-Representation Subspace Clustering for Incomplete Multi-view Data

  • Jiyuan Liu
  • Xinwang Liu
  • Yi Zhang
  • Pei Zhang
  • Wenxuan Tu
  • Siwei Wang
  • Sihang Zhou
  • Weixuan Liang
  • Siqi Wang
  • Yuexiang Yang

Incomplete multi-view clustering is an important research topic in multimedia where
partial data entries of one or more views are missing. Current subspace clustering
approaches mostly employ matrix factorization on the observed feature matrices to
address this issue. Meanwhile, self-representation technique is left unexplored, since
it explicitly relies on full data entries to construct the coefficient matrix, which
is contradictory to the incomplete data setting. However, it is widely observed that
self-representation subspace method enjoys a better clustering performance over the
factorization based one. Therefore, we adapt it to incomplete data by jointly performing
data imputation and self-representation learning. To the best of our knowledge, this
is the first attempt in incomplete multi-view clustering literature. Besides, the
proposed method is carefully compared with current advances in experiment with respect
to different missing ratios, verifying its effectiveness.

Is Visual Context Really Helpful for Knowledge Graph? A Representation Learning Perspective

  • Meng Wang
  • Sen Wang
  • Han Yang
  • Zheng Zhang
  • Xi Chen
  • Guilin Qi

Visual modality recently has aroused extensive attention in the fields of knowledge
graph and multimedia because a lot of real-world knowledge is multi-modal in nature.
However, it is currently unclear to what extent the visual modality can improve the
performance of knowledge graph tasks over unimodal models, and equally treating structural
and visual features may encode too much irrelevant information from images. In this
paper, we probe the utility of the auxiliary visual context from knowledge graph representation
learning perspective by designing a Relation Sensitive Multi-modal Embedding model,
RSME for short. RSME can automatically encourage or filter the influence of visual
context during the representation learning. We also examine the effect of different
visual feature encoders. Experimental results validate the superiority of our approach
compared to the state-of-the-art methods. On the basis of in-depth analysis, we conclude
that under appropriate circumstances models are capable of leveraging the visual input
to generate better knowledge graph embeddings and vice versa.

Knowledge Perceived Multi-modal Pretraining in E-commerce

  • Yushan Zhu
  • Huaixiao Zhao
  • Wen Zhang
  • Ganqiang Ye
  • Hui Chen
  • Ningyu Zhang
  • Huajun Chen

In this paper, we address multi-modal pretraining of product data in the field of
E-commerce. Current multi-modal pretraining methods proposed for image and text modalities
lack robustness in the face of modality-missing and modality-noise, which are two
pervasive problems of multi-modal product data in real E-commerce scenarios. To this
end, we propose a novel method, K3M, which introduces knowledge modality in multi-modal
pretraining to correct the noise and supplement the missing of image and text modalities.
The modal-encoding layer extracts the features of each modality. The modal-interaction
layer is capable of effectively modeling the interaction of multiple modalities, where
an initial-interactive feature fusion model is designed to maintain the independence
of image modality and text modality, and a structure aggregation module is designed
to fuse the information of image, text, and knowledge modalities. We pretrain K3M
with three pretraining tasks, including masked object modeling (MOM), masked language
modeling (MLM), and link prediction modeling (LPM). Experimental results on a real-world
E-commerce dataset and a series of product-based downstream tasks demonstrate that
K3M achieves significant improvements in performances than the baseline and state-of-the-art
methods when modality-noise or modality-missing exists.

SESSION: Session 19: Video Program and Demo Session

Text2Video: Automatic Video Generation Based on Text Scripts

  • Yipeng Yu
  • Zirui Tu
  • Longyu Lu
  • Xiao Chen
  • Hui Zhan
  • Zixun Sun

To make video creation simpler, in this paper we present Text2Video, a novel system
to automatically produce videos using only text-editing for novice users. Given an
input text script, the director-like system can generate game-related engaging videos
which illustrate the given narrative, provide diverse multi-modal content, and follow
video editing guidelines. The system involves five modules: (1) A material manager
extracts highlights from raw live game videos, and tags each video highlight, image
and audio with labels. (2) A natural language processor extracts entities and semantics
from the input text scripts. (3) A refined cross-modal retrieval searches for matching
candidate shots from the material manager. (4) A text to speech speaker reads the
processed text scripts with synthesized human voice. (5) The selected material shots
and synthesized speech are assembled artistically through appropriate video editing

A System for Interactive and Intelligent AD Auxiliary Screening

  • Sen Yang
  • Qike Zhao
  • Lanxin Miao
  • Min Chen
  • Lianli Gao
  • Jingkuan Song
  • Weidong Le

Montreal Cognitive Assessment (MoCA) test is an auxiliary medical screening method
for Alzheimer's disease (AD). During the traditional process, a testee is required
to conduct several test items on the paper questionnaire following the guidance of
a medical staff. It is inefficient and dependents largely on the doctor's subjective
judgment and experience level. Therefore, we propose an Interactive and Intelligent
AD Auxiliary Screening (IAS) system consisting of speech-based Interactive Unit Testing
Module (IUTM) and truth-based Intelligent Analysis Module (IAM), both of which are
developed by deep learning techniques. Following the guidance of voice commands, the
testee could achieve the MoCA test independently in IUTM just by a mobile device,
and then the testing data is analyzed accurately and objectively by IAM. Moreover,
the electronic system is beneficial to collect and analyze clinical data for further
research compared to the traditional method. And the system is deployed in the Department
of Neurology, Sichuan Provincial People's Hospital in June 2021 and has been used
in the clinical screening of Alzheimer's disease.

Move As You Like: Image Animation in E-Commerce Scenario

  • Borun Xu
  • Biao Wang
  • Jiale Tao
  • Tiezheng Ge
  • Yuning Jiang
  • Wen Li
  • Lixin Duan

Creative image animations are attractive in e-commerce applications, where motion
transfer is one of the import ways to generate animations from static images. However,
existing methods rarely transfer motion to objects other than human body or human
face, and even fewer apply motion transfer in practical scenarios. In this work, we
apply motion transfer on the Taobao product images in real e-commerce scenario to
generate creative animations, which are more attractive than static images and they
will bring more benefits. We animate the Taobao products of dolls, copper running
horses and toy dinosaurs based on motion transfer method for demonstration.

MDMS: Music Data Matching System for Query Variant Retrieval

  • Rinita Roy
  • Ruben Mayer
  • Hans-Arno Jacobsen

The distribution of royalty fees to music right holders is slow and inefficient due
to the lack of automation in music recognition and music licensing processes. The
challenge for an improved system is to recognise different versions of a music such
as remix or cover versions, leading to clear assessment and unique identification
of each music work. Through our music data matching system called MDMS, we query many
indexed and stored music pieces with a small part of a music piece. The system retrieves
the closest stored variant of the input query by using music fingerprints of the underlying
melody together with signal processing techniques. Tailored indices based on fingerprint
hashes accelerate processing across a large corpus of stored music. Results are found
even if the stored versions vary from the query song in terms of one or more music
features --- tempo, key/mode, presence of instruments/vocals, and singer --- and the
differences are highlighted in the output.

Community Generated VR Painting using Eye Gaze

  • Mu Mu
  • Murtada Dohan

The social experience is an important part of art exhibitions. This demo introduces
an eye-gaze based generative art prototype for virtual reality (VR) art exhibitions.
Our work extends the visitors' experience from individual art exploration to become
content co-creators. The design generates live community artworks based on all visitors'
visual interactions with VR paintings. During our VR exhibition at a public gallery,
over 100 visitors participated in the new creative process for community-generated

Sync Glass: Virtual Pouring and Toasting Experience with Multimodal Presentation

  • Yuki Tajima
  • Toshiharu Horiuchi
  • Gen Hattori

One of the challenges of non-face-to-face communication is the absence of the haptic
dimension. To solve this, a haptic communication system via the Internet has been
proposed. The system has to be designed in such a way that it does not create discomfort
during general use. The "Sync Glass" that we have developed transmits and presents
the feeling of pouring a drink and making a toast accompanied by haptic, sound and
visual effects. The device is designed to resemble a glass cup and, moreover, each
action, including drinking and making a toast is performed in the customary way, making
its use more acceptable to users. In the internal user demonstrations we performed,
the experience has been reviewed with participants saying that "the feeling of pouring
is so realistic", "so enjoyable!", and similar affirmative statements.

VideoDiscovery: An Automatic Short-Video Generation System for E-commerce Live-streaming

  • Yanhao Zhang
  • Qiang Wang
  • Yun Zheng
  • Pan Pan
  • Yinghui Xu

We demonstrate an end-to-end intelligent system of short-video generation for live-streaming,
namely "VideoDiscovery'', which aims to automatically produce batches of high-value
short-videos by discovering and organizing highlight content for commodity delivery.
Traditionally, production of high-value short-videos for live-streaming is cost-expensive
and time-consuming, which also demands experienced editing skills. To this end, we
construct this system with three modules: 1)Semantic segment structuring first decodes
live-streaming into a series of semantic candidates including commodity, Q&A, action,
multi-modal, etc. 2)Hierarchical search engine performs automatically searches for
semantically matching candidate shots from scripts. 3)Script-aware shot assembly is
formulated combination problem over a graph of shots, considering temporal constraints
and candidate idioms. Specifically, given an input live-streaming, the recommended
video results illustrate diverse visual-semantic content, and follow script guidelines.
Currently, our system has been launched online for Taobao stores, which enables to
generate appealing videos in minutes for advertising and recommendation. The entry
of our system is available at

SmartSales: An AI-Powered Telemarketing Coaching System in FinTech

  • Yuanfeng Song
  • Xuefang Zhao
  • Di Jiang
  • Xiaoling Huang
  • Weiwei Zhao
  • Qian Xu
  • Raymond Chi-Wing Wong
  • Qiang Yang

Telemarketing is a primary and mature method for enterprises to solicit prospective
customers to buy products or services. However, training telesales representatives
is always a pain point for enterprises since it is usually conducted manually and
costs great effort and time. In this demonstration, we propose a telemarketing coaching
system named SmartSales to help enterprises develop better salespeople. Powered by
artificial intelligence (AI), SmartSales aims to accumulate the experienced sales
pitch from customer-sales dialogues and use it to coach junior salespersons. To the
best of our knowledge, this is the first practice of an AI telemarketing coaching
system in the domain of Chinese FinTech in the literature. SmartSales has been successfully
deployed in the WeBank's telemarketing team. We expect that SmartSales will inspire
more research on AI assistant systems.

SmartMeeting: Automatic Meeting Transcription and Summarization for In-Person Conversations

  • Yuanfeng Song
  • Di Jiang
  • Xuefang Zhao
  • Xiaoling Huang
  • Qian Xu
  • Raymond Chi-Wing Wong
  • Qiang Yang

Meetings are a necessary part of the operations of any institution, whether they are
held online or in-person. However, meeting transcription and summarization are always
painful requirements since they involve tedious human effort. This drives the need
for automatic meeting transcription and summarization (AMTS) systems. A successful
AMTS system relies on systematic integration of multiple natural language processing
(NLP) techniques, such as automatic speech recognition, speaker identification, and
meeting summarization, which are traditionally developed separately and validated
offline with standard datasets. In this demonstration, we provide a novel productive
meeting tool named SmartMeeting, which enables users to automatically record, transcribe,
summarize, and manage the information in an in-person meeting. SmartMeeting transcribes
every word on the fly, enriches the transcript with speaker identification and voice
separation, and extracts essential decisions and crucial insights automatically. In
our demonstration, the audience can experience the great potential of the state-of-the-art
NLP techniques in this real-life application.

Aesthetic Evaluation and Guidance for Mobile Photography

  • Hao Lou
  • Heng Huang
  • Chaoen Xiao
  • Xin Jin

Nowadays, almost everyone can shoot photos using smart phones. However, not everyone
can take good photos. We propose to use computational aesthetics to automatically
teach people without photography training to take excellent photos. We present Aesthetic
Dashboard: a system of rich aesthetic evaluation and guidance for mobile photography.
We take 2 most used types of photos: landscapes and portraits into consideration.
When people take photos in the preview mode, for landscapes, we show the overall aesthetic
score and scores of 3 basic attributes: light, composition and color usage. Meanwhile,
the matching scores of the 3 basic attributes of current preview to typical templates
are shown, which can help users to adjust 3 basic attributes accordingly. For portraits,
besides the above basic attributes, the facial appearance, the guidance of face light,
body pose and the garment color are also shown to the users. This is the first system
that can teach mobile users to shoot good photos in the form of aesthetic dashboard,
through which, users can adjust several aesthetic attributes to take good photos easily.

A Question Answering System for Unstructured Table Images

  • Wenyuan Xue
  • Siqi Cai
  • Wen Wang
  • Qingyong Li
  • Baosheng Yu
  • Yibing Zhan
  • Dacheng Tao

Question answering over tables is a very popular semantic parsing task in natural
language processing (NLP). However, few existing methods focus on table images, even
though there are usually large-scale unstructured tables in practice (e.g., table
images). Table parsing from images is nontrivial since it is closely related to not
only NLP but also computer vision (CV) to parse the tabular structure from an image.
In this demo, we present a question answering system for unstructured table images.
The proposed system mainly consists of 1) a table recognizer to recognize the tabular
structure from an image and 2) a table parser to generate the answer to a natural
language question over the table. In addition, to train the model, we further provide
table images and structure annotations for two widely used semantic parsing datasets.
Specifically, the test set is used for this demo, from where the users can either
choose from default questions or enter a new custom question.

Post2Story: Automatically Generating Storylines from Microblogging Platforms

  • Xujian Zhao
  • Chongwei Wang
  • Peiquan Jin
  • Hui Zhang
  • Chunming Yang
  • Bo Li

In this paper, we demonstrate Post2Story, which aims to detect events and generate
storylines on microblog posts. Post2Story has several new features: (1) It proposes
to employ social influence to extract events from microblogs. (2) It presents a new
Event Graph Convolutional Network (E-GCN) model to learn the latent relationships
among events, which can help predict the story branch of an event and link events.
(3) It offers a user-friendly interface to extract and visualize the development of
events. After an introduction to the system architecture and key technologies of Post2Story,
we demonstrate the functionalities of Post2Story on a real dataset.

ViDA-MAN: Visual Dialog with Digital Humans

  • Tong Shen
  • Jiawei Zuo
  • Fan Shi
  • Jin Zhang
  • Liqin Jiang
  • Meng Chen
  • Zhengchen Zhang
  • Wei Zhang
  • Xiaodong He
  • Tao Mei

We demonstrate ViDA-MAN, a digital-human agent for multi-modal interaction, which
offers realtime audio-visual responses to instant speech inquiries. Compared to traditional
text or voice-based system, ViDA-MAN offers human-like interactions (e.g, vivid voice,
natural facial expression and body gestures). Given a speech request, the demonstration
is able to response with high quality videos in sub-second latency. To deliver immersive
user experience, ViDA-MAN seamlessly integrates multi-modal techniques including Acoustic
Speech Recognition (ASR), multi-turn dialog, Text To Speech (TTS), talking heads video
generation. Backed with large knowledge base, ViDA-MAN is able to chat with users
on a number of topics including chit-chat, weather, device control, News recommendations,
booking hotels, as well as answering questions via structured knowledge.

A Picture is Worth a Thousand Words: A Unified System for Diverse Captions and Rich Images Generation

  • Yupan Huang
  • Bei Liu
  • Jianlong Fu
  • Yutong Lu

A creative image-and-text generative AI system mimics humans' extraordinary abilities
to provide users with diverse and comprehensive caption suggestions, as well as rich
image creations. In this work, we demonstrate such an AI creation system to produce
both diverse captions and rich images. When users imagine an image and associate it
with multiple captions, our system paints a rich image to reflect all captions faithfully.
Likewise, when users upload an image, our system depicts it with multiple diverse
captions. We propose a unified multi-modal framework to achieve this goal. Specifically,
our framework jointly models image-and-text representations with a Transformer network,
which supports rich image creation by accepting multiple captions as input. We consider
the relations among input captions to encourage diversity in training and adopt a
non-autoregressive decoding strategy to enable real-time inference. Based on these,
our system supports both diverse captions and rich images generations. Our code is
available online.

Softly: Simulated Empathic Touch between an Agent and a Human

  • Maxime Grandidier
  • Fabien Boucaud
  • Indira Thouvenin
  • Catherine Pelachaud

RecipeLog: Recipe Authoring App for Accurate Food Recording

  • Akihisa Ishino
  • Yoko Yamakata
  • Hiroaki Karasawa
  • Kiyoharu Aizawa

Diet management is usually conducted by recording the name of foods eaten, but in
fact, the nutritional value of food in the same name varies greatly from recipe to
recipe. To know accurate nutritional values of the foods, recording personal recipes
is effective but time-consuming. Therefore, we are developing a mobile application
"RecipeLog", that assists users to write their own recipes by modifying prepared ones.
In our experiments, we show that with RecipeLog users create personal recipes with
45% less edit distance compared to writing from scratch.

iART: A Search Engine for Art-Historical Images to Support Research in the Humanities

  • Matthias Springstein
  • Stefanie Schneider
  • Javad Rahnama
  • Eyke Hüllermeier
  • Hubertus Kohle
  • Ralph Ewerth

In this paper, we introduce iART: an open Web platform for art-historical research
that facilitates the process of comparative vision. The system integrates various
machine learning techniques for keyword- and content-based image retrieval as well
as category formation via clustering. An intuitive GUI supports users to define queries
and explore results. By using a state-of-the-art cross-modal deep learning approach,
it is possible to search for concepts that were not previously detected by trained
classification models. Art-historical objects from large, openly licensed collections
such as Amsterdam Rijksmuseum and Wikidata are made available to users.

ArtiVisual: A Platform to Generate and Compare Art

  • Jardenna Mohazzab
  • Abe Vos
  • Jonathan van Westendorp
  • Lucas Lageweg
  • Dylan Prins
  • Aritra Bhowmik

ArtiVisual is a platform for generating new art-pieces based on an existing art style
and comparing commonalities between paintings from different era. We combine an image
generative network with established state-of-the-art visualisation techniques to deepen
the users' understanding of art in general. With ArtiVisual we can generate images
based on art- styles via an interactive timeline. Common features between art-styles
are reflected on the generated art piece produced by the network after learning the
subspace of each artist's specific features. Visualisations are presented to provide
insight into commonalities between existing and generated images. The combination
of a trained network and our visualisation techniques provides a rigid framework for
thorough exploration and understanding of art datasets.

GCNIllustrator: Illustrating the Effect of Hyperparameters on Graph Convolutional Networks

  • Ivona Najdenkoska
  • Jeroen den Boef
  • Thomas Schneider
  • Justo van der Werf
  • Reinier de Ridder
  • Fajar Fathurrahman
  • Marcel Worring

An increasing number of real-world applications are using graph-structured datasets,
imposing challenges to existing machine learning algorithms. Graph Convolutional Networks
(GCNs) are deep learning models, specifically designed to operate on graphs. One of
the most tedious steps in training GCNs is the choice of the hyperparameters, especially
since they exhibit unique properties compared to other neural models. Not only machine
learning beginners, but also experienced practitioners often have difficulties to
properly tune their models. We hypothesize that having a tool that visualizes the
effect of hyperparameters choice on the performance can accelerate the model development
and improve the understanding of these black-box models. Additionally, observing clusters
of certain nodes helps to empirically understand how a given prediction was made due
to the feature propagation step of GCNs. Therefore, this demo introduces GCNIllustrator
- a web-based visual analytics tool for illustrating the effect of hyperparameters
on the predictions in a citations graph.

On-demand Action Detection System using Pose Information

  • Noboru Yoshida
  • Jianquan Liu

Human action detection is a very important yet difficult task for various multimedia
applications such as safety surveillance, sports video analysis and video editing
in media industry. Most existing methods proposed for action detection are machine
learning based approaches, however, highly time- and cost-consuming to prepare training
data with annotations. Thus, it is still very difficult to apply these methods for
industrial applications where the actions of interests might happen rarely in real
scenarios such as criminal or suspicious behaviors, because it is impossible to collect
a large number of such training data for target actions. In this paper, we disruptively
abandon these conventional methods, alternatively, adopting an on-demand retrieval
approach using pose information to handle the action detection task. We introduce
a demo system that can detect similar actions immediately by specifying a few second
sample video without any training process. The system demonstrates the usability and
efficacy of our on-demand approach for human action detection. The experimental results
are reported to show that our approach outperforms the state-of-the-art method in
higher precision and recall, up to 11% and 6.1% improvement, respectively.

APF: An Adversarial Privacy-preserving Filter to Protect Portrait Information

  • Xian Zhao
  • Jiaming Zhang
  • Xiaowen Huang

While widely adopted in practical applications, face recognition has been disputed
on the malicious use of face images and potential privacy issues. Online photo sharing
services accidentally act as the main approach for the malicious crawlers to exploit
face recognition to access portrait privacy. In this demo, we propose an adversarial
privacy-preserving filter, which can preserve face image from malicious face recognition
algorithms. This filter is generated by an end-cloud collaborated adversarial attack
framework consisting of three modules: (1) Image-specific gradient generation module,
to extract image-specific gradient in the user end; (2) Adversarial gradient transfer
module, to fine-tune the image-specific gradient in the server; and (3) Universal
adversarial perturbation enhancement module, to append image-independent perturbation
to derive the final adversarial perturbation. A short video about our system is available

Text-driven 3D Avatar Animation with Emotional and Expressive Behaviors

  • Li Hu
  • Jinwei Qi
  • Bang Zhang
  • Pan Pan
  • Yinghui Xu

Text-driven 3D avatar animation has been an essential part of virtual human techniques,
which has a wide range of applications in movie, digital games and video streaming.
In this work, we introduce a practical system which drives both facial and body movements
of 3D avatar by text input. Our proposed system first converts text input to speech
signal and conducts text analysis to extract semantic tags simultaneously. Then we
generate the lip movements from the synthetic speech, and meanwhile facial expression
and body movement are generated by the joint modeling of speech and textual information,
which can drive our virtual 3D avatar talking and acting like a real human.

Text to Scene: A System of Configurable 3D Indoor Scene Synthesis

  • Xinyan Yang
  • Fei Hu
  • Long Ye

In this work, we show the Text to Scene system, which can configure 3D indoor scene
from natural language. Given a text, the system will organize inclusive semantic message
to a graph template, complete the graph with a novel graph-based contextual completion
method Contextual ConvE(CConvE) and visulize the graph by arranging 3D models under
an object location protocol. In the experiments, qualitative results obtained by the
Text to Scene(T2S) system and quantitative evaluation of CConvE compared with other
state-of-the-art approaches are reported.

MovieREP: A New Movie Reproduction Framework for Film Soundtrack

  • Ruiqi Wang
  • Long Ye
  • Qin Zhang

Film sound reproduction is the process of converting the image-form film soundtrack
to wave-form movie sound. In this paper, a novel optical imaging based reproduction
framework is proposed with the basic idea that restoring film audio damage in the
image domain. In traditional reproduction method, the scanning light emitted by film
projector causes inversible physical damage to the flammable film soundtrack (made
of Nitrate compounds). By using optical imaging method in film soundtrack capturing,
our framework can avoid the damage and the self-ignition problem. Experiment results
show that our framework can improve the reproduction speed to 2 times while maintaining
equal sound quality. Also, the sound sampling rate can be enhanced to 162.08%.

SESSION: Session 20: Multimodal Fusion and Embedding-III

DSP: Dual Soft-Paste for Unsupervised Domain Adaptive Semantic Segmentation

  • Li Gao
  • Jing Zhang
  • Lefei Zhang
  • Dacheng Tao

Unsupervised domain adaptation (UDA) for semantic segmentation aims to adapt a segmentation
model trained on the labeled source domain to the unlabeled target domain. Existing
methods try to learn domain invariant features while suffering from large domain gaps
that make it difficult to correctly align discrepant features, especially in the initial
training phase. To address this issue, we propose a novel Dual Soft-Paste (DSP) method
in this paper. Specifically, DSP selects some classes from a source domain image using
a long-tail class first sampling strategy and softly pastes the corresponding image
patch on both the source and target training images with a fusion weight. Technically,
we adopt the mean teacher framework for domain adaptation, where the pasted source
and target images go through the student network while the original target image goes
through the teacher network. Output-level alignment is carried out by aligning the
probability maps of the target fused image from both networks using a weighted cross-entropy
loss. In addition, feature-level alignment is carried out by aligning the feature
maps of the source and target images from student network using a weighted maximum
mean discrepancy loss. DSP facilitates the model learning domain-invariant features
from the intermediate domains, leading to faster convergence and better performance.
Experiments on two challenging benchmarks demonstrate the superiority of DSP over
state-of-the-art methods. Code is available at

Generating Point Cloud from Single Image in The Few Shot Scenario

  • Yu Lin
  • Jinghui Guo
  • Yang Gao
  • Yi-fan Li
  • Zhuoyi Wang
  • Latifur Khan

Reconstructing point clouds from images would extremely benefit many practical CV
applications, such as robotics, automated vehicles, and Augmented Reality. Fueled
by the advances of deep neural network, many deep learning frameworks are proposed
to address this problem recently. However, these frameworks generally rely on a large
amount of labeled training data (e.g., image and point cloud pairs). Although we usually
have numerous 2D images, corresponding 3D shapes are insufficient in practice. In
addition, most available 3D data covers only a limited amount of classes, which further
restricts the models' generalization ability to novel classes. To mitigate these issues,
we propose a novel few-shot single-view point cloud generation framework by considering
both class-specific and class-agnostic 3D shape priors. Specifically, we abstract
each class by a prototype vector that embeds class-specific shape priors. Class-agnostic
shape priors are modeled by a set of learnable shape primitives that encode universal
3D shape information shared across classes. Later, we combine the input image with
class-specific prototypes and class-agnostic shape primitives to guide the point cloud
generation process. Experiments on the popular ModelNet and ShapeNet datasets demonstrate
that our method outperforms state-of-the-art methods in the few-shot setting.

Product-oriented Machine Translation with Cross-modal Cross-lingual Pre-training

  • Yuqing Song
  • Shizhe Chen
  • Qin Jin
  • Wei Luo
  • Jun Xie
  • Fei Huang

Translating e-commercial product descriptions, a.k.a product-oriented machine translation
(PMT), is essential to serve e-shoppers all over the world. However, due to the domain
specialty, the PMT task is more challenging than traditional machine translation problems.
Firstly, there are many specialized jargons in the product description, which are
ambiguous to translate without the product image. Secondly, product descriptions are
related to the image in more complicated ways than standard image descriptions, involving
various visual aspects such as objects, shapes, colors or even subjective styles.
Moreover, existing PMT datasets are small in scale to support the research. In this
paper, we first construct a large-scale bilingual product description dataset called
Fashion-MMT, which contains over 114k noisy and 40k manually cleaned description translations
with multiple product images. To effectively learn semantic alignments among product
images and bilingual texts in translation, we design a unified product-oriented cross-modal
cross-lingual model for pre-training and fine-tuning. Experiments on the Fashion-MMT
and Multi30k datasets show that our model significantly outperforms the state-of-the-art
models even pre-trained on the same dataset. It is also shown to benefit more from
large-scale noisy data to improve the translation quality. We will release the dataset
and codes at

Pre-training Graph Transformer with Multimodal Side Information for Recommendation

  • Yong Liu
  • Susen Yang
  • Chenyi Lei
  • Guoxin Wang
  • Haihong Tang
  • Juyong Zhang
  • Aixin Sun
  • Chunyan Miao

Side information of items, e.g., images and text description, has shown to be effective
in contributing to accurate recommendations. Inspired by the recent success of pre-training
models on natural language and images, we propose a pre-training strategy to learn
item representations by considering both item side information and their relationships.
We relate items by common user activities, e.g., co-purchase, and construct a homogeneous
item graph. This graph provides a unified view of item relations and their associated
side information in multimodality. We develop a novel sampling algorithm named MCNSampling
to select contextual neighbors for each item. The proposed Pre-trained Multimodal
Graph Transformer (PMGT) learns item representations with two objectives: 1) graph
structure reconstruction, and 2) masked node feature reconstruction. Experimental
results on real datasets demonstrate that the proposed PMGT model effectively exploits
the multimodality side information to achieve better accuracies in downstream tasks
including item recommendation and click-through ratio prediction. In addition, we
also report a case study of testing PMGT in an online setting with 600 thousand users.

Learning Disentangled Factors from Paired Data in Cross-Modal Retrieval: An Implicit
Identifiable VAE Approach

  • Minyoung Kim
  • Ricardo Guerrero
  • Vladimir Pavlovic

We tackle the problem of learning the underlying disentangled latent factors that
are shared between the paired bi-modal data in cross-modal retrieval. Typically the
data in both modalities are complex, structured, and high dimensional (e.g., image
and text), for which the conventional deep auto-encoding latent variable models such
as the Variational Autoencoder (VAE) often suffer from difficulty of accurate decoder
training or realistic synthesis. In this paper we propose a novel idea of the implicit
decoder, which completely removes the ambient data decoding module from a latent variable
model, via implicit encoder inversion that is achieved by Jacobian regularization
of the low-dimensional embedding function. Motivated from the recent Identifiable-VAE
(IVAE) model, we modify it to incorporate the query modality data as conditioning
auxiliary input, which allows us to prove that the true parameters of the model can
be identifiable under some regularity conditions. Tested on various datasets where
the true factors are fully/partially available, our model is shown to identify the
factors accurately, significantly outperforming conventional latent variable models.

Progressive Graph Attention Network for Video Question Answering

  • Liang Peng
  • Shuangji Yang
  • Yi Bin
  • Guoqing Wang

Video question answering~(Video-QA) is a task of answering a natural language question
related to the content of a video. Existing methods generally explore the single interactions
between objects or between frames, which are insufficient to deal with the sophisticated
scenes in videos. To tackle this problem, we propose a novel model, termed Progressive
Graph Attention Network (PGAT), which can jointly explore the multiple visual relations
on object-level, frame-level and clip-level. Specifically, in the object-level relation
encoding, we design two kinds of complementary graphs, one for learning the spatial
and semantic relations between objects from the same frame, the other for modeling
the temporal relations between the same object from different frames. The frame-level
graph explores the interactions between diverse frames to record the fine-grained
appearance change, while the clip-level graph models the temporal and semantic relations
between various actions from clips. These different-level graphs are concatenated
in a progressive manner to learn the visual relations from low-level to high-level.
Furthermore, we for the first time identified that there are serious answer biases
with TGIF-QA, a very large Video-QA dataset, and reconstructed a new dataset based
on it to overcome the biases, called TGIF-QA-R. We evaluate the proposed model on
three benchmark datasets and the new TGIF-QA-R, and the experimental results demonstrate
that our model significantly outperforms other state-of-the-art models. Our codes
and dataset are available at

SESSION: Session 21: Media Interpretation-I

Mix-order Attention Networks for Image Restoration

  • Tao Dai
  • Yalei Lv
  • Bin Chen
  • Zhi Wang
  • Zexuan Zhu
  • Shu-Tao Xia

Convolutional neural networks (CNNs) have obtained great success in image restoration
tasks, like single image denoising, demosaicing, and super-resolution. However, most
existing CNN-based methods neglect the diversity of image contents and degradations
in the corrupted images and treat channel-wise features equally, thus hindering the
representation ability of CNNs. To address this issue, we propose deep mix-order attention
networks (MAN) to extract features that capture rich feature statistics within networks.
Our MAN is mainly built on simple residual blocks and our mix-order channel attention
(MOCA) module, which further consists of feature gating and feature pooling blocks
to capture different types of semantic information. With our MOCA, our MAN can be
flexible to handle various types of image contents and degradations. Besides, our
MAN can be generalized to different image restoration tasks, like image denoising,
super-resolution, and demosaicing. Extensive experiments demonstrate that our method
obtains favorably against state-of-the-art methods in terms of quantitative and qualitative

Vehicle Counting Network with Attention-based Mask Refinement and Spatial-awareness
Block Loss

  • Ji Zhang
  • Jian-Jun Qiao
  • Xiao Wu
  • Wei Li

Vehicle counting aims to calculate the number of vehicles in congested traffic scenes.
Although object detection and crowd counting have made tremendous progress with the
development of deep learning, vehicle counting remains a challenging task, due to
scale variations, viewpoint changes, inconsistent location distributions, diverse
visual appearances and severe occlusions. In this paper, a well-designed Vehicle Counting
Network (VCNet) is novelly proposed to alleviate the problem of scale variation and
inconsistent spatial distribution in congested traffic scenes. Specifically, VCNet
is composed of two major components: (i) To capture multi-scale vehicles across different
types and camera viewpoints, an effective multi-scale density map estimation structure
is designed by building an attention-based mask refinement module. The multi-branch
structure with hybrid dilated convolution blocks is proposed to assign receptive fields
to generate multi-scale density maps. To efficiently aggregate multi-scale density
maps, the attention-based mask refinement is well-designed to highlight the vehicle
regions, which enables each branch to suppress the scale interference from other branches.
(ii) In order to capture the inconsistent spatial distributions, a spatial-awareness
block loss (SBL) based on the region-weighted reward strategy is proposed to calculate
the loss of different spatial regions including sparse, congested and occluded regions
independently by dividing the density map into different regions. Extensive experiments
conducted on three benchmark datasets, TRANCOS, VisDrone2019 Vehicle and CVCSet demonstrate
that the proposed VCNet outperforms the state-of-the-art approaches in vehicle counting.
Moreover, the proposed idea can be applicable for crowd counting, which produces competitive
results on ShanghaiTech crowd counting dataset.

DPT: Deformable Patch-based Transformer for Visual Recognition

  • Zhiyang Chen
  • Yousong Zhu
  • Chaoyang Zhao
  • Guosheng Hu
  • Wei Zeng
  • Jinqiao Wang
  • Ming Tang

Transformer has achieved great success in computer vision, while how to split patches
in an image remains a problem. Existing methods usually use a fixed-size patch embedding
which might destroy the semantics of objects. To address this problem, we propose
a new Deformable Patch (DePatch) module which learns to adaptively split the images
into patches with different positions and scales in a data-driven way rather than
using predefined fixed patches. In this way, our method can well preserve the semantics
in patches. The DePatch module can work as a plug-and-play module, which can easily
be incorporated into different transformers to achieve an end-to-end training. We
term this DePatch-embedded transformer as Deformable Patch-based Transformer (DPT)
and conduct extensive evaluations of DPT on image classification and object detection.
Results show DPT can achieve 81.8% top-1 accuracy on ImageNet classification, and
43.7% box AP with RetinaNet, 44.3% with Mask R-CNN on MSCOCO object detection. Code
has been made available at:

Scene Text Image Super-Resolution via Parallelly Contextual Attention Network

  • Cairong Zhao
  • Shuyang Feng
  • Brian Nlong Zhao
  • Zhijun Ding
  • Jun Wu
  • Fumin Shen
  • Heng Tao Shen

Optical degradation blurs text shapes and edges, so existing scene text recognition
methods have difficulties in achieving desirable results on low-resolution (LR) scene
text images acquired in real-world environments. The above problem can be solved by
efficiently extracting sequential information to reconstruct super-resolution (SR)
text images, which remains a challenging task. In this paper, we propose a Parallelly
Contextual Attention Network (PCAN), which effectively learns sequence-dependent features
and focuses more on high-frequency information of the reconstruction in text images.
Firstly, we explore the importance of sequence-dependent features in horizontal and
vertical directions parallelly for text SR, and then design a parallelly contextual
attention block to adaptively select the key information in the text sequence that
contributes to image super-resolution. Secondly, we propose a hierarchically orthogonal
texture-aware attention module and an edge guidance loss function, which can help
to reconstruct high-frequency information in text images. Finally, we conduct extensive
experiments on TextZoom dataset, and the results can be easily incorporated into mainstream
text recognition algorithms to further improve their performance in LR image recognition.
Besides, our approach exhibits great robustness in defending against adversarial attacks
on seven mainstream scene text recognition datasets, which means it can also improve
the security of the text recognition pipeline. Compared with directly recognizing
LR images, our method can respectively improve the recognition accuracy of ASTER,
MORAN, and CRNN by 14.9%, 14.0%, and 20.1%. Our method outperforms eleven state-of-the-art
(SOTA) SR methods in terms of boosting text recognition performance. Most importantly,
it outperforms the current optimal text-orient SR method TSRN by 3.2%, 3.7%, and 6.0%
on the recognition accuracy of ASTER, MORAN, and CRNN respectively.

Improving Pedestrian Detection from a Long-tailed Domain Perspective

  • Mengyuan Ding
  • Shanshan Zhang
  • Jian Yang

Although pedestrian detection has developed a lot recently, there still exists some
challenging scenarios, such as small-scale, occlusion and low-light. Current works
usually focus on one of these scenarios independently and propose specific methods.
However, different challenges may happen at a time simultaneously and change across
time, making a specific method infeasible in practice. Therefore we are motivated
to design a method which is able to handle various challenges and to obtain reasonable
performance across different scenarios. In this paper, we first propose Instance Domain
Compactness (IDC) to measure the difference of each instance in the feature space
and handle hard cases from a novel long-tailed domain perspective. Specifically, we
first propose a Feature Augmentation Module (FAM) to augment the tail instances in
the feature space, thereby increasing the number and diversity of tail samples. Besides,
a IDC-guided loss weighting module (IDCW) is formulated to adaptively re-weight the
loss of each sample so as to balance the optimization procedure. Extensive analysis
and experiments illustrate that our method improves the generalization of the model
without any extra parameters and achieves comparable results across different challenging
scenarios on both CityPersons and Caltech datasets.

Robust Shadow Detection by Exploring Effective Shadow Contexts

  • Xianyong Fang
  • Xiaohao He
  • Linbo Wang
  • Jianbing Shen

Effective contexts for separating shadows from non-shadow objects can appear in different
scales due to different object sizes. This paper introduces a new module, Effective-Context
Augmentation (ECA), to utilize these contexts for robust shadow detection with deep
structures. Taking regular deep features as global references, ECA enhances the discriminative
features from the parallelly computed fine-scale features and, therefore, obtains
robust features embedded with effective object contexts by boosting them. We further
propose a novel encoder-decoder style of shadow detection method where ECA acts as
the main building block of the encoder to extract strong feature representations and
the guidance to the classification process of the decoder. Moreover, the networks
are optimized with only one loss, which is easy to train and does not have the instability
caused by extra losses superimposed on the intermediate features among existing popular
studies. Experimental results show that the proposed method can effectively eliminate
fake detections. Especially, our method outperforms state-of-the-arts methods and
improves over $13.97%$ and $34.67%$ on the challenging SBU and UCF datasets respectively
in balance error rate.

SESSION: Session 22: Doctoral Symposium

End-to-end Quality of Experience Evaluation for HTTP Adaptive Streaming

  • Babak Taraghi

Exponential growth in multimedia streaming traffic over the Internet motivates the
research and further investigation of the user's perceived quality of such services.
Enhancement of experienced quality by the users becomes more substantial when service
providers compete on establishing superiority by gaining more subscribers or customers.
Quality of Experience (QoE) enhancement would not be possible without an authentic
and accurate assessment of the streaming sessions. HTTP Adaptive Streaming (HAS) is
today's prevailing technique to deliver the highest possible audio and video content
quality to the users. An end-to-end evaluation of QoE in HAS covers the precise measurement
of the metrics that affect the perceived quality, eg. startup delay, stall events,
and delivered media quality. Mentioned metrics improvements could limit the service's
scalability, which is an important factor in real-world scenarios. In this study,
we will investigate the stated metrics, best practices and evaluations methods, and
available techniques with an aim to (i) design and develop practical and scalable
measurement tools and prototypes, (ii) provide a better understanding of current technologies
and techniques (eg. Adaptive Bitrate algorithms), (iii) conduct in-depth research
on the significant metrics in a way that improvements of QoE with scalability in mind
would be feasible, and finally (iv) provide a comprehensive QoE model which outperforms
state-of-the-art models.

Generative Adversarial Network for Text-to-Face Synthesis and Manipulation

  • Yutong Zhou

Over the past few years, several studies have been conducted on text-to-image synthesis
techniques, which transfer input textual descriptions into realistic images. However,
facial image synthesis and manipulation from input sentences have not been widely
explored due to the lack of datasets. My research interests center around the development
of multi-modality technology and facial image generation with Generative Adversarial
Networks. Towards that end, we propose an approach for facial image generation and
manipulation from text descriptions. We also introduce the first Text-to-Face synthesis
dataset with large-scale facial attributes. In this extended abstract, we first present
the existing condition and further direction of my Ph.D. research that I have followed
during the first year. Then, we introduce the proposed method (accepted by IEEE FG2021),
annotated novel dataset and experimental results. Finally, the future outlook on other
challenges, proposed dataset and expected impact are discussed. Codes and paper lists
studied in text-to-image synthesis are summarized on

GAN-aided Serial Dependence Study in Medical Image Perception

  • Zhihang Ren

Medical imaging has been critically important for the health and well-being of millions
of patients. Although deep learning has been widely studied in medical imaging area
and the performance of deep learning has exceeded human's performance in certain medical
diagnostic tasks, detecting and diagnosing lesions still depends on the visual system
of human observers (radiologists), who completed years of training to scrutinize anomalies.
Routinely, radiologists sequentially read batches of medical images one after the
other. A basic underlying assumption of radiologists' precise diagnosis is that their
perceptions and decisions on a current medical image are completely independent from
the previous reading history of medical images. However, recent research proposed
that the human visual system has visual serial dependencies (VSDs) at many levels.
VSD means that what was seen in the past influences (and captures) what is seen and
reported at this moment. Our pilot data via naive artificial stimuli has shown that
VSD has a disruptive effect in radiologic searches that impairs accurate detection
and recognition of tumors or other structures. However, the naive artificial stimuli
have been noted by both untrained observers and expert radiologists to be less authentic.
In this project, we will generate authentic medical images via Generative Adversarial
Networks (GANs) in order to replace the simple stimuli in future experiments. The
rationale for the proposed research project is that once it is known how serial dependence
arises and how it impacts visual search, we can understand how to control for it.
Hence, the accuracy of diagnosis via medical imaging can significantly improve. The
specific goals of this project are to establish, identify and mitigate the impact
of VSD on visual search tasks in clinical settings.

Image Style Transfer with Generative Adversarial Networks

  • Ru Li

Image style transfer is a recently popular research field, which aims to learn the
mapping between different domains and involves different computer vision techniques.
Recently, Generative Adversarial Networks (GAN) have demonstrated their potentials
of translating images from source domain X to target domain Y in the absence of paired
examples. However, such a translation cannot guarantee to generate high perceptual
quality results. Existing style transfer methods work well with relatively uniform
content, they often fail to capture geometric or structural patterns that reflect
the quality of generated images. The goal of this doctoral research is to investigate
the image style transfer approaches, and design advanced and useful methods to solve
existing problems. Though preliminary experiments conducted so far, we demonstrate
our insights on the image style translation approaches, and present the directions
to be pursued in the future.

Annotation-Efficient Semantic Segmentation with Shape Prior Knowledge

  • Yuhang Lu

Deep learning methods have achieved great success on semantic segmentation in recent
years. But the training typically relies on large-scale fully-annotated ground truth
masks, which are difficult to obtain in practice. In this research, we study the problem
of reducing the annotation cost of segmentation network training with a focus on exploring
the shape prior knowledge of objects. Under the context of three applications, we
study three types of shape priors. Specifically, we first exploit the implicit shape
prior of curve structures to propose a weakly supervised curve structure segmentation
method, and then explicitly formulate the shape prior of anatomical structures as
loss functions to propose a one-shot anatomical structures segmentation network. Last,
we try to generalize the shape constraint to arbitrary objects to propose a class-agnostic
few-shot segmentation framework. Experiment results show that our methods could achieve
comparable or better performance than fully supervised segmentation methods with less
annotation costs on the studied applications.

Neural-based Rendering and Application

  • Peng Dai

Rendering plays an important role in many fields such as virtual reality and film,
but the high dependence on computing sources and human experience hinders its application.
With the development of deep learning, neural rendering has attracted much attention
due to its impressive performance and efficiency than traditional rendering. In this
paper, we mainly introduce two neural rendering works, one is rendering simulation
and the other is image-based novel view rendering. Moreover, we also discuss the potential
applications (i.e. data augmentation) based on the results of neural rendering, which
has received little attention.

Towards Bridging Video and Language by Caption Generation and Sentence Localization

  • Shaoxiang Chen

Various video understanding tasks (classification, tracking, action detection, etc.)
have been extensively studied in the multimedia and computer vision communities over
the recent years. While these tasks are important, we think that bridging video and
language is a more natural and intuitive way to interact with videos. Caption generation
and sentence localization are two representative tasks for connecting video and language,
and my research is focused on these two tasks. In this extended abstract, I present
approaches for tackling each of these tasks by exploiting fine-grained information
in videos, together with ideas about how these two tasks can be connected. So far,
my work have demonstrated that these two tasks share a common foundation, and by connecting
them to form a cycle, video and language can be more closely bridged. Finally, several
challenges and future directions will be discussed.

Situational Anomaly Detection in Multimedia Data under Concept Drift

  • Pratibha Kumari

Anomaly detection has been a very challenging and active area of research for decades,
particularly for video surveillance. However, most of the works detect predefined
anomaly classes using static models. These frameworks have limited applicability for
real-life surveillance where the data have concept drift. Under concept drift, the
distribution of both normal and anomaly classes changes over time. An event may change
its class from anomaly to normal or vice-versa. The non-adaptive frameworks do not
handle this drift. Additionally, the focus has been on detecting local anomalies,
such as a region of an image. In contrast, in CCTV-based monitoring, flagging unseen
anomalous situations can be of greater interest. Utilizing multiple sensory information
for anomaly detection has also received less attention. This extended abstract discusses
these gaps and possible solutions.

Dynamic Knowledge Distillation with Cross-Modality Knowledge Transfer

  • Guangzhi Wang

Supervised learning for vision tasks has achieved great success be-cause of the advances
of deep learning research in many areas, such as high quality datasets, network architectures
and regularization methods. In the vanilla deep learning paradigm, training a model
for visual tasks is mainly based on the provided training images and annotations.
Inspired by human learning with knowledge transfer where information from multiples
modalities are considered, we pro-pose to improve visual tasks' performance by introducing
explicit knowledge extracted from other modalities. As the first step, we propose
to improve image classification performance by introducing linguistic knowledge as
additional constraints in model learning. This knowledge is represented as a set of
constraints to be jointly utilized with visual knowledge. To coordinate the training
dynamic, we propose to imbue our model the ability of dynamic distilling from multiple
knowledge sources. This is done via a model agnostic knowledge weighting module which
guides the learning process and updates via meta-steps during training. Preliminary
experiments on various benchmark datasets validate the efficacy of our method. Our
code will be made publicly available to ensure reproducibility.

SESSION: Session 23: Media Interpretation-II

WeClick: Weakly-Supervised Video Semantic Segmentation with Click Annotations

  • Peidong Liu
  • Zibin He
  • Xiyu Yan
  • Yong Jiang
  • Shu-Tao Xia
  • Feng Zheng
  • Hu Maowei

Compared with tedious per-pixel mask annotating, it is much easier to annotate data
by clicks, which costs only several seconds for an image. However, applying clicks
to learn video semantic segmentation model has not been explored before. In this work,
we propose an effective weakly-supervised video semantic segmentation pipeline with
click annotations, called WeClick, for saving laborious annotating effort by segmenting
an instance of the semantic class with only a single click. Since detailed semantic
information is not captured by clicks, directly training with click labels leads to
poor segmentation predictions. To mitigate this problem, we design a novel memory
flow knowledge distillation strategy to exploit temporal information (named memory
flow) in abundant unlabeled video frames, by distilling the neighboring predictions
to the target frame via estimated motion. Moreover, we adopt vanilla knowledge distillation
for model compression. In this case, WeClick learns compact video semantic segmentation
models with the low-cost click annotations during the training phase yet achieves
real-time and accurate models during the inference period. Experimental results on
Cityscapes and Camvid show that WeClick outperforms the state-of-the-art methods,
increases performance by 10.24% mIoU than baseline, and achieves real-time execution.

Towards Cross-Granularity Few-Shot Learning: Coarse-to-Fine Pseudo-Labeling with Visual-Semantic

  • Jinhai Yang
  • Hua Yang
  • Lin Chen

Few-shot learning aims at rapidly adapting to novel categories with only a handful
of samples at test time, which has been predominantly tackled with the idea of meta-learning.
However, meta-learning approaches essentially learn across a variety of few-shot tasks
and thus still require large-scale training data with fine-grained supervision to
derive a generalized model, thereby involving prohibitive annotation cost. In this
paper, we advance the few-shot classification paradigm towards a more challenging
scenario, i.e, cross-granularity few-shot classification, where the model observes
only coarse labels during training while is expected to perform fine-grained classification
during testing. This task largely relieves the annotation cost since fine-grained
labeling usually requires strong domain-specific expertise. To bridge the cross-granularity
gap, we approximate the fine-grained data distribution by greedy clustering of each
coarse-class into pseudo-fine-classes according to the similarity of image embeddings.
We then propose a meta-embedder that jointly optimizes the visual- and semantic-discrimination,
in both instance-wise and coarse class-wise, to obtain a good feature space for this
coarse-to-fine pseudo-labeling process. Extensive experiments and ablation studies
are conducted to demonstrate the effectiveness and robustness of our approach on three
representative datasets.

Disentangled Representation Learning and Enhancement Network for Single Image De-Raining

  • Guoqing Wang
  • Changming Sun
  • Xing Xu
  • Jingjing Li
  • Zheng Wang
  • Zeyu Ma

In this paper, we present a disentangled representation learning and enhancement network
(DRLE-Net) to address the challenging single image de-raining problems, i.e., raindrop
and rain streak removal. Specifically, the DRLE-Net is formulated as a multi-task
learning framework, and an elegant knowledge transfer strategy is designed to train
the encoder of DRLE-Net to embed a rainy image into two separated latent spaces representing
the task (clean image reconstruction in this paper) relevant and irrelevant variations
respectively, such that only the essential task-relevant factors will be used by the
decoder of DRLE-Net to generate high-quality de-raining results. Furthermore, visual
attention information is modeled and fed into the disentangled representation learning
network to enhance the task-relevant factor learning. To facilitate the optimization
of the hierarchical network, a new adversarial loss formulation is proposed and used
together with the reconstruction loss to train the proposed DRLE-Net. Extensive experiments
are carried out for removing raindrops or rainstreaks from both synthetic and real
rainy images, and DRLE-Net is demonstrated to produce significantly better results
than state-of-the-art models.

Towards Robust Cross-domain Image Understanding with Unsupervised Noise Removal

  • Lei Zhu
  • Zhaojing Luo
  • Wei Wang
  • Meihui Zhang
  • Gang Chen
  • Kaiping Zheng

Deep learning has made a tremendous impact on various applications in multimedia,
such as media interpretation and multimodal retrieval. However, deep learning models
usually require a large amount of labeled data to achieve satisfactory performance.
In multimedia analysis, domain adaptation studies the problem of cross-domain knowledge
transfer from a label rich source domain to a label scarce target domain, thus potentially
alleviates the annotation requirement for deep learning models. However, we find that
contemporary domain adaptation methods for cross-domain image understanding perform
poorly when source domain is noisy. Weakly Supervised Domain Adaptation (WSDA) studies
the domain adaptation problem under the scenario where source data can be noisy. Prior
methods on WSDA remove noisy source data and align the marginal distribution across
domains without considering the fine-grained semantic structure in the embedding space,
which have the problem of class misalignment, e.g., features of cats in the target
domain might be mapped near features of dogs in the source domain. In this paper,
we propose a novel method, termed Noise Tolerant Domain Adaptation (NTDA), for WSDA.
Specifically, we adopt the cluster assumption and learn cluster discriminatively with
class prototypes (centroids) in the embedding space. We propose to leverage the location
information of the data points in the embedding space and model the location information
with a Gaussian mixture model to identify noisy source data. We then design a network
which incorporates the Gaussian mixture noise model as a sub-module for unsupervised
noise removal and propose a novel cluster-level adversarial adaptation method based
on the Generative Adversarial Network (GAN) framework which aligns unlabeled target
data with the less noisy class prototypes for mapping the semantic structure across
domains. Finally, we devise a simple and effective algorithm to train the network
from end to end. We conduct extensive experiments to evaluate the effectiveness of
our method on both general images and medical images from COVID-19 and e-commerce
datasets. The results show that our method significantly outperforms state-of-the-art
WSDA methods.

Exploiting BERT for Multimodal Target Sentiment Classification through Input Space

  • Zaid Khan
  • Yun Fu

Multimodal target/aspect sentiment classification combines multimodal sentiment analysis
and aspect/target sentiment classification. The goal of the task is to combine vision
and language to understand the sentiment towards a target entity in a sentence. Twitter
is an ideal setting for the task because it is inherently multimodal, highly emotional,
and affects real world events. However, multimodal tweets are short and accompanied
by complex, possibly irrelevant images. We introduce a two-stream model that translates
images in input space using an object-aware transformer followed by a single-pass
non-autoregressive text generation approach. We then leverage the translation to construct
an auxiliary sentence that provides multimodal information to a language model. Our
approach increases the amount of text available to the language model and distills
the object-level information in complex images. We achieve state-of-the-art performance
on two multimodal Twitter datasets without modifying the internals of the language
model to accept multimodal data, demonstrating the effectiveness of our translation.
In addition, we explain a failure mode of a popular approach for aspect sentiment
analysis when applied to tweets. Our code is available at

Video Representation Learning with Graph Contrastive Augmentation

  • Jingran Zhang
  • Xing Xu
  • Fumin Shen
  • Yazhou Yao
  • Jie Shao
  • Xiaofeng Zhu

Contrastive-based self-supervised learning for image representations has significantly
closed the gap with supervised learning. A natural extension of image-based contrastive
learning methods to the video domain is to fully exploit the temporal structure presented
in videos. We propose a novel contrastive self-supervised video representation learning
framework, termed Graph Contrastive Augmentation (GCA), by constructing a video temporal
graph and devising a graph augmentation that is designed to enhance the correlation
across frames of videos and developing a new view for exploring temporal structure
in videos. Specifically, we construct the temporal graph in the video by leveraging
the relational knowledge behind the correlated sequence video features. Afterwards,
we apply the proposed graph augmentation to generate another graph view by cooperating
random corruption of the original graph to enhance the diversity of the intrinsic
structure of the temporal graph. To this end, we provide two different kinds of contrastive
learning methods to train our framework using temporal relationships concealed in
videos as self-supervised signals. We perform empirical experiments on downstream
tasks, action recognition and video retrieval, using the learned video representation,
and the results demonstrate that with the graph view of temporal structure, our proposed
GCA remarkably improves performance against or on par with the recent methods.

SESSION: Poster Session 4

An EM Framework for Online Incremental Learning of Semantic Segmentation

  • Shipeng Yan
  • Jiale Zhou
  • Jiangwei Xie
  • Songyang Zhang
  • Xuming He

Incremental learning of semantic segmentation has emerged as a promising strategy
for visual scene interpretation in the open-world setting. However, it remains challenging
to acquire novel classes in an online fashion for the segmentation task, mainly due
to its continuously-evolving semantic label space, partial pixelwise ground-truth
annotations, and constrained data availability. To address this, we propose an incremental
learning strategy that can fast adapt deep segmentation models without catastrophic
forgetting, using a streaming input data with pixel annotations on the novel classes
only. To this end, we develop a unified learning strategy based on the Expectation-Maximization
(EM) framework, which integrates an iterative relabeling strategy that fills in the
missing labels and a rehearsal-based incremental learning step that balances the stability-plasticity
of the model. Moreover, our EM algorithm adopts an adaptive sampling method to select
informative training data and a class-balancing training strategy in the incremental
model updates, both improving the efficacy of model learning. We validate our approach
on the PASCAL VOC 2012 and ADE20K datasets, and the results demonstrate its superior
performance over the existing incremental methods.

I2V-GAN: Unpaired Infrared-to-Visible Video Translation

  • Shuang Li
  • Bingfeng Han
  • Zhenjie Yu
  • Chi Harold Liu
  • Kai Chen
  • Shuigen Wang

Human vision is often adversely affected by complex environmental factors, especially
in night vision scenarios. Thus, infrared cameras are often leveraged to help enhance
the visual effects via detecting infrared radiation in the surrounding environment,
but the infrared videos are undesirable due to the lack of detailed semantic information.
In such a case, an effective video-to-video translation method from the infrared domain
to the visible light counterpart is strongly needed by overcoming the intrinsic huge
gap between infrared and visible fields. To address this challenging problem, we propose
an infrared-to-visible (I2V) video translation method I2V-GAN to generate fine-grained
and spatial-temporal consistent visible light videos by given unpaired infrared videos.
Technically, our model capitalizes on three types of constraints: 1) adversarial constraint
to generate synthetic frames that are similar to the real ones, 2) cyclic consistency
with the introduced perceptual loss for effective content conversion as well as style
preservation, and 3) similarity constraints across and within domains to enhance the
content and motion consistency in both spatial and temporal spaces at a fine-grained
level. Furthermore, the current public available infrared and visible light datasets
are mainly used for object detection or tracking, and some are composed of discontinuous
images which are not suitable for video tasks. Thus, we provide a new dataset for
infrared-to-visible video translation, which is named IRVI. Specifically, it has 12
consecutive video clips of vehicle and monitoring scenes, and both infrared and visible
light videos could be apart into 24352 frames. Comprehensive experiments on IRVI validate
that I2V-GAN is superior to the compared state-of-the-art methods in the translation
of infrared-to-visible videos with higher fluency and finer semantic details. Moreover,
additional experimental results on the flower-to-flower dataset indicate I2V-GAN is
also applicable to other video translation tasks. The code and IRVI dataset are available

Implicit Feedbacks are Not Always Favorable: Iterative Relabeled One-Class Collaborative
Filtering against Noisy Interactions

  • Zitai Wang
  • Qianqian Xu
  • Zhiyong Yang
  • Xiaochun Cao
  • Qingming Huang

Due to privacy concerns, there is a rising favor in Recommender System community for
the One-class Collaborative Filtering (OCCF) framework, which predicts user preferences
only based on binary implicit feedback (e.g., click or not-click, rated or unrated).
The major challenge in OCCF problem stems from the inherent noise in implicit interaction.
Previous approaches have taken into account the noise in unobserved interactions (i.e.,
not-click only means a missing value, rather than negative feedback). However, they
generally ignore the noise in observed interactions (i.e., click does not necessarily
represent positive feedback), which might induce performance degradation. To attack
this issue, we propose a novel iteratively relabeling framework to jointly mitigate
the noise in both observed and unobserved interactions. As the core of the framework,
the iterative relabeling module exploits the self-training principle to dynamically
generate pseudo labels for user preferences. The downstream module for a recommendation
task is then trained with the refreshed labels where the noisy patterns are largely
alleviated. Finally, extensive experiments on three real-world datasets demonstrate
the effectiveness of our proposed methods.

InsPose: Instance-Aware Networks for Single-Stage Multi-Person Pose Estimation

  • Dahu Shi
  • Xing Wei
  • Xiaodong Yu
  • Wenming Tan
  • Ye Ren
  • Shiliang Pu

Multi-person pose estimation is an attractive and challenging task. Existing methods
are mostly based on two-stage frameworks, which include top-down and bottom-up methods.
Two-stage methods either suffer from high computational redundancy for additional
person detectors or they need to group keypoints heuristically after predicting all
the instance-agnostic keypoints. The single-stage paradigm aims to simplify the multi-person
pose estimation pipeline and receives a lot of attention. However, recent single-stage
methods have the limitation of low performance due to the difficulty of regressing
various full-body poses from a single feature vector. Different from previous solutions
that involve complex heuristic designs, we present a simple yet effective solution
by employing instance-aware dynamic networks. Specifically, we propose an instance-aware
module to adaptively adjust (part of) the network parameters for each instance. Our
solution can significantly increase the capacity and adaptive-ability of the network
for recognizing various poses, while maintaining a compact end-to-end trainable pipeline.
Extensive experiments on the MS-COCO dataset demonstrate that our method achieves
significant improvement over existing single-stage methods, and makes a better balance
of accuracy and efficiency compared to the state-of-the-art two-stage approaches.

Implicit Feature Refinement for Instance Segmentation

  • Lufan Ma
  • Tiancai Wang
  • Bin Dong
  • Jiangpeng Yan
  • Xiu Li
  • Xiangyu Zhang

We propose a novel implicit feature refinement module for high-quality instance segmentation.
Existing image/video instance segmentation methods rely on explicitly stacked convolutions
to refine instance features before the final prediction. In this paper, we first give
an empirical comparison of different refinement strategies, which reveals that the
widely-used four consecutive convolutions are not necessary. As an alternative, weight-sharing
convolution blocks provides competitive performance. When such block is iterated for
infinite times, the block output will eventually converge to an equilibrium state.
Based on this observation, the implicit feature refinement (IFR) is developed by constructing
an implicit function. The equilibrium state of instance features can be obtained by
fixed-point iteration via a simulated infinite-depth network. Our IFR enjoys several
advantages: 1) simulates an infinite-depth refinement network while only requiring
parameters of single residual block; 2) produces high-level equilibrium instance features
of global receptive field; 3) serves as a plug-and-play general module easily extended
to most object recognition frameworks. Experiments on the COCO and YouTube-VIS benchmarks
show that our IFR achieves improved performance on state-of-the-art image/video instance
segmentation frameworks, while reducing the parameter burden (e.g. 1% AP improvement
on Mask R-CNN with only 30.0% parameters in mask head). Code will be made available
at \href .

Question-controlled Text-aware Image Captioning

  • Anwen Hu
  • Shizhe Chen
  • Qin Jin

For an image with multiple scene texts, different people may be interested in different
text information. Current text-aware image captioning models are not able to generate
distinctive captions according to various information needs. To explore how to generate
personalized text-aware captions, we define a new challenging task, namely Question-controlled
Text-aware Image Captioning (Qc-TextCap). With questions as control signals, this
task requires models to understand questions, find related scene texts and describe
them together with objects fluently in human language. Based on two existing text-aware
captioning datasets, we automatically construct two datasets, ControlTextCaps and
ControlVizWiz to support the task. We propose a novel Geometry and Question Aware
Model (GQAM). GQAM first applies a Geometry-informed Visual Encoder to fuse region-level
object features and region-level scene text features with considering spatial relationships.
Then, we design a Question-guided Encoder to select the most relevant visual features
for each question. Finally, GQAM generates a personalized text-aware caption with
a Multimodal Decoder. Our model achieves better captioning performance and question
answering ability than carefully designed baselines on both two datasets. With questions
as control signals, our model generates more informative and diverse captions than
the state-of-the-art text-aware captioning model. Our code and datasets are publicly
available at

Style-Aware Image Recommendation for Social Media Marketing

  • Yiwei Zhang
  • Toshihiko Yamasaki

Social media have become a popular platform for brands to allocate marketing budget
and build their relationship with customers. Posting images with a consistent concept
on social media helps customers recognize, remember, and consider brands. This strategy
is known as brand concept consistency in marketing literature. Consequently, brands
spend immense manpower and financial resources in choosing which images to post or
repost. Therefore, automatically recommending images with a consistent brand concept
is a necessary task for social media marketing. In this paper, we propose a content-based
recommendation system that learns the concept of brands and recommends images that
are coherent with the brand. Specifically, brand representation is performed from
the brand posts on social media. Existing methods rely on visual features extracted
by pre-trained neural networks, which can represent objects in the image but not the
style of the image. To bridge this gap, a framework using both object and style vectors
as input is proposed to learn the brand representation. In addition, we show that
the proposed method can not only be applied to brands but also be applied to influencers.
We collected a new Instagram influencer dataset, consisting of 616 influencers and
about 1 million images, which can greatly benefit future research in this area. The
experimental results on two large-scale Instagram datasets show the superiority of
the proposed method over state-of-the-art methods.

WePerson: Learning a Generalized Re-identification Model from All-weather Virtual

  • He Li
  • Mang Ye
  • Bo Du

The aim of person re-identification (Re-ID) is retrieving a person of interest across
multiple non-overlapping cameras. Re-ID has gained significantly increased advancement
in recent years. However, real data annotation is costly and model generalization
ability is hindered by the lack of large-scale and diverse data. To address this problem,
we propose a Weather Person pipeline that can generate a synthesized Re-ID dataset
with different weather, scenes, and natural lighting conditions automatically. The
pipeline is built on the top of a game engine which contains a digital city, weather
and lighting simulation system, and various character models with manifold dressing.
To train a generalizable Re-ID model from the large-scale virtual WePerson dataset,
we design an adaptive sample selection strategy to close the domain gap and avoid
redundancy. We also design an informative sampling method for a mini-batch sampler
to accelerate the learning process. In addition, an efficient training method is introduced
by adopting instance normalization to capture identity invariant components from various
appearances. We evaluate our pipeline using direct transfer on 3 widely-used real-world
benchmarks, achieving competitive performance without any real-world image training.
This dataset starts the attempt to evaluate diverse environmental factors in a controllable
virtual engine, which provides important guidance for future generalizable Re-ID model
design. Notably, we improve the current state-of-the-art accuracy from 38.5% to 46.4%
on the challenging MSMT17 dataset. Dataset and code are available at

Polar Ray: A Single-stage Angle-free Detector for Oriented Object Detection in Aerial

  • Shuai Liu
  • Lu Zhang
  • Shuai Hao
  • Huchuan Lu
  • You He

Oriented bounding boxes are widely used for object detection in aerial images. Existing
oriented object detection methods typically follow the general object detection paradigm
by adding an extra rotation angle on the horizontal bounding boxes. However, the angular
periodicity incurs the difficulty in angle regression and rotation sensitivity on
bounding boxes. In this paper, we propose a new anchor-free oriented object detector,
Polar Ray Network (PRNet), where object keypoints are represented by polar coordinates
without angle regression. Our PRNet learns a set of polar rays from the object center
to boundary with predefined equal-distributed angles. We introduce a dynamic PointConv
module to optimize the regression of polar ray by incorporating object corner features.
Furthermore, a classification feature guidance module is presented to improve the
classification accuracy by incorporating more spatial contents from polar rays. Experimental
results on two public datasets, i.e., DOTA and HRSC2016, demonstrate that the proposed
PRNet significantly outperforms existing anchor-free detectors, and shows highly competitiveness
with the state-of-the-art two-stage anchor-based methods.

Self-Contrastive Learning with Hard Negative Sampling for Self-supervised Point Cloud

  • Bi'an Du
  • Xiang Gao
  • Wei Hu
  • Xin Li

Point clouds have attracted increasing attention. Significant progress has been made
in methods for point cloud analysis, which often requires costly human annotation
as supervision. To address this issue, we propose a novel self-contrastive learning
for self-supervised point cloud representation learning, aiming to capture both local
geometric patterns and nonlocal semantic primitives based on the nonlocal self-similarity
of point clouds. The contributions are two-fold: on the one hand, instead of contrasting
among different point clouds as commonly employed in contrastive learning, we exploit
self-similar point cloud patches within a single point cloud as positive samples and
otherwise negative ones to facilitate the task of contrastive learning. On the other
hand, we actively learn hard negative samples that are close to positive samples for
discriminative feature learning, which are sampled conditional on each anchor patch
leveraging on the degree of self-similarity. Experimental results show that the proposed
method achieves state-of-the-art performance on widely used benchmark datasets for
self-supervised point cloud segmentation and transfer learning for classification.

Generally Boosting Few-Shot Learning with HandCrafted Features

  • Yi Zhang
  • Sheng Huang
  • Fengtao Zhou

Existing Few-Shot Learning (FSL) methods predominantly focus on developing different
types of sophisticated models to extract the transferable prior knowledge for recognizing
novel classes, while they almost pay less attention to the feature learning part in
FSL which often simply leverage some well-known CNN as the feature learner. However,
feature is the core medium for encoding such transferable knowledge. Feature learning
is easy to be trapped in the over-fitting particularly in the scarcity of the training
data, and thereby degenerates the performances of FSL. The handcrafted features, such
as Histogram of Oriented Gradient (HOG) and Local Binary Pattern (LBP), have no requirement
on the amount of training data, and used to perform quite well in many small-scale
data scenarios, since their extractions involve no learning process, and are mainly
based on the empirically observed and summarized prior feature engineering knowledge.
In this paper, we intend to develop a general and simple approach for generally boosting
FSL via exploiting such prior knowledge in the feature learning phase. To this end,
we introduce two novel handcrafted feature regression modules, namely HOG and LBP
regression, to the feature learning parts of deep learning-based FSL models. These
two modules are separately plugged into the different convolutional layers of backbone
based on the characteristics of the corresponding handcrafted features to guide the
backbone optimization from different feature granularity, and also ensure that the
learned feature can encode the handcrafted feature knowledge which improves the generalization
ability of feature and alleviate the over-fitting of the models. Three recent state-of-the-art
FSL approaches are leveraged for examining the effectiveness of our method. Extensive
experiments on miniImageNet, CIFAR-FS and FC100 datasets show that the performances
of all these FSL approaches are well boosted via applying our method on all three
datasets. Our codes and models have been released.

ROECS: A Robust Semi-direct Pipeline Towards Online Extrinsics Correction of the Surround-view

  • Tianjun Zhang
  • Nlong Zhao
  • Ying Shen
  • Xuan Shao
  • Lin Zhang
  • Yicong Zhou

Generally, a surround-view system (SVS), which is an indispensable component of advanced
driving assistant systems (ADAS), consists of four to six wide-angle fisheye cameras.
As long as both intrinsics and extrinsics of all cameras have been calibrated, a top-down
surround-view with the real scale can be synthesized at runtime from fisheye images
captured by these cameras. However, when the vehicle is driving on the road, relative
poses between cameras in the SVS may change from the initial calibrated states due
to bumps or collisions. In case that extrinsics' representations are not adjusted
accordingly, on the surround-view, obvious geometric misalignment will appear. Currently,
the researches on correcting the extrinsics of the SVS in an online manner are quite
sporadic, and a mature and robust pipeline is still lacking. As an attempt to fill
this research gap to some extent, in this work, we present a novel extrinsics correction
pipeline designed specially for the SVS, namely ROECS (Robust Online Extrinsics Correction
of the Surround-view system). Specifically, a "refined bi-camera error" model is firstly
designed. Then, by minimizing the overall "bi-camera error" within a sparse and semi-direct
framework, the SVS's extrinsics can be iteratively optimized and become accurate eventually.
Besides, an innovative three-step pixel selection strategy is also proposed. The superior
robustness and the generalization capability of ROECS are validated by both quantitative
and qualitative experimental results. To make the results reproducible, the collected
data and the source code have been released at

Pseudo Graph Convolutional Network for Vehicle ReID

  • Wen Qian
  • Zhiqun He
  • Silong Peng
  • Chen Chen
  • Wei Wu

Image-based Vehicle ReID methods have suffered from limited information caused by
viewpoints, illumination, and occlusion as they usually use a single image as input.
Graph convolutional methods (GCN) can alleviate the aforementioned problem by aggregating
neighbor samples' information to enhance the feature representation. However, it's
uneconomical and computational for the inference processes of GCN-based methods since
they need to iterate over all samples for searching the neighbor nodes. In this paper,
we propose the first Pseudo-GCN Vehicle ReID method (PGVR) which enables a CNN-based
module to performs competitively to GCN-based methods and has a faster and lightweight
inference process. To enable the Pseudo-GCN mechanism, a two-branch network and a
graph-based knowledge distillation are proposed. The two-branch network consists of
a CNN-based student branch and a GCN-based teacher branch. The GCN-based teacher branch
adopts a ReID-based GCN to learn the topological optimization ability under the supervision
of ReID tasks during training time. Moreover, the graph-based knowledge distillation
explicitly transfers the topological optimization ability from the teacher branch
to the student branch which acknowledges all nodes. We evaluate our proposed method
PGVR on three mainstream Vehicle ReID benchmarks and demonstrate that PGVR achieves
state-of-the-art performance.

Towards Fast and High-Quality Sign Language Production

  • Wencan Huang
  • Wenwen Pan
  • Zhou Zhao
  • Qi Tian

Sign Language Production (SLP) aims to automatically translate a spoken language description
to its corresponding sign language video. The core procedure of SLP is to transform
sign gloss intermediaries into sign pose sequences (G2P). Most existing methods for
G2P are based on sequential autoregression or sequence-to-sequence encoder-decoder
learning. However, by generating target pose frames conditioned on the previously
generated ones, these models are prone to bringing issues such as error accumulation
and high inference latency. In this paper, we argue that such issues are mainly caused
by adopting autoregressive manner. Hence, we propose a novel Non-AuToregressive (NAT)
model with a parallel decoding scheme, as well as an External Aligner for sequence
alignment learning. Specifically, we extract alignments from the external aligner
by monotonic alignment search for gloss duration prediction, which is used by a length
regulator to expand the source gloss sequence to match the length of the target sign
pose sequence for parallel sign pose generation. Furthermore, we devise a spatial-temporal
graph convolutional pose generator in the NAT model to generate smoother and more
natural sign pose sequences. Extensive experiments conducted on PHOENIX14T dataset
show that our proposed model outperforms state-of-the-art autoregressive models in
terms of speed and quality.

Effective De-identification Generative Adversarial Network for Face Anonymization

  • Zhenzhong Kuang
  • Huigui Liu
  • Jun Yu
  • Aikui Tian
  • Lei Wang
  • Jianping Fan
  • Noboru Babaguchi

The growing application of face images and modern AI technology has raised another
important concern in privacy protection. In many real scenarios like scientific research,
social sharing and commercial application, lots of images are released without privacy
processing to protect people's identity. In this paper, we develop a novel effective
de-identification generative adversarial network (DeIdGAN) for face anonymization
by seamlessly replacing a given face image with a different synthesized yet realistic
one. Our approach consists of two steps. First, we anonymize the input face to obfuscate
its original identity. Then, we use our designed de-identification generator to synthesize
an anonymized face. During the training process, we leverage a pair of identity-adversarial
discriminators to explicitly constrain identity protection by pushing the synthesized
face away from the predefined sensitive faces to resist re-identification and identity
invasion. Finally, we validate the effectiveness of our approach on public datasets.
Compared with existing methods, our approach can not only achieve better identity
protection rates but also preserve superior image quality and data reusability, which
suggests the state-of-the-art performance.

Cross-modal Retrieval and Synthesis (X-MRS): Closing the Modality Gap in Shared Subspace

  • Ricardo Guerrero
  • Hai X. Pham
  • Vladimir Pavlovic

Computational food analysis (CFA) naturally requires multi-modal evidence of a particular
food, e.g., images, recipe text, etc. A key to making CFA possible is multi-modal
shared representation learning, which aims to create a joint representation of the
multiple views (text and image) of the data. In this work we propose a method for
food domain cross-modal shared representation learning that preserves the vast semantic
richness present in the food data. Our proposed method employs an effective transformer-based
multilingual recipe encoder coupled with a traditional image embedding architecture.
Here, we propose the use of imperfect multilingual translations to effectively regularize
the model while at the same time adding functionality across multiple languages and
alphabets. Experimental analysis on the public Recipe1M dataset shows that the representation
learned via the proposed method significantly outperforms the current state-of-the-arts
(SOTA) on retrieval tasks. Furthermore, the representational power of the learned
representation is demonstrated through a generative food image synthesis model conditioned
on recipe embeddings. Synthesized images can effectively reproduce the visual appearance
of paired samples, indicating that the learned representation captures the joint semantics
of both the textual recipe and its visual content, thus narrowing the modality gap.

When Face Completion Meets Irregular Holes: An Attributes Guided Deep Inpainting Network

  • Jie Xiao
  • Dandan Zhan
  • Haoran Qi
  • Zhi Jin

Lots of convolutional neural network (CNN)-based methods have been proposed to implement
face completion with regular holes. However, in practical applications, irregular
holes are more common to see. Moreover, due to the distinct attributes and large variation
of appearance for human faces, it is more challenging to fill irregular holes in face
images while keeping content consistent with the rest region. Since facial attributes
(e.g., gender, smiling, pointy nose, etc.) allow for a more understandable description
of one face, they can provide some hints that benefit the face completion task. In
this work, we propose a novel attributes-guided face completion network (AttrFaceNet),
which comprises a facial attribute prediction subnet and a face completion subnet.
The attribute prediction subnet predicts facial attributes from the rest parts of
the corrupted images and guides the face completion subnet to fill the missing regions.
The proposed AttrFaceNet is evaluated in an end-to-end way on commonly used datasets
CelebA and Helen. Extensive experimental results show that our method outperforms
state-of-the-art methods qualitatively and quantitatively especially in large mask
size cases. Code is available at

Non-Linear Fusion for Self-Paced Multi-View Clustering

  • Zongmo Huang
  • Yazhou Ren
  • Xiaorong Pu
  • Lifang He

With the advance of the multi-media and multi-modal data, multi-view clustering (MVC)
has drawn increasing attentions recently. In this field, one of the most crucial challenges
is that the characteristics and qualities of different views usually vary extensively.
Therefore, it is essential for MVC methods to find an effective approach that handles
the diversity of multiple views appropriately. To this end, a series of MVC methods
focusing on how to integrate the loss from each view have been proposed in the past
few years. Among these methods, the mainstream idea is assigning weights to each view
and then combining them linearly. In this paper, inspired by the effectiveness of
non-linear combination in instance learning and the auto-weighted approaches, we propose
Non-Linear Fusion for Self-Paced Multi-View Clustering (NSMVC), which is totally different
from the the conventional linear-weighting algorithms. In NSMVC, we directly assign
different exponents to different views according to their qualities. By this way,
the negative impact from the corrupt views can be significantly reduced. Meanwhile,
to address the non-convex issue of the MVC model, we further define a novel regularizer-free
modality of Self-Paced Learning (SPL), which fits the proposed non-linear model perfectly.
Experimental results on various real-world data sets demonstrate the effectiveness
of the proposed method.

Counterfactual Debiasing Inference for Compositional Action Recognition

  • Pengzhan Sun
  • Bo Wu
  • Xunsong Li
  • Wen Li
  • Lixin Duan
  • Chuang Gan

Compositional action recognition is a novel challenge in the computer vision community
and focuses on revealing the different combinations of verbs and nouns instead of
treating subject-object interactions in videos as individual instances only. Existing
methods tackle this challenging task by simply ignoring appearance information or
fusing object appearances with dynamic instance tracklets. However, those strategies
usually do not perform well for unseen action instances. For that, in this work we
propose a novel learning framework called Counterfactual Debiasing Network (CDN) to
improve the model generalization ability by removing the interference introduced by
visual appearances of objects/subjects. It explicitly learns the appearance information
in action representations and later removes the effect of such information in a causal
inference manner. Specifically, we use tracklets and video content to model the factual
inference by considering both appearance information and structure information. In
contrast, only video content with appearance information is leveraged in the counterfactual
inference. With the two inferences, we conduct a causal graph which captures and removes
the bias introduced by the appearance information by subtracting the result of the
counterfactual inference from that of the factual inference. By doing that, our proposed
CDN method can better recognize unseen action instances by debiasing the effect of
appearances. Extensive experiments on the Something-Else dataset clearly show the
effectiveness of our proposed CDN over existing state-of-the-art methods.

STST: Spatial-Temporal Specialized Transformer for Skeleton-based Action Recognition

  • Yuhan Zhang
  • Bo Wu
  • Wen Li
  • Lixin Duan
  • Chuang Gan

Skeleton-based action recognition has been widely investigated considering their strong
adaptability to dynamic circumstances and complicated backgrounds. To recognize different
actions from skeleton sequences, it is essential and crucial to model the posture
of the human represented by the skeleton and its changes in the temporal dimension.
However, most of the existing works treat skeleton sequences in the temporal and spatial
dimension in the same way, ignoring the difference between the temporal and spatial
dimension in skeleton data which is not an optimal way to model skeleton sequences.
The posture represented by the skeleton in each frame is proposed to be modeled individually.
Meanwhile, capturing the movement of the entire skeleton in the temporal dimension
is needed. So, we designed Spatial Transformer Block and Directional Temporal Transformer
Block for modeling skeleton sequences in spatial and temporal dimensions respectively.
Due to occlusion/sensor/raw video, etc., there are noises on both temporal and spatial
dimensions in the extracted skeleton data reducing the recognition capabilities of
models. To adapt to this imperfect information condition, we propose a multi-task
self-supervised learning method by providing confusing samples in different situations
to improve the robustness of our model. Combining the above design, we propose our
Spatial-Temporal Specialized Transformer~(STST) and conduct experiments with our model
on the SHREC, NTU-RGB+D, and Kinetics-Skeleton. Extensive experimental results demonstrate
the improved performances and analysis of the proposed method.

Exploring Gradient Flow Based Saliency for DNN Model Compression

  • Xinyu Liu
  • Baopu Li
  • Zhen Chen
  • Yixuan Yuan

Model pruning aims to reduce the deep neural network (DNN) model size or computational
overhead. Traditional model pruning methods such as l-1 pruning that evaluates the
channel significance for DNN pay too much attention to the local analysis of each
channel and make use of the magnitude of the entire feature while ignoring its relevance
to the batch normalization (BN) and ReLU layer after each convolutional operation.
To overcome these problems, we propose a new model pruning method from a new perspective
of gradient flow in this paper. Specifically, we first theoretically analyze the channel's
influence based on Taylor expansion by integrating the effects of BN layer and ReLU
activation function. Then, the incorporation of the first-order Talyor polynomial
of the scaling parameter and the shifting parameter in the BN layer is suggested to
effectively indicate the significance of a channel in a DNN. Comprehensive experiments
on both image classification and image denoising tasks demonstrate the superiority
of the proposed novel theory and scheme. Code is available at

An Adaptive Iterative Inpainting Method with More Information Exploration

  • Shengjie Chen
  • Zhenhua Guo
  • Bo Yuan

The CNN-based image inpainting methods have achieved promising performance because
of its outstanding semantic understanding and reasoning potentialities. However, previous
works could not get satisfied results in some situations because information is not
fully explored. In this paper, we propose a new method by combining three innovative
ideas. First, to increase the diversity of the semantic information obtained by the
network in image synthesis, we propose a multiple hidden space perceptual (MHSP) loss,
which extracts high-level features from multiple pre-trained autoencoders. Second,
we adopt an adaptive iterative reasoning (AIR) stategy to reduce the calculations
under small-hole circumstances while ensuring the performance in large-hole circumstances.
Third, we find that color inconsistencies occasionally occurred in the final image
merging process, so we add a novel interval maximum saturation (IMS) loss to the final
loss function. Experiments on the benchmark datasets show our method performs favorably
against state-of-the-art approaches. Code is made publicly available at:

Assisting News Media Editors with Cohesive Visual Storylines

  • Gonçalo Marcelino
  • David Semedo
  • André Mourão
  • Saverio Blasi
  • João Magalhães
  • Marta Mrak

Creating a cohesive, high-quality, relevant, media story is a challenge that news
media editors face on a daily basis. This challenge is aggravated by the flood of
highly-relevant information that is constantly pouring onto the newsroom. To assist
news media editors in this daunting task, this paper proposes a framework to organize
news content into cohesive, high-quality, relevant visual storylines. First, we formalize,
in a nonsubjective manner, the concept of visual story transition. Leveraging it,
we propose four graph based methods of storyline creation, aiming for global story
cohesiveness. These where created and implemented to take full advantage of existing
graph algorithms, ensuring their correctness and good computational performance. They
leverage a strong ensemble-based estimator which was trained to predict story transition
quality based on both the semantic and visual features present in the pair of images
under scrutiny. A user study covered a total of 28 curated stories about sports and
cultural events. Experiments showed that (i) visual transitions in storylines can
be learned with a quality above 90%, and (ii) the proposed graph methods can produce
cohesive storylines with a quality in the range of 88% to 96%.

MM-Flow: Multi-modal Flow Network for Point Cloud Completion

  • Yiqiang Zhao
  • Yiyao Zhou
  • Rui Chen
  • Bin Hu
  • Xiding Ai

Point cloud is often noisy and incomplete. Existing completion methods usually generate
the complete shapes for missing regions of 3D objects based on the deterministic learning
frameworks, which only predict a single reconstruction output. However, these methods
ignore the ill-posed nature of the completion problem and do not fully account for
multiple possible completion predictions corresponding to one incomplete input. To
address this problem, we propose a flow-based network together with a multi-modal
mapping strategy for 3D point cloud completion. Specially, an encoder is first introduced
to encode the input point cloud data into a rich latent representation suitable for
conditioning in all flow-layers. Then we design a conditional normalizing flow architecture
to learn the exact distribution of the plausible completion shapes over the multi-modal
latent space. Finally, in order to fully utilize additional shape information, we
propose a tree-structured decoder to perform the inverse mapping for complete shape
generation with high fidelity. The proposed flow network is trained using a single
loss named the negative log-likelihood to capture the distribution variations between
input and output, without complex reconstruction loss and adversarial loss. Extensive
experiments on ShapeNet dataset, KITTI dataset and measured data demonstrate that
our method outperforms the state-of-the-art point cloud completion methods through
qualitative and quantitative analysis.

Long-tailed Distribution Adaptation

  • Zhiliang Peng
  • Wei Huang
  • Zonghao Guo
  • Xiaosong Zhang
  • Jianbin Jiao
  • Qixiang Ye

Recognizing images with long-tailed distributions remains a challenging problem while
there lacks an interpretable mechanism to solve this problem. In this study, we formulate
Long-tailed recognition as Domain Adaption (LDA), by modeling the long-tailed distribution
as an unbalanced domain and the general distribution as a balanced domain. Within
the balanced domain, we propose to slack the generalization error bound, which is
defined upon the empirical risks of unbalanced and balanced domains and the divergence
between them. We propose to jointly optimize empirical risks of the unbalanced and
balanced domains and approximate their domain divergence by intra-class and inter-class
distances, with the aim to adapt models trained on the long-tailed distribution to
general distributions in an interpretable way. Experiments on benchmark datasets for
image recognition, object detection, and instance segmentation validate that our LDA
approach, beyond its interpretability, achieves state-of-the-art performance.

Lesion-Inspired Denoising Network: Connecting Medical Image Denoising and Lesion Detection

  • Kecheng Chen
  • Kun Long
  • Yazhou Ren
  • Jiayu Sun
  • Xiaorong Pu

Deep learning has achieved notable performance in the denoising task of low-quality
medical images and the detection task of lesions, respectively. However, existing
low-quality medical image denoising approaches are disconnected from the detection
task of lesions. Intuitively, the quality of denoised images will influence the lesion
detection accuracy that in turn can be used to affect the denoising performance. To
this end, we propose a play-and-plug medical image denoising framework, namely Lesion-Inspired
Denoising Network (LIDnet), to collaboratively improve both denoising performance
and detection accuracy of denoised medical images. Specifically, we propose to insert
the feedback of downstream detection task into existing denoising framework by jointly
learning a multi-loss objective. Instead of using perceptual loss calculated on the
entire feature map, a novel region-of-interest (ROI) perceptual loss induced by the
lesion detection task is proposed to further connect these two tasks. To achieve better
optimization for overall framework, we propose a customized collaborative training
strategy for LIDnet. On consideration of clinical usability and imaging characteristics,
three low-dose CT images datasets are used to evaluate the effectiveness of the proposed
LIDnet. Experiments show that, by equipping with LIDnet, both of the denoising and
lesion detection performance of baseline methods can be significantly improved.

Domain Adaptive Semantic Segmentation without Source Data

  • Fuming You
  • Jingjing Li
  • Lei Zhu
  • Zhi Chen
  • Zi Huang

Domain adaptive semantic segmentation is recognized as a promising technique to alleviate
the domain shift between the labeled source domain and the unlabeled target domain
in many real-world applications, such as automatic pilot. However, large amounts of
source domain data often introduce significant costs in storage and training, and
sometimes the source data is inaccessible due to privacy policies. To address these
problems, we investigate domain adaptive semantic segmentation without source data,
which assumes that the model is pre-trained on the source domain, and then adapting
to the target domain without accessing source data anymore. Since there is no supervision
from the source domain data, many self-training methods tend to fall into the winner-takes-all
dilemma, where the majority classes totally dominate the segmentation networks and
the networks fail to classify the minority classes. Consequently, we propose an effective
framework for this challenging problem with two components: positive learning and
negative learning. In positive learning, we select the class-balanced pseudo-labeled
pixels with intra-class threshold, while in negative learning, for each pixel, we
investigate which category the pixel does not belong to with the proposed heuristic
complementary label selection. Notably, our framework can be easily implemented and
incorporated with other methods to further enhance the performance. Extensive experiments
on two widely-used synthetic-to-real benchmarks demonstrate our claims and the effectiveness
of our framework, which outperforms the baseline with a large margin. Code is available

Cross-modal Joint Prediction and Alignment for Composed Query Image Retrieval

  • Yuchen Yang
  • Min Wang
  • Wengang Zhou
  • Houqiang Li

In this paper, we focus on the composed query image retrieval task, namely retrieving
the target images that are similar to a composed query, in which a modification text
is combined with a query image to describe a user's accurate search intention. Previous
methods usually focus on learning the joint image-text representations, but rarely
consider the intrinsic relationship among the query image, the target image and the
modification text. To address this problem, we propose a new cross-modal joint prediction
and alignment framework for composed query image retrieval. In our framework, the
modification text is regarded as an implicit transformation between the query image
and the target image. Motivated by that, not only the combination of the query image
and modification text should be similar to the target image, but also the modification
text should be predicted according to the query image and the target image. We devote
to aligning this relationship by a novel Joint Prediction Module (JPM). Our proposed
framework can seamlessly incorporate the JPM into the existing methods to effectively
improve the discrimination and robustness of visual and textual representations. The
experiments on three public datasets demonstrate the effectiveness of our proposed
framework, proving that our proposed JPM can be simply incorporated with the existing
methods while effectively improving the performance.

JDMAN: Joint Discriminative and Mutual Adaptation Networks for Cross-Domain Facial
Expression Recognition

  • Yingjian Li
  • Yingnan Gao
  • Bingzhi Chen
  • Zheng Zhang
  • Lei Zhu
  • Guangming Lu

Cross-domain Facial Expression Recognition (FER) is challenging due to the difficulty
of concurrently handling the domain shift and semantic gap during domain adaptation.
Existing methods mainly focus on reducing the domain discrepancy for transferable
features but fail to decrease the semantic one, which may result in negative transfer.
To this end, we propose Joint Discriminative and Mutual Adaptation Networks (JDMAN),
which collaboratively bridge the domain shift and semantic gap by domain- and category-level
co-adaptation based on mutual information and discriminative metric learning techniques.
Specifically, we design a mutual information minimization module for domain-level
adaptation, which narrows the domain shift by simultaneously distilling the domain-invariant
components and eliminating the untransferable ones lying in different domains. Moreover,
we propose a semantic metric learning module for category-level adaptation, which
can close the semantic discrepancy during discriminative intra-domain representation
learning and transferable inter-domain knowledge discovery. These two modules are
jointly leveraged in our JDMAN to safely transfer the source knowledge to target data
in an end-to-end manner. Extensive experimental results on six databases show that
our method achieves state-of-the-art performance. The code of our JDMAN is available

Improving Weakly Supervised Object Localization via Causal Intervention

  • Feifei Shao
  • Yawei Luo
  • Li Zhang
  • Lu Ye
  • Siliang Tang
  • Yi Yang
  • Jun Xiao

The recently emerged weakly-supervised object localization (WSOL) methods can learn
to localize an object in the image only using image-level labels. Previous works endeavor
to perceive the interval objects from the small and sparse discriminative attention
map, yet ignoring the co-occurrence confounder (e.g., duck and water), which makes
the model inspection (e.g., CAM) hard to distinguish between the object and context.
In this paper, we make an early attempt to tackle this challenge via causal intervention
(CI). Our proposed method, dubbed CI-CAM, explores the causalities among image features,
contexts, and categories to eliminate the biased object-context entanglement in the
class activation maps thus improving the accuracy of object localization. Extensive
experiments on several benchmarks demonstrate the effectiveness of CI-CAM in learning
the clear object boundary from confounding contexts. Particularly, on the CUB-200-2011
which severely suffers from the co-occurrence confounder, CI-CAM significantly outperforms
the traditional CAM-based baseline (58.39% vs 52.4% in Top-1 localization accuracy).
While in more general scenarios such as ILSVRC 2016, CI-CAM can also perform on par
with the state of the arts.

Imbalanced Source-free Domain Adaptation

  • Xinhao Li
  • Jingjing Li
  • Lei Zhu
  • Guoqing Wang
  • Zi Huang

Conventional Unsupervised Domain Adaptation (UDA) aims to transfer knowledge from
a well-labeled source domain to an unlabeled target domain only when data from both
domains is simultaneously accessible, which is challenged by the recent Source-free
Domain Adaptation (SFDA). However, we notice that the performance of existing SFDA
methods would be dramatically degraded by intra-domain class imbalance and inter-domain
label shift. Unfortunately, class-imbalance is a common phenomenon in real-world domain
adaptation applications. To address this issue, we present Imbalanced Source-free
Domain Adaptation (ISFDA) in this paper. Specifically, we first train a uniformed
model from the source domain, and then propose secondary label correction, curriculum
sampling, plus intra-class tightening and inter-class separation to overcome the joint
presence of covariate shift and label shift. Extensive experiments on three imbalanced
benchmarks verify that ISFDA could perform favorably against existing UDA and SFDA
methods under various conditions of class-imbalance, and outperform existing SFDA
methods by over 15% in terms of per-class average accuracy on a large-scale long-tailed
imbalanced dataset.

Learning Transferrable and Interpretable Representations for Domain Generalization

  • Zhekai Du
  • Jingjing Li
  • Ke Lu
  • Lei Zhu
  • Zi Huang

Conventional machine learning models are often vulnerable to samples with different
distributions from the ones of training samples, which is known as domain shift. Domain
Generalization (DG) challenges this issue by training a model based on multiple source
domains and generalizing it to arbitrary unseen target domains. In spite of remarkable
results made in DG, a majority of existing works lack a deep understanding of the
feature representations learned in DG models, resulting in limited generalization
ability when facing domainsout-of-distribution. In this paper, we aim to learn a domain
transformation space via a domain transformer network (DTN) which explicitly mines
the relationship among multiple domains and constructs transferable feature representations
for down-stream tasks by interpreting each feature as a semantically weighted combination
of multiple domain-specific features. Our DTN is encouraged to meta-learn the properties
and characteristics of domains during the training process based on multiple seen
domains, making transformed feature representations more semantical, thus generalizing
better to unseen domains. Once the model is constructed, the feature representations
of unseen target domains can also be inferred adaptively by selectively combining
the feature representations from the diverse set of seen domains. We conduct extensive
experiments on five DG benchmarks and the results strongly demonstrate the effectiveness
of our approach.

WAS-VTON: Warping Architecture Search for Virtual Try-on Network

  • Zhenyu Xie
  • Xujie Zhang
  • Fuwei Zhao
  • Haoye Dong
  • Michael C. Kampffmeyer
  • Haonan Yan
  • Xiaodan Liang

Despite recent progress on image-based virtual try-on, current methods are constraint
by shared warping networks and thus fail to synthesize natural try-on results when
faced with clothing categories that require different warping operations. In this
paper, we address this problem by finding clothing category-specific warping networks
for the virtual try-on task via Neural Architecture Search (NAS). We introduce a NAS-Warping
Module and elaborately design a bilevel hierarchical search space to identify the
optimal network-level and operation-level flow estimation architecture. Given the
network-level search space, containing different numbers of warping blocks, and the
operation-level search space with different convolution operations, we jointly learn
a combination of repeatable warping cells and convolution operations specifically
for the clothing-person alignment. Moreover, a NAS-Fusion Module is proposed to synthesize
more natural final try-on results, which is realized by leveraging particular skip
connections to produce better-fused features that are required for seamlessly fusing
the warped clothing and the unchanged person part. We adopt an efficient and stable
one-shot searching strategy to search the above two modules. Extensive experiments
demonstrate that our WAS-VTON significantly outperforms the previous fixed-architecture
try-on methods with more natural warping results and virtual try-on results.

DFR-Net: A Novel Multi-Task Learning Network for Real-Time Multi-Instrument Segmentation

  • Yan-Jie Zhou
  • Shi-Qi Liu
  • Xiao-Liang Xie
  • Zeng-Guang Hou

In computer-assisted vascular surgery, real-time multi-instrument segmentation serves
as a pre-requisite step. However, a large amount of effort has been dedicated to single-instrument
rather than multi-instrument in computer-assisted intervention research to this day.
To fill the overlooked gap, this study introduces a Light-Weight Deep Feature Refinement
Network (DFR-Net) based on multi-task learning for real-time multi-instrument segmentation.
In this network, the proposed feature refinement module (FRM) can capture long-term
dependencies while retaining precise positional information, which helps model locate
the foreground objects of interest. The designed channel calibration module (CCM)
can re-calibrate fusion weights of multi-level features, which helps model balance
the importance of semantic information and appearance information. Besides, the connectivity
loss function is developed to address fractures in the wire-like structure segmentation
results. Extensive experiments on two different types of datasets consistently demonstrate
that DFR-Net can achieve state-of-the-art segmentation performance while meeting the
real-time requirements.

From Superficial to Deep: Language Bias driven Curriculum Learning for Visual Question

  • Mingrui Lao
  • Yanming Guo
  • Yu Liu
  • Wei Chen
  • Nan Pu
  • Michael S. Lew

Most Visual Question Answering (VQA) models are faced with language bias when learning
to answer a given question, thereby failing to understand multimodal knowledge simultaneously.
Based on the fact that VQA samples with different levels of language bias contribute
differently for answer prediction, in this paper, we overcome the language prior problem
by proposing a novel Language Bias driven Curriculum Learning (LBCL) approach, which
employs an easy-to-hard learning strategy with a novel difficulty metric Visual Sensitive
Coefficient (VSC). Specifically, in the initial training stage, the VQA model mainly
learns the superficial textual correlations between questions and answers (easy concept)
from more-biased examples, and then progressively focuses on learning the multimodal
reasoning (hard concept) from less-biased examples in the following stages. The curriculum
selection of examples on different stages is according to our proposed difficulty
metric VSC, which is to evaluate the difficulty driven by the language bias of each
VQA sample. Furthermore, to avoid the catastrophic forgetting of the learned concept
during the multi-stage learning procedure, we propose to integrate knowledge distillation
into the curriculum learning framework. Extensive experiments show that our LBCL can
be generally applied to common VQA baseline models, and achieves remarkably better
performance on the VQA-CP v1 and v2 datasets, with an overall 20% accuracy boost over
baseline models.

Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and Benchmark

  • Xun Gao
  • Yin Zhao
  • Jie Zhang
  • Longjun Cai

Recognizing the emotional state of people is a basic but challenging task in video
understanding. In this paper, we propose a new task in this field, named Pairwise
Emotional Relationship Recognition (PERR). This task aims to recognize the emotional
relationship between the two interactive characters in a given video clip. It is different
from the traditional emotion and social relation recognition task. Varieties of information,
consisting of character appearance, behaviors, facial emotions, dialogues, background
music as well as subtitles contribute differently to the final results, which makes
the task more challenging but meaningful in developing more advanced multi-modal models.
To facilitate the task, we develop a new dataset called Emotional RelAtionship of
inTeractiOn (ERATO) based on dramas and movies. ERATO is a large-scale multi-modal
dataset for PERR task, which has 31,182 video clips, lasting about 203 video hours.
Different from the existing datasets, ERATO contains interaction-centric videos with
multi-shots, varied video length, and multiple modalities including visual, audio
and text. As a minor contribution, we propose a baseline model composed of Synchronous
Modal-Temporal Attention (SMTA) unit to fuse the multi-modal information for the PERR
task. In contrast to other prevailing attention mechanisms, our proposed SMTA can
steadily improve the performance by about 1%. We expect the ERATO as well as our proposed
SMTA to open up a new way for PERR task in video understanding and further improve
the research of multi-modal fusion methodology.

Block Popularity Prediction for Multimedia Storage Systems Using Spatial-Temporal-Sequential
Neural Networks

  • Yingying Cheng
  • Fan Zhang
  • Gang Hu
  • Yiwen Wang
  • Hanhui Yang
  • Gong Zhang
  • Zhuo Cheng

Predicting block popularity is of crucial importance for data placement in multi-tiered
multimedia storage systems. Traditional methods, such as least recently used and exponential
smoothing, are commonly employed to predict future block access frequencies and fail
to achieve good performance for complex and changing access patterns. Recently, deep
neural networks have brought great success to pattern recognition and prediction,
which motivates us to introduce deep learning to solve the problem of block popularity
prediction. In this paper, we first analyze and verify the temporal and spatial correlations
among the multimedia I/O traces. Then, we design a multi-dimension feature to capture
such correlations, which serves as the input of the designed deep neural network.
A spatial-temporal-sequential neural network (STSNN) and its variants that capture
the locality information, time dependency information, and block sequential information
are proposed to predict the block popularity. We systematically evaluate our STSNN
models against six baseline models from three different categories, i.e., heuristic
methods, regression methods and neural network-based methods. Experiment results show
that our proposed STSNN models are very promising for predicting block access frequencies
under some of Huawei and Microsoft datasets and particularly achieve 2-6 times better
performance compared with the baselines in terms of the I/O hit ratio, I/O recall
rate and I/O prediction ratio under the Microsoft 64 MB-block dataset.

Transferrable Contrastive Learning for Visual Domain Adaptation

  • Yang Chen
  • Yingwei Pan
  • Yu Wang
  • Ting Yao
  • Xinmei Tian
  • Tao Mei

Self-supervised learning (SSL) has recently become the favorite among feature learning
methodologies. It is therefore appealing for domain adaptation approaches to consider
incorporating SSL. The intuition is to enforce instance-level feature consistency
such that the predictor becomes somehow invariant across domains. However, most existing
SSL methods in the regime of domain adaptation usually are treated as standalone auxiliary
components, leaving the signatures of domain adaptation unattended. Actually, the
optimal region where the domain gap vanishes and the instance level constraint that
SSL peruses may not coincide at all. From this point, we present a particular paradigm
of self-supervised learning tailored for domain adaptation, i.e., Transferrable Contrastive
Learning (TCL), which links the SSL and the desired cross-domain transferability congruently.
We find contrastive learning intrinsically a suitable candidate for domain adaptation,
as its instance invariance assumption can be conveniently promoted to cross-domain
class-level invariance favored by domain adaptation tasks. Based on particular memory
bank constructions and pseudo label strategies, TCL then penalizes cross-domain intra-class
domain discrepancy between source and target through a clean and novel contrastive
loss. The free lunch is, thanks to the incorporation of contrastive learning, TCL
relies on a moving-averaged key encoder that naturally achieves a temporally ensembled
version of pseudo labels for target data, which avoids pseudo label error propagation
at no extra cost. TCL therefore efficiently reduces cross-domain gaps. Through extensive
experiments on benchmarks (Office-Home, VisDA-2017, Digits-five, PACS and DomainNet)
for both single-source and multi-source domain adaptation tasks, TCL has demonstrated
state-of-the-art performances.

Weighted Gaussian Loss based Hamming Hashing

  • Rong-Cheng Tu
  • Xian-Ling Mao
  • Cihang Kong
  • Zihang Shao
  • Ze-Lin Li
  • Wei Wei
  • Heyan Huang

Recently, deep Hamming hashing methods have been proposed for Hamming space retrieval
which enables constant-time search by hash table lookups instead of linear scan. When
carrying out Hamming space retrieval, for each query datapoint, there is a Hamming
ball centered on the query datapoint, and only the datapoints within the Hamming ball
are returned as the relevant ones, while those beyond are discarded directly. Thus,
to further enhance the retrieval performance, it is a key point for the Hamming hashing
methods to decrease the dissimilar datapoints within the Hamming ball. However, nearly
all existing Hamming hashing methods cannot effectively penalize the dissimilar pairs
within the Hamming ball to push them out. To tackle this problem, in this paper, we
propose a novel Weighted Gaussian Loss based Hamming Hashing, called WGLHH, which
introduces a weighted Gaussian loss to optimize hashing model. Specifically, the weighted
Gaussian loss consists of three parts: a novel Gaussian-distribution based loss, a
novel badly-trained-pair attention mechanism and a quantization loss. The Gaussian-distribution
based loss is proposed to effectively penalize the dissimilar pairs within the Hamming
ball. The badly-trained-pair attention mechanism is proposed to assign a weight for
each data pair, which puts more weight on data pairs whose corresponding hash codes
cannot preserve original similarity well, and less on those having already handled
well. The quantization loss is used to reduce the quantization error. By incorporating
the three parts, the proposed weighted Gaussian loss will penalize significantly on
the dissimilar pairs within the Hamming ball to generate more compact hashing codes.
Extensive experiments on two benchmark datasets show that the proposed method outperforms
the state-of-the-art baselines in image retrieval task.

Domain-Aware SE Network for Sketch-based Image Retrieval with Multiplicative Euclidean
Margin Softmax

  • Peng Lu
  • Gao Huang
  • Hangyu Lin
  • Wenming Yang
  • Guodong Guo
  • Yanwei Fu

This paper proposes a novel approach for Sketch-Based Image Retrieval (SBIR), for
which the key is to bridge the gap between sketches and photos in terms of the data
representation. Inspired by channel-wise attention explored in recent years, we present
a Domain-Aware Squeeze-and-Excitation (DASE) network, which seamlessly incorporates
the prior knowledge of sample sketch or photo into SE module and make the SE module
capable of emphasizing appropriate channels according to domain signal. Accordingly,
the proposed network can switch its mode to achieve a better domain feature with lower
intra-class discrepancy. Moreover, while previous works simply focus on minimizing
intra-class distance and maximizing inter-class distance, we introduce a loss function,
named Multiplicative Euclidean Margin Softmax (MEMS), which introduces multiplicative
Euclidean margin into feature space and ensure that the maximum intra-class distance
is smaller than the minimum inter-class distance. This facilitates learning a highly
discriminative feature space and ensures a more accurate image retrieval result. Extensive
experiments are conducted on two widely used SBIR benchmark datasets. Our approach
achieves better results on both datasets, surpassing the state-of-the-art methods
by a large margin.

FTAFace: Context-enhanced Face Detector with Fine-grained Task Attention

  • Deyu Wang
  • Dongchao Wen
  • Wei Tao
  • Lingxiao Yin
  • Tse-Wei Chen
  • Tadayuki Ito
  • Kinya Osa
  • Masami Kato

In face detection, it is a common strategy to treat samples differently according
to their difficulty for balancing training data distribution. However, we observe
that widely used sampling strategies, such as OHEM and Focal loss, can lead to the
performance imbalance between different tasks (e.g., classification and localization).
Through analysis, we point out that, due to the driving of classification information,
these sample-based strategies are difficult to coordinate the attention of different
tasks during the training, thus leading to the above imbalance. Accordingly, we first
confirm this by shifting the attention from the sample level to the task level. Then,
we propose a fine-grained task attention method, a.k.a FTA, including inter-task importance
and intra-task importance, which adaptively adjusts the attention of each item in
the task from both global and local perspectives, so as to achieve finer optimization.
In addition, we introduce transformer as a feature enhancer to assist our convolution
network, and propose a context enhancement transformer, a.k.a CET, to mine the spatial
relationship in the features towards more robust feature representation. Extensive
experiments on WiderFace and FDDB benchmarks demonstrate that our method significantly
boosts the baseline performance by 2.7%, 2.3% and 4.9% on easy, medium and hard validation
sets respectively. Furthermore, the proposed FTAFace-light achieves higher accuracy
than the state-of-the-art and reduces the amount of computation by 28.9%.

Identity-aware Graph Memory Network for Action Detection

  • Jingcheng Ni
  • Jie Qin
  • Di Huang

Action detection plays an important role in high-level video understanding and media
interpretation. Many existing studies fulfill this spatio-temporal localization by
modeling the context, capturing the relationship of actors, objects, and scenes conveyed
in the video. However, they often universally treat all the actors without considering
the consistency and distinctness between individuals, leaving much room for improvement.
In this paper, we explicitly highlight the identity information of the actors in terms
of both long-term and short-term context through a graph memory network, namely identity-aware
graph memory network (IGMN). Specifically, we propose the hierarchical graph neural
network (HGNN) to comprehensively conduct long-term relation modeling within the same
identity as well as between different ones. Regarding short-term context, we develop
a dual attention module (DAM) to generate identity-aware constraint to reduce the
influence of interference by the actors of different identities. Extensive experiments
on the challenging AVA dataset demonstrate the effectiveness of our method, which
achieves state-of-the-art results on AVA v2.1 and v2.2.

Improving Robustness and Accuracy via Relative Information Encoding in 3D Human Pose

  • Wenkang Shan
  • Haopeng Lu
  • Shanshe Wang
  • Xinfeng Zhang
  • Wen Gao

Most of the existing 3D human pose estimation approaches mainly focus on predicting
3D positional relationships between the root joint and other human joints (local motion)
instead of the overall trajectory of the human body (global motion). Despite the great
progress achieved by these approaches, they are not robust to global motion, and lack
the ability to accurately predict local motion with a small movement range. To alleviate
these two problems, we propose a relative information encoding method that yields
positional and temporal enhanced representations. Firstly, we encode positional information
by utilizing relative coordinates of 2D poses to enhance the consistency between the
input and output distribution. The same posture with different absolute 2D positions
can be mapped to a common representation. It is beneficial to resist the interference
of global motion on the prediction results. Second, we encode temporal information
by establishing the connection between the current pose and other poses of the same
person within a period of time. More attention will be paid to the movement changes
before and after the current pose, resulting in better prediction performance on local
motion with a small movement range. The ablation studies validate the effectiveness
of the proposed relative information encoding method. Besides, we introduce a multi-stage
optimization method to the whole framework to further exploit the positional and temporal
enhanced representations. Our method outperforms state-of-the-art methods on two public
datasets. Code is available at

Deep Neural Network Retrieval

  • Nan Zhong
  • Zhenxing Qian
  • Xinpeng Zhang

With the rapid development of deep learning-based techniques, the general public can
use a lot of "machine learning as a service" (MLaaS), which provides end-to-end machine
learning solutions. Taking the image classification task as an example, users only
need to update their dataset and labels to MLaaS without requiring the specific knowledge
of machine learning or a concrete structure of the classifier. Afterward, MLaaS returns
a well-trained classifier to them. In this paper, we explore a potential novel task
named "deep neural network retrieval" and its application which helps MLaaS to save
computation resources. MLaaS usually owns a huge amount of well-trained models for
various tasks and datasets. If a user requires a task that is similar to the one having
been finished previously, MLaaS can quickly retrieve a model rather than training
from scratch. We propose a pragmatic solution and two different approaches to extract
the semantic feature of DNN representing the function of DNN, which is analogous to
the usage of word2vec in natural language processing. The semantic feature of DNN
can be expressed as a vector by feeding some well-designed litmus images into the
DNN or as a matrix by reversely constructing the most desired input of DNN. Both methods
can consider the topological information and parameters of the DNN simultaneously.
Extensive experiments, including multiple datasets and networks, also demonstrate
the efficiency of our method and show the high accuracy of deep neural network retrieval.

Adversarial Learning with Mask Reconstruction for Text-Guided Image Inpainting

  • Xingcai Wu
  • Yucheng Xie
  • Jiaqi Zeng
  • Zhenguo Yang
  • Yi Yu
  • Qing Li
  • Wenyin Liu

Text-guided image inpainting aims to complete the corrupted patches coherent with
both visual and textual context. On one hand, existing works focus on surrounding
pixels of the corrupted patches without considering the objects in the image, resulting
in the characteristics of objects described in text being painted on non-object regions.
On the other hand, the redundant information in text may distract the generation of
objects of interest in the restored image. In this paper, we propose an adversarial
learning framework with mask reconstruction (ALMR) for image inpainting with textual
guidance, which consists of a two-stage generator and dual discriminators. The two-stage
generator aims to restore coarse-grained and fine-grained images, respectively. In
particular, we devise a dual-attention module (DAM) to incorporate the word-level
and sentence-level textual features as guidance on generating the coarse-grained and
fine-grained details in the two stages. Furthermore, we design a mask reconstruction
module (MRM) to penalize the restoration of the objects of interest with the given
textual descriptions about the objects. For adversarial training, we exploit global
and local discriminators for the whole image and corrupted patches, respectively.
Extensive experiments conducted on CUB-200-2011, Oxford-102 and CelebA-HQ show the
outperformance of the proposed ALMR (e.g., FID value is reduced from 29.69 to 14.69
compared with the state-of-the-art approach on CUB-200-2011). Codes are available

Spatiotemporal Inconsistency Learning for DeepFake Video Detection

  • Zhihao Gu
  • Yang Chen
  • Taiping Yao
  • Shouhong Ding
  • Jilin Li
  • Feiyue Huang
  • Lizhuang Ma

The rapid development of facial manipulation techniques has aroused public concerns
in recent years. Following the success of deep learning, existing methods always formulate
DeepFake video detection as a binary classification problem and develop frame-based
and video-based solutions. However, little attention has been paid to capturing the
spatial-temporal inconsistency in forged videos. To address this issue, we term this
task as a Spatial-Temporal Inconsistency Learning (STIL) process and instantiate it
into a novel STIL block, which consists of a Spatial Inconsistency Module (SIM), a
Temporal Inconsistency Module (TIM), and an Information Supplement Module (ISM). Specifically,
we present a novel temporal modeling paradigm in TIM by exploiting the temporal difference
over adjacent frames along with both horizontal and vertical directions. And the ISM
simultaneously utilizes the spatial information from SIM and temporal information
from TIM to establish a more comprehensive spatial-temporal representation. Moreover,
our STIL block is flexible and could be plugged into existing 2D CNNs. Extensive experiments
and visualizations are presented to demonstrate the effectiveness of our method against
the state-of-the-art competitors.

VeloCity: Using Voice Assistants for Cyclists to Provide Traffic Reports

  • Gian-Luca Savino
  • Jessé Moraes Braga
  • Johannes Schöning

Cycling is on the rise as a relevant alternative to car-based mobility and even though
there are mobile applications specifically designed for cyclists to support this development,
many still face unresolved challenges in terms of safe user interaction with complex
data while riding. We present the design, development, and evaluation of VeloCity
- an application for reporting traffic incidents and structures relevant to cyclists.
In a case study, we compared its' three input methods (touch, in-app speech recognition,
the voice assistant of the operating system) to evaluate which attributes make for
safe interaction while cycling. We found that participants prefer to use the voice
assistant over the other modalities as it was least distracting due to its hands-
and eyes-free interaction design. Furthermore, they chose short commands over conversational
phrases. Based on our results, we present five guidelines for designing voice user
interfaces for cyclists and argue for moving away from touch-based interfaces in this
domain, which still make up most of the applied interaction techniques today.

Edit Like A Designer: Modeling Design Workflows for Unaligned Fashion Editing

  • Qiyu Dai
  • Shuai Yang
  • Wenjing Wang
  • Wei Xiang
  • Jiaying Liu

Fashion editing has drawn increasing research interest with its extensive application
prospect. Instead of directly manipulating the real fashion item image, it is more
intuitive for designers to modify it via the design draft. In this paper, we model
design workflows for a novel task of unaligned fashion editing, allowing the user
to edit a fashion item through manipulating its corresponding design draft. The challenge
lies in the large misalignment between the real fashion item and the design draft,
which could severely degrade the quality of editing results. To address this issue,
we propose an Unaligned Fashion Editing Network (UFE-Net). A coarsely rendered fashion
item is firstly generated from the edited design draft via a translation module. With
this as guidance, we align and manipulate the original unedited fashion item via a
novel alignment-driven fashion editing module, and then optimize the details and shape
via a reference-guided refinement module. Furthermore, a joint training strategy is
introduced to exploit the synergy between the alignment and editing tasks. Our UFE-Net
enables the edited fashion item to have semantically consistent geometric shape and
realistic details to the edited draft in the edited region, as well as to keep the
unedited region intact. Experiments demonstrate our superiority over the competing
methods on unaligned fashion editing.

Privacy-Preserving Portrait Matting

  • Jizhizi Li
  • Sihan Ma
  • Jing Zhang
  • Dacheng Tao

Recently, there has been an increasing concern about the privacy issue raised by using
personally identifiable information in machine learning. However, previous portrait
matting methods were all based on identifiable portrait images. To fill the gap, we
present P3M-10k in this paper, which is the first large-scale anonymized benchmark
for Privacy-Preserving Portrait Matting. P3M-10k consists of 10,000 high-resolution
face-blurred portrait images along with high-quality alpha mattes. We systematically
evaluate both trimap-free and trimap-based matting methods on P3M-10k and find that
existing matting methods show different generalization capabilities when following
the Privacy-Preserving Training (PPT) setting, i.e., training on face-blurred images
and testing on arbitrary images. To devise a better trimap-free portrait matting model,
we propose P3M-Net, which leverages the power of a unified framework for both semantic
perception and detail matting, and specifically emphasizes the interaction between
them and the encoder to facilitate the matting process. Extensive experiments on P3M-10k
demonstrate that P3M-Net outperforms the state-of-the-art methods in terms of both
objective metrics and subjective visual quality. Besides, it shows good generalization
capacity under the PPT setting, confirming the value of P3M-10k for facilitating future
research and enabling potential real-world applications. The source code and dataset
are available at

A Transformer based Approach for Image Manipulation Chain Detection

  • Jiaxiang You
  • Yuanman Li
  • Jiantao Zhou
  • Zhongyun Hua
  • Weiwei Sun
  • Xia Li

Image manipulation chain detection aims to identify the existence of involved operations
and also their orders, playing an important role in multimedia forensics and image
analysis. However,all the existing algorithms model the manipulation chain detection
as a classification problem, and can only detect chains containing up to two operations.
Due to the exponentially increased solution space and the complex interactions among
operations, how to reveal a long chain from a processed image remains a long-standing
problem in the multimedia forensic community. To address this challenge, in this paper,
we propose a new direction for manipulation chain detection. Different from previous
works, we treat the manipulation chain detection as a machine translation problem
rather than a classification one, where we model the chains as the sentences of a
target language, and each word serves as one possible image operation. Specifically,
we first transform the manipulated image into a deep feature space, and further model
the traces left by the manipulation chain as a sentence of a latent source language.
Then, we propose to detect the manipulation chain through learning the mapping from
the source language to the target one under a machine translation framework. Our method
can detect manipulation chains consisting of up to five operations, and we obtain
promising results on both the short-chain detection and the long-chain detection.

HANet: Hierarchical Alignment Networks for Video-Text Retrieval

  • Peng Wu
  • Xiangteng He
  • Mingqian Tang
  • Yiliang Lv
  • Jing Liu

Video-text retrieval is an important yet challenging task in vision-language understanding,
which aims to learn a joint embedding space where related video and text instances
are close to each other. Most current works simply measure the video-text similarity
based on video-level and text-level embeddings. However, the neglect of more fine-grained
or local information causes the problem of insufficient representation. Some works
exploit the local details by disentangling sentences, but overlook the corresponding
videos, causing the asymmetry of video-text representation. To address the above limitations,
we propose a Hierarchical Alignment Network (HANet) to align different level representations
for video-text matching. Specifically, we first decompose video and text into three
semantic levels, namely event (video and text), action (motion and verb), and entity
(appearance and noun). Based on these, we naturally construct hierarchical representations
in the individual-local-global manner, where the individual level focuses on the alignment
between frame and word, local level focuses on the alignment between video clip and
textual context, and global level focuses on the alignment between the whole video
and text. Different level alignments capture fine-to-coarse correlations between video
and text, as well as take the advantage of the complementary information among three
semantic levels. Besides, our HANet is also richly interpretable by explicitly learning
key semantic concepts. Extensive experiments on two public datasets, namely MSR-VTT
and VATEX, show the proposed HANet outperforms other state-of-the-art methods, which
demonstrates the effectiveness of hierarchical representation and alignment. Our code
is publicly available at

Scalable Multi-view Subspace Clustering with Unified Anchors

  • Mengjing Sun
  • Pei Zhang
  • Siwei Wang
  • Sihang Zhou
  • Wenxuan Tu
  • Xinwang Liu
  • En Zhu
  • Changjian Wang

Multi-view subspace clustering has received widespread attention to effectively fuse
multi-view information among multimedia applications. Considering that most existing
approaches' cubic time complexity makes it challenging to apply to realistic large-scale
scenarios, some researchers have addressed this challenge by sampling anchor points
to capture distributions in different views. However, the separation of the heuristic
sampling and clustering process leads to weak discriminate anchor points. Moreover,
the complementary multi-view information has not been well utilized since the graphs
are constructed independently by the anchors from the corresponding views. To address
these issues, we propose a Scalable Multi-view Subspace Clustering with Unified Anchors
(SMVSC). To be specific, we combine anchor learning and graph construction into a
unified optimization framework. Therefore, the learned anchors can represent the actual
latent data distribution more accurately, leading to a more discriminative clustering
structure. Most importantly, the linear time complexity of our proposed algorithm
allows the multi-view subspace clustering approach to be applied to large-scale data.
Then, we design a four-step alternative optimization algorithm with proven convergence.
Compared with state-of-the-art multi-view subspace clustering methods and large-scale
oriented methods, the experimental results on several datasets demonstrate that our
SMVSC method achieves comparable or better clustering performance much more efficiently.
The code of SMVSC is available at

PRNet: A Progressive Recovery Network for Revealing Perceptually Encrypted Images

  • Tao Xiang
  • Ying Yang
  • Shangwei Guo
  • Hangcheng Liu
  • Hantao Liu

Perceptual encryption is an efficient way of protecting image content by only selectively
encrypting a portion of significant data in plain images. Existing security analysis
of perceptual encryption usually resorts to traditional cryptanalysis techniques,
which require heavy manual work and strict prior knowledge of encryption schemes.
In this paper, we introduce a new end-to-end method of analyzing the visual security
of perceptually encrypted images, without any manual work or knowing any prior knowledge
of the encryption scheme. Specifically, by leveraging convolutional neural networks
(CNNs), we propose a progressive recovery network (PRNet) to recover visual content
from perceptually encrypted images. Our PRNet is stacked with several dense attention
recovery blocks (DARBs), where each DARB contains two branches: feature extraction
branch and image recovery branch. These two branches cooperate to rehabilitate more
detailed visual information and generate efficient feature representation via densely
connected structure and dual-saliency mechanism. We conduct extensive experiments
to demonstrate that PRNet works on different perceptual encryption schemes with different
settings, and the results show that PRNet significantly outperforms the state-of-the-art
CNN-based image restoration methods.

FakeTagger: Robust Safeguards against DeepFake Dissemination via Provenance Tracking

  • Run Wang
  • Felix Juefei-Xu
  • Meng Luo
  • Yang Liu
  • Lina Wang

In recent years, DeepFake is becoming a common threat to our society, due to the remarkable
progress of generative adversarial networks (GAN) in image synthesis. Unfortunately,
existing studies that propose various approaches, in fighting against DeepFake and
determining if the facial image is real or fake, is still at an early stage. Obviously,
the current DeepFake detection method struggles to catch the rapid progress of GANs,
especially in the adversarial scenarios where attackers can evade the detection intentionally,
such as adding perturbations to fool the DNN-based detectors. While passive detection
simply tells whether the image is fake or real, DeepFake provenance, on the other
hand, provides clues for tracking the sources in DeepFake forensics. Thus, the tracked
fake images could be blocked immediately by administrators and avoid further spread
in social networks.

In this paper, we investigate the potentials of image tagging in serving the DeepFake
provenance tracking. Specifically, we devise a deep learning-based approach, named
FakeTagger, with a simple yet effective encoder and decoder design along with channel
coding to embed message to the facial image, which is to recover the embedded message
after various drastic GAN-based DeepFake transformation with high confidence. The
embedded message could be employed to represent the identity of facial images, which
further contributed to DeepFake detection and provenance. Experimental results demonstrate
that our proposed approach could recover the embedded message with an average accuracy
of more than 95% over the four common types of DeepFakes. Our research finding confirms
effective privacy-preserving techniques for protecting personal photos from being

Discriminative Latent Semantic Graph for Video Captioning

  • Yang Bai
  • Junyan Wang
  • Yang Long
  • Bingzhang Hu
  • Yang Song
  • Maurice Pagnucco
  • Yu Guan

Video captioning aims to automatically generate natural language sentences that can
describe the visual contents of a given video. Existing generative models like encoder-decoder
frameworks cannot explicitly explore the object-level interactions and frame-level
information from complex spatio-temporal data to generate semantic-rich captions.
Our main contribution is to identify three key problems in a joint framework for future
video summarization tasks. 1) Enhanced Object Proposal: we propose a novel Conditional
Graph that can fuse spatio-temporal information into latent object proposal. 2) Visual
Knowledge: Latent Proposal Aggregation is proposed to dynamically extract visual words
with higher semantic levels. 3) Sentence Validation: A novel Discriminative Language
Validator is proposed to verify generated captions so that key semantic concepts can
be effectively preserved. Our experiments on two public datasets (MVSD and MSR-VTT)
manifest significant improvements over state-of-the-art approaches on all metrics,
especially for BLEU-4 and CIDEr. Our code is available at

From Image to Imuge: Immunized Image Generation

  • Qichao Ying
  • Zhenxing Qian
  • Hang Zhou
  • Haisheng Xu
  • Xinpeng Zhang
  • Siyi Li

We introduce Imuge, an image tamper resilient generative scheme for image self-recovery.
The traditional manner of concealing image content within the image are inflexible
and fragile to diverse digital attack, i.e. image cropping and JPEG compression. To
address this issue, we jointly train a U-Net backboned encoder, a tamper localization
network and a decoder for image recovery. Given an original image, the encoder produces
a visually indistinguishable immunized image. At the recipient's side, the verifying
network localizes the malicious modifications, and the original content can be approximately
recovered by the decoder, despite the presence of the attacks. Several strategies
are proposed to boost the training efficiency. We demonstrate that our method can
recover the details of the tampered regions with a high quality despite the presence
of various kinds of attacks. Comprehensive ablation studies are conducted to validate
our network designs.

Wisdom of (Binned) Crowds: A Bayesian Stratification Paradigm for Crowd Counting

  • Sravya Vardhani Shivapuja
  • Mansi Pradeep Khamkar
  • Divij Bajaj
  • Ganesh Ramakrishnan
  • Ravi Kiran Sarvadevabhatla

Datasets for training crowd counting deep networks are typically heavy-tailed in count
distribution and exhibit discontinuities across the count range. As a result, the
de facto statistical measures (MSE, MAE) exhibit large variance and tend to be unreliable
indicators of performance across the count range. To address these concerns in a holistic
manner, we revise processes at various stages of the standard crowd counting pipeline.
To enable principled and balanced minibatch sampling, we propose a novel smoothed
Bayesian sample stratification approach. We propose a novel cost function which can
be readily incorporated into existing crowd counting deep networks to encourage strata-aware
optimization. We analyze the performance of representative crowd counting approaches
across standard datasets at per strata level and in aggregate. We analyze the performance
of crowd counting approaches across standard datasets and demonstrate that our proposed
modifications noticeably reduce error standard deviation. Our contributions represent
a nuanced, statistically balanced and fine-grained characterization of performance
for crowd counting approaches.

Demystifying Commercial Video Conferencing Applications

  • Insoo Lee
  • Jinsung Lee
  • Kyunghan Lee
  • Dirk Grunwald
  • Sangtae Ha

Video conferencing applications have seen explosive growth both in the number of available
applications and their use. However, there have been few studies on the detailed analysis
of video conferencing applications with respect to network dynamics, yet understanding
these dynamics is essential for network design and improving these applications. In
this paper, we carry out an in-depth measurement and modeling study on the rate control
algorithms used in six popular commercial video conferencing applications. Based on
macroscopic behaviors commonly observed across these applications in our extensive
measurements, we construct a unified architecture to model the rate control mechanisms
of individual applications. We then reconstruct each application's rate control by
inferring key parameters that closely follow its rate control and quality adaptation
behaviors. To our knowledge, this is the first work that reverse-engineers rate control
algorithms of popular video conferencing applications, which are often unknown or
hidden as they are proprietary software. We confirm our analysis and models using
an end-to-end testbed that can capture the dynamics of each application under a variety
of network conditions. We also show how we can use these models to gain insights into
the particular behaviors of an application in two practical scenarios.

LightFEC: Network Adaptive FEC with a Lightweight Deep-Learning Approach

  • Han Hu
  • Sheng Cheng
  • Xinggong Zhang
  • Zongming Guo

Nowadays, the interest of real-time video streaming reaches a peak. To deal with the
problem of packet loss and optimize users' Quality of Experience (QoE), Forward error
correction (FEC) has been studied and applied extensively. The performance of FEC
depends on whether the future loss pattern is precisely predicted, while the previous
researches have not provided a robust packet loss prediction method. In this work,
we propose LightFEC to make accurate and fast prediction of packet loss pattern. By
applying long short-term memory (LSTM) networks, clustering algorithms and model compression
methods, LightFEC is able to accurately predict packet loss in various network conditions
without consuming too much time. According to the results of well-designed experiments,
we find out that LightFEC outperforms other schemes on prediction accuracy, which
improves the packet recovery ratio while keeping the redundancy ratio at a low level.

SOGAN: 3D-Aware Shadow and Occlusion Robust GAN for Makeup Transfer

  • Yueming Lyu
  • Jing Dong
  • Bo Peng
  • Wei Wang
  • Tieniu Tan

In recent years, virtual makeup applications have become more and more popular. However,
it is still challenging to propose a robust makeup transfer method in the real-world
environment. Current makeup transfer methods mostly work well on good-conditioned
clean makeup images, but transferring makeup that exhibits shadow and occlusion is
not satisfying. To alleviate it, we propose a novel makeup transfer method, called
3D-Aware Shadow and Occlusion Robust GAN (SOGAN). Given the source and the reference
faces, we first fit a 3D face model and then disentangle the faces into shape and
texture. In the texture branch, we map the texture to the UV space and design a UV
texture generator to transfer the makeup. Since human faces are symmetrical in the
UV space, we can conveniently remove the undesired shadow and occlusion from the reference
image by carefully designing a Flip Attention Module (FAM). After obtaining cleaner
makeup features from the reference image, a Makeup Transfer Module (MTM) is introduced
to perform accurate makeup transfer. The qualitative and quantitative experiments
demonstrate that our SOGAN not only achieves superior results in shadow and occlusion
situations but also performs well in large pose and expression variations.

SESSION: Reproducibility

Reproducibility Companion Paper: Campus3D: A Photogrammetry Point Cloud Benchmark for Outdoor Scene Hierarchical Understanding

  • Yuqing Liao
  • Xinke Li
  • Zekun Tong
  • Yabang Zhao
  • Andrew Lim
  • Zhenzhong Kuang
  • Cise Midoglu

This companion paper is to support the replication of paper "Campus3D: A Photogrammetry
Point Cloud Benchmark for Outdoor Scene Hierarchical Understanding", which was presented
at ACM Multimedia 2020. The supported paper's main purpose was to provide a photogrammetry
point cloud-based dataset with hierarchical multilabels to facilitate the area of
3D deep learning. Based on this provided dataset and source code, in this work, we
build a complete package to reimplement the proposed methods and experiments (i.e.,
the hierarchical learning framework and the benchmarks of the hierarchical semantic
segmentation task). Specifically, this paper contains the technical details of the
package, including file structure, dataset preparation, installation package, and
the conduction of the experiment. We also present the replicated experiment results
and indicate our contributions to the original implementation.

Reproducibility Companion Paper: Norm-in-Norm Loss with Faster Convergence and Better Performance for Image Quality

  • Dingquan Li
  • Tingting Jiang
  • Ming Jiang
  • Vajira Lasantha Thambawita
  • Haoliang Wang

This companion paper supports the experimental replication of the paper "Norm-in-Norm
Loss with Faster Convergence and Better Performance for Image Quality Assessment''
presented at ACM Multimedia 2020. We provide the software package for replicating
the implementation of the "Norm-in-Norm'' loss and the corresponding "LinearityIQA''
model used in the original paper. This paper contains the guidelines to reproduce
all the experimental results of the original paper.

Reproducibility Companion Paper: Kalman Filter-Based Head Motion Prediction for Cloud-Based Mixed Reality

  • Serhan Gül
  • Sebastian Bosse
  • Dimitri Podborski
  • Thomas Schierl
  • Cornelius Hellge
  • Marc A. Kastner
  • Jan Zahálka

In our MM'20 paper,, we presented a Kalman filter-based approach for prediction of
head motion in 6DoF. The proposed approach was employed in our cloud-based volumetric
video streaming system to reduce the interaction latency experienced by the user.
In this companion paper, we present the dataset collected for our experiments and
our simulation framework that reproduces the obtained experimental results. Our implementation
is freely available on Github to facilitate further research.

Reproducibility Companion Paper: Blind Natural Video Quality Prediction via Statistical Temporal Features and Deep
Spatial Features

  • Jari Korhonen
  • Yicheng Su
  • Junyong You
  • Steven Hicks
  • Cise Midoglu

Blind natural video quality assessment (BVQA), also known as no-reference video quality
assessment, is a highly active research topic. In our recent contribution titled "Blind
Natural Video Quality Prediction via Statistical Temporal Features and Deep Spatial
Features" published in ACM Multimedia 2020, we proposed a two-level video quality
model employing statistical temporal features and spatial features extracted by a
deep convolutional neural network (CNN) for this purpose. At the time of publishing,
the proposed model (CNN-TLVQM) achieved state-of-the-art results in BVQA. In this
paper, we describe the process of reproducing the published results by using CNN-TLVQM
on two publicly available natural video quality datasets.

Reproducibility Companion Paper: Describing Subjective Experiment Consistency by p-Value P-P Plot

  • Jakub Nawala
  • Lucjan Janowski
  • Bogdan Cmiel
  • Krzysztof Rusek
  • Marc A. Kastner
  • Jan Zahálka

In this paper we reproduce experimental results presented in our earlier work titled
"Describing Subjective Experiment Consistency by p-Value P-P Plot" that was presented
in the course of the 28th ACM International Conference on Multimedia. The paper aims
at verifying the soundness of our prior results and helping others understand our
software framework. We present artifacts that help reproduce tables, figures and all
the data derived from raw subjective responses that were included in our earlier work.
Using the artifacts we show that our results are reproducible. We invite everyone
to use our software framework for subjective responses analyses going beyond reproducibility

Reproducibility Companion Paper: Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework

  • Li Tao
  • Xueting Wang
  • Toshihiko Yamasaki
  • Jingjing Chen
  • Steven Hicks

In this companion paper, we provide details of the artifacts to support the replication
of "Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework",
which was presented at MM'20. The Inter-intra Contrastive (IIC) framework aims to
extract more discriminative temporal information by extending intra-negative samples
in contrastive self-supervised learning. In this paper, we first summarize our contribution.
Then we explain the file structure of the source code and detailed settings. Since
our proposal is a framework which contain a lot of different settings, we provide
some custom settings to help other researchers to use our methods easily. The source
code is available at

Reproducibility Companion Paper: Visual Relation of Interest Detection

  • Fan Yu
  • Haonan Wang
  • Tongwei Ren
  • Jinhui Tang
  • Gangshan Wu
  • Jingjing Chen
  • Zhenzhong Kuang

In this companion paper, we provide the details of the reproducibility artifacts of
the paper "Visual Relation of Interest Detection" presented at MM'20. Visual Relation
of Interest Detection (VROID) aims to detect visual relations that are important for
conveying the main content of an image. In this paper, we explain the file structure
of the source code and publish the details of our ViROI dataset, which can be used
to retrain the model with custom parameters. We also detail the scripts for component
analysis and comparison with other methods and list the parameters that can be modified
for custom training and inference.

Reproducibility Companion Paper: On Learning Disentangled Representation for Acoustic
Event Detection

  • Lijian Gao
  • Qirong Mao
  • Jingjing Chen
  • Ming Dong
  • Ratna Chinnam
  • Lucile Sassatelli
  • Miguel Romero Rondon
  • Ujjwal Sharma

This companion paper is provided to describe the major experiments reported in our
paper "On Learning Disentangled Representation for Acoustic Event Detection" published
in ACM Multimedia 2019. To make the replication of our work easier, we first give
an introduction of the computing environment where all of our experiments are conducted.
Furthermore, we provide an environmental configuration file to setup the compiling
environment and other artifacts including the source code, datasets and the files
generated during our experiments. Finally, we summarize the structure and usage of
the source code. For more details, please consult the README file in the archive of
artifacts on GitHub:

SESSION: Keynote Talk V&VI

AI and the Future of Education

  • James Lester

It has become clear that AI will profoundly transform society. AI will dramatically
change the socio-technological landscape, produce seismic economic shifts, and fundamentally
reshape the workforce in ways that we are only beginning to grasp. With its imminent
arrival, it is critically important to deeply engage with questions around how we
should design education in the Age of AI. Fortunately, while we must address the significant
challenges posed by AI, we can also leverage AI itself to address these challenges.
In this talk we will consider how (and at what rate) AI technologies for education
will evolve, discuss emerging innovations in AI-augmented learning environments for
formal and informal contexts, and explore what competencies will be elevated in an
AI-pervasive workforce. We will discuss near-future AI technologies that leverage
advances in natural language processing, computer vision, and machine learning to
create narrative-centered learning environments, embodied conversational agents for
learning, and multimodal learning analytics. We will conclude by considering what
all of these developments suggest for K-12 education and the future of human learning.

Digital Human in an Integrated Physical-Digital World (IPhD)

  • Zhengyou Zhang

With the rapid development of digital technologies such as VR, AR, XR, and more importantly
the almost ubiquitous mobile broadband coverage, we are entering an Integrated Physical-Digital
World (IPhD), the tight integration of virtual world with the physical world. The
IPhD is characterized with four key technologies: Virtualization of the physical world,
Realization of the virtual world, Holographic internet, and Intelligent Agent. Internet
will continue its development with faster speed and broader bandwidth, and will eventually
be able to communicate holographic contents including 3D shape, appearance, spatial
audio, touch sensing and smell. Intelligent agents, such as digital human, and digital/physical
robots, travels between digital and physical worlds. In this talk, we will describe
our work on digital human for this IPhD world. This includes: computer vision techniques
for building digital humans, multimodal text-to-speech synthesis (voice and lip shapes),
speech-driven face animation, neural-network-based body motion control, human-digital-human
interaction, and an emotional video game anchor.

SESSION: Session 24: Media Interpretation-III

Cross-Camera Feature Prediction for Intra-Camera Supervised Person Re-identification
across Distant Scenes

  • Wenhang Ge
  • Chunyan Pan
  • Ancong Wu
  • Hongwei Zheng
  • Wei-Shi Zheng

Person re-identification (Re-ID) aims to match person images across non-overlapping
camera views. The majority of Re-ID methods focus on small-scale surveillance systems
in which each pedestrian is captured in different camera views of adjacent scenes.
However, in large-scale surveillance systems that cover larger areas, it is required
to track a pedestrian of interest across distant scenes (e.g., a criminal suspect
escapes from one city to another). Since most pedestrians appear in limited local
areas, it is difficult to collect training data with cross-camera pairs of the same
person. In this work, we study intra-camera supervised person re-identification across
distant scenes (ICS-DS Re-ID), which uses cross-camera unpaired data with intra-camera
identity labels for training. It is challenging as cross-camera paired data plays
a crucial role for learning camera-invariant features in most existing Re-ID methods.
To learn camera-invariant representation from cross-camera unpaired training data,
we propose a cross-camera feature prediction method to mine cross-camera self supervision
information from camera-specific feature distribution by transforming fake cross-camera
positive feature pairs and minimize the distances of the fake pairs. Furthermore,
we automatically localize and extract local-level feature by a transformer. Joint
learning of global-level and local-level features forms a global-local cross-camera
feature prediction scheme for mining fine-grained cross-camera self supervision information.
Finally, cross-camera self supervision and intra-camera supervision are aggregated
in a framework. The experiments are conducted in the ICS-DS setting on Market-SCT,
Duke-SCT and MSMT17-SCT datasets. The evaluation results demonstrate the superiority
of our method, which gains significant improvements of 15.4 Rank-1 and 22.3 mAP on
Market-SCT as compared to the second best method. Our code is available at

Video Visual Relation Detection via Iterative Inference

  • Xindi Shang
  • Yicong Li
  • Junbin Xiao
  • Wei Ji
  • Tat-Seng Chua

The core problem of video visual relation detection (VidVRD) lies in accurately classifying
the relation triplets, which comprise of the classes of subject and object entities,
and the predicate classes of various relationships between them. Existing VidVRD approaches
classify these three relation components in either independent or cascaded manner,
thus fail to fully exploit the inter-dependency among them. In order to utilize this
inter-dependency in tackling the challenges of visual relation recognition in videos,
we propose a novel iterative relation inference approach for VidVRD. We derive our
model from the viewpoint of joint relation classification which is light-weight yet
effective, and propose a training approach to better learn the dependency knowledge
from the likely correct triplet combinations. As such, the proposed inference approach
is able to gradually refine each component based on its learnt dependency and the
other two's predictions. Our ablation studies show that this iterative relation inference
can empirically converge in a few steps and consistently boost the performance over
baselines. Further, we incorporate it into a newly designed VidVRD architecture, named
VidVRD-II (Iterative Inference), which generalizes well across different datasets.
Experiments show that VidVRD-II achieves the start-of-the-art performance on both
of ImageNet-VidVRD and VidOR benchmark datasets.

Instance-wise or Class-wise? A Tale of Neighbor Shapley for Concept-based Explanation

  • Jiahui Li
  • Kun Kuang
  • Lin Li
  • Long Chen
  • Songyang Zhang
  • Jian Shao
  • Jun Xiao

Interpreting model knowledge is an essential topic to improve human understanding
of deep black-box models. Traditional methods contribute to providing intuitive instance-wise
explanations which allocating importance scores for low-level features (e.g, pixels
for images). To adapt to the human way of thinking, one strand of recent researches
has shifted its spotlight to mining important concepts. However, these concept-based
interpretation methods focus on computing the contribution of each discovered concept
on the class level and can not precisely give instance-wise explanations. Besides,
they consider each concept as an independent unit, and ignore the interactions among
concepts. To this end, in this paper, we propose a novel COncept-based NEighbor Shapley
approach (dubbed as CONE-SHAP) to evaluate the importance of each concept by considering
its physical and semantic neighbors, and interpret model knowledge with both instance-wise
and class-wise explanations. Thanks to this design, the interactions among concepts
in the same image are fully considered. Meanwhile, the computational complexity of
Shapley Value is reduced from exponential to polynomial. Moreover, for a more comprehensive
evaluation, we further propose three criteria to quantify the rationality of the allocated
contributions for the concepts, including coherency, complexity, and faithfulness.
Extensive experiments and ablations have demonstrated that our CONE-SHAP algorithm
outperforms existing concept-based methods and simultaneously provides precise explanations
for each instance and class.

Multifocal Attention-Based Cross-Scale Network for Image De-raining

  • Zheyu Zhang
  • Yurui Zhu
  • Xueyang Fu
  • Zhiwei Xiong
  • Zheng-Jun Zha
  • Feng Wu

Albeit existing deep learning-based image de-raining methods have achieved promising
results, most of them only extract single scale features, and neglect the fact that
similar rain streaks appear repeatedly across different scales. Therefore, this paper
aims to explore the cross-scale cues in a multi-scale fashion. Specifically, we first
introduce an adaptive-kernel pyramid to provide effective multi-scale information.
Then, we design two cross-scale similarity attention blocks (CSSABs) to search spatial
and channel relationships between two scales, respectively. The spatial CSSAB explores
the spatial similarity between pixels of cross-scale features, while the channel CSSAB
emphasizes the interdependencies among cross-scale features. To further improve the
diversity of features, we adopt the wavelet transformation and multi-head mechanism
in CSSABs to generate multifocal features which focus on different areas. Finally,
based on our CSSABs, we construct an effective multifocal attention-based cross-scale
network, which exhaustively utilizes the cross-scale correlations of both rain streaks
and background, to achieve image de-raining. Experiments show the superiority of our
network over state-of-the-art image de-raining approaches both qualitatively and quantitatively.
The source code and pre-trained models are available at

PFFN: Progressive Feature Fusion Network for Lightweight Image Super-Resolution

  • Dongyang Zhang
  • Changyu Li
  • Ning Xie
  • Guoqing Wang
  • Jie Shao

Recently, convolutional neural network (CNN) has been the core ingredient of modern
models, triggering the surge of deep learning in super-resolution (SR). Despite the
great success of these CNN-based methods which are prone to be deeper and heavier,
it is impracticable to directly apply these methods for some low-budget devices due
to the superfluous computational overhead. To alleviate this problem, a novel lightweight
SR network named progressive feature fusion network (PFFN) is developed to seek for
better balance between performance and running efficiency. Specifically, to fully
exploit the feature maps, a novel progressive attention block (PAB) is proposed as
the main building block of PFFN. The proposed PAB adopts several parallel but connected
paths with pixel attention, which could significantly increase the receptive field
of each layer, distill useful information and finally learn more discriminative feature
representations. In PAB, a powerful dual attention module (DAM) is further incorporated
to provide the channel and spatial attention mechanism in fairly lightweight manner.
Besides, we construct a pretty concise and effective upsampling module with the help
of multi-scale pixel attention, named MPAU. All of the above modules ensure the network
can benefit from attention mechanism while still being lightweight enough. Furthermore,
a novel training strategy following the cosine annealing learning scheme is proposed
to maximize the representation ability of the model. Comprehensive experiments show
that our PFFN achieves the best performance against all existing lightweight state-of-the-art
SR methods with less number of parameters and even performs comparably to computationally
expensive networks.

InterBN: Channel Fusion for Adversarial Unsupervised Domain Adaptation

  • Mengzhu Wang
  • Wei Wang
  • Baopu Li
  • Xiang Zhang
  • Long Lan
  • Huibin Tan
  • Tianyi Liang
  • Wei Yu
  • Zhigang Luo

A classifier trained on one dataset rarely works on other datasets obtained under
different conditions because of domain shifting. Such a problem is usually solved
by domain adaptation methods. In this paper, we propose a novel unsupervised domain
adaptation (UDA) method based on Interchangeable Batch Normalization (InterBN) to
fuse different channels in deep neural networks for adversarial domain adaptation.Specifically,
we first observe that the channels with small batch normalization scaling factor have
less influence on the whole domain adaption, followed by a theoretical proof that
the scaling factors for some channels will definitely come close to zero when imposing
a sparsity regularization. Then, we replace the channels that have smaller scaling
factors in the source domain with the mean of the channels which have larger scaling
factors in the target domain or vice versa. Such a simple but effective channel fusion
scheme can drastically increase the domain adaption ability.Extensive experimental
results show that our InterBN significantly outperforms the current adversarial domain
adaptation methods by a large margin on four visual benchmarks. In particular, InterBN
achieves a remarkable improvement of 7.7% over the conditional adversarial adaptation
networks (CDAN) on VisDA-2017 benchmark.

SESSION: Session 25: Multimedia Art, Entertainment and Culture

Learning to Compose Stylistic Calligraphy Artwork with Emotions

  • Shaozu Yuan
  • Ruixue Liu
  • Meng Chen
  • Baoyang Chen
  • Zhijie Qiu
  • Xiaodong He

Emotion plays a critical role in calligraphy composition, which makes the calligraphy
artwork impressive and have a soul. However, previous research on calligraphy generation
all neglected the emotion as a major contributor to the artistry of calligraphy. Such
defects prevent them from generating aesthetic, stylistic, and diverse calligraphy
artworks, but only static handwriting font library instead. To address this problem,
we propose a novel cross-modal approach to generate stylistic and diverse Chinese
calligraphy artwork driven by different emotions automatically. We firstly detect
the emotions in the text by a classifier, then generate the emotional Chinese character
images via a novel modified Generative Adversarial Network (GAN) structure, finally
we predict the layout for all character images with a recurrent neural network. We
also collect a large-scale stylistic Chinese calligraphy image dataset with rich emotions.
Experimental results demonstrate that our model outperforms all baseline image translation
models significantly for different emotional styles in terms of content accuracy and
style discrepancy. Besides, our layout algorithm can also learn the patterns and habits
of calligrapher, and makes the generated calligraphy more artistic. To the best of
our knowledge, we are the first to work on emotion-driven discourse-level Chinese
calligraphy artwork composition.

Graph Neural Networks for Knowledge Enhanced Visual Representation of Paintings

  • Athanasios Efthymiou
  • Stevan Rudinac
  • Monika Kackovic
  • Marcel Worring
  • Nachoem Wijnberg

We propose ArtSAGENet, a novel multimodal architecture that integrates Graph Neural
Networks (GNNs) and Convolutional Neural Networks (CNNs), to jointly learn visual
and semantic-based artistic representations. First, we illustrate the significant
advantages of multi-task learning for fine art analysis and argue that it is conceptually
a much more appropriate setting in the fine art domain than the single-task alternatives.
We further demonstrate that several GNN architectures can outperform strong CNN baselines
in a range of fine art analysis tasks, such as style classification, artist attribution,
creation period estimation, and tag prediction, while training them requires an order
of magnitude less computational time and only a small amount of labeled data. Finally,
through extensive experimentation we show that our proposed ArtSAGENet captures and
encodes valuable relational dependencies between the artists and the artworks, surpassing
the performance of traditional methods that rely solely on the analysis of visual
content. Our findings underline a great potential of integrating visual content and
semantics for fine art analysis and curation.

ArtScience and the ICECUBE LED Display [ILDm^3]

  • Mark-David Hosale
  • Robert Allison
  • Jim Madsen
  • Marcus Gordon

ICECUBE LED Display [ILDm^3] is a cubic-meter, 1/1000th scale model of the IceCube
Neutrino Observatory, a novel telescope that looks for nearly invisible cosmic messengers,
neutrinos, using a cubic-kilometer of instrumented ice starting 1450 meters below
the surface at the South Pole. The display uses art methodologies as a means for expressing
imperceptible astrophysical events as sound, light and colour in the domain of the
human sensorium. The experience is as aesthetically critical as it is facilitatory
to an intuitive understanding of subatomic astrophysical data, leading to new ways
of knowing about our Universe and its processes.

The objective of this project was to build a static volumetric dis- play as a model
of IceCube for visualization of spatio-temporal data recorded by the observatory.
While the primary use of the display is as a model for artistic, educational, and
outreach purposes, the display is also being explored as an instrument for the scientific
analysis of IceCube data sets by human observers. The technical approach to designing
the display was to place an emphasis on reproducibility so that it can be readily
built and used by the re- searchers in the IceCube research community. Evaluation
of the display is being used as a baseline for the development of future exhibits.

PUGCQ: A Large Scale Dataset for Quality Assessment of Professional User-Generated

  • Guo Li
  • Baoliang Chen
  • Lingyu Zhu
  • Qinwen He
  • Hongfei Fan
  • Shiqi Wang

Recent years have witnessed a surge of professional user-generated content (PUGC)
based video services, coinciding with the accelerated proliferation of video acquisition
devices such as mobile phones, wearable cameras, and unmanned aerial vehicles. Different
from traditional UGC videos by impromptu shooting, PUGC videos produced by professional
users tend to be carefully designed and edited, receiving high popularity with a relatively
satisfactory playing count. In this paper, we systematically conduct the comprehensive
study on the perceptual quality of PUGC videos and introduce a database consisting
of 10,000 PUGC videos with subjective ratings. In particular, during the subjective
testing, we collect the human opinions based upon not only the MOS, but also the attributes
that could potentially influence the visual quality including face, noise, blur, brightness,
and color. We make the attempt to analyze the large-scale PUGC database with a series
of video quality assessment (VQA) algorithms and a dedicated baseline model based
on pretrained deep neural network is further presented. The cross-dataset experiments
reveal a large domain gap between the PUGC and the traditional user-generated videos,
which are critical in learning based VQA. These results shed light on developing next-generation
PUGC quality assessment algorithms with desired properties including promising generalization
capability, high accuracy, and effectiveness in perceptual optimization. The dataset
and the codes are released at

Combining Attention with Flow for Person Image Synthesis

  • Yurui Ren
  • Yubo Wu
  • Thomas H. Li
  • Shan Liu
  • Ge Li

Pose-guided person image synthesis aims to synthesize person images by transforming
reference images into target poses. In this paper, we observe that the commonly used
spatial transformation blocks have complementary advantages. We propose a novel model
by combining the attention operation with the flow-based operation. Our model not
only takes the advantage of the attention operation to generate accurate target structures
but also uses the flow-based operation to sample realistic source textures. Both objective
and subjective experiments demonstrate the superiority of our model. Meanwhile, comprehensive
ablation studies verify our hypotheses and show the efficacy of the proposed modules.
Besides, additional experiments on the portrait image editing task demonstrate the
versatility of the proposed combination.

Dual Learning Music Composition and Dance Choreography

  • Shuang Wu
  • Zhenguang Liu
  • Shijian Lu
  • Li Cheng

Music and dance have always co-existed as pillars of human activities, contributing
immensely to the cultural, social, and entertainment functions in virtually all societies.
Notwithstanding the gradual systematization of music and dance into two independent
disciplines, their intimate connection is undeniable and one art-form often appears
incomplete without the other. Recent research works have studied generative models
for dance sequences conditioned on music. The dual task of composing music for given
dances, however, has been largely overlooked. In this paper, we propose a novel extension,
where we jointly model both tasks in a dual learning approach. To leverage the duality
of the two modalities, we introduce an optimal transport objective to align feature
embeddings, as well as a cycle consistency loss to foster overall consistency. Experimental
results demonstrate that our dual learning framework improves individual task performance,
delivering generated music compositions and dance choreographs that are realistic
and faithful to the conditioned inputs.

SESSION: Session 26: Open Source Competition

MMFashion: An Open-Source Toolbox for Visual Fashion Analysis

  • Xin Liu
  • Jiancheng Li
  • Jiaqi Wang
  • Ziwei Liu

We present MMFashion, a comprehensive, flexible and user-friendly open-source visual
fashion analysis toolbox based on PyTorch. This toolbox supports a wide spectrum of
fashion analysis tasks, including Fashion Attribute Prediction, Fashion Recognition
and Retrieval, Fashion Landmark Detection, Fashion Parsing and Segmentation and Fashion
Compatibility and Recommendation. It covers almost all the mainstream tasks in fashion
analysis community. MMFashion has several appealing properties. Firstly, MMFashion
follows the principle of modular design. The framework is decomposed into different
components so that it is easily extensible for diverse customized modules. In addition,
detailed documentations, demo scripts and off-the-shelf models are available, which
ease the burden of layman users to leverage the recent advances in deep learning-based
fashion analysis. Our proposed MMFashion is currently the most complete platform for
visual fashion analysis in deep learning era, with more functionalities to be added.
This toolbox and the benchmark could serve the flourishing research community by providing
a flexible toolkit to deploy existing models and develop new ideas and approaches.
We welcome all contributions to this still-growing efforts towards open science:

Efficient Reinforcement Learning Development with RLzoo

  • Zihan Ding
  • Tianyang Yu
  • Hongming Zhang
  • Yanhua Huang
  • Guo Li
  • Quancheng Guo
  • Luo Mai
  • Hao Dong

Many multimedia developers are exploring for adopting Deep Reinforcement Learning
(DRL) techniques in their applications. They however often find such an adoption challenging.
Existing DRL libraries provide poor support for prototyping DRL agents (i.e., models),
customising the agents, and comparing the performance of DRL agents. As a result,
the developers often report low efficiency in developing DRL agents. In this paper,
we introduce RLzoo, a new DRL library that aims to make the development of DRL agents
efficient. RLzoo provides developers with (i) high-level yet flexible APIs for prototyping
DRL agents, and further customising the agents for best performance, (ii) a model
zoo where users can import a wide range of DRL agents and easily compare their performance,
and (iii) an algorithm that can automatically construct DRL agents with custom components
(which are critical to improve agent's performance in custom applications). Evaluation
results show that RLzoo can effectively reduce the development cost of DRL agents,
while achieving comparable performance with existing DRL libraries.

Fast and Flexible Human Pose Estimation with HyperPose

  • Yixiao Guo
  • Jiawei Liu
  • Guo Li
  • Luo Mai
  • Hao Dong

Estimating human pose is an important yet challenging task in multimedia applications.
Existing pose estimation libraries target reproducing standard pose estimation algorithms.
When it comes to customising these algorithms for real-world applications, none of
the existing libraries can offer both the flexibility of developing custom pose estimation
algorithms and the high-performance of executing these algorithms on commodity devices.
In this paper, we introduce Hyperpose, a novel flexible and high-performance pose
estimation library. Hyperpose provides expressive Python APIs that enable developers
to easily customise pose estimation algorithms for their applications. It further
provides a model inference engine highly optimised for real-time pose estimation.
This engine can dynamically dispatch carefully designed pose estimation tasks to CPUs
and GPUs, thus automatically achieving high utilisation of hardware resources irrespective
of deployment environments. Extensive evaluation results show that Hyperpose can achieve
up to 3.1x~7.3x higher pose estimation throughput compared to state-of-the-art pose
estimation libraries without compromising estimation accuracy. By 2021, Hyperpose
has received over 1000 stars on GitHub and attracted users from both industry and

SmartEye: An Open Source Framework for Real-Time Video Analytics with Edge-Cloud Collaboration

  • Xuezhi Wang
  • Guanyu Gao

Video analytics with Deep Neural Networks (DNNs) empowers many vision-based applications.
However, deploying DNN models for video analytics services must address the challenges
of computational capacity, service delay, and cost. Leveraging the edge-cloud collaboration
to address these problems has become a growing trend. This paper provides the multimedia
research community with an open source framework named SmartEye for real-time video
analytics by leveraging the edge-cloud collaboration. The system consists of 1) an
edge layer which enables video preprocessing, model selection, on-edge inference,
and task offloading; 2) a request forwarding layer which serves as a gateway of the
cloud and forwards the offloaded tasks to backend workers; and 3) a backend worker
layer that processes the offloaded tasks with specified DNN models. One can easily
customize the policies for preprocessing, offloading, model selection, and request
forwarding. The framework can facilitate research and development in this field. The
project is released as an open source project on GitHub at

ZoomSense: A Scalable Infrastructure for Augmenting Zoom

  • Tom Bartindale
  • Peter Chen
  • Harrison Marshall
  • Stanislav Pozdniakov
  • Dan Richardson

We have seen a dramatic increase in the adoption of teleconferencing systems such
as Zoom for remote teaching and working. Although designed primarily for traditional
video conferencing scenarios, these platforms are actually being deployed in many
diverse contexts. As such, Zoom offers little to aid hosts' understanding of attendee
participation and often hinders participant agency. We introduce ZoomSense : an open-source,
scalable infrastructure built upon 'virtual meeting participants', which exposes real-time
meta-data, meeting content and host controls through an easy to use abstraction -
so that developers can rapidly and sustainably augment Zoom.

Efficient Graph Deep Learning in TensorFlow with tf_geometric

  • Jun Hu
  • Shengsheng Qian
  • Quan Fang
  • Youze Wang
  • Quan Zhao
  • Huaiwen Zhang
  • Changsheng Xu

We introduce tf_geometric1, an efficient and friendly library for graph deep learning,
which is compatible with both TensorFlow 1.x and 2.x. It provides kernel libraries
for building Graph Neural Networks (GNNs) as well as implementations of popular GNNs.
The kernel libraries consist of infrastructures for building efficient GNNs, including
graph data structures, graph map-reduce framework, graph mini-batch strategy, etc.
These infrastructures enable tf_geometric to support single-graph computation, multi-graph
computation, graph mini-batch, distributed training, etc.; therefore, tf_geometric
can be used for a variety of graph deep learning tasks, such as node classification,
link prediction, and graph classification. Based on the kernel libraries, tf_geometric
implements a variety of popular GNN models. To facilitate the implementation of GNNs,
tf_geometric also provides some other libraries for dataset management, graph sampling,
etc. Different from existing popular GNN libraries, tf_geometric provides not only
Object-Oriented Programming (OOP) APIs, but also Functional APIs, which enable tf_geometric
to handle advanced tasks such as graph meta-learning. The APIs are friendly and suitable
for both beginners and experts.

FaceX-Zoo: A PyTorch Toolbox for Face Recognition

  • Jun Wang
  • Yinglu Liu
  • Yibo Hu
  • Hailin Shi
  • Tao Mei

Due to the remarkable progress in recent years, deep face recognition is in great
need of public support for practical model production and further exploration. The
demands are in three folds, including 1) modular training scheme, 2) standard and
automatic evaluation, and 3) groundwork of deployment. To meet these demands, we present
a novel open-source project, named FaceX-Zoo, which is constructed with modular and
scalable design, and oriented to the academic and industrial community of face-related
analysis. FaceX-Zoo provides 1) the training module with various choices of backbone
and supervisory head; 2) the evaluation module that enables standard and automatic
test on most popular benchmarks; 3) the module of simple yet fully functional face
SDK for the validation and primary application of end-to-end face recognition; 4)
the additional module that integrates a group of useful tools. Based on these easy-to-use
modules, FaceX-Zoo can help the community to easily build stateof-the-art solutions
for deep face recognition and, such like the newly-emerged challenge of masked face
recognition caused by the worldwide COVID-19 pandemic. Besides, FaceX-Zoo can be easily
upgraded and scaled up along with further exploration in face related fields. The
source codes and models have been released and received over 900 stars at

PyTorchVideo: A Deep Learning Library for Video Understanding

  • Haoqi Fan
  • Tullie Murrell
  • Heng Wang
  • Kalyan Vasudev Alwala
  • Yanghao Li
  • Yilei Li
  • Bo Xiong
  • Nikhila Ravi
  • Meng Li
  • Haichuan Yang
  • Jitendra Malik
  • Ross Girshick
  • Matt Feiszli
  • Aaron Adcock
  • Wan-Yen Lo
  • Christoph Feichtenhofer

We introduce PyTorchVideo, an open-source deep-learning library that provides a rich
set of modular, efficient, and reproducible components for a variety of video understanding
tasks, including classification, detection, self-supervised learning, and low-level
processing. The library covers a full stack of video understanding tools including
multimodal data loading, transformations, and models that reproduce state-of-the-art
performance. PyTorchVideo further supports hardware acceleration that enables real-time
inference on mobile devices. The library is based on PyTorch and can be used by any
training framework; for example, PyTorchLightning, PySlowFast, or Classy Vision. PyTorchVideo
is available at

AICoacher: A System Framework for Online Realtime Workout Coach

  • Haocong Ying
  • Tie Liu
  • Mingxin Ai
  • Jiali Ding
  • Yuanyuan Shang

There is a growing demand for online fitness due to the impact of the epidemic. This
paper presents a real-time online fitness system framework called AICoacher, which
offers different online coaches. The framework constructs an extensible AI-based architecture
that supports a variety of fitness movements. Firstly, key frames of motion are extracted
automatically, and the feature vectors are calculated with the body pose points. Secondly,
the state transition matrix can effectively identify fitness actions and capture their
time-continuous characteristics. Finally, AICoacher can accurately provide the number
of repetitions and correction tips of fitness movements. Currently, the AICoacher
has a number of fitness courses supported by online coaches and has been tested on
hundreds of fitness movements. The code can be downloaded from

MMOCR: A Comprehensive Toolbox for Text Detection, Recognition and Understanding

  • Zhanghui Kuang
  • Hongbin Sun
  • Zhizhong Li
  • Xiaoyu Yue
  • Tsui Hin Lin
  • Jianyong Chen
  • Huaqiang Wei
  • Yiqin Zhu
  • Tong Gao
  • Wenwei Zhang
  • Kai Chen
  • Wayne Zhang
  • Dahua Lin

We present MMOCR---an open-source toolbox which provides a comprehensive pipeline
for text detection and recognition, as well as their downstream tasks such as named
entity recognition and key information extraction. MMOCR implements 14 state-of-the-art
algorithms, which is significantly more than all the existing open-source OCR projects
we are aware of to date. To facilitate future research and industrial applications
of text recognition-related problems, we also provide a large number of trained models
and detailed benchmarks to give insights into the performance of text detection, recognition
and understanding. MMOCR is publicly released at

A Complete End to End Open Source Toolchain for the Versatile Video Coding (VVC) Standard

  • Adam Wieckowski
  • Christian Lehmann
  • Benjamin Bross
  • Detlev Marpe
  • Thibaud Biatek
  • Mikael Raulet
  • Jean Le Feuvre

Versatile Video Coding (VVC) is the most recent international video coding standard
jointly developed by ITU-T and ISO/IEC, which has been finalized in July 2020. VVC
allows for significant bit-rate reductions around 50% for the same subjective video
quality compared to its predecessor, High Efficiency Video Coding (HEVC). One year
after finalization, VVC support in devices and chipsets is still under development,
which is aligned with the typical development cycles of new video coding standards.
This paper presents open-source software packages that allow building a complete VVC
end-to-end toolchain already one year after its finalization. This includes the Fraunhofer
HHI VVenC library for fast and efficient VVC encoding as well as HHI's VVdeC library
for live decoding. An experimental integration of VVC in the GPAC software tools and
FFmpeg media framework allows packaging VVC bitstreams, e.g. encoded with VVenC, in
MP4 file format and using DASH for content creation and streaming. The integration
of VVdeC allows playback on the receiver. Given these packages, step-by-step tutorials
are provided for two possible application scenarios: VVC file encoding plus playback
and adaptive streaming with DASH.

X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics

  • Yehao Li
  • Yingwei Pan
  • Jingwen Chen
  • Ting Yao
  • Tao Mei

With the rise and development of deep learning over the past decade, there has been
a steady momentum of innovation and breakthroughs that convincingly push the state-of-the-art
of cross-modal analytics between vision and language in multimedia field. Nevertheless,
there has not been an open-source codebase in support of training and deploying numerous
neural network models for cross-modal analytics in a unified and modular fashion.
In this work, we propose X-modaler --- a versatile and high-performance codebase that
encapsulates the state-of-the-art cross-modal analytics into several general-purpose
stages (e.g., pre-processing, encoder, cross-modal interaction, decoder, and decode
strategy). Each stage is empowered with the functionality that covers a series of
modules widely adopted in state-of-the-arts and allows seamless switching in between.
This way naturally enables a flexible implementation of state-of-the-art algorithms
for image captioning, video captioning, and vision-language pre-training, aiming to
facilitate the rapid development of research community. Meanwhile, since the effective
modular designs in several stages (e.g., cross-modal interaction) are shared across
different vision-language tasks, X-modaler can be simply extended to power startup
prototypes for other tasks in cross-modal analytics, including visual question answering,
visual commonsense reasoning, and cross-modal retrieval. X-modaler is an Apache-licensed
codebase, and its source codes, sample projects and pre-trained models are available

Interpreting Super-Resolution CNNs for Sub-Pixel Motion Compensation in Video Coding

  • Luka Murn
  • Alan F. Smeaton
  • Marta Mrak

Machine learning approaches for more efficient video compression have been developed
thanks to breakthroughs in deep learning. However, they typically bring coding improvements
at the cost of significant increases in computational complexity, making them largely
unsuitable for practical applications. In this paper, we present open-source software
for convolutional neural network-based solutions which improve the interpolation of
reference samples needed for fractional precision motion compensation. Contrary to
previous efforts, the networks are fully linear, allowing them to be interpreted,
with a full interpolation filter set derived from trained models, making it simple
to integrate in conventional video coding schemes. When implemented in the context
of the state-of-the-art Versatile Video Coding (VVC) test model, the complexity of
the learned interpolation schemes is significantly reduced compared to the interpolation
with full neural networks, while achieving notable coding efficiency improvements
on lower resolution video sequences. The open-source software package is available
at under the 3-clause BSD

SESSION: Session 27: Multimedia Search and Recommendation-I

Towards Accurate Localization by Instance Search

  • Yi-Geng Hong
  • Hui-Chu Xiao
  • Wan-Lei Zhao

Visual object localization is the key step in a series of object detection tasks.
In the literature, high localization accuracy is achieved with the mainstream strongly
supervised frameworks. However, such methods require object-level annotations and
are unable to detect objects of unknown categories. Weakly supervised methods face
similar difficulties. In this paper, a self-paced learning framework is proposed to
achieve accurate object localization on the rank list returned by instance search.
The proposed framework mines the target instance gradually from the queries and their
corresponding top-ranked search results. Since a common instance is shared between
the query and the images in the rank list, the target visual instance can be accurately
localized even without knowing what the object category is. In addition to performing
localization on instance search, the issue of few-shot object detection is also addressed
under the same framework. Superior performance over state-of-the-art methods is observed
on both tasks.

Database-adaptive Re-ranking for Enhancing Cross-modal Image Retrieval

  • Rintaro Yanagi
  • Ren Togo
  • Takahiro Ogawa
  • Miki Haseyama

We propose an approach that enhances arbitrary existing cross-modal image retrieval
performance. Most of the cross-modal image retrieval methods mainly focus on direct
computation of similarities between a text query and candidate images in an accurate
way. However, their retrieval performance is affected by the ambiguity of text queries
and the bias of target databases (DBs). Dealing with ambiguous text queries and DBs
with bias will lead to accurate cross-modal image retrieval in real-world applications.
A DB-adaptive re-ranking method using modality-driven spaces, which can extend arbitrary
cross-modal image retrieval methods for enhancing their performance, is proposed in
this paper. The proposed method includes two approaches: "DB-adaptive re-ranking''
and "modality-driven clue information extraction''. Our method estimates clue information
that can effectively clarify the desired image from the whole set of a target DB and
then receives user's feedback for the estimated information. Furthermore, our method
extracts more detailed information of a query text and a target DB by focusing on
modality-driven spaces, and it enables more accurate re-ranking. Our method allows
users to reach their desired single image by just answering questions. Experimental
results using MSCOCO, Visual Genome and newly introduced datasets including images
with a particular object show that the proposed method can enhance the performance
of state-of-the-art cross-modal image retrieval methods.

Fine-grained Cross-modal Alignment Network for Text-Video Retrieval

  • Ning Han
  • Jingjing Chen
  • Guangyi Xiao
  • Hao Zhang
  • Yawen Zeng
  • Hao Chen

Despite the recent progress of cross-modal text-to-video retrieval techniques, their
performance is still unsatisfactory. Most existing works follow a trend of learning
a joint embedding space to measure the distance between global-level or local-level
textual and video representation. The fine-grained interactions between video segments
and phrases are usually neglected in cross-modal learning, which results in suboptimal
retrieval performances. To tackle the problem, we propose a novel Fine-grained Cross-modal
Alignment Network (FCA-Net), which considers the interactions between visual semantic
units (i.e., sub-actions/sub-events) in videos and phrases in sentences for cross-modal
alignment. Specifically, the interactions between visual semantic units and phrases
are formulated as a link prediction problem optimized by a graph auto-encoder to obtain
the explicit relations between them and enhance the aligned feature representation
for fine-grained cross-modal alignment. Experimental results on MSR-VTT, YouCook2,
and VATEX datasets demonstrate the superiority of our model as compared to the state-of-the-art

Meta Self-Paced Learning for Cross-Modal Matching

  • Jiwei Wei
  • Xing Xu
  • Zheng Wang
  • Guoqing Wang

Cross-modal matching has attracted growing attention due to the rapid emergence of
the multimedia data on the web and social applications. Recently, many re-weighting
methods have been proposed for accelerating model training by designing a mapping
function from similarity scores to weights. However, these re-weighting methods are
difficult to be universally applied in practice since manually pre-set weighting functions
inevitably involve hyper-parameters. In this paper, we propose a Meta Self-Paced Network
(Meta-SPN) that automatically learns a weighting scheme from data for cross-modal
matching. Specifically, a meta self-paced network composed of a fully connected neural
network is designed to fit the weight function, which takes the similarity score of
the sample pairs as input and outputs the corresponding weight value. Our meta self-paced
network considers not only the self-similarity scores, but also their potential interactions
(e.g., relative-similarity) when learning the weights. Motivated by the success of
meta-learning, we use the validation set to update the meta self-paced network during
the training of the matching network. Experiments on two image-text matching benchmarks
and two video-text matching benchmarks demonstrate the generalization and effectiveness
of our method.

CausalRec: Causal Inference for Visual Debiasing in Visually-Aware Recommendation