MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Digital Library logo
Full Citation in the ACM Digital Library

SESSION: Keynote Talks I&II

Video Coding for Machine

Weno Gao

Video coding systems, started for TV broadcasting services over satellite and cable
networks with limited bandwidth, later on used for surveillance video and internet
video, those target on higher compression ratio with lower quality lose, under the
trade-off of RDO (rate distortion optimization) model, judged by human experts. In
other word, current video coding standards are good for people, for human visual perception,
not design for machine intelligence. However, today more and more applications from
industry require video coding for machine, which targets to compress image and video
for machine usage, object detection and or tracking, image classification, event analysis,
and so on, those target on higher compression ratio with higher recognition accuracy,
under the trade-off of RAO (rate accuracy optimization) model, judged by system. In
this case, video coding needs to do feature compression, which preserves and transmits
the most critical information for computer vision and pattern recognition, not for
human visual perception. So it is quite different between video coding for human and
video coding for machine, even if the two systems will coexist for a long time. In
this talk, I will introduce the history of VCM, list some early works on pattern analysis
based on compressed data domain, some efforts from ISO/IEC MPEG group on MPEG-7 CDVS
(compact descriptor for visual search) and CDVA (compact descriptors for visual analysis),
some ongoing projects on AVS working group and MPEG working group, give the key techniques
and challenges on VCM, and overview its future.

Semantic Media Conversion: Possibilities and Limits

H. V. Jagadish

With recent amazing progress in machine intelligence, it is becoming increasingly
easy to "convert" information reliably from one medium to another. For example, there
is already a regular annual conference on "Text as Data". We will soon have similar
facility to deal with images, videos, music, and so on. Let's call this semantic media
conversion.

In this talk, I will outline some possibilities with high quality semantic media conversion.
In particular, it becomes possible to convert all media into alphanumeric data, nicely
organized in structured tables with limited loss of information. Multimedia data,
so converted, becomes easy to use, to aggregate, and to analyze, leading to new Data
Science opportunities.

But this ease of analysis also leads to questions of appropriateness.

We shouldn't necessarily do everything that we have the ability to do.
What are our values
How do we apply them in practice
What limits do we apply to semantic media conversion and the analysis enabled by it.

SESSION: Session 1: Deep Learning for Multimedia-I

Image Re-composition via Regional Content-Style Decoupling

Rong Zhang
Wei Li
Yiqun Zhang
Hong Zhang
Jinhui Yu
Ruigang Yang
Weiwei Xu

Typical image composition harmonizes regions from different images to a single plausible
image. We extend the idea of image composition by introducing the content-style decomposition
and combination to form the concept of image re-composition. In other words, our image
re-composition could arbitrarily combine those contents and styles decomposed from
different images to generate more diverse images in a unified framework. In the decomposition
stage, we incorporate the whitening normalization to obtain a more thorough content-style
decoupling, which substantially improves the re-composition results. Moreover, to
handle the variation of structure and texture of different objects in an image, we
design the network to support regional feature representation and achieve region-aware
content-style decomposition. Regarding the composition stage, we propose a cycle consistency
loss to constrain the network preserving the content and style information during
the composition. Our method can produce diverse re-composition results, including
content-content, content-style and style-style. Our experimental results demonstrate
a large improvement over the current state-of-the-art methods.

Deep Clustering based on Bi-Space Association Learning

Hao Huang
Shinjae Yoo
Chenxiao Xu

Clustering is the task of instance grouping so that similar ones are grouped into
the same cluster, while dissimilar ones are in different clusters. However, such similarity
is a local concept in regard to different clusters and their relevant feature space.
This work aims to discover clusters by exploring feature association and instance
similarity concurrently. We propose a deep clustering framework that can localize
the search for relevant features appertaining to different clusters. In turn, this
allows for measuring instance similarity that exist in multiple, possibly overlapping,
feature subsets, which contribute to more accurate clustering of instances. Additionally,
the relevant features of each cluster endow interpretability of clustering results.
Experiments on text and image datasets show that our method outperforms existing state-of-the-art
baselines.

Feature Stylization and Domain-aware Contrastive Learning for Domain Generalization

Seogkyu Jeon
Kibeom Hong
Pilhyeon Lee
Jewook Lee
Hyeran Byun

Domain generalization aims to enhance the model robustness against domain shift without
accessing the target domain. Since the available source domains for training are limited,
recent approaches focus on generating samples of novel domains. Nevertheless, they
either struggle with the optimization problem when synthesizing abundant domains or
cause the distortion of class semantics. To these ends, we propose a novel domain
generalization framework where feature statistics are utilized for stylizing original
features to ones with novel domain properties. To preserve class information during
stylization, we first decompose features into high and low frequency components. Afterward,
we stylize the low frequency components with the novel domain styles sampled from
the manipulated statistics, while preserving the shape cues in high frequency ones.
As the final step, we re-merge both the components to synthesize novel domain features.
To enhance domain robustness, we utilize the stylized features to maintain the model
consistency in terms of features as well as outputs. We achieve the feature consistency
with the proposed domain-aware supervised contrastive loss, which ensures domain invariance
while increasing class discriminability. Experimental results demonstrate the effectiveness
of the proposed feature stylization and the domain-aware contrastive loss. Through
quantitative comparisons, we verify the lead of our method upon existing state-of-the-art
methods on two benchmarks, PACS and Office-Home.

HDA-Net: Horizontal Deformable Attention Network for Stereo Matching

Qi Zhang
Xuesong Zhang
Baoping Li
Yuzhong Chen
Anlong Ming

Stereo matching is a fundamental and challenging task which has various applications
in autonomous driving, dense reconstruction and other depth related tasks. Contextual
information with discriminative features is crucial for accurate stereo matching in
the ill-posed regions (textureless, occlusion, etc.). In this paper, we propose an
efficient horizontal attention module to adaptively capture the global correspondence
clues. Compared with the popular non-local attention, our horizontal attention is
more effective for stereo matching with better performance and lower consumption of
computation and memory. We further introduce a deformable module to refine the contextual
information in the disparity discontinuous areas such as the boundary of objects.
Learning-based method is adopted to construct the cost volume by concatenating the
features of two branches. In order to offer explicit similarity measure to guide learning-based
volume for obtaining more reasonable unimodal matching cost distribution we additionally
combine the learning-based volume with the improved zero-centered group-wise correlation
volume. Finally, we regularize the 4D joint cost volume by a 3D CNN module and generate
the final output by disparity regression. The experimental results show that our proposed
HDA-Net achieves the state-of-the-art performance on the Scene Flow dataset and obtains
competitive performance on the KITTI datasets compared with the relevant networks.

MBRS: Enhancing Robustness of DNN-based Watermarking by Mini-Batch of Real and Simulated
JPEG Compression

Zhaoyang Jia
Han Fang
Weiming Zhang

Based on the powerful feature extraction ability of deep learning architecture, recently,
deep-learning based watermarking algorithms have been widely studied. The basic framework
of such algorithm is the auto-encoder like end-to-end architecture with an encoder,
a noise layer and a decoder. The key to guarantee robustness is the adversarial training
with the differential noise layer. However, we found that none of the existing framework
can well ensure the robustness against JPEG compression, which is non-differential
but is an essential and important image processing operation. To address such limitations,
we proposed a novel end-to-end training architecture, which utilizes Mini-Batch of
Real and Simulated JPEG compression (MBRS) to enhance the JPEG robustness. Precisely,
for different mini-batches, we randomly choose one of real JPEG, simulated JPEG and
noise-free layer as the noise layer. Besides, we suggest to utilize the Squeeze-and-Excitation
blocks which can learn better feature in embedding and extracting stage, and propose
a "message processor" to expand the message in a more appreciate way. Meanwhile, to
improve the robustness against crop attack, we propose an additive diffusion block
into the network. The extensive experimental results have demonstrated the superior
performance of the proposed scheme compared with the state-of-the-art algorithms.
Under the JPEG compression with quality factor $Q=50$, our models achieve a bit error
rate less than 0.01% for extracted messages, with PSNR larger than 36 for the encoded
images, which shows the well-enhanced robustness against JPEG attack. Besides, under
many other distortions such as Gaussian filter, crop, cropout and dropout, the proposed
framework also obtains strong robustness. The code implemented by PyTorch is avaiable
in https://github.com/jzyustc/MBRS.

From Synthetic to Real: Image Dehazing Collaborating with Unlabeled Real Data

Ye Liu
Lei Zhu
Shunda Pei
Huazhu Fu
Jing Qin
Qing Zhang
Liang Wan
Wei Feng

Single image dehazing is a challenging task, for which the domain shift between synthetic
training data and real-world testing images usually leads to degradation of existing
methods. To address this issue, we propose a novel image dehazing framework collaborating
with unlabeled real data. First, we develop a disentangled image dehazing network
(DID-Net), which disentangles the feature representations into three component maps,
i.e. the latent haze-free image, the transmission map, and the global atmospheric
light estimate, respecting the physical model of a haze process. Our DID-Net predicts
the three component maps by progressively integrating features across scales, and
refines each map by passing an independent refinement network. Then a disentangled-consistency
mean-teacher network (DMT-Net) is employed to collaborate unlabeled real data for
boosting single image dehazing. Specifically, we encourage the coarse predictions
and refinements of each disentangled component to be consistent between the student
and teacher networks by using a consistency loss on unlabeled real data. We make comparison
with 13 state-of-the-art dehazing methods on a new collected dataset (Haze4K) and
two widely-used dehazing datasets (i.e., SOTS and HazeRD), as well as on real-world
hazy images. Experimental results demonstrate that our method has obvious quantitative
and qualitative improvements over the existing methods.

SESSION: Session 2: Deep Learning for Multimedia-II

Video Semantic Segmentation via Sparse Temporal Transformer

Jiangtong Li
Wentao Wang
Junjie Chen
Li Niu
Jianlou Si
Chen Qian
Liqing Zhang

Currently, video semantic segmentation mainly faces two challenges: 1) the demand
of temporal consistency; 2) the balance between segmentation accuracy and inference
efficiency. For the first challenge, existing methods usually use optical flow to
capture the temporal relation in consecutive frames and maintain the temporal consistency,
but the low inference speed by means of optical flow limits the real-time applications.
For the second challenge, flow based key frame warping is one mainstream solution.
However, the unbalanced inference latency of flow-based key frame warping makes it
unsatisfactory for real-time applications. Considering the segmentation accuracy and
inference efficiency, we propose a novel Sparse Temporal Transformer (STT) to bridge
temporal relation among video frames adaptively, which is also equipped with query
selection and key selection. The key selection and query selection strategies are
separately applied to filter out temporal and spatial redundancy in our temporal transformer.
Specifically, our STT can reduce the time complexity of temporal transformer by a
large margin without harming the segmentation accuracy and temporal consistency. Experiments
on two benchmark datasets, Cityscapes and Camvid, demonstrate that our method achieves
the state-of-the-art segmentation accuracy and temporal consistency with comparable
inference speed.

Diverse Image Inpainting with Bidirectional and Autoregressive Transformers

Yingchen Yu
Fangneng Zhan
Rongliang WU
Jianxiong Pan
Kaiwen Cui
Shijian Lu
Feiying Ma
Xuansong Xie
Chunyan Miao

Image inpainting is an underdetermined inverse problem, which naturally allows diverse
contents to fill up the missing or corrupted regions realistically. Prevalent approaches
using convolutional neural networks (CNNs) can synthesize visually pleasant contents,
but CNNs suffer from limited perception fields for capturing global features. With
image-level attention, transformers enable to model long-range dependencies and generate
diverse contents with autoregressive modeling of pixel-sequence distributions. However,
the unidirectional attention in autoregressive transformers is suboptimal as corrupted
image regions may have arbitrary shapes with contexts from any direction. We propose
BAT-Fill, an innovative image inpainting framework that introduces a novel bidirectional
autoregressive transformer (BAT) for image inpainting. BAT utilizes the transformers
to learn autoregressive distributions, which naturally allows the diverse generation
of missing contents. In addition, it incorporates the masked language model like BERT,
which enables bidirectionally modeling of contextual information of missing regions
for better image completion. Extensive experiments over multiple datasets show that
BAT-Fill achieves superior diversity and fidelity in image inpainting qualitatively
and quantitatively.

SSFlow: Style-guided Neural Spline Flows for Face Image Manipulation

Hanbang Liang
Xianxu Hou
Linlin Shen

Significant progress has been made in high-resolution and photo-realistic image generation
by Generative Adversarial Networks (GANs). However, the generation process is still
lack of control, which is crucial for semantic face editing. Furthermore, it remains
challenging to edit target attributes and preserve the identity at the same time.
In this paper, we propose SSFlow to achieve identity-preserved semantic face manipulation
in StyleGAN latent space based on conditional Neural Spline Flows. To further improve
the performance of Neural Spline Flows on such task, we also propose Constractive
Squash component and Blockwise 1 x 1 Convolution layer. Moreover, unlike other conditional
flow-based approaches that require facial attribute labels during inference, our method
can achieve label-free manipulation in a more flexible way. As a result, our methods
are able to perform well-disentangled edits along various attributes, and generalize
well for both real and artistic face image manipulation. Qualitative and quantitative
evaluations show the advantages of our method for semantic face manipulation over
state-of-the-art approaches.

Constrained Graphic Layout Generation via Latent Optimization

Kotaro Kikuchi
Edgar Simo-Serra
Mayu Otani
Kota Yamaguchi

It is common in graphic design humans visually arrange various elements according
to their design intent and semantics. For example, a title text almost always appears
on top of other elements in a document. In this work, we generate graphic layouts
that can flexibly incorporate such design semantics, either specified implicitly or
explicitly by a user. We optimize using the latent space of an off-the-shelf layout
generation model, allowing our approach to be complementary to and used with existing
layout generation models. Our approach builds on a generative layout model based on
a Transformer architecture, and formulates the layout generation as a constrained
optimization problem where design constraints are used for element alignment, overlap
avoidance, or any other user-specified relationship. We show in the experiments that
our approach is capable of generating realistic layouts in both constrained and unconstrained
generation tasks with a single model. The code is available at https://github.com/ktrk115/const_layout.

Transfer Vision Patterns for Multi-Task Pixel Learning

Xiaoya Zhang
Ling Zhou
Yong Li
Zhen Cui
Jin Xie
Jian Yang

Multi-task pixel perception is one of the most important topics in the field of machine
intelligence. Inspired by the observation of cross-task interdependencies of visual
patterns, we propose a multi-task vision pattern transformation (VPT) method to adaptively
correlate and transfer cross-task visual patterns by leveraging the powerful transformer
mechanism. To better transfer visual patterns, specifically, we build two types of
pattern transformation based on the statistic prior that the affinity relations across
tasks are correlated. One aims to transfer feature patterns for the integration of
different task features; the other aims to exchange structure patterns for mining
and leveraging the latent interaction cues. These two types of transformations are
encapsulated into two VPT units, which provide universal matching interfaces for multi-task
learning, complement each other to guide the transmission of feature/structure patterns,
and finally realize an adaptive selection of important patterns across tasks. Extensive
experiments on the joint learning of semantic segmentation, depth prediction and surface
normal estimation demonstrate that our proposed method is more effective than those
baselines and achieve the state-of-that-art performance in three pixel-level visual
tasks.

Object-aware Long-short-range Spatial Alignment for Few-Shot Fine-Grained Image Classification

Yike Wu
Bo Zhang
Gang Yu
Weixi Zhang
Bin Wang
Tao Chen
Jiayuan Fan

The goal of few-shot fine-grained image classification is to recognize rarely seen
fine-grained objects in the query set, given only a few samples of this class in the
support set. Previous works focus on learning discriminative image features from a
limited number of training samples for distinguishing various fine-grained classes,
but ignore one important fact that spatial alignment of the discriminative semantic
features between the query image with arbitrary changes and the support image, is
also critical for computing the semantic similarity between each support-query pair.
In this work, we propose an object-aware long-short-range spatial alignment approach,
which is composed of a foreground object feature enhancement (FOE) module, a long-range
semantic correspondence (LSC) module and a short-range spatial manipulation (SSM)
module. The FOE is developed to weaken background disturbance and encourage higher
foreground object response. To address the problem of long-range object feature misalignment
between support-query image pairs, the LSC is proposed to learn the transferable long-range
semantic correspondence by a designed feature similarity metric. Further, the SSM
module is developed to refine the transformed support feature after the long-range
step to align short-range misaligned features (or local details) with the query features.
Extensive experiments have been conducted on four benchmark datasets, and the results
show superior performance over most state-of-the-art methods under both 1-shot and
5-shot classification scenarios.

SESSION: Session 3: Brave New Idea

Recycling Discriminator: Towards Opinion-Unaware Image Quality Assessment Using Wasserstein GAN

Yunan Zhu
Haichuan Ma
Jialun Peng
Dong Liu
Zhiwei Xiong

Generative adversarial networks (GANs) have been extensively used for training networks
that perform image generation. After training, the discriminator in GAN was not used
anymore. We propose to recycle the trained discriminator for another use: no-reference
image quality assessment (NR-IQA). We are motivated by twofold facts. First, in Wasserstein
GAN (WGAN), the discriminator is designed to calculate the distance between the distribution
of generated images and that of real images; thus, the trained discriminator may encode
the distribution of real-world images. Second, NR-IQA often needs to leverage the
distribution of real-world images for assessing image quality. We then conjecture
that using the trained discriminator for NR-IQA may help get rid of any human-labeled
quality opinion scores and lead to a new opinion-unaware (OU) method. To validate
our conjecture, we start from a restricted NR-IQA problem, that is IQA for artificially
super-resolved images. We train super-resolution (SR) WGAN with two kinds of discriminators:
one is to directly evaluate the entire image, and the other is to work on small patches.
For the latter kind, we obtain patch-wise quality scores, and then have the flexibility
to fuse the scores, e.g., by weighted average. Moreover, we directly extend the trained
discriminators for authentically distorted images that have different kinds of distortions.
Our experimental results demonstrate that the proposed method is comparable to the
state-of-the-art OU NR-IQA methods on SR images and is even better than them on authentically
distorted images. Our method provides a better interpretable approach to NR-IQA. Our
code and models are available at https://github.com/YunanZhu/RecycleD.

Learning Kinematic Formulas from Multiple View Videos

Liangchen Song
Sheng Liu
Celong Liu
Zhong Li
Yuqi Ding
Yi Xu
Junsong Yuan

Given a set of multiple view videos, which records the motion trajectory of an object,
we propose to find out the objects' kinematic formulas with neural rendering techniques.
For example, if the input multiple view videos record the free fall motion of an object
with different initial speed v, the network aims to learn its kinematics: Δ=vt-1over
2 gt2, where Δ, g and t are displacement, gravitational acceleration and time. To
achieve this goal, we design a novel framework consisting of a motion network and
a differentiable renderer. For the differentiable renderer, we employ Neural Radiance
Field (NeRF) since the geometry is implicitly modeled by querying coordinates in the
space. The motion network is composed of a series of blending functions and linear
weights, enabling us to analytically derive the kinematic formulas after training.
The proposed framework is trained end to end and only requires knowledge of cameras'
intrinsic and extrinsic parameters. To validate the proposed framework, we design
three experiments to demonstrate its effectiveness and extensibility. The first experiment
is the video of free fall and the framework can be easily combined with the principle
of parsimony, resulting in the correct free fall kinematics. The second experiment
is on the large angle pendulum which does not have analytical kinematics. We use the
differential equation controlling pendulum dynamics as a physical prior in the framework
and demonstrate that the convergence speed becomes much faster. Finally, we study
the explosion animation and demonstrate that our framework can well handle such black-box-generated
motions.

DEPA: Self-Supervised Audio Embedding for Depression Detection

Pingyue Zhang
Mengyue Wu
Heinrich Dinkel
Kai Yu

Depression detection research has increased over the last few decades, one major bottleneck
of which is the limited data availability and representation learning. Recently, self-supervised
learning has seen success in pretraining text embeddings and has been applied broadly
on related tasks with sparse data, while pretrained audio embeddings based on self-supervised
learning are rarely investigated. This paper proposes DEPA, a self-supervised, pretrained
dep ression a udio embedding method for depression detection. An encoder-decoder network
is used to extract DEPA on in-domain depressed datasets (DAIC and MDD) and out-domain
(Switchboard, Alzheimer's) datasets. With DEPA as the audio embedding extracted at
response-level, a significant performance gain is achieved on downstream tasks, evaluated
on both sparse datasets like DAIC and large major depression disorder dataset (MDD).
This paper not only exhibits itself as a novel embedding extracting method capturing
response-level representation for depression detection but more significantly, is
an exploration of self-supervised learning in a specific task within audio processing.

Retinomorphic Sensing: A Novel Paradigm for Future Multimedia Computing

Zhaodong Kang
Jianing Li
Lin Zhu
Yonghong Tian

Conventional frame-based cameras for multimedia computing have encountered important
challenges in high-speed and extreme light scenarios. However, how to design a novel
paradigm for visual perception that overcomes the disadvantages of conventional cameras
still remains an open issue. In this paper, we propose a novel solution, namely retinomorphic
sensing, which integrates fovea-like and peripheral-like sampling mechanisms to generate
asynchronous visual streams using a unified representation as the retina does. Technically,
our encoder incorporates an interaction controller to switch flexibly between dynamic
and static sensing. Then, the decoder effectively extracts dynamic events for machine
vision and reconstructs visual textures for human vision. The results show that our
strategy enables it to sense dynamic events and visual textures meanwhile reduce data
redundancy. We further build a prototype hybrid camera system to verify this strategy
on vision tasks such as image reconstruction and object detection. We believe that
this novel paradigm will provide insight into future multimedia computing. The code
can be available at https://github.com/acmmm2021-bni-retinomorphic/retinomorphic-sensing.

Metaverse for Social Good: A University Campus Prototype

Haihan Duan
Jiaye Li
Sizheng Fan
Zhonghao Lin
Xiao Wu
Wei Cai

In recent years, the metaverse has attracted enormous attention from around the world
with the development of related technologies. The expected metaverse should be a realistic
society with more direct and physical interactions, while the concepts of race, gender,
and even physical disability would be weakened, which would be highly beneficial for
society. However, the development of metaverse is still in its infancy, with great
potential for improvement. Regarding metaverse's huge potential, industry has already
come forward with advance preparation, accompanied by feverish investment, but there
are few discussions about metaverse in academia to scientifically guide its development.
In this paper, we highlight the representative applications for social good. Then
we propose a three-layer metaverse architecture from a macro perspective, containing
infrastructure, interaction, and ecosystem. Moreover, we journey toward both a historical
and novel metaverse with a detailed timeline and table of specific attributes. Lastly,
we illustrate our implemented blockchain-driven metaverse prototype of a university
campus and discuss the prototype design and insights.

SESSION: Session 4: Deep Learning for Multimedia-III

Enhanced Invertible Encoding for Learned Image Compression

Yueqi Xie
Ka Leong Cheng
Qifeng Chen

Although deep learning based image compression methods have achieved promising progress
these days, the performance of these methods still cannot match the latest compression
standard Versatile Video Coding (VVC). Most of the recent developments focus on designing
a more accurate and flexible entropy model that can better parameterize the distributions
of the latent features. However, few efforts are devoted to structuring a better transformation
between the image space and the latent feature space. In this paper, instead of employing
previous autoencoder style networks to build this transformation, we propose an enhanced
Invertible Encoding Network with invertible neural networks (INNs) to largely mitigate
the information loss problem for better compression. Experimental results on the Kodak,
CLIC, and Tecnick datasets show that our method outperforms the existing learned image
compression methods and compression standards, including VVC (VTM 12.1), especially
for high-resolution images. Our source code is available at https://github.com/xyq7/InvCompress.

DC-GNet: Deep Mesh Relation Capturing Graph Convolution Network for 3D Human Shape
Reconstruction

Shihao Zhou
Mengxi Jiang
Shanshan Cai
Yunqi Lei

In this paper, we aim to reconstruct a full 3D human shape from a single image. Previous
vertex-level and parameter regression approaches reconstruct 3D human shape based
on a pre-defined adjacency matrix to encode positive relations between nodes. The
deep topological relations for the surface of the 3D human body are not carefully
exploited. Moreover, the performance of most existing approaches often suffer from
domain gap when handling more occlusion cases in real-world scenes. In this work,
we propose a Deep Mesh Relation Capturing Graph Convolution Network, DC-GNet, with
a shape completion task for 3D human shape reconstruction. Firstly, we propose to
capture deep relations within mesh vertices, where an adaptive matrix encoding both
positive and negative relations is introduced. Secondly, we propose a shape completion
task to learn prior about various kinds of occlusion cases. Our approach encodes mesh
structure from more subtle relations between nodes in a more distant region. Furthermore,
our shape completion module alleviates the performance degradation issue in the outdoor
scene. Extensive experiments on several benchmarks show that our approach outperforms
the previous 3D human pose and shape estimation approaches.

Deep Marginal Fisher Analysis based CNN for Image Representation and Classification

Xun Cai
Jiajing Chai
Yanbo Gao
Shuai Li
Bo Zhu

Deep Convolutional Neural Networks (CNNs) have achieved great success in image classification.
While conventional CNNs optimized with iterative gradient descent algorithms with
large data have been widely used and investigated, there is also research focusing
on learning CNNs with non-iterative optimization methods such as the principle component
analysis network (PCANet). It is very simple and efficient but achieves competitive
performance for some image classification tasks especially on tasks with only a small
amount of data available. This paper further extends this line of research and proposes
a deep Marginal Fisher Analysis (MFA) based CNN, termed as DMNet. It addresses the
limitation of PCANet like CNNs when the samples do not follow Gaussian distribution,
by using a local MFA for CNN filter optimization. It uses a graph embedding framework
for convolution filter optimization by maximizing the inter-class discriminability
among marginal points while minimizing intra-class distance. Cascaded MFA convolution
layers can be used to construct a deep network. Moreover, a binary stochastic hashing
is developed by randomly selecting features with a probability based on the importance
of feature maps for binary hashing. Experimental results demonstrate that the proposed
method achieves state-of-the-art result in non-iterative optimized CNN methods, and
ablation studies have been conducted to verify the effectiveness of the proposed modules
in our DMNet.

Learning Structure Affinity for Video Depth Estimation

Yuanzhouhan Cao
Yidong Li
Haokui Zhang
Chao Ren
Yifan Liu

Depth estimation is a structure learning problem. The affinity among neighbouring
pixels plays an important role in inferring depth values. In this paper, we propose
to learn structure affinity in both spatial and temporal domain for accurate depth
estimation from monocular videos. Specifically, we first propose a convolutional spatial
temporal propagation network (CSTPN) that learns affinity among neighbouring video
frames. Secondly, we employ a structure knowledge distillation scheme that transfers
the spatial temporal affinity learned by cumbersome network to compact network. By
calculating pixel-wise similarities between neighboring frames and neighbouring sequences,
our knowledge distillation scheme efficiently captures both short-term and long-term
spatial temporal affinity. Finally, we apply a warping loss based on optical flow
between video frames to further enforce the temporal affinity. Experiment results
show that our proposed depth estimation approach outperform the state-of-the-art methods
on both indoor and outdoor benchmark datasets.

X-GGM: Graph Generative Modeling for Out-of-distribution Generalization in Visual
Question Answering

Jingjing Jiang
Ziyi Liu
Yifan Liu
Zhixiong Nan
Nanning Zheng

Encouraging progress has been made towards Visual Question Answering (VQA) in recent
years, but it is still challenging to enable VQA models to adaptively generalize to
out-of-distribution (OOD) samples. Intuitively, recompositions of existing visual
concepts (i.e., attributes and objects) can generate unseen compositions in the training
set, which will promote VQA models to generalize to OOD samples. In this paper, we
formulate OOD generalization in VQA as a compositional generalization problem and
propose a graph generative modeling-based training scheme (X-GGM) to handle the problem
implicitly. X-GGM leverages graph generative modeling to iteratively generate a relation
matrix and node representations for the predefined graph that utilizes attribute-object
pairs as nodes. Furthermore, to alleviate the unstable training issue in graph generative
modeling, we propose a gradient distribution consistency loss to constrain the data
distribution with adversarial perturbations and the generated distribution. The baseline
VQA model (LXMERT) trained with the X-GGM scheme achieves state-of-the-art OOD performance
on two standard VQA OOD benchmarks, i.e., VQA-CP v2 and GQA-OOD. Extensive ablation
studies demonstrate the effectiveness of X-GGM components.

DSSL: Deep Surroundings-person Separation Learning for Text-based Person Retrieval

Aichun Zhu
Zijie Wang
Yifeng Li
Xili Wan
Jing Jin
Tian Wang
Fangqiang Hu
Gang Hua

Many previous methods on text-based person retrieval tasks are devoted to learning
a latent common space mapping, with the purpose of extracting modality-invariant features
from both visual and textual modality. Nevertheless, due to the complexity of high-dimensional
data, the unconstrained mapping paradigms are not able to properly catch discriminative
clues about the corresponding person while drop the misaligned information. Intuitively,
the information contained in visual data can be divided into person information (PI)
and surroundings information (SI), which are mutually exclusive from each other. To
this end, we propose a novel Deep Surroundings-person Separation Learning (DSSL) model
in this paper to effectively extract and match person information, and hence achieve
a superior retrieval accuracy. A surroundings-person separation and fusion mechanism
plays the key role to realize an accurate and effective surroundings-person separation
under a mutually exclusion constraint. In order to adequately utilize multi-modal
and multi-granular information for a higher retrieval accuracy, five diverse alignment
paradigms are adopted. Extensive experiments are carried out to evaluate the proposed
DSSL on CUHK-PEDES, which is currently the only accessible dataset for text-base person
retrieval task. DSSL achieves the state-of-the-art performance on CUHK-PEDES. To properly
evaluate our proposed DSSL in the real scenarios, a Real Scenarios Text-based Person
Reidentification (RSTPReid) dataset is constructed to benefit future research on text-based
person retrieval, which will be publicly available.

SESSION: Session 5: Emerging Multimedia Applications-I

Diverse Multimedia Layout Generation with Multi Choice Learning

David D. Nguyen
Surya Nepal
Salil S. Kanhere

Designing visually appealing layouts for multimedia documents containing text, graphs
and images requires a form of creative intelligence. Modelling the generation of layouts
has recently gained attention due to its importance in aesthetics and communication
style. In contrast to standard prediction tasks, there are a range of acceptable layouts
which depend on user preferences. For example, a poster designer may prefer logos
on the top-left while another prefers logos on the bottom-right. Both are correct
choices yet existing machine learning models treat layouts as a single choice prediction
problem. In such situations, these models would simply average over all possible choices
given the same input forming a degenerate sample. In the above example, this would
form an unacceptable layout with a logo in the centre.

In this paper, we present an auto-regressive neural network architecture, called LayoutMCL,
that uses multi-choice prediction and winner-takes-all loss to effectively stabilise
layout generation. LayoutMCL avoids the averaging problem by using multiple predictors
to learn a range of possible options for each layout object. This enables LayoutMCL
to generate multiple and diverse layouts from a single input which is in contrast
with existing approaches which yield similar layouts with minor variations. Through
quantitative benchmarks on real data (magazine, document and mobile app layouts),
we demonstrate that LayoutMCL reduces Fréchet Inception Distance (FID) by 83-98% and
generates significantly more diversity in comparison to existing approaches.

Viewing from Frequency Domain: A DCT-based Information Enhancement Network for Video Person Re-Identification

Liangchen Liu
Xi Yang
Nannan Wang
Xinbo Gao

Video-based person re-identification (Re-ID) aims to match the target pedestrians
under non-overlapping camera system by video tracklets. The key issue of video Re-ID
focuses on exploring effective spatio-temporal features. Generally, the spatio-temporal
information of a video sequence can be divided into two aspects: the discriminative
information in each frame and the shared information over the whole sequence. To make
full use of the rich information in video sequences, this paper proposes a Discrete
Cosine Transform based Information Enhancement Network (DCT-IEN) to achieve more comprehensive
spatio-temporal representation from frequency domain. Inspired by the principle that
average pooling is one of the special frequency components in DCT (the lowest frequency
component), DCT-IEN first adopts discrete cosine transform to convert the extracted
feature maps into frequency domain, thereby retaining more information that embedded
in different frequency components. With the help of DCT frequency spectrum, two branches
are adopted to learn the final video representation: Frequency Selection Module (FSM)
and Lowest Frequency Enhancement Module (LFEM). FSM explores the most discriminative
features in each frame by aggregating different frequency components with attention
mechanism. LFEM enhances the shared feature over the whole video sequence by frame
feature regularization. By fusing these two kinds of features together, DCT-IEN finally
achieves comprehensive video representation. We conduct extensive experiments on two
widely used datasets. The experimental results verify our idea and demonstrate the
effectiveness of DCT-IEN for video-based Re-ID.

Unsupervised Portrait Shadow Removal via Generative Priors

Yingqing He
Yazhou Xing
Tianjia Zhang
Qifeng Chen

Portrait images often suffer from undesirable shadows cast by casual objects or even
the face itself. While existing methods for portrait shadow removal require training
on a large-scale synthetic dataset, we propose the first unsupervised method for portrait
shadow removal without any training data. Our key idea is to leverage the generative
facial priors embedded in the off-the-shelf pretrained StyleGAN2. To achieve this,
we formulate the shadow removal task as a layer decomposition problem: a shadowed
portrait image is constructed by the blending of a shadow image and a shadow-free
image. We propose an effective progressive optimization algorithm to learn the decomposition
process. Our approach can also be extended to portrait tattoo removal and watermark
removal. Qualitative and quantitative experiments on a real-world portrait shadow
dataset demonstrate that our approach achieves comparable performance with supervised
shadow removal methods. Our source code is available at https://github.com/YingqingHe/Shadow-Removal-via-Generative-Priors.

Multimodal Global Relation Knowledge Distillation for Egocentric Action Anticipation

Yi Huang
Xiaoshan Yang
Changsheng Xu

In this paper, we consider the task of action anticipation on egocentric videos. Previous
methods ignore explicit modeling of the global context relation among past and future
actions, which is not an easy task due to the vacancy of unobserved videos. To solve
this problem, we propose a Multimodal Global Relation Knowledge Distillation (MGRKD)
framework to distill the knowledge learned from full videos to improve the action
anticipation task on partially observed videos. The proposed MGRKD has a teacher-student
learning strategy, where either the teacher or student model has three branches of
global relation graph networks (GRGN) to explore the pairwise relations between past
and future actions based on three kinds of features (i.e., RGB, motion or object).
The teacher model has a similar architecture with the student model, except that the
teacher model uses true feature of the future video snippet to build the graph in
GRGN while the student model uses a progressive GRU to predict an initialized node
feature of future snippet in GRGN. Through the teacher-student learning strategy,
the discriminative features and relation knowledge of the past and future actions
learned in the teacher model can be distilled to the student model. The experiments
on two egocentric video datasets EPIC-Kitchens and EGTEA Gaze+ show that the proposed
framework achieves state-of-the-art performances.

Exploring Pathologist Knowledge for Automatic Assessment of Breast Cancer Metastases
in Whole-slide Image

Liuan Wang
Li Sun
Mingjie Zhang
Huigang Zhang
Wang Ping
Rong Zhou
Jun Sun

Automatic assessment of breast cancer metastases plays an important role to help pathologist
reduce the time-consuming work in histopathological whole-slide image diagnosis. From
the utilization of knowledge point of view, the low-magnification level and high-magnification
level are carefully checked by the pathologists for tumor pattern and cell tumor characteristic.
In this paper, we propose a novel automatic patient-level tumor segmentation and classification
method, which makes full use of the diagnosis knowledge clues from pathologists. For
tumor segmentation, a multi-level view DeepLabV3+ (MLV-DeepLabV3+) is designed to
explore the distinguishing features of cell characteristics between tumor and normal
tissue. Furthermore, the expert segmentation models are selected and integrated by
Pareto-front optimization to imitate the expert consultation to get perfect diagnosis.
For wholeslide classification, multi-level magnifications are adaptive checked to
focus on the effective features in different magnification. The experimental results
demonstrate that our pathologist knowledge-based automatic assessment of whileslide
image is effective and robust on the public benchmark dataset.

Towards Multiple Black-boxes Attack via Adversarial Example Generation Network

Duan Mingxing
Kenli Li
Lingxi Xie
Qi Tian
Bin Xiao

The current research on adversarial attacks aims at a single model while the research
on attacking multiple models simultaneously is still challenging. In this paper, we
propose a novel black-box attack method, referred to as MBbA, which can attack multiple
black-boxes at the same time. By encoding input image and its target category into
an associated space, each decoder seeks the appropriate attack areas from the image
through the designed loss functions, and then generates effective adversarial examples.
This process realizes end-to-end adversarial example generation without involving
substitute models for the black-box scenario. On the other hand, adopting the adversarial
examples generated by MBbA for adversarial training, the robustness of the attacked
models are greatly improved. More importantly, those adversarial examples can achieve
satisfactory attack performance, even if these black-box models are trained with the
adversarial examples generated by other black-box attack methods, which show good
transferability. Finally, extensive experiments show that compared with other state-of-the-art
methods: (1) MBbA takes the least time to obtain the most effective attack effects
in multi-black-box attack scenario. Furthermore, MBbA achieves the highest attack
success rates in a single black-box attack scenario; (2) the adversarial examples
generated by MBbA can effectively improve the robustness of the attacked models and
exhibit good transferability.

SESSION: Session 6: Emerging Multimedia Applications-II

DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction

Hao Feng
Yuechen Wang
Wengang Zhou
Jiajun Deng
Houqiang Li

In this work, we propose a new framework, called Document Image Transformer (DocTr),
to address the issue of geometry and illumination distortion of the document images.
Specifically, DocTr consists of a geometric unwarping transformer and an illumination
correction transformer. By setting a set of learned query embedding, the geometric
unwarping transformer captures the global context of the document image by self-attention
mechanism and decodes the pixel-wise displacement solution to correct the geometric
distortion. After geometric unwarping, our illumination correction transformer further
removes the shading artifacts to improve the visual quality and OCR accuracy. Extensive
evaluations are conducted on several datasets, and superior results are reported against
the state-of-the-art methods. Remarkably, our DocTr achieves $20.02%$ Character Error
Rate (CER), a $15%$ absolute improvement over the state-of-the-art methods. Moreover,
it also shows high efficiency on running time and parameter count.

Self-supervised Multi-view Multi-Human Association and Tracking

Yiyang Gan
Ruize Han
Liqiang Yin
Wei Feng
Song Wang

Multi-view Multi-human association and tracking (MvMHAT) aims to track a group of
people over time in each view, as well as to identify the same person across different
views at the same time. This is a relatively new problem but is very important for
multi-person scene video surveillance. Different from previous multiple object tracking
(MOT) and multi-target multi-camera tracking (MTMCT) tasks, which only consider the
over-time human association, MvMHAT requires to jointly achieve both cross-view and
over-time data association. In this paper, we model this problem with a self-supervised
learning framework and leverage an end-to-end network to tackle it. Specifically,
we propose a spatial-temporal association network with two designed self-supervised
learning losses, including a symmetric-similarity loss and a transitive-similarity
loss, at each time to associate the multiple humans over time and across views. Besides,
to promote the research on MvMHAT, we build a new large-scale benchmark for the training
and testing of different algorithms. Extensive experiments on the proposed benchmark
verify the effectiveness of our method. We have released the benchmark and code to
the public.

Learning Fine-Grained Motion Embedding for Landscape Animation

Hongwei Xue
Bei Liu
Huan Yang
Jianlong Fu
Houqiang Li
Jiebo Luo

In this paper we focus on landscape animation, which aims to generate time-lapse videos
from a single landscape image. Motion is crucial for landscape animation as it determines
how objects move in videos. Existing methods are able to generate appealing videos
by learning motion from real time-lapse videos. However, current methods suffer from
inaccurate motion generation, which leads to unrealistic video results. To tackle
this problem, we propose a model named FGLA to generate high-quality and realistic
videos by learning Fine-Grained motion embedding for Landscape Animation. Our model
consists of two parts: (1) a motion encoder which embeds time-lapse motion in a fine-grained
way. (2) a motion generator which generates realistic motion to animate input images.
To train and evaluate on diverse time-lapse videos, we build the largest high-resolution
Time-lapse video dataset with Diverse scenes, namely Time-lapse-D, which includes
16,874 video clips with over 10 million frames. Quantitative and qualitative experimental
results demonstrate the superiority of our method. In particular, our method achieves
relative improvements by 19% on LIPIS and 5.6% on FVD compared with state-of-the-art
methods on our dataset. A user study carried out with 700 human subjects shows that
our approach visually outperforms existing methods by a large margin.

Multi-label Pattern Image Retrieval via Attention Mechanism Driven Graph Convolutional
Network

Ying Li
Hongwei Zhou
Yeyu Yin
Jiaquan Gao

Pattern images are artificially designed images which are discriminative in aspects
of elements, styles, arrangements and so on. Pattern images are widely used in fields
like textile, clothing, art, fashion and graphic design. With the growth of image
numbers, pattern image retrieval has great potential in commercial applications and
industrial production. However, most of existing content-based image retrieval works
mainly focus on describing simple attributes with clear conceptual boundaries, which
are not suitable for pattern image retrieval. It is difficult to accurately represent
and retrieve pattern images which include complex details and multiple elements. Therefore,
in this paper, we collect a new pattern image dataset with multiple labels per image
for the pattern image retrieval task. To extract discriminative semantic features
of multi-label pattern images and construct high-level topology relationships between
features, we further propose an Attention Mechanism Driven Graph Convolutional Network
(AMD-GCN). Different layers of the multi-semantic attention module activate regions
of interest corresponding to multiple labels, respectively. By embedding the learned
labels from attention module into the graph convolutional network, which can capture
the dependency of labels on the graph manifold, the AMD-GCN builds an end-to-end framework
to extract high-level semantic features with label semantics and inner relationships
for retrieval. Experiments on the pattern image dataset show that the proposed method
highlights the relevant semantic regions of multiple labels, and achieves higher accuracy
than state-of-the-art image retrieval methods.

Collocation and Try-on Network: Whether an Outfit is Compatible

Na Zheng
Xuemeng Song
Qingying Niu
Xue Dong
Yibing Zhan
Liqiang Nie

Whether an outfit is compatible? Using machine learning methods to assess an outfit's
compatibility, namely, fashion compatibility modeling (FCM), has recently become a
popular yet challenging topic. However, current FCM studies still perform far from
satisfactory, because they only consider the collocation compatibility modeling, while
neglecting the natural human habits that people generally evaluate outfit compatibility
from both the collocation (discrete assess) and the try-on (unified assess) perspectives.
In light of the above analysis, we propose a Collocation and Try-On Network (CTO-Net)
for FCM, combining both the collocation and try-on compatibilities. In particular,
for the collocation perspective, we devise a disentangled graph learning scheme, where
the collocation compatibility is disentangled into multiple fine-grained compatibilities
between items; regarding the try-on perspective, we propose an integrated distillation
learning scheme to unify all item information in the whole outfit to evaluate the
compatibility based on the latent try-on representation. To further enhance the collocation
and try-on compatibilities, we exploit the mutual learning strategy to obtain a more
comprehensive judgment. Extensive experiments on the real-world dataset demonstrate
that our CTO-Net significantly outperforms the state-of-the-art methods. In particular,
compared with the competitive counterparts, our proposed CTO-Net significantly improves
AUC accuracy from 83.2% to 87.8% and MRR from 15.4% to 21.8%. We have released our
source codes and trained models to benefit other researchers.1

MeronymNet: A Hierarchical Model for Unified and Controllable Multi-Category Object Generation

Rishabh Baghel
Abhishek Trivedi
Tejas Ravichandran
Ravi Kiran Sarvadevabhatla

We introduce MeronymNet, a novel hierarchical approach for controllable, part-based
generation of multi-category objects using a single unified model. We adopt a guided
coarse-to-fine strategy involving semantically conditioned generation of bounding
box layouts, pixel-level part layouts and ultimately, the object depictions themselves.
We use Graph Convolutional Networks, Deep Recurrent Networks along with custom-designed
Conditional Variational Autoencoders to enable flexible, diverse and category-aware
generation of 2-D objects in a controlled manner. The performance scores for generated
objects reflect MeronymNet's superior performance compared to multiple strong baselines
and ablative variants. We also showcase MeronymNet's suitability for controllable
object generation and interactive object editing at various levels of structural and
semantic granularity.

SESSION: Session 7: Emerging Multimedia Applications-III

Ada-VSR: Adaptive Video Super-Resolution with Meta-Learning

Akash Gupta
Padmaja Jonnalagedda
Bir Bhanu
Amit K. Roy-Chowdhury

Most of the existing works in supervised spatio-temporal video super-resolution (STVSR)
heavily rely on a large-scale external dataset consisting of paired low-resolution
low-frame rate (LR-LFR) and high-resolution high-frame-rate (HR-HFR) videos. Despite
their remarkable performance, these methods make a prior assumption that the low-resolution
video is obtained by down-scaling the high-resolution video using a known degradation
kernel, which does not hold in practical settings. Another problem with these methods
is that they cannot exploit instance-specific internal information of a video at testing
time. Recently, deep internal learning approaches have gained attention due to their
ability to utilize the instance-specific statistics of a video. However, these methods
have a large inference time as they require thousands of gradient updates to learn
the intrinsic structure of the data. In this work, we present Adaptive VideoSuper-Resolution
(Ada-VSR) which leverages external, as well as internal, information through meta-transfer
learning and internal learning, respectively. Specifically, meta-learning is employed
to obtain adaptive parameters, using a large-scale external dataset, that can adapt
quickly to the novel condition (degradation model) of the given test video during
the internal learning task, thereby exploiting external and internal information of
a video for super-resolution. The model trained using our approach can quickly adapt
to a specific video condition with only a few gradient updates, which reduces the
inference time significantly. Extensive experiments on standard datasets demonstrate
that our method performs favorably against various state-of-the-art approaches.

CoReD: Generalizing Fake Media Detection with Continual Representation using Distillation

Minha Kim
Shahroz Tariq
Simon S. Woo

Over the last few decades, artificial intelligence research has made tremendous strides,
but it still heavily relies on fixed datasets in stationary environments. Continual
learning is a growing field of research that examines how AI systems can learn sequentially
from a continuous stream of linked data in the same way that biological systems do.
Simultaneously, fake media such as deepfakes and synthetic face images have emerged
as significant to current multimedia technologies. Recently, numerous method has been
proposed which can detect deepfakes with high accuracy. However, they suffer significantly
due to their reliance on fixed datasets in limited evaluation settings. Therefore,
in this work, we apply continuous learning to neural networks' learning dynamics,
emphasizing its potential to increase data efficiency significantly. We propose Continual
Representation using Distillation (CoReD) method that employs the concept of Continual
Learning (CL), Representation Learning (RL), and Knowledge Distillation (KD). We design
CoReD to perform sequential domain adaptation tasks on new deepfake and GAN-generated
synthetic face datasets, while effectively minimizing the catastrophic forgetting
in a teacher-student model setting. Our extensive experimental results demonstrate
that our method is efficient at domain adaptation to detect low-quality deepfakes
videos and GAN-generated images from several datasets, outperforming the-state-of-art
baseline methods.

SRNet: Spatial Relation Network for Efficient Single-stage Instance Segmentation in Videos

Xiaowen Ying
Xin Li
Mooi Choo Chuah

The task of instance segmentation in videos aims to consistently identify objects
at pixel level throughout the entire video sequence. Existing state-of-the-art methods
either follow the tracking-by-detection paradigm to employ multi-stage pipelines or
directly train a complex deep model to process the entire video clips as 3D volumes.
However, these methods are typically slow and resource-consuming such that they are
often limited to offline processing. In this paper, we propose SRNet, a simple and
efficient framework for joint segmentation and tracking of object instances in videos.
The key to achieving both high efficiency and accuracy in our framework is to formulate
the instance segmentation and tracking problem into a unified spatial-relation learning
task where each pixel in the current frame relates to its object center, and each
object center relates to its location in the previous frame. This unified learning
framework allows our framework to perform join instance segmentation and tracking
through a single stage while maintaining low overheads among different learning tasks.
Our proposed framework can handle two different task settings and demonstrates comparable
performance with state-of-the-art methods on two different benchmarks while running
significantly faster.

Personality Recognition by Modelling Person-specific Cognitive Processes using Graph
Representation

Zilong Shao
Siyang Song
Shashank Jaiswal
Linlin Shen
Michel Valstar
Hatice Gunes

Recent research shows that in dyadic and group interactions individuals' nonverbal
behaviours are influenced by the behaviours of their conversational partner(s). Therefore,
in this work we hypothesise that during a dyadic interaction, the target subject's
facial reactions are driven by two main factors: (i) their internal (person-specific)
cognition, and (ii) the externalised nonverbal behaviours of their conversational
partner. Subsequently, our novel proposition is to simulate and represent the target
subject's (i.e., the listener) cognitive process in the form of a person-specific
CNN architecture whose input is the audio-visual non-verbal cues displayed by the
conversational partner (i.e., the speaker), and the output is the target subject's
(i.e., the listener) facial reactions. We then undertake a search for the optimal
CNN architecture whose results are used to create a person-specific graph representation
for recognising the target subject's personality. The graph representation, fortified
with a novel end-to-end edge feature learning strategy, helps with retaining both
the unique parameters of the person-specific CNN and the geometrical relationship
between its layers. Consequently, the proposed approach is the first work that aims
to recognize the true (self-reported) personality of a target subject (i.e., the listener)
from the learned simulation of their cognitive process (i.e., parameters of the person-specific
CNN). The experimental results show that the CNN architectures are well associated
with target subjects' personality traits and the proposed approach clearly outperforms
multiple existing approaches that predict personality directly from non-verbal behaviours.
In light of these findings, this work opens up a new avenue of research for predicting
and recognizing socio-emotional phenomena (personality, affect, engagement etc.) from
simulations of person-specific cognitive processes.

Enhancing Knowledge Tracing via Adversarial Training

Xiaopeng Guo
Zhijie Huang
Jie Gao
Mingyu Shang
Maojing Shu
Jun Sun

We study the problem of knowledge tracing (KT) where the goal is to trace the students'
knowledge mastery over time so as to make predictions on their future performance.
Owing to the good representation capacity of deep neural networks (DNNs), recent advances
on KT have increasingly concentrated on exploring DNNs to improve the performance
of KT. However, we empirically reveal that the DNNs based KT models may run the risk
of overfitting, especially on small datasets, leading to limited generalization. In
this paper, by leveraging the current advances in adversarial training (AT), we propose
an efficient AT based KT method (ATKT) to enhance KT model's generalization and thus
push the limit of KT. Specifically, we first construct adversarial perturbations and
add them on the original interaction embeddings as adversarial examples. The original
and adversarial examples are further used to jointly train the KT model, forcing it
is not only to be robust to the adversarial examples, but also to enhance the generalization
over the original ones. To better implement AT, we then present an efficient attentive-LSTM
model as KT backbone, where the key is a proposed knowledge hidden state attention
module that adaptively aggregates information from previous knowledge hidden states
while simultaneously highlighting the importance of current knowledge hidden state
to make a more accurate prediction. Extensive experiments on four public benchmark
datasets demonstrate that our ATKT achieves new state-of-the-art performance. Code
is available at: https://github.com/xiaopengguo/ATKT.

Beyond OCR + VQA: Involving OCR into the Flow for Robust and Accurate TextVQA

Gangyan Zeng
Yuan Zhang
Yu Zhou
Xiaomeng Yang

Text-based visual question answering (TextVQA) requires analyzing both the visual
contents and texts in an image to answer a question, which is more practical than
general visual question answering (VQA). Existing efforts tend to regard optical character
recognition (OCR) as a pre-processing and then combine it with a VQA framework. It
makes the performance of multimodal reasoning and question answering highly depend
on the accuracy of OCR. In this work, we address this issue with two perspectives.
First, we take advantages of multimodal cues to complete the semantic information
of texts. A visually enhanced text embedding is proposed to enable understanding of
texts without accurately recognizing them. Second, we further leverage rich contextual
information to modify the answer texts even if the OCR module does not correctly recognize
them. In addition, the visual objects are endued with semantic representations to
enable objects in the same semantic space as OCR tokens. Equipped with these techniques,
the cumulative error propagation caused by poor OCR performance is effectively suppressed.
Extensive experiments on TextVQA and ST-VQA datasets demonstrate that our approach
achieves the state-of-the-art performance in terms of accuracy and robustness.

SESSION: Poster Session 1

JPGNet: Joint Predictive Filtering and Generative Network for Image Inpainting

Qing Guo
Xiaoguang Li
Felix Juefei-Xu
Hongkai Yu
Yang Liu
Song Wang

Image inpainting aims to restore the missing regions of corrupted images and make
the recovery result identical to the originally complete image, which is different
from the common generative task emphasizing the naturalness or realism of generated
images. Nevertheless, existing works usually regard it as a pure generation problem
and employ cutting-edge deep generative techniques to address it. The generative networks
can fill the main missing parts with realistic contents but usually distort the local
structures or introduce obvious artifacts. In this paper, for the first time, we formulate
image inpainting as a mix of two problems, i.e., predictive filtering and deep generation.
Predictive filtering is good at preserving local structures and removing artifacts
but falls short to complete the large missing regions. The deep generative network
can fill the numerous missing pixels based on the understanding of the whole scene
but hardly restores the details identical to the original ones. To make use of their
respective advantages, we propose the joint predictive filtering and generative network
(JPGNet) that contains three branches: predictive filtering & uncertainty network
(PFUNet), deep generative network, and uncertainty-aware fusion network (UAFNet).
The PFUNet can adaptively predict pixel-wise kernels for filtering-based inpainting
according to the input image and output an uncertainty map. This map indicates the
pixels should be processed by filtering or generative networks, which is further fed
to the UAFNet for a smart combination between filtering and generative results. Note
that, our method as a novel framework for the image inpainting problem can benefit
any existing generation-based methods. We validate our method on three public datasets,
i.e., Dunhuang, Places2, and CelebA, and demonstrate that our method can enhance three
state-of-the-art generative methods (i.e., StructFlow, EdgeConnect, and RFRNet) significantly
with slightly extra time costs. We have released the code at https://github.com/tsingqguo/jpgnet.

AdvFilter: Predictive Perturbation-aware Filtering against Adversarial Attack via
Multi-domain Learning

Yihao Huang
Qing Guo
Felix Juefei-Xu
Lei Ma
Weikai Miao
Yang Liu
Geguang Pu

High-level representation-guided pixel denoising and adversarial training are independent
solutions to enhance the robustness of CNNs against adversarial attacks by pre-processing
input data and re-training models, respectively. Most recently, adversarial training
techniques have been widely studied and improved while the pixel denoising-based method
is getting less attractive. However, it is still questionable whether there exists
a more advanced pixel denoising-based method and whether the combination of the two
solutions benefits each other. To this end, we first comprehensively investigate two
kinds of pixel denoising methods for adversarial robustness enhancement (i.e., existing
additive-based and unexplored filtering-based methods) under the loss functions of
image-level and semantic-level, respectively, showing that pixel-wise filtering can
obtain much higher image quality (e.g., higher PSNR) as well as higher robustness
(e.g., higher accuracy on adversarial examples) than existing pixel-wise additive-based
method. However, we also observe that the robustness results of the filtering-based
method rely on the perturbation amplitude of adversarial examples used for training.
To address this problem, we propose predictive perturbation-aware & pixel-wise filtering,
where dual-perturbation filtering and an uncertainty-aware fusion module are designed
and employed to automatically perceive the perturbation amplitude during the training
and testing process. The method is termed as AdvFilter. Moreover, we combine adversarial
pixel denoising methods with three adversarial training-based methods, hinting that
considering data and models jointly is able to achieve more robust CNNs. The experiments
conduct on NeurIPS-2017DEV, SVHN and CIFAR10 datasets and show advantages over enhancing
CNNs' robustness, high generalization to different models and noise levels.

Pixel-level Intra-domain Adaptation for Semantic Segmentation

Zizheng Yan
Xianggang Yu
Yipeng Qin
Yushuang Wu
Xiaoguang Han
Shuguang Cui

Recent advances in unsupervised domain adaptation have achieved remarkable performance
on semantic segmentation tasks. Despite such progress, existing works mainly focus
on bridging the inter-domain gaps between the source and target domain, while only
few of them noticed the intra-domain gaps within the target data. In this work, we
propose a pixel-level intra-domain adaptation approach to reduce the intra-domain
gaps within the target data. Compared with image-level methods, ours treats each pixel
as an instance, which adapts the segmentation model at a more fine-grained level.
Specifically, we first conduct the inter-domain adaptation between the source and
target domain; Then, we separate the pixels in target images into the easy and hard
subdomains; Finally, we propose a pixel-level adversarial training strategy to adapt
a segmentation network from the easy to the hard subdomain. Moreover, we show that
the segmentation accuracy can be further improved by incorporating a continuous indexing
technique in the adversarial training. Experimental results show the effectiveness
of our method against existing state-of-the-art approaches.

Mask is All You Need: Rethinking Mask R-CNN for Dense and Arbitrary-Shaped Scene Text Detection

Xugong Qin
Yu Zhou
Youhui Guo
Dayan Wu
Zhihong Tian
Ning Jiang
Hongbin Wang
Weiping Wang

Due to the large success in object detection and instance segmentation, Mask R-CNN
attracts great attention and is widely adopted as a strong baseline for arbitrary-shaped
scene text detection and spotting. However, two issues remain to be settled. The first
is dense text case, which is easy to be neglected but quite practical. There may exist
multiple instances in one proposal, which makes it difficult for the mask head to
distinguish different instances and degrades the performance. In this work, we argue
that the performance degradation results from the learning confusion issue in the
mask head. We propose to use an MLP decoder instead of the "deconv-conv" decoder in
the mask head, which alleviates the issue and promotes robustness significantly. And
we propose instance-aware mask learning in which the mask head learns to predict the
shape of the whole instance rather than classify each pixel to text or non-text. With
instance-aware mask learning, the mask branch can learn separated and compact masks.
The second is that due to large variations in scale and aspect ratio, RPN needs complicated
anchor settings, making it hard to maintain and transfer across different datasets.
To settle this issue, we propose an adaptive label assignment in which all instances
especially those with extreme aspect ratios are guaranteed to be associated with enough
anchors. Equipped with these components, the proposed method named MAYOR achieves
state-of-the-art performance on five benchmarks including DAST1500, MSRA-TD500, ICDAR2015,
CTW1500, and Total-Text.

Windowing Decomposition Convolutional Neural Network for Image Enhancement

Chuanjun Zheng
Daming Shi
Yukun Liu

Image enhancement aims to improve the aesthetic quality of images. Most enhancement
methods are based on image decomposition techniques. For example, an entire image
can be decomposed into a smooth base layer and a residual detail layer. Applying appropriate
algorithms to different layers can solve most enhancement problems. Besides decomposing
the entire image, the local decomposition approach in local Laplacian filter can also
achieve satisfied enhancement results. As a standard convolution is also a local operator
that the output values is determined by neighborhood pixels, we observe that the standard
convolution can be improved by integrating the local decomposition method for better
solving image enhancement problems. Based on this analysis, we propose Windowing Decomposition
Convolution (WDC) that decomposes the content of each convolution window by a windowing
basic value before applying convolution operation. Using different windowing basic
values, the WDC can gather global information and locally separate the processing
of different components of images. Moreover, combined with WDC, a new Windowing Decomposition
Convolutional Neural Network (WDCNN) is presented. The experimental results show that
our WDCNN achieves superior enhancement performance on the MIT-Adobe FiveK and sRGB-SID
datasets for noise-free image retouching and low-light noisy image enhancement compared
with state-of-the-art techniques.

Joint Optimization in Edge-Cloud Continuum for Federated Unsupervised Person Re-identification

Weiming Zhuang
Yonggang Wen
Shuai Zhang

Person re-identification (ReID) aims to re-identify a person from non-overlapping
camera views. Since person ReID data contains sensitive personal information, researchers
have adopted federated learning, an emerging distributed training method, to mitigate
the privacy leakage risks. However, existing studies rely on data labels that are
laborious and time-consuming to obtain. We present FedUReID, a federated unsupervised
person ReID system to learn person ReID models without any labels while preserving
privacy. FedUReID enables in-situ model training on edges with unlabeled data. A cloud
server aggregates models from edges instead of centralizing raw data to preserve data
privacy. Moreover, to tackle the problem that edges vary in data volumes and distributions,
we personalize training in edges with joint optimization of cloud and edge. Specifically,
we propose personalized epoch to reassign computation throughout training, personalized
clustering to iteratively predict suitable labels for unlabeled data, and personalized
update to adapt the server aggregated model to each edge. Extensive experiments on
eight person ReID datasets demonstrate that FedUReID not only achieves higher accuracy
but also reduces computation cost by 29%. Our FedUReID system with the joint optimization
will shed light on implementing federated learning to more multimedia tasks without
data labels.

Multi-view 3D Smooth Human Pose Estimation based on Heatmap Filtering and Spatio-temporal
Information

Zehai Niu
Ke Lu
Jian Xue
Haifeng Ma
Runchen Wei

The estimation of 3D human poses from time-synchronized, calibrated multi-view video
usually consists of two steps: (1) a 2D detector to locate the 2D coordinate point
position of the joint via heatmaps for each frame and (2) a post-processing method
such as the recursive pictorial structure model or robust triangulation to obtain
3D coordinate points. However, most existing methods are based on a single frame only.
They do not take advantage of the temporal characteristics of the video sequence itself,
and must rely on post-processing algorithms. They are also susceptible to human self-occlusion,
and the generated sequences suffer from jitter. Therefore, we propose a network model
incorporating spatial and temporal features. Using a coarse-to-fine approach, the
proposed heatmap temporal network (HTN) generates temporal heatmap information, with
an occlusion heatmap filter used to filter low-quality heatmaps before they are sent
to the HTN. The heatmap fusion and the triangulation weights are dynamically adjusted,
and intermediate supervision is employed to enable better integration of temporal
and spatial information. Our network is also end-to-end differentiable. This overcomes
the long-standing problem of skeleton jitter being generated and ensures that the
sequence is smooth and stable.

Imitative Learning for Multi-Person Action Forecasting

Yu-Ke Li
Pin Wang
Mang Ye
Ching-Yao Chan

Multi-person action forecasting is an emerging task and a pivotal step towards video
understanding. The major challenge lies in estimating a distribution characterizing
the upcoming actions of all individuals in the scene. The state-of-the-art solutions
attempt to solve this problem via a step-by-step prediction procedure. However, they
are not adequate to address some particular limitations, such as the compounding errors,
the innate uncertainty of the future and the spatio-temporal contexts. To handle the
multi-person action forecasting challenges, we put forth a novel imitative learning
framework upon the basis of inverse reinforcement learning. Specifically, we aim to
learn a policy to model the aforementioned distribution up to a coming horizon through
an objective that naturally solves the compounding errors. Such a policy is able to
explore multiple plausible futures via extrapolating a series of latent variables
and taking them into account to generate predictions. The impacts of these latent
variables are further investigated by optimizing the directed information. Moreover,
we reason the spatial context along with the temporal cue in a single pass with the
usage of graph structural data. The experimental outcomes on two large-scale datasets
reveal that our approach yields considerable improvements in terms of both diversity
and quality with respect to recent leading studies.

Stereo Video Super-Resolution via Exploiting View-Temporal Correlations

Ruikang Xu
Zeyu Xiao
Mingde Yao
Yueyi Zhang
Zhiwei Xiong

Stereo Video Super-Resolution (StereoVSR) aims to generate high-resolution video steams
from two low-resolution videos under stereo settings. Existing video super-resolution
and stereo image super-resolution techniques can be extended to tackle the StereoVSR
task, yet they cannot make full use of the multi-view and temporal information to
achieve satisfactory performance. In this paper, we propose a novel Stereo Video Super-Resolution
Network (SVSRNet) to fulfill the StereoVSR task via exploiting view-temporal correlations.
First, we devise a view-temporal attention module (VTAM) to integrate the information
of cross-time-cross-view for constructing high-resolution stereo videos. Second, we
propose a spatial-temporal fusion module (STFM), which aggregates the information
across time in intra-view to emphasize important features for subsequent restoration.
In addition, we design a view-temporal consistency loss function to enforce consistency
constraint of superresolved stereo videos. Comprehensive experimental results demonstrate
that our method generates superior results.

M3TR: Multi-modal Multi-label Recognition with Transformer

Jiawei Zhao
Yifan Zhao
Jia Li

Multi-label image recognition aims to recognize multiple objects simultaneously in
one image. Recent ideas to solve this problem have focused on learning dependencies
of label co-occurrences to enhance the high-level semantic representations. However,
these methods usually neglect the important relations of intrinsic visual structures
and face difficulties in understanding contextual relationships. To build the global
scope of visual context as well as interactions between visual modality and linguistic
modality, we propose the Multi-Modal Multi-label recognition TRansformers (M3TR) with
the ternary relationship learning for inter-and intra-modalities. For the intra-modal
relationship, we make insightful conjunction of CNNs and Transformers, which embeds
visual structures into the high-level features by learning the semantic cross-attention.
For constructing the interactions between the visual and linguistic modalities, we
propose a linguistic cross-attention to embed the class-wise linguistic information
into the visual structure learning, and finally present a linguistic guided enhancement
module to enhance the representation of high-level semantics. Experimental evidence
reveals that with the collaborative learning of ternary relationship, our proposed
M3TR achieves new state-of-the-art on two public multi-label recognition benchmarks.

TACR-Net: Editing on Deep Video and Voice Portraits

Luchuan Song
Bin Liu
Guojun Yin
Xiaoyi Dong
Yufei Zhang
Jia-Xuan Bai

Utilizing an arbitrary speech clip to edit the mouth of the portrait in the target
video is a novel yet challenging task. Despite impressive results have been achieved,
there are still three limitations in the existing methods: 1) since the acoustic features
are not completely decoupled from person identity, there is no global speech to facial
features (i.e., landmarks, expression blendshape) mapping method. 2) the audio-driven
talking face sequences generated by simple cascade structure usually lack of temporal
consistency and spatial correlation, which leads to defects in the consistency of
changes in details. 3) the operation of forgery is always at the video level, without
considering the forgery of the voice, especially the synchronization of the converted
voice and the mouth. To address these distortion problems, we propose a novel deep
learning framework, named Temporal-Refinement Autoregressive-Cascade Rendering Network
(TACR-Net) for audio-driven dynamic talking face editing. The proposed TACR-Net encodes
facial expression blendshape based on the given acoustic features without separately
training for special video. Then TACR-Net also involves a novel autoregressive cascade
structure generator for video re-rendering. Finally, we transform the in-the-wild
speech to the target portrait and obtain a photo-realistic and audio-realistic video.

Annotation-Efficient Untrimmed Video Action Recognition

Yixiong Zou
Shanghang Zhang
Guangyao Chen
Yonghong Tian
Kurt Keutzer
José M. F. Moura

Deep learning has achieved great success in recognizing video actions, but the collection
and annotation of training data are still quite laborious, which mainly lies in two
aspects: (1) the amount of required annotated data is large; (2) temporally annotating
the location of each action is time-consuming. Works such as few-shot learning or
untrimmed video recognition have been proposed to handle either one aspect or the
other. However, very few existing works can handle both issues simultaneously. In
this paper, we target a new problem, Annotation-Efficient Video Recognition, to reduce
the requirement of annotations for both large amount of samples and the action location.
Such problem is challenging due to two aspects: (1) the untrimmed videos only have
weak supervision; (2) video segments not relevant to current actions of interests
(background, BG) could contain actions of interests (foreground, FG) in novel classes,
which is a widely existing phenomenon but has rarely been studied in few-shot untrimmed
video recognition. To achieve this goal, by analyzing the property of BG, we categorize
BG into informative BG (IBG) and non-informative BG (NBG), and we propose (1) an open-set
detection based method to find the NBG and FG, (2) a contrastive learning method to
learn IBG and distinguish NBG in a self-supervised way, and (3) a self-weighting mechanism
for the better distinguishing of IBG and FG. Extensive experiments on ActivityNet
v1.2 and ActivityNet v1.3 verify the rationale and effectiveness of the proposed methods.

Face-based Voice Conversion: Learning the Voice behind a Face

Hsiao-Han Lu
Shao-En Weng
Ya-Fan Yen
Hong-Han Shuai
Wen-Huang Cheng

Zero-shot voice conversion (VC) trained by non-parallel data has gained a lot of attention
in recent years. Previous methods usually extract speaker embeddings from audios and
use them for converting the voices into different voice styles. Since there is a strong
relationship between human faces and voices, a promising approach would be to synthesize
various voice characteristics from face representation. Therefore, we introduce a
novel idea of generating different voice styles from different human face photos,
which can facilitate new applications, e.g., personalized voice assistants. However,
the audio-visual relationship is implicit. Moreover, the existing VCs are trained
on laboratory-collected datasets without speaker photos, while the datasets with both
photos and audios are in-the-wild datasets. Directly replacing the target audio with
the target photo and training on the in-the-wild dataset leads to noisy results. To
address these issues, we propose a novel many-to-many voice conversion network, namely
Face-based Voice Conversion (FaceVC), with a 3-stage training strategy. Quantitative
and qualitative experiments on the LRS3-Ted dataset show that the proposed FaceVC
successfully performs voice conversion according to the target face photos. Audio
samples can be found on the demo website at https://facevc.github.io/.

A Large-Scale Benchmark for Food Image Segmentation

Xiongwei Wu
Xin Fu
Ying Liu
Ee-Peng Lim
Steven C.H. Hoi
Qianru Sun

Food image segmentation is a critical and indispensible task for developing health-related
applications such as estimating food calories and nutrients. Existing food image segmentation
models are underperforming due to two reasons: (1) there is a lack of high quality
food image datasets with fine-grained ingredient labels and pixel-wise location masks---the
existing datasets either carry coarse ingredient labels or are small in size; and
(2) the complex appearance of food makes it difficult to localize and recognize ingredients
in food images, e.g., the ingredients may overlap one another in the same image, and
the identical ingredient may appear distinctly in different food images.

In this work, we build a new food image dataset FoodSeg103 (and its extension FoodSeg154)
containing 9,490 images. We annotate these images with 154 ingredient classes and
each image has an average of 6 ingredient labels and pixel-wise masks. In addition,
we propose a multi-modality pre-training approach called ReLeM that explicitly equips
a segmentation model with rich and semantic food knowledge. In experiments, we use
three popular semantic segmentation methods (i.e., Dilated Convolution based[20],
Feature Pyramid based[25], and Vision Transformer based[60] ) as baselines, and evaluate
them as well as ReLeM on our new datasets. We believe that the FoodSeg103 (and its
extension FoodSeg154) and the pre-trained models using ReLeM can serve as a benchmark
to facilitate future works on fine-grained food image understanding. We make all these
datasets and methods public at https://xiongweiwu.github.io/foodseg103.html.

HAT: Hierarchical Aggregation Transformers for Person Re-identification

Guowen Zhang
Pingping Zhang
Jinqing Qi
Huchuan Lu

Recently, with the advance of deep Convolutional Neural Networks (CNNs), person Re-Identification
(Re-ID) has witnessed great success in various applications.However, with limited
receptive fields of CNNs, it is still challenging to extract discriminative representations
in a global view for persons under non-overlapped cameras.Meanwhile, Transformers
demonstrate strong abilities of modeling long-range dependencies for spatial and sequential
data.In this work, we take advantages of both CNNs and Transformers, and propose a
novel learning framework named Hierarchical Aggregation Transformer (HAT) for image-based
person Re-ID with high performance.To achieve this goal, we first propose a Deeply
Supervised Aggregation (DSA) to recurrently aggregate hierarchical features from CNN
backbones.With multi-granularity supervision, the DSA can enhance multi-scale features
for person retrieval, which is very different from previous methods.Then, we introduce
a Transformer-based Feature Calibration (TFC) to integrate low-level detail information
as the global prior for high-level semantic information.The proposed TFC is inserted
to each level of hierarchical features, resulting in great performance improvements.To
our best knowledge, this work is the first to take advantages of both CNNs and Transformers
for image-based person Re-ID.Comprehensive experiments on four large-scale Re-ID benchmarks
demonstrate that our method shows better results than several state-of-the-art methods.The
code is released at https://github.com/AI-Zhpp/HAT.

Long-Range Feature Propagating for Natural Image Matting

Qinglin Liu
Haozhe Xie
Shengping Zhang
Bineng Zhong
Rongrong Ji

Natural image matting estimates the alpha values of unknown regions in the trimap.
Recently, deep learning based methods propagate the alpha values from the known regions
to unknown regions according to the similarity between them. However, we find that
more than 50% pixels in the unknown regions cannot be correlated to pixels in known
regions due to the limitation of small effective reception fields of common convolutional
neural networks, which leads to inaccurate estimation when the pixels in the unknown
regions cannot be inferred only with pixels in the reception fields. To solve this
problem, we propose Long-Range Feature Propagating Network (LFPNet), which learns
the long-range context features outside the reception fields for alpha matte estimation.
Specifically, we first design the propagating module which extracts the context features
from the downsampled image. Then, we present Center-Surround Pyramid Pooling (CSPP)
that explicitly propagates the context features from the surrounding context image
patch to the inner center image patch. Finally, we use the matting module which takes
the image, trimap and context features to estimate the alpha matte. Experimental results
demonstrate that the proposed method performs favorably against the state-of-the-art
methods on the AlphaMatting and Adobe Image Matting datasets.

Towards Controllable and Photorealistic Region-wise Image Manipulation

Ansheng You
Chenglin Zhou
Qixuan Zhang
Lan Xu

Adaptive and flexible image editing is a desirable function of modern generative models.
In this work, we present a generative model with auto-encoder architecture for per-region
style manipulation. We apply a code consistency loss to enforce an explicit disentanglement
between content and style latent representations, making the content and style of
generated samples consistent with their corresponding content and style references.
The model is also constrained by a content alignment loss to ensure the foreground
editing will not interfere background contents. As a result, given interested region
masks provided by users, our model supports foreground region-wise style transfer.
Specially, our model receives no extra annotations such as semantic labels except
for self-supervision. Extensive experiments show the effectiveness of the proposed
method and exhibit the flexibility of the proposed model for various applications,
including region-wise style editing, latent space interpolation, cross-domain style
transfer.

Information-Growth Attention Network for Image Super-Resolution

Zhuangzi Li
Ge Li
Thomas Li
Shan Liu
Wei Gao

It is generally known that a high-resolution (HR) image contains more productive information
compared with its low-resolution (LR) versions, so image super-resolution (SR) satisfies
an information-growth process. Considering the property, we attempt to exploit the
growing information via a particular attention mechanism. In this paper, we propose
a concise but effective Information-Growth Attention Network (IGAN) that shows the
incremental information is beneficial for SR. Specifically, a novel information-growth
attention is proposed. It aims to pay attention to features involving large information-growth
capacity by assimilating the difference from current features to the former features
within a network. We also illustrate its effectiveness contrasted by widely-used self-attention
using entropy and generalization analysis. Furthermore, existing channel-wise attention
generation modules (CAGMs) have large informational attenuation due to directly calculating
global mean for feature maps. Therefore, we present an innovative CAGM that progressively
decreases feature maps' sizes, leading to more adequate feature exploitation. Extensive
experiments also demonstrate IGAN outperforms state-of-the-art attention-aware SR
approaches.

Anchor-free 3D Single Stage Detector with Mask-Guided Attention for Point Cloud

Jiale Li
Hang Dai
Ling Shao
Yong Ding

Most of the existing single-stage and two-stage 3D object detectors are anchor-based
methods, while the efficient but challenging anchor-free single-stage 3D object detection
is not well investigated. Recent studies on 2D object detection show that the anchor-free
methods also are of great potential. However, the unordered and sparse properties
of point clouds prevent us from directly leveraging the advanced 2D methods on 3D
point clouds. We overcome this by converting the voxel-based sparse 3D feature volumes
into the sparse 2D feature maps. We propose an attentive module to fit the sparse
feature maps to dense mostly on the object regions through the deformable convolution
tower and the supervised mask-guided attention. By directly regressing the 3D bounding
box from the enhanced and dense feature maps, we construct a novel single-stage 3D
detector for point clouds in an anchor-free manner. We propose an IoU-based detection
confidence re-calibration scheme to improve the correlation between the detection
confidence score and the accuracy of the bounding box regression. Our code is publicly
available at https://github.com/jialeli1/MGAF-3DSSD.

Shape Controllable Virtual Try-on for Underwear Models

Xin Gao
Zhenjiang Liu
Zunlei Feng
Chengji Shen
Kairi Ou
Haihong Tang
Mingli Song

Image virtual try-on task has abundant applications and has become a hot research
topic recently. Existing 2D image-based virtual try-on methods aim to transfer a target
clothing image onto a reference person, which has two main disadvantages: cannot control
the size and length precisely; unable to accurately estimate the user's figure in
the case of users wearing thick clothing, resulting in inaccurate dressing effect.
In this paper, we put forward an akin task that aims to dress clothing for underwear
models. To solve the above drawbacks, we propose a Shape Controllable Virtual Try-On
Network (SC-VTON), where a graph attention network integrates the information of model
and clothing to generate the warped clothing image. In addition, the control points
are incorporated into SC-VTON for the desired clothing shape. Furthermore, by adding
a Splitting Network and a Synthesis Network, we can use in-shop clothing/model pair
data to help optimize the deformation module and generalize the task to the typical
virtual try-on task. Extensive experiments show that the proposed method can achieve
accurate shape control. Meanwhile, compared with other methods, our method can generate
high-resolution results with detailed textures, which can be applied in real applications.

E2Net: Excitative-Expansile Learning for Weakly Supervised Object Localization

Zhiwei Chen
Liujuan Cao
Yunhang Shen
Feihong Lian
Yongjian Wu
Rongrong Ji

Weakly supervised object localization (WSOL) has gained recent popularity, which seeks
to train localizers with only image-level labels. However, due to relying heavily
on classification objective for training, prevailing WSOL methods only localize discriminative
parts of object, ignoring other useful information, such as the wings of a bird, and
suffer from severe rotation variations. Moreover, learning object localization imposes
CNNs to attend non-salient regions under weak supervision, which may negatively influence
image classification results. To address these challenges, this paper proposes a novel
end-to-end Excitation-Expansion network, coined as E$^2$Net, to localize entire objects
with only image-level labels, which served as the base of most multimedia tasks. The
proposed E$^2$Net consists of two key components: Maxout-Attention Excitation (MAE)
and Orientation-Sensitive Expansion (OSE). Firstly, MAE module aims to activate non-discriminative
localization features while simultaneously recovering discriminative classification
cues. To this end, we couple erasing strategy with maxout learning efficiently to
facilitate entire-object localization without hurting classification accuracy. Secondly,
to address rotation variations, the proposed OSE module expands less salient object
parts along with all possible orientations. Particularly, OSE module dynamically combines
selective attention banks from various orientated expansions of receptive-field, which
introduces additional multi-parallel localization heads. Extensive experiments on
ILSVRC 2012 and CUB-200-2011 demonstrate that the proposed E$^2$Net outperforms the
previous state-of-the-art WSOL methods and also significantly improves classification
performance.

Few-shot Fine-Grained Action Recognition via Bidirectional Attention and Contrastive
Meta-Learning

Jiahao Wang
Yunhong Wang
Sheng Liu
Annan Li

Fine-grained action recognition is attracting increasing attention due to the emerging
demand of specific action understanding in real-world applications, whereas the data
of rare fine-grained categories is very limited. Therefore, we propose the few-shot
fine-grained action recognition problem, aiming to recognize novel fine-grained actions
with only few samples given for each class. Although progress has been made in coarse-grained
actions, existing few-shot recognition methods encounter two issues handling fine-grained
actions: the inability to capture subtle action details and the inadequacy in learning
from data with low inter-class variance. To tackle the first issue, a human vision
inspired bidirectional attention module (BAM) is proposed. Combining top-down task-driven
signals with bottom-up salient stimuli, BAM captures subtle action details by accurately
highlighting informative spatio-temporal regions. To address the second issue, we
introduce contrastive meta-learning (CML). Compared with the widely adopted ProtoNet-based
method, CML generates more discriminative video representations for low inter-class
variance data, since it makes full use of potential contrastive pairs in each training
episode. Furthermore, to fairly compare different models, we establish specific benchmark
protocols on two large-scale fine-grained action recognition datasets. Extensive experiments
show that our method consistently achieves state-of-the-art performance across evaluated
tasks.

Selective Dependency Aggregation for Action Classification

Yi Tan
Yanbin Hao
Xiangnan He
Yinwei Wei
Xun Yang

Video data are distinct from images for the extra temporal dimension, which results
in more content dependencies from various perspectives. It increases the difficulty
of learning representation for various video actions. Existing methods mainly focus
on the dependency under a specific perspective, which cannot facilitate the categorization
of complex video actions. This paper proposes a novel selective dependency aggregation
(SDA) module, which adaptively exploits multiple types of video dependencies to refine
the features. Specifically, we empirically investigate various long-range and short-range
dependencies achieved by the multi-direction multi-scale feature squeeze and the dependency
excitation. Query structured attention is then adopted to fuse them selectively, fully
considering the diversity of videos' dependency preferences. Moreover, the channel
reduction mechanism is involved in SDA for controlling the additional computation
cost to be lightweight. Finally, we show that the SDA module can be easily plugged
into different backbones to form SDA-Nets and demonstrate its effectiveness, efficiency
and robustness by conducting extensive experiments on several video benchmarks for
action classification. The code and models will be available at https://github.com/ty-97/SDA.

Conditional Directed Graph Convolution for 3D Human Pose Estimation

Wenbo Hu
Changgong Zhang
Fangneng Zhan
Lei Zhang
Tien-Tsin Wong

Graph convolutional networks have significantly improved 3D human pose estimation
by representing the human skeleton as an undirected graph. However, this representation
fails to reflect the articulated characteristic of human skeletons as the hierarchical
orders among the joints are not explicitly presented. In this paper, we propose to
represent the human skeleton as a directed graph with the joints as nodes and bones
as edges that are directed from parent joints to child joints. By so doing, the directions
of edges can explicitly reflect the hierarchical relationships among the nodes. Based
on this representation, we further propose a spatial-temporal conditional directed
graph convolution to leverage varying non-local dependence for different poses by
conditioning the graph topology on input poses. Altogether, we form a U-shaped network,
named U-shaped Conditional Directed Graph Convolutional Network, for 3D human pose
estimation from monocular videos. To evaluate the effectiveness of our method, we
conducted extensive experiments on two challenging large-scale benchmarks: Human3.6M
and MPI-INF-3DHP. Both quantitative and qualitative results show that our method achieves
top performance. Also, ablation studies show that directed graphs can better exploit
the hierarchy of articulated human skeletons than undirected graphs, and the conditional
connections can yield adaptive graph topologies for different poses.

Cross Chest Graph for Disease Diagnosis with Structural Relational Reasoning

Gangming Zhao

Locating lesions is important in the computer-aided diagnosis of X-ray images. However,
box-level annotation is time-consuming and laborious. How to locate lesions accurately
with few, or even without careful annotations is an urgent problem. Although several
works have approached this problem with weakly-supervised methods, the performance
needs to be improved. One obstacle is that general weakly-supervised methods have
failed to consider the characteristics of X-ray images, such as the highly-structural
attribute. We therefore propose the Cross-chest Graph (CCG), which improves the performance
of automatic lesion detection by imitating doctor's training and decision-making process.
CCG models the intra-image relationship between different anatomical areas by leveraging
the structural information to simulate the doctor's habit of observing different areas.
Meanwhile, the relationship between any pair of images is modeled by a knowledge-reasoning
module to simulate the doctor's habit of comparing multiple images. We integrate intra-image
and inter-image information into a unified end-to-end framework. Experimental results
on the NIH Chest-14 database (112,120 frontal-view X-ray images with 14 diseases)
demonstrate that the proposed method achieves state-of-the-art performance in weakly-supervised
localization of lesions by absorbing professional knowledge in the medical field.

ZiGAN: Fine-grained Chinese Calligraphy Font Generation via a Few-shot Style Transfer
Approach

Qi Wen
Shuang Li
Bingfeng Han
Yi Yuan

Chinese character style transfer is a very challenging problem because of the complexity
of the glyph shapes or underlying structures and large numbers of existed characters,
when comparing with English letters. Moreover, the handwriting of calligraphy masters
has a more irregular stroke and is difficult to obtain in real-world scenarios. Recently,
several GAN-based methods have been proposed for font synthesis, but some of them
require numerous reference data and the other part of them have cumbersome preprocessing
steps to divide the character into different parts to be learned and transferred separately.
In this paper, we propose a simple but powerful end-to-end Chinese calligraphy font
generation framework ZiGAN, which does not require any manual operation or redundant
preprocessing to generate fine-grained target style characters with few-shot references.
To be specific, a few paired samples from different character styles are leveraged
to attain fine-grained correlation between structures underlying different glyphs.
To capture valuable style knowledge in target and strengthen the coarse-grained understanding
of character content, we utilize multiple unpaired samples to align the feature distributions
belonging to different character styles. By doing so, only a few target Chinese calligraphy
characters are needed to generated expected style transferred characters. Experiments
demonstrate that our method has a state-of-the-art generalization ability in few-shot
Chinese character style transfer.

Cycle-Consistent Inverse GAN for Text-to-Image Synthesis

Hao Wang
Guosheng Lin
Steven C. H. Hoi
Chunyan Miao

This paper investigates an open research task of text-to-image synthesis for automatically
generating or manipulating images from text descriptions. Prevailing methods mainly
take the textual descriptions as the conditional input for the GAN generation, and
need to train different models for the text-guided image generation and manipulation
tasks. In this paper, we propose a novel unified framework of Cycle-consistent Inverse
GAN (CI-GAN) for both text-to-image generation and text-guided image manipulation
tasks. Specifically, we first train a GAN model without text input, aiming to generate
images with high diversity and quality. Then we learn a GAN inversion model to convert
the images back to the GAN latent space and obtain the inverted latent codes for each
image, where we introduce the cycle-consistency training to learn more robust and
consistent inverted latent codes. We further uncover the semantics of the latent space
of the trained GAN model, by learning a similarity model between text representations
and the latent codes. In the text-guided optimization module, we can generate images
with the desired semantic attributes through optimization on the inverted latent codes.
Extensive experiments on the Recipe1M and CUB datasets validate the efficacy of our
proposed framework.

Fully Quantized Image Super-Resolution Networks

Hu Wang
Peng Chen
Bohan Zhuang
Chunhua Shen

With the rising popularity of intelligent mobile devices, it is of great practical
significance to develop accurate, real-time and energy-efficient image Super-Resolution
(SR) methods. A prevailing method for improving inference efficiency is model quantization,
which allows for replacing the expensive floating-point operations with efficient
bitwise arithmetic. To date, it is still challenging for quantized SR frameworks to
deliver a feasible accuracy-efficiency trade-off. Here, we propose a Fully Quantized
image Super-Resolution framework (FQSR) to jointly optimize efficiency and accuracy.
In particular, we target obtaining end-to-end quantized models for all layers, especially
including skip connections, which was rarely addressed in the literature of SR quantization.
We further identify obstacles faced by low-bit SR networks and propose a novel method
to counteract them accordingly. The difficulties are caused by 1) for SR task, due
to the existence of skip connections, high-resolution feature maps would occupy a
huge amount of memory spaces; 2) activation and weight distributions being vastly
distinctive in different layers; 3) the inaccurate approximation of the quantization.
We apply our quantization scheme on multiple mainstream super-resolution architectures,
including SRResNet, SRGAN and EDSR. Experimental results show that our FQSR with low-bits
quantization is able to achieve on par performance compared with the full-precision
counterparts on five benchmark datasets and surpass the state-of-the-art quantized
SR methods with significantly reduced computational cost and memory consumption. Code
is available at https://git.io/JWxPp.

AKECP: Adaptive Knowledge Extraction from Feature Maps for Fast and Efficient Channel
Pruning

Haonan Zhang
Longjun Liu
Hengyi Zhou
Wenxuan Hou
Hongbin Sun
Nanning Zheng

Pruning can remove redundant parameters and structures of Deep Neural Networks (DNNs)
to reduce inference time and memory overhead. As an important component of neural
networks, the feature map (FM) has stated to be adopted for network pruning. However,
the majority of FM-based pruning methods do not fully investigate effective knowledge
in the FM for pruning. In addition, it is challenging to design a robust pruning criterion
with a small number of images and achieve parallel pruning due to the variability
of FMs. In this paper, we propose Adaptive Knowledge Extraction for Channel Pruning
(AKECP), which can compress the network fast and efficiently. In AKECP, we first investigate
the characteristics of FMs and extract effective knowledge with an adaptive scheme.
Secondly, we formulate the effective knowledge of FMs to measure the importance of
corresponding network channels. Thirdly, thanks to the effective knowledge extraction,
AKECP can efficiently and simultaneously prune all the layers with extremely few or
even one image. Experimental results show that our method can compress various networks
on different datasets without introducing additional constraints, and it has advanced
the state-of-the-arts. Notably, for ResNet-110 on CIFAR-10, AKECP achieves 59.9% of
parameters and 59.8% of FLOPs reduction with negligible accuracy loss. For ResNet-50
on ImageNet, AKECP saves 40.5% of memory footprint and reduces 44.1% of FLOPs with
only 0.32% of Top-1 accuracy drop.

Dynamic Momentum Adaptation for Zero-Shot Cross-Domain Crowd Counting

Qiangqiang Wu
Jia Wan
Antoni B. Chan

Zero-shot cross-domain crowd counting is a challenging task where a crowd counting
model is trained on a source domain (i.e., training dataset) and no additional labeled
or unlabeled data is available for fine-tuning the model when testing on an unseen
target domain (i.e., a different testing dataset). The generalisation performance
of existing crowd counting methods is typically limited due to the large gap between
source and target domains. Here, we propose a novel Crowd Counting framework built
upon an external Momentum Template, termed C2MoT, which enables the encoding of domain
specific information via an external template representation. Specifically, the Momentum
Template (MoT) is learned in a momentum updating way during offline training, and
then is dynamically updated for each test image in online cross-dataset evaluation.
Thanks to the dynamically updated MoT, our C2MoT effectively generates dense target
correspondences that explicitly accounts for head regions, and then effectively predicts
the density map based on the normalized correspondence map. Experiments on large scale
datasets show that our proposed C2MoT achieves leading zero-shot cross-domain crowd
counting performance without model fine-tuning, while also outperforming domain adaptation
methods that use fine-tuning on target domain data. Moreover, C2MoT also obtains state-of-the-art
counting performance on the source domain.

Auto-MSFNet: Search Multi-scale Fusion Network for Salient Object Detection

Miao Zhang
Tingwei Liu
Yongri Piao
Shunyu Yao
Huchuan Lu

Multi-scale features fusion plays a critical role in salient object detection. Most
of existing methods have achieved remarkable performance by exploiting various multi-scale
features fusion strategies. However, an elegant fusion framework requires expert knowledge
and experience, heavily relying on laborious trial and error. In this paper, we propose
a multi-scale features fusion framework based on Neural Architecture Search (NAS),
named Auto-MSFNet. First, we design a novel search cell, named FusionCell to automatically
decide multi-scale features aggregation. Rather than searching one repeatable cell
stacked, we allow different FusionCells to flexibly integrate multi-level features.
Simultaneously, considering features generated from CNNs are naturally spatial and
channel-wise, we propose a new search space for efficiently focusing on the most relevant
information. The search space mitigates incomplete object structures or over-predicted
foreground regions caused by progressive fusion. Second, we propose a progressive
polishing loss to further obtain exquisite boundaries by penalizing misalignment of
salient object boundaries. Extensive experiments on five benchmark datasets demonstrate
the effectiveness of the proposed method and achieve state-of-the-art performance
on four evaluation metrics. The code and results of our method are available at https://github.com/OIPLab-DUT/Auto-MSFNet.

Few-shot Unsupervised Domain Adaptation with Image-to-Class Sparse Similarity Encoding

Shengqi Huang
Wanqi Yang
Lei Wang
Luping Zhou
Ming Yang

This paper investigates a valuable setting called few-shot unsupervised domain adaptation
(FS-UDA), which has not been sufficiently studied in the literature. In this setting,
the source domain data are labelled, but with few-shot per category, while the target
domain data are unlabelled. To address the FS-UDA setting, we develop a general UDA
model to solve the following two key issues: the few-shot labeled data per category
and the domain adaptation between support and query sets. Our model is general in
that once trained it will be able to be applied to various FS-UDA tasks from the same
source and target domains. Inspired by the recent local descriptor based few-shot
learning (FSL), our general UDA model is fully built upon local descriptors (LDs)
for image classification and domain adaptation. By proposing a novel concept called
similarity patterns (SPs), our model not only effectively considers the spatial relationship
of LDs that was ignored in previous FSL methods, but also makes the learned image
similarity better serve the required domain alignment. Specifically, we propose a
novel IMage-to-class sparse Similarity Encoding (IMSE) method. It learns SPs to extract
the local discriminative information for classification and meanwhile aligns the covariance
matrix of the SPs for domain adaptation. Also, domain adversarial training and multi-scale
local feature matching are performed upon LDs. Extensive experiments conducted on
a multi-domain benchmark dataset DomainNet demonstrates the state-of-the-art performance
of our IMSE for the novel setting of FS-UDA. In addition, for FSL, our IMSE can also
show better performance than most of recent FSL methods on miniImageNet.

Semantic-aware Transfer with Instance-adaptive Parsing for Crowded Scenes Pose Estimation

Xuanhan Wang
Lianli Gao
Yan Dai
Yixuan Zhou
Jingkuan Song

Crowded scenes human pose estimation remains challenging, which requires joint comprehension
of multi-persons and their keypoints in a highly complex scenario. The top-down mechanism,
which is a detect-then-estimate pipeline, has become the mainstream solution for general
pose estimation and obtained impressive progress. However, simply applying this mechanism
to crowded scenes pose estimation results in unsatisfactory performance due to several
issues, in particular involving missing keypoints in crowds and ambiguously labeling
during training. To tackle above two issues, we introduce a novel method named Semantic-aware
Transfer with Instance-adaptive Parsing (STIP). Specifically, our STIP first enhances
the discriminative power of pixel-level representations with a semantic-aware mechanism,
where it smartly decides which pixels to enhance and what semantic embeddings to add.
In this way, the missing keypoints detection can be alleviated.Secondly, instead of
adopting a standard regressor with fixed parameters, we propose a new instance-adaptive
parsing method, where it dynamically generates instance-specific parameters for reducing
adverse effects caused by ambiguously labeling. Notably, STIP is designed in a plugin
fashion and it can be integrated into any top-down models, such as HRNet. Extensive
experiments on two challenging benchmarks, i.e., CrowdPose and MS-COCO, demonstrate
the superiority and generalizability of our approach.

Multimodal Dialog System: Relational Graph-based Context-aware Question Understanding

Haoyu Zhang
Meng Liu
Zan Gao
Xiaoqiang Lei
Yinglong Wang
Liqiang Nie

Multimodal dialog system has attracted increasing attention from both academia and
industry over recent years. Although existing methods have achieved some progress,
they are still confronted with challenges in the aspect of question understanding
(i.e., user intention comprehension). In this paper, we present a relational graph-based
context-aware question understanding scheme, which enhances the user intention comprehension
from local to global. Specifically, we first utilize multiple attribute matrices as
the guidance information to fully exploit the product-related keywords from each textual
sentence, strengthening the local representation of user intentions. Afterwards, we
design a sparse graph attention network to adaptively aggregate effective context
information for each utterance, completely understanding the user intentions from
a global perspective. Moreover, extensive experiments over a benchmark dataset show
the superiority of our model compared with several state-of-the-art baselines.

Shadow Detection via Predicting the Confidence Maps of Shadow Detection Methods

Jingwei Liao
Yanli Liu
Guanyu Xing
Housheng Wei
Jueyu Chen
Songhua Xu

Today's mainstream shadow detection methods are manually designed via a case-by-case
approach. Accordingly, these methods may only be able to detect shadows for specific
scenes. Given the complex and diverse shadow scenes in reality, none of the existing
methods can provide a one-size-fits-all solution with satisfactory performance. To
address this problem, this paper introduces a new concept, named shadow detection
confidence, which can be used to evaluate the effect of any shadow detection method
for any given scene. The best detection effect for a scene is achieved by combining
prediction results by multiple methods. To measure the shadow detection confidence
characteristics of an image, a novel relative confidence map prediction network (RCMPNet)
is proposed. Experimental results show that the proposed method outperforms multiple
state-of-the-art shadow detection methods on four shadow detection benchmark datasets.

Motion Prediction via Joint Dependency Modeling in Phase Space

Pengxiang Su
Zhenguang Liu
Shuang Wu
Lei Zhu
Yifang Yin
Xuanjing Shen

Motion prediction is a classic problem in computer vision, which aims at forecasting
future motion given the observed pose sequence. Various deep learning models have
been proposed, achieving state-of-the-art performance on motion prediction. However,
existing methods typically focus on modeling temporal dynamics in the pose space.
Unfortunately, the complicated and high dimensionality nature of human motion brings
inherent challenges for dynamic context capturing. Therefore, we move away from the
conventional pose based representation and present a novel approach employing a phase
space trajectory representation of individual joints. Moreover, current methods tend
to only consider the dependencies between physically connected joints. In this paper,
we introduce a novel convolutional neural model to effectively leverage explicit prior
knowledge of motion anatomy, and simultaneously capture both spatial and temporal
information of joint trajectory dynamics. We then propose a global optimization module
that learns the implicit relationships between individual joint features. Empirically,
our method is evaluated on large-scale 3D human motion benchmark datasets (i.e., Human3.6M,
CMU MoCap). These results demonstrate that our method sets the new state-of-the-art
on the benchmark datasets. Our code is released at https://github.com/Pose-Group/TEID.

Q-Art Code: Generating Scanning-robust Art-style QR Codes by Deformable Convolution

Hao Su
Jianwei Niu
Xuefeng Liu
Qingfeng Li
Ji Wan
Mingliang Xu

Quick Response (QR) code is a popular form of matrix barcodes that are widely used
to tag online links on print media (e.g., posters, leaflets, and books). However,
standard QR codes typically appear as noise-like black/white squares (named modules)
which seriously disrupt the attractiveness of their carriers. In this paper, we propose
StyleCode-Net, a method to generate novel art-style QR codes which can better match
the entire style of their carriers to improve the visual quality. For endowing QR
codes with artistic elements, a big challenge is that the scanning-robustness must
be preserved after transforming colors and textures. To address these issues, we propose
a module-based deformable convolutional mechanism (MDCM) and a dynamic target mechanism
(DTM) in StyleCode-Net. MDCM can extract the features of black and white modules of
QR codes respectively. Then, the extracted features are fed to DTM to balance the
scanning-robustness and the style representation. Extensive subjective and objective
experiments show that our art-style QR codes have reached the state-of-the-art level
in both visual quality and scanning-robustness, and these codes have the potential
to replace standard QR codes in real-world applications.

Depth Quality-Inspired Feature Manipulation for Efficient RGB-D Salient Object Detection

Wenbo Zhang
Ge-Peng Ji
Zhuo Wang
Keren Fu
Qijun Zhao

RGB-D salient object detection (SOD) recently has attracted increasing research interest
by benefiting conventional RGB SOD with extra depth information. However, existing
RGB-D SOD models often fail to perform well in terms of both efficiency and accuracy,
which hinders their potential applications on mobile devices and real-world problems.
An underlying challenge is that the model accuracy usually degrades when the model
is simplified to have few parameters. To tackle this dilemma and also inspired by
the fact that depth quality is a key factor influencing the accuracy, we propose a
novel depth quality-inspired feature manipulation (DQFM) process, which is efficient
itself and can serve as a gating mechanism for filtering depth features to greatly
boost the accuracy. DQFM resorts to the alignment of low-level RGB and depth features,
as well as holistic attention of the depth stream to explicitly control and enhance
cross-modal fusion. We embed DQFM to obtain an efficient light-weight model called
DFM-Net, where we also design a tailored depth backbone and a two-stage decoder for
further efficiency consideration. Extensive experimental results demonstrate that
our DFM-Net achieves state-of-the-art accuracy when comparing to existing non-efficient
models, and meanwhile runs at 140ms on CPU (2.2x faster than the prior fastest efficient
model) with only ~8.5Mb model size (14.9% of the prior lightest). Our code will be
available at https://github.com/zwbx/DFM-Net.

Revisiting Mid-Level Patterns for Cross-Domain Few-Shot Recognition

Yixiong Zou
Shanghang Zhang
Jianpeng Yu
Yonghong Tian
José M. F. Moura

Existing few-shot learning (FSL) methods usually assume base classes and novel classes
are from the same domain (in-domain setting). However, in practice, it may be infeasible
to collect sufficient training samples for some special domains to construct base
classes. To solve this problem, cross-domain FSL (CDFSL) is proposed very recently
to transfer knowledge from general-domain base classes to special-domain novel classes.
Existing CDFSL works mostly focus on transferring between near domains, while rarely
consider transferring between distant domains, which is in practical need as any novel
classes could appear in real-world applications, and is even more challenging. In
this paper, we study a challenging subset of CDFSL where the novel classes are in
distant domains from base classes, by revisiting the mid-level features, which are
more transferable yet under-explored in main stream FSL work. To boost the discriminability
of mid-level features, we propose a residual-prediction task to encourage mid-level
features to learn discriminative information of each sample. Notably, such mechanism
also benefits the in-domain FSL and CDFSL in near domains. Therefore, we provide two
types of features for both cross- and in-domain FSL respectively, under the same training
framework. Experiments under both settings on six public datasets, including two challenging
medical datasets, validate the our rationale and demonstrate state-of-the-art performance.
Code will be released.

Space-Angle Super-Resolution for Multi-View Images

Yuqi Sun
Ri Cheng
Bo Yan
Shili Zhou

The limited spatial and angular resolutions in multi-view multimedia applications
restrict their visual experience in practical use. In this paper, we first argue the
space-angle super-resolution (SASR) problem for irregular arranged multi-view images.
It aims to increase the spatial resolution of source views and synthesize arbitrary
virtual high resolution (HR) views between them jointly. One feasible solution is
to perform super-resolution (SR) and view synthesis (VS) methods separately. However,
it cannot fully exploit the intra-relationship between SR and VS tasks. Intuitively,
multi-view images can provide more angular references, and higher resolution can provide
more high-frequency details. Therefore, we propose a one-stage space-angle super-resolution
network called SASRnet, which simultaneously synthesizes real and virtual HR views.
Extensive experiments on several benchmarks demonstrate that our proposed method outperforms
two-stage methods, meanwhile prove that SR and VS can promote each other. To our knowledge,
this work is the first to address the SASR problem for unstructured multi-view images
in an end-to-end learning-based manner.

Weakly-Supervised Video Object Grounding via Stable Context Learning

Wei Wang
Junyu Gao
Changsheng Xu

We investigate the problem of weakly-supervised video object grounding (WSVOG), where
only the video-sentence annotations are provided for training. It aims at localizing
the queried objects described in the sentence to visual regions in the video. Despite
the recent progress, existing approaches have not fully exploited the potential of
the description sentences for cross-modal alignment in two aspects: (1) Most of them
extract objects from the description sentences and represent them with fixed textual
representations. While achieving promising results, they do not make full use of the
contextual information in the sentence. (2) A few works have attempted to utilize
contextual information to learn object representations, but found a significant decrease
in performance due to the unstable training in cross-modal alignment. To address the
above issues, in this paper, we propose a Stable Context Learning (SCL) framework
for WSVOG which jointly enjoys the merits of stable learning and rich contextual information.
Specifically, we design two modules named Context-Aware Object Stabilizer module and
Cross-Modal Alignment Knowledge Transfer module, which are cooperated together to
inject contextual information to stable object concepts in text modality and transfer
contextualized knowledge in cross-modal alignment. Our approach is finally optimized
under a frame-level MIL paradigm. Extensive experiments on three popular benchmarks
demonstrate its significant effectiveness.

Modeling the Uncertainty for Self-supervised 3D Skeleton Action Representation Learning

Yukun Su
Guosheng Lin
Ruizhou Sun
Yun Hao
Qingyao Wu

Self-supervised learning (SSL) has been proved very effective in learning representations
from unlabeled data in language and vision domains. Yet, very few instrumental self-supervised
approaches exist for 3D skeleton action understanding, and directly applying the existing
SSL methods from other domains for skeleton action learning may suffer from misalignment
of representations and some limitations. In this paper, we consider that a good representation
learning encoder can distinguish the underlying features of different actions, which
can make the similar motions closer while pushing the dissimilar motions away. There
exists, however, some uncertainties in the skeleton actions due to the inherent ambiguity
of 3D skeleton pose in different viewpoints or the sampling algorithm in contrastive
learning, thus, it is ill-posed to differentiate the action features in the deterministic
embedding space. To address these issues, we rethink the distance between action features
and propose to model each action representation into the probabilistic embedding space
to alleviate the uncertainties upon encountering the ambiguous 3D skeleton inputs.
To validate the effectiveness of the proposed method, extensive experiments are conducted
on Kinetics, NTU60, NTU120, and PKUMMD datasets with several alternative network architectures.
Experimental evaluations demonstrate the superiority of our approach and through which,
we can gain significant performance improvement without using extra labeled data.

D³Net: Dual-Branch Disturbance Disentangling Network for Facial Expression Recognition

Rongyun Mo
Yan Yan
Jing-Hao Xue
Si Chen
Hanzi Wang

One of the main challenges in facial expression recognition (FER) is to address the
disturbance caused by various disturbing factors, including common ones (such as identity,
pose, and illumination) and potential ones (such as hairstyle, accessory, and occlusion).
Recently, a number of FER methods have been developed to explicitly or implicitly
alleviate the disturbance involved in facial images. However, these methods either
consider only a few common disturbing factors or neglect the prior information of
these disturbing factors, thus resulting in inferior recognition performance. In this
paper, we propose a novel Dual-branch Disturbance Disentangling Network (D3Net), mainly
consisting of an expression branch and a disturbance branch, to perform effective
FER. In the disturbance branch, a label-aware sub-branch (LAS) and a label-free sub-branch
(LFS) are elaborately designed to cope with different types of disturbing factors.
On the one hand, LAS explicitly captures the disturbance due to some common disturbing
factors by transfer learning on a pretrained model. On the other hand, LFS implicitly
encodes the information of potential disturbing factors in an unsupervised manner.
In particular, we introduce an Indian buffet process (IBP) prior to model the distribution
of potential disturbing factors in LFS. Moreover, we leverage adversarial training
to increase the differences between disturbance features and expression features,
thereby enhancing the disentanglement of disturbing factors. By disentangling the
disturbance from facial images, we are able to extract discriminative expression features.
Extensive experiments demonstrate that our proposed method performs favorably against
several state-of-the-art FER methods on both in-the-lab and in-the-wild databases.

Towards a Unified Middle Modality Learning for Visible-Infrared Person Re-Identification

Yukang Zhang
Yan Yan
Yang Lu
Hanzi Wang

Visible-infrared person re-identification (VI-ReID) aims to search identities of pedestrians
across different spectra. In this task, one of the major challenges is the modality
discrepancy between the visible (VIS) and infrared (IR) images. Some state-of-the-art
methods try to design complex networks or generative methods to mitigate the modality
discrepancy while ignoring the highly non-linear relationship between the two modalities
of VIS and IR. In this paper, we propose a non-linear middle modality generator (MMG),
which helps to reduce the modality discrepancy. Our MMG can effectively project VIS
and IR images into a unified middle modality image (UMMI) space to generate middle-modality
(M-modality) images. The generated M-modality images and the original images are fed
into the backbone network to reduce the modality discrepancy.Furthermore, in order
to pull together the two types of M-modality images generated from the VIS and IR
images in the UMMI space, we propose a distribution consistency loss (DCL) to make
the modality distribution of the generated M-modalities images as consistent as possible.
Finally, we propose a middle modality network (MMN) to further enhance the discrimination
and richness of features in an explicit manner. Extensive experiments have been conducted
to validate the superiority of MMN for VI-ReID over some state-of-the-art methods
on two challenging datasets. The gain of MMN is more than 11.1% and 8.4% in terms
of Rank-1 and mAP, respectively, even compared with the latest state-of-the-art methods
on the SYSU-MM01 dataset.

ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal
Knowledge Integration

Yuhao Cui
Zhou Yu
Chunqi Wang
Zhongzhou Zhao
Ji Zhang
Meng Wang
Jun Yu

Vision-and-language pretraining (VLP) aims to learn generic multimodal representations
from massive image-text pairs. While various successful attempts have been proposed,
learning fine-grained semantic alignments between image-text pairs plays a key role
in their approaches. Nevertheless, most existing VLP approaches have not fully utilized
the intrinsic knowledge within the image-text pairs, which limits the effectiveness
of the learned alignments and further restricts the performance of their models. To
this end, we introduce a new VLP method called ROSITA, which integrates the cross-
and intra-modal knowledge in a unified scene graph to enhance the semantic alignments.
Specifically, we introduce a novel structural knowledge masking (SKM) strategy to
use the scene graph structure as a priori to perform masked language (region) modeling,
which enhances the semantic alignments by eliminating the interference information
within and across modalities. Extensive ablation studies and comprehensive analysis
verifies the effectiveness of ROSITA in semantic alignments. Pretrained with both
in-domain and out-of-domain datasets, ROSITA significantly outperforms existing state-of-the-art
VLP methods on three typical vision-and-language tasks over six benchmark datasets.

Object Point Cloud Classification via Poly-Convolutional Architecture Search

Xuanxiang Lin
Ke Chen
Kui Jia

Existing point cloud classifiers concern on handling irregular data structures to
discover a global and discriminative configuration of local geometries. These classification
methods design a number of effective permutation-invariant feature encoding kernels,
but still suffer from the intrinsic challenge of large geometric feature variations
caused by inconsistent point distributions along object surface. In this paper, point
cloud classification can be addressed via deep graph representation learning on aggregating
multiple convolutional feature kernels (namely, a poly convolutional operation) anchored
on each point with its local neighbours. Inspired by recent success of neural architecture
search, we introduce a novel concept of poly-convolutional architecture search (PolyConv
search in short) to model local geometric patterns in a more flexible manner.

To this end, the Monte Carlo Tree Search (MCTS) method is adopted, which can be formulated
into a Markov Decision Process problem to cast decisions for dependently selecting
layer-wise aggregation kernels. Experiments on the popular ModelNet40 benchmark have
verified that superior performance can be achieved by constructing networks via the
MCTS method, with aggregation kernels in our PolyConv search space.

Semantic-Guided Relation Propagation Network for Few-shot Action Recognition

Xiao Wang
Weirong Ye
Zhongang Qi
Xun Zhao
Guangge Wang
Ying Shan
Hanzi Wang

Few-shot action recognition has drawn growing attention as it can recognize novel
action classes by using only a few labeled samples. In this paper, we propose a novel
semantic-guided relation propagation network (SRPN), which leverages semantic information
together with visual information for few-shot action recognition. Different from most
previous works that neglect semantic information in the labeled data, our SRPN directly
utilizes the semantic label as an additional supervisory signal to improve the generalization
ability of the network. Besides, we treat the relation of each visual-semantic pair
as a relational node, and we use a graph convolutional network to model and propagate
such sample relations across visual-semantic pairs, including both intra-class commonality
and inter-class uniqueness, to guide the relation propagation in the graph. However,
since videos contain crucial sequences and ordering information, we propose a novel
spatial-temporal difference module, which can facilitate the network to enhance the
visual feature learning ability at both feature level and granular level for videos.
Extensive experiments conducted on several challenging benchmarks demonstrate that
our SRPN outperforms several state-of-the-art methods with a significant margin.

Anti-Distillation Backdoor Attacks: Backdoors Can Really Survive in Knowledge Distillation

Yunjie Ge
Qian Wang
Baolin Zheng
Xinlu Zhuang
Qi Li
Chao Shen
Cong Wang

Motivated by resource-limited scenarios, knowledge distillation (KD) has received
growing attention, effectively and quickly producing lightweight yet high-performance
student models by transferring the dark knowledge from large teacher models. However,
many pre-trained teacher models are downloaded from public platforms that lack necessary
vetting, posing a possible threat to knowledge distillation tasks. Unfortunately,
thus far, there has been little research to consider the backdoor attack from the
teacher model into student models in KD, which may pose a severe threat to its wide
use. In this paper, we, for the first time, propose a novel Anti-Distillation Backdoor
Attack (ADBA), in which the backdoor embedded in the public teacher model can survive
the knowledge distillation process and thus be transferred to secret distilled student
models. We first introduce a shadow to imitate the distillation process and adopt
an optimizable trigger to transfer information to help craft the desired teacher model.
Our attack is powerful and effective, which achieves 95.92%, 94.79%, and 90.19% average
success rates of attacks (SRoAs) against several different structure student models
on MNIST, CIFAR-10, and GTSRB, respectively. Our ADBA also performs robustly under
different user distillation environments with 91.72% and 92.37% average SRoAs on MNIST
and CIFAR-10, respectively. Finally, we show that the ADBA has a low overhead in the
injecting process, which converges on 50 and 70 epochs on CIFAR-10 and GTSRB, respectively,
while the normal training epochs of these datasets are almost 200.

One-stage Context and Identity Hallucination Network

Yinglu Liu
Mingcan Xiang
Hailin Shi
Tao Mei

Face swapping aims to synthesize a face image, in which the facial identity is well
transplanted from the source image and the context (e.g., hairstyle, head posture,
facial expression, lighting, and background) keeps consistent with the reference image.
The prior work mainly accomplishes the task in two stages, i.e., generating the inner
face with the source identity, and then stitching the generation with the complementary
part of the reference image by image blending techniques. The blending mask, which
is usually obtained by the additional face segmentation model, is a common practice
towards photo-realistic face swapping. However, artifacts usually appear at the blending
boundary, especially in areas occluded by the hair, eyeglasses, accessories, etc.
To address this problem, rather than struggling with the blending mask in the two-stage
routine, we develop a novel one-stage context and identity hallucination network,
which learns a series of hallucination maps to softly divide the context areas and
identity areas. For context areas, the features are fully utilized by a multi-level
context encoder. For identity areas, we design a novel two-cascading AdaIN to transfer
the identity while retaining the context. Besides, with the help of hallucination
maps, we introduce an effectively improved reconstruction loss to utilize unlimited
unpaired face images for training. Our network performs well on both context areas
and identity areas without any dependency on post-processing. Extensive qualitative
and quantitative experiments demonstrate the superiority of our network.

Mitigating Generation Shifts for Generalized Zero-Shot Learning

Zhi Chen
Yadan Luo
Sen Wang
Ruihong Qiu
Jingjing Li
Zi Huang

Generalized Zero-Shot Learning (GZSL) is the task of leveraging semantic information
to recognize seen and unseen samples, where unseen classes are not observable during
training. It is natural to derive generative models and hallucinate training samples
for unseen classes based on the knowledge learned from the seen samples. However,
most of these models suffer from the generation shifts, where the synthesized samples
may drift from the real distribution of unseen data. In this paper, we propose a novel
generative flow framework that consists of multiple conditional affine coupling layers
for learning unseen data generation. In particular, we identify three potential problems
that trigger the generation shifts, i.e., semantic inconsistency, variance collapse,
and structure disorder and address them respectively. First, to reinforce the correlations
between the generated samples and their corresponding attributes, we explicitly embed
the semantic information into the transformations in each coupling layer. Second,
to recover the intrinsic variance of the real unseen features, we introduce a visual
perturbation strategy to diversify the generated data and hereby help adjust the decision
boundary of the classifiers. Third, a relative positioning strategy is proposed to
revise the attribute embeddings, guiding them to fully preserve the inter-class geometric
structure and further avoid structure disorder in the semantic space. Experimental
results demonstrate that GSMFlow achieves the state-of-the-art performance on GZSL.

Weakly-Supervised Temporal Action Localization via Cross-Stream Collaborative Learning

Yuan Ji
Xu Jia
Huchuan Lu
Xiang Ruan

Weakly supervised temporal action localization (WTAL) is a challenging task as only
video-level category labels are available during training stage. Without precise temporal
annotations, most approaches rely on complementary RGB and optical flow features to
predict the start and end frame of each action category in a video. However, existing
approaches simply resort to either concatenation or weighted sum to learn how to take
advantages of these two modalities for accurate action localization, which ignore
the substantial variance between such two modalities. In this paper, we present Cross-Stream
Collaborative Learning (CSCL) to address these issues. The proposed CSCL introduce
a cross-stream weighting module to identify which modality is more robust during training
and take advantage of the robust modality to guide the weaker one. Furthermore, we
suppress the snippets which has high action-ness scores in both modalities to further
exploiting the complementary property between two modalities. In addition, we bring
the concept of co-training for WTAL and take both modalities into account for pseudo
label generation to help training a stronger model. Extensive experiments conducted
on THUMOS14 and ActivityNet dataset demonstrate that CSCL achieves a favorable performance
against state-of-the-arts methods.

Deep Interactive Video Inpainting: An Invisibility Cloak for Harry Potter

Cheng Chen
Jiayin Cai
Yao Hu
Xu Tang
Xinggang Wang
Chun Yuan
Xiang Bai
Song Bai

In this paper, we propose a new task of deep interactive video inpainting and an application
for users to interact with machines. To our best knowledge, this is the first deep
learning-based interactive video inpainting framework that only uses a free form of
user input as guidance (i.e. scribbles) instead of mask annotations, which has academic,
entertainment, and commercial value.

With users' scribbles on a certain frame, it simultaneously performs interactive video
object segmentation and video inpainting throughout the whole video. To achieve this,
we utilize a shared spatial-temporal memory module, which combines both segmentation
and inpainting into an end-to-end pipeline. In our framework, the past frames with
object masks (either the users' scribbles or the predicted masks) constitute an external
memory, and the current frame as the query is segmented and inpainted by reading the
visual cues stored in that memory. Furthermore, our method allows users to iteratively
refine the segmentation results, which effectively improves the inpainting performance
with frames where inferior segmentation results are witnessed. Hence, one could obtain
high-quality video inpainting results even with challenging video sequences. Qualitative
and quantitative experimental results demonstrate the superiority of our approach.

Searching Motion Graphs for Human Motion Synthesis

Chenchen Liu
Yadong Mu

This work proposes a graph search based method for human motion sequence synthesis,
complementing the modern generative model (e.g., variational auto-encoder or Gaussian
process) based solutions that currently dominate this task and showing strong advantages
at several aspects. The cornerstone of our method is a novel representation which
we dub as motion graph. Each motion graph is scaffolded by a set of realistic human
motion sequences (e.g., all training data in the Human3.6M benchmark). We devise a
scheme that adds transition edges across different motion sequences, enabling more
longer and diverse routes in the motion graph. Crucially, the proposed motion graph
bridges the problem of human motion synthesis with graph-oriented combinatorial optimization,
by naturally treating pre-specified starting or ending pose in human pose synthesis
as end-points of the retrieved graph path. Based on a jump-sensitive graph path search
algorithm proposed in this paper, our model can efficiently solve human motion completion
over the motion graphs. In contrast, existing methods are mainly effective for human
motion prediction and inadequate to impute missing sequences while jointly satisfying
the two constraints of pre-specified starting / ending poses. For the case of only
specifying the starting pose (i.e., human motion prediction), a forward graph walking
from the starting node is first performed to sample a diverse set of ending nodes
on the motion graph, each of which defines a motion completion problem. We conduct
comprehensive experiments on two large-scale benchmarks (Human3.6M and HumanEva-I).
The proposed method clearly proves to be superior in terms of several metrics, including
the diversity of generated human motion sequences, affinity to real poses, and cross-scenario
generalization etc.

When Video Classification Meets Incremental Classes

Hanbin Zhao
Xin Qin
Shihao Su
Yongjian Fu
Zibo Lin
Xi Li

With the rapid development of social media, tremendous videos with new classes are
generated daily, which raise an urgent demand for video classification methods that
can continuously update new classes while maintaining the knowledge of old videos
with limited storage and computing resources. In this paper, we summarize this task
as Class-Incremental Video Classification (CIVC) and propose a novel framework to
address it. As a subarea of incremental learning tasks, the challenge of catastrophic
forgetting is unavoidable in CIVC. To better alleviate it, we utilize some characteristics
of videos. First, we decompose the spatio-temporal knowledge before distillation rather
than treating it as a whole in the knowledge transfer process; trajectory is also
used to refine the decomposition. Second, we propose a dual granularity exemplar selection
method to select and store representative video instances of old classes and key-frames
inside videos under a tight storage budget. We benchmark our method and previous SOTA
class-incremental learning methods on Something-Something V2 and Kinetics datasets,
and our method outperforms previous methods significantly.

Fast and Accurate Lane Detection via Frequency Domain Learning

Yulin He
Wei Chen
Zhengfa Liang
Dan Chen
Yusong Tan
Xin Luo
Chen Li
Yulan Guo

It is desirable to maintain both high accuracy and runtime efficiency in lane detection.
State-of-the-art methods mainly address the efficiency problem by direct compression
of high-dimensional features. These methods usually suffer from information loss and
cannot achieve satisfactory accuracy performance. To ensure the diversity of features
and subsequently maintain information as much as possible, we introduce multi-frequency
analysis into lane detection. Specifically, we propose a multi-spectral feature compressor
(MSFC) based on two-dimensional (2D) discrete cosine transform (DCT) to compress features
while preserving diversity information. We group features and associate each group
with an individual frequency component, which incurs only 1/7 overhead of one-dimensional
convolution operation but preserves more information. Moreover, to further enhance
the discriminability of features, we design a multi-spectral lane feature aggregator
(MSFA) based on one-dimensional (1D) DCT to aggregate features from each lane according
to their corresponding frequency components. The proposed method outperforms the state-of-the-art
methods (including LaneATT and UFLD) on TuSimple, CULane, and LLAMAS benchmarks. For
example, our method achieves 76.32% F1 at 237 FPS and 76.98% F1 at 164 FPS on CULane,
which is 1.23% and 0.30% higher than LaneATT. Our code and models are available at
https://github.com/harrylin-hyl/MSLD.

Learning Multi-context Aware Location Representations from Large-scale Geotagged Images

Yifang Yin
Ying Zhang
Zhenguang Liu
Yuxuan Liang
Sheng Wang
Rajiv Ratn Shah
Roger Zimmermann

With the ubiquity of sensor-equipped smartphones, it is common to have multimedia
documents uploaded to the Internet that have GPS coordinates associated with them.
Utilizing such geotags as an additional feature is intuitively appealing for improving
the performance of location-aware applications. However, raw GPS coordinates are fine-grained
location indicators without any semantic information. Existing methods on geotag semantic
encoding mostly extract hand-crafted, application-specific location representations
that heavily depend on large-scale supplementary data and thus cannot perform efficiently
on mobile devices. In this paper, we present a machine learning based approach, termed
GPS2Vec+, which learns rich location representations by capitalizing on the world-wide
geotagged images. Once trained, the model has no dependence on the auxiliary data
anymore so it encodes geotags highly efficiently by inference. We extract visual and
semantic knowledge from image content and user-generated tags, and transfer the information
into locations by using geotagged images as a bridge. To adapt to different application
domains, we further present an attention-based fusion framework that estimates the
importance of the learnt location representations under different contexts for effective
feature fusion. Our location representations yield significant performance improvements
over the state-of-the-art geotag encoding methods on image classification and venue
annotation.

MV-TON: Memory-based Video Virtual Try-on network

Xiaojing Zhong
Zhonghua Wu
Taizhe Tan
Guosheng Lin
Qingyao Wu

With the development of Generative Adversarial Network, image-based virtual try-on
methods have made great progress. However, limited work has explored the task of video-based
virtual try-on while it is important in real-world applications. Most existing video-based
virtual try-on methods usually require clothing templates and they can only generate
blurred and low-resolution results. To address these challenges, we propose a Memory-based
Video virtual Try-On Network (MV-TON), which seamlessly transfers desired clothes
to a target person without using any clothing templates and generates high-resolution
realistic videos. Specifically, MV-TON consists of two modules: 1) a try-on module
that transfers the desired clothes from model images to frame images by pose alignment
and region-wise replacing of pixels; 2) a memory refinement module that learns to
embed the existing generated frames into the latent space as external memory for the
following frame generation. Experimental results show the effectiveness of our method
in the video virtual try-on task and its superiority over other existing methods.

Token Shift Transformer for Video Classification

Hao Zhang
Yanbin Hao
Chong-Wah Ngo

Transformer achieves remarkable successes in understanding 1 and 2-dimensional signals
(e.g., NLP and Image Content Understanding). As a potential alternative to convolutional
neural networks, it shares merits of strong interpretability, high discriminative
power on hyper-scale data, and flexibility in processing varying length inputs. However,
its encoders naturally contain computational intensive operations such as pair-wise
self-attention, incurring heavy computational burden when being applied on the complex
3-dimensional video signals. This paper presents Token Shift Module (i.e., TokShift),
a novel, zero-parameter, zero-FLOPs operator, for modeling temporal relations within
each transformer encoder. Specifically, the TokShift barely temporally shifts partial
[Class] token features back-and-forth across adjacent frames. Then, we densely plug
the module into each encoder of a plain 2D vision transformer for learning 3D video
representation. It is worth noticing that our TokShift transformer is a pure convolutional-free
video transformer pilot with computational efficiency for video understanding. Experiments
on standard benchmarks verify its robustness, effectiveness, and efficiency. Particularly,
with input clips of 8/12 frames, the TokShift transformer achieves SOTA precision:
79.83%/80.40% on the Kinetics-400, 66.56% on EGTEA-Gaze+, and 96.80% on UCF-101 datasets,
comparable or better than existing SOTA convolutional counterparts. Our code is open-sourced
in: https://github.com/VideoNetworks/TokShift-Transformer.

Attribute-specific Control Units in StyleGAN for Fine-grained Image Manipulation

Rui Wang
Jian Chen
Gang Yu
Li Sun
Changqian Yu
Changxin Gao
Nong Sang

Image manipulation with StyleGAN has been an increasing concern in recent years. Recent
works have achieved tremendous success in analyzing several semantic latent spaces
to edit the attributes of the generated images. However, due to the limited semantic
and spatial manipulation precision in these latent spaces, the existing endeavors
are defeated in fine-grained StyleGAN image manipulation, i.e., local attribute translation.
To address this issue, we discover attribute-specific control units, which consist
of multiple channels of feature maps and modulation styles. Specifically, we collaboratively
manipulate the modulation style channels and feature maps in control units rather
than individual ones to obtain the semantic and spatial disentangled controls. Furthermore,
we propose a simple yet effective method to detect the attribute-specific control
units. We move the modulation style along a specific sparse direction vector and replace
the filter-wise styles used to compute the feature maps to manipulate these control
units. We evaluate our proposed method in various face attribute manipulation tasks.
Extensive qualitative and quantitative results demonstrate that our proposed method
performs favorably against the state-of-the-art methods. The manipulation results
of real images further show the effectiveness of our method.

Attention-driven Graph Clustering Network

Zhihao Peng
Hui Liu
Yuheng Jia
Junhui Hou

The combination of the traditional convolutional network (i.e., an auto-encoder) and
the graph convolutional network has attracted much attention in clustering, in which
the auto-encoder extracts the node attribute feature and the graph convolutional network
captures the topological graph feature. However, the existing works (i) lack a flexible
combination mechanism to adaptively fuse those two kinds of features for learning
the discriminative representation and (ii) overlook the multi-scale information embedded
at different layers for subsequent cluster assignment, leading to inferior clustering
results. To this end, we propose a novel deep clustering method named Attention-driven
Graph Clustering Network (AGCN). Specifically, AGCN exploits a heterogeneity-wise
fusion module to dynamically fuse the node attribute feature and the topological graph
feature. Moreover, AGCN develops a scale-wise fusion module to adaptively aggregate
the multi-scale features embedded at different layers. Based on a unified optimization
framework, AGCN can jointly perform feature learning and cluster assignment in an
unsupervised fashion. Compared with the existing deep clustering methods, our method
is more flexible and effective since it comprehensively considers the numerous and
discriminative information embedded in the network and directly produces the clustering
results. Extensive quantitative and qualitative results on commonly used benchmark
datasets validate that our AGCN consistently outperforms state-of-the-art methods.

Lifting the Veil of Frequency in Joint Segmentation and Depth Estimation

Tianhao Fu
Yingying Li
Xiaoqing Ye
Xiao Tan
Hao Sun
Fumin Shen
Errui Ding

Joint learning of scene parsing and depth estimation remains a challenging task due
to the rivalry between the two tasks. In this paper, we revisit the mutual enhancement
for joint semantic segmentation and depth estimation. Inspired by the observation
that the competition and cooperation could be reflected in the feature frequency components
of different tasks, we propose a Frequency Aware Feature Enhancement (FAFE) network
that can effectively enhance the reciprocal relationship whereas avoiding the competition.
In FAFE, a frequency disentanglement module is proposed to fetch the favorable frequency
component sets for each task and resolve the discordance between the two tasks. For
task cooperation, we introduce a re-calibration unit to aggregate features of the
two tasks, so as to complement task information with each other. Accordingly, the
learning of each task can be boosted by the complementary task appropriately. Besides,
a novel local-aware consistency loss function is proposed to impose on the predicted
segmentation and depth so as to strengthen the cooperation. With the FAFE network
and new local-aware consistency loss encapsulated into the multi-task learning network,
the proposed approach achieves superior performance over previous state-of-the-art
methods. Extensive experiments and ablation studies on multi-task datasets demonstrate
the effectiveness of our proposed approach.

SESSION: Panel 1

The Next Generation Multimodal Conversational Search and Recommendation

Joao Magalhaes
Tat-Seng Chua
Tao Mei
Alan Smeaton

The world has become multimodal. In addition to text, we have been sharing a huge
amount of multimedia information in the form of images and videos on the Internet.
The wide spread use of smart mobile devices has also changed the way we interact with
the Internet. It is now natural for us to capture images and videos freely and use
as part of a query, in addition to the traditional text and voices. These, along with
the rapid advancements in multimedia, natural language processing, information retrieval,
and conversation technologies, mean that it is time for us to explore multimodal conversation
and its roles in search and recommendation. Multimodal conversation has the potential
to help us to uncover and digest the huge amount of multimedia information and knowledge
hidden within many systems. It also enables a natural 2-way interactions between humans
and machines, with mutual benefits in enriching their respective knowledge. Finally,
it opens up the possibilities of disrupting many existing applications and launching
new innovative applications. This panel is timely and aims to explore this emerging
trend, and discuss its potential benefits and pitfalls to society. The panel will
also explore the limitations of current technologies and highlight future research
directions towards developing a multimedia conversational system.

SESSION: Session 8: Emerging Multimedia Applications-IV

VoteHMR: Occlusion-Aware Voting Network for Robust 3D Human Mesh Recovery from Partial
Point Clouds

Guanze Liu
Yu Rong
Lu Sheng

3D human mesh recovery from point clouds is essential for various tasks, including
AR/VR and human behavior understanding. Previous works in this field either require
high-quality 3D human scans or sequential point clouds, which cannot be easily applied
to low-quality 3D scans captured by consumer-level depth sensors. In this paper, we
make the first attempt to reconstruct reliable 3D human shapes from single-frame partial
point clouds. To achieve this, we propose an end-to-end learnable method, named VoteHMR.
The core of VoteHMR is a novel occlusion-aware voting network that can first reliably
produce visible joint-level features from the input partial point clouds, and then
complete the joint-level features through the kinematic tree of the human skeleton.
Compared with holistic features used by previous works, the joint-level features can
not only effectively encode the human geometry information but also be robust to noisy
inputs with self-occlusions and missing areas. By exploiting the rich complementary
clues from the joint-level features and global features from the input point clouds,
the proposed method encourages reliable and disentangled parameter predictions for
statistical 3D human models, such as SMPL. The proposed method achieves state-of-the-art
performances on two large-scale datasets, namely SURREAL and DFAUST. Furthermore,
VoteHMR also demonstrates superior generalization ability on real-world datasets,
such as Berkeley MHAD.

MageAdd: Real-Time Interaction Simulation for Scene Synthesis

Shao-Kui Zhang
Yi-Xiao Li
Yu He
Yong-Liang Yang
Song-Hai Zhang

While recent researches on computational 3D scene synthesis have achieved impressive
results, automatically synthesized scenes do not guarantee satisfaction of end users.
On the other hand, manual scene modelling can always ensure high quality, but requires
a cumbersome trial-and-error process. In this paper, we bridge the above gap by presenting
a data-driven 3D scene synthesis framework that can intelligently infer objects to
the scene by incorporating and simulating user preferences with minimum input. While
the cursor is moved and clicked in the scene, our framework automatically selects
and transforms suitable objects into scenes in real time. This is based on priors
learnt from the dataset for placing different types of objects, and updated according
to the current scene context. Through extensive experiments we demonstrate that our
framework outperforms the state-of-the-art on result aesthetics, and enables effective
and efficient user interactions.

Cross-View Exocentric to Egocentric Video Synthesis

Gaowen Liu
Hao Tang
Hugo M. Latapie
Jason J. Corso
Yan Yan

Cross-view video synthesis task seeks to generate video sequences of one view from
another dramatically different view. In this paper, we investigate the exocentric
(third-person) view to egocentric (first-person) view video generation task. This
is challenging because egocentric view sometimes is remarkably different from the
exocentric view. Thus, transforming the appearances across the two different views
is a non-trivial task. Particularly, we propose a novel Bi-directional Spatial Temporal
Attention Fusion Generative Adversarial Network (STA-GAN) to learn both spatial and
temporal information to generate egocentric video sequences from the exocentric view.
The proposed STA-GAN consists of three parts: temporal branch, spatial branch, and
attention fusion. First, the temporal and spatial branches generate a sequence of
fake frames and their corresponding features. The fake frames are generated in both
downstream and upstream directions for both temporal and spatial branches. Next, the
generated four different fake frames and their corresponding features (spatial and
temporal branches in two directions) are fed into a novel multi-generation attention
fusion module to produce the final video sequence. Meanwhile, we also propose a novel
temporal and spatial dual-discriminator for more robust network optimization. Extensive
experiments on the Side2Ego and Top2Ego datasets show that the proposed STA-GAN significantly
outperforms the existing methods.

EVRNet: Efficient Video Restoration on Edge Devices

Sachin Mehta
Amit Kumar
Fitsum Reda
Varun Nasery
Vikram Mulukutla
Rakesh Ranjan
Vikas Chandra

In video transmission applications, video signals are transmitted over lossy channels,
resulting in low-quality received signals. To re- store videos on recipient edge devices
in real-time, we introduce an efficient video restoration network, EVRNet. EVRNet
efficiently allocates parameters inside the network using alignment, differential,
and fusion modules. With extensive experiments on different video restoration tasks
(deblocking, denoising, and super-resolution), we demonstrate that EVRNet delivers
competitive performance to existing methods with significantly fewer parameters and
MACs. For example, EVRNet has 260× fewer parameters and 958× fewer MACs than enhanced
deformable convolution-based video restoration net- work (EDVR) for 4× video super-resolution
while its SSIM score is 0.018 less than EDVR. We also evaluated the performance of
EVR-Net under multiple distortions on unseen dataset to demonstrate its ability in
modeling variable-length sequences under both camera and object motion.

Multimodal Entity Linking: A New Dataset and A Baseline

Jingru Gan
Jinchang Luo
Haiwei Wang
Shuhui Wang
Wei He
Qingming Huang

In this paper, we introduce a new Multimodal Entity Linking (MEL) task on the multimodal
data. The MEL task discovers entities in multiple modalities and various forms within
large-scale multimodal data and maps multimodal mentions in a document to entities
in a structured knowledge base such as Wikipedia. Different from the conventional
Neural Entity Linking (NEL) task that focuses on textual information solely, MEL aims
at achieving human-level disambiguation among entities in images, texts, and knowledge
bases. Due to the lack of sufficient labeled data for the MEL task, we release a large-scale
multimodal entity linking dataset M3EL (abbreviated for MultiModal Movie Entity Linking).
Specifically, we collect reviews and images of 1,100 movies, extract textual and visual
mentions, and label them with entities registered in Wikipedia. In addition, we construct
a new baseline method to solve the MEL problem, which models the alignment of textual
and visual mentions as a bipartite graph matching problem and solves it with an optimal-transportation-based
linking method. Extensive experiments on the M3EL dataset verify the quality of the
dataset and the effectiveness of the proposed method. We envision this work to be
helpful for soliciting more research effort and applications regarding multimodal
computing and inference in the future. We make the dataset and the baseline algorithm
publicly available at https://jingrug.github.io/research/M3EL.

AI-Lyricist: Generating Music and Vocabulary Constrained Lyrics

Xichu Ma
Ye Wang
Min-Yen Kan
Wee Sun Lee

We propose AI-Lyricist: a system to generate novel yet meaningful lyrics given a required
vocabulary and a MIDI file as inputs. This task involves multiple challenges, including
automatically identifying the melody and extracting a syllable template from multi-channel
music, generating creative lyrics that match the input music's style and syllable
alignment, and satisfying vocabulary constraints. To address these challenges, we
propose an automatic lyrics generation system consisting of four modules: (1) A music
structure analyzer to derive the musical structure and syllable template from a given
MIDI file, utilizing the concept of expected syllable number to better identify the
melody, (2) a SeqGAN-based lyrics generator optimized by multi-adversarial training
through policy gradients with twin discriminators for text quality and syllable alignment,
(3) a deep coupled music-lyrics embedding model to project music and lyrics into a
joint space to allow fair comparison of both melody and lyric constraints, and a module
called (4) Polisher, to satisfy vocabulary constraints by applying a mask to the generator
and substituting the words to be learned. We trained our model on a dataset of over
7,000 music-lyrics pairs, enhanced with manually annotated labels in terms of theme,
sentiment and genre. Both objective and subjective evaluations show AI-Lyricist's
superior performance against the state-of-the-art for the proposed tasks.

SESSION: Session 9: Emotional and Social Signals in Multimedia

CaFGraph: Context-aware Facial Multi-graph Representation for Facial Action Unit Recognition

Yingjie Chen
Diqi Chen
Yizhou Wang
Tao Wang
Yun Liang

Facial action unit (AU) recognition has attracted increasing attention due to its
indispensable role in affective computing, especially in the field of affective human-computer
interaction. Due to the subtle and transient nature of AU, it is challenging to capture
the delicate and ambiguous motions in local facial regions among consecutive frames.
Considering that context is essential to resolve ambiguity in human visual system,
modeling context within or among facial images emerges as a promising approach for
AU recognition task. To this end, we propose CaFGraph, a novel context-aware facial
multi-graph that can model both morphological & muscular-based region-level local
context and region-level temporal context. CaFGraph is the first work to construct
a universal facial multi-graph structure that is independent of both task settings
and dataset statistics for almost all fine-grained facial behavior analysis tasks,
including but not limited to AU recognition. To make full use of the context, we then
present CaFNet that learns context-aware facial graph representations via CaFGraph
from facial images for multi-label AU recognition. Experiments on two widely used
benchmark datasets, BP4D and DISFA, demonstrate the superiority of our CaFNet over
the state-of-the-art methods.

Self-Supervised Regional and Temporal Auxiliary Tasks for Facial Action Unit Recognition

Jingwei Yan
Jingjing Wang
Qiang Li
Chunmao Wang
Shiliang Pu

Automatic facial action unit (AU) recognition is a challenging task due to the scarcity
of manual annotations. To alleviate this problem, a large amount of efforts has been
dedicated to exploiting various methods which leverage numerous unlabeled data. However,
many aspects with regard to some unique properties of AUs, such as the regional and
relational characteristics, are not sufficiently explored in previous works. Motivated
by this, we take the AU properties into consideration and propose two auxiliary AU
related tasks to bridge the gap between limited annotations and the model performance
in a self-supervised manner via the unlabeled data. Specifically, to enhance the discrimination
of regional features with AU relation embedding, we design a task of RoI inpainting
to recover the randomly cropped AU patches. Meanwhile, a single image based optical
flow estimation task is proposed to leverage the dynamic change of facial muscles
and encode the motion information into the global feature representation. Based on
these two self-supervised auxiliary tasks, local features, mutual relation and motion
cues of AUs are better captured in the backbone network with the proposed regional
and temporal based auxiliary task learning (RTATL) framework. Extensive experiments
on BP4D and DISFA demonstrate the superiority of our method and new state-of-the-art
performances are achieved.

HetEmotionNet: Two-Stream Heterogeneous Graph Recurrent Neural Network for Multi-modal Emotion Recognition

Ziyu Jia
Youfang Lin
Jing Wang
Zhiyang Feng
Xiangheng Xie
Caijie Chen

The research on human emotion under multimedia stimulation based on physiological
signals is an emerging field and important progress has been achieved for emotion
recognition based on multi-modal signals. However, it is challenging to make full
use of the complementarity among spatial-spectral-temporal domain features for emotion
recognition, as well as model the heterogeneity and correlation among multi-modal
signals. In this paper, we propose a novel two-stream heterogeneous graph recurrent
neural network, named HetEmotionNet, fusing multi-modal physiological signals for
emotion recognition. Specifically, HetEmotionNet consists of the spatial-temporal
stream and the spatial-spectral stream, which can fuse spatial-spectral-temporal domain
features in a unified framework. Each stream is composed of the graph transformer
network for modeling the heterogeneity, the graph convolutional network for modeling
the correlation, and the gated recurrent unit for capturing the temporal domain or
spectral domain dependency. Extensive experiments on two real-world datasets demonstrate
that our proposed model achieves better performance than state-of-the-art baselines.

Simplifying Multimodal Emotion Recognition with Single Eye Movement Modality

Xu Yan
Li-Ming Zhao
Bao-Liang Lu

Multimodal emotion recognition has long been a popular topic in affective computing
since it significantly enhances the performance compared with that of a single modality.
Among all, the combination of electroencephalography (EEG) and eye movement signals
is one of the most attractive practices due to their complementarity and objectivity.
However, the high cost and inconvenience of EEG signal acquisition severely hamper
the popularization of multimodal emotion recognition in practical scenarios, while
eye movement signals are much easier to acquire. To increase the feasibility and the
generalization ability of emotion decoding without compromising the performance, we
propose a generative adversarial network-based framework. In our model, a single modality
of eye movements is used as input and it is capable of mapping the information onto
multimodal features. Experimental results on SEED series datasets with different emotion
categories demonstrate that our model with multimodal features generated by the single
eye movement modality maintains competitive accuracies compared to those with multimodality
input and drastically outperforms those single-modal emotion classifiers. This illustrates
that the model has the potential to reduce the dependence on multimodalities without
sacrificing performance which makes emotion recognition more applicable and practicable.

Learning What and When to Drop: Adaptive Multimodal and Contextual Dynamics for Emotion Recognition in Conversation

Feiyu Chen
Zhengxiao Sun
Deqiang Ouyang
Xueliang Liu
Jie Shao

Multi-sensory data has exhibited a clear advantage in expressing richer and more complex
feelings, on the Emotion Recognition in Conversation (ERC) task. Yet, current methods
for multimodal dynamics that aggregate modalities or employ additional modality-specific
and modality-shared networks are still inadequate in balancing between the sufficiency
of multimodal processing and the scalability to incremental multi-sensory data type
additions. This incurs a bottleneck of performance improvement of ERC. To this end,
we present MetaDrop, a differentiable and end-to-end approach for the ERC task that
learns module-wise decisions across modalities and conversation flows simultaneously,
which supports adaptive information sharing pattern and dynamic fusion paths. Our
framework mitigates the problem of modelling complex multimodal relations while ensuring
it enjoys good scalability to the number of modalities. Experiments on two popular
multimodal ERC datasets show that MetaDrop achieves new state-of-the-art results.

Zero-shot Video Emotion Recognition via Multimodal Protagonist-aware Transformer Network

Fan Qi
Xiaoshan Yang
Changsheng Xu

Recognizing human emotions from videos has attracted significant attention in numerous
computer vision and multimedia applications, such as human-computer interaction and
health care. It aims to understand the emotional response of humans, where candidate
emotion categories are generally defined by specific psychological theories. However,
with the development of psychological theories, emotion categories become increasingly
diverse and fine-grained, samples are also increasingly difficult to collect. In this
paper, we investigate a new task of zero-shot video emotion recognition, which aims
to recognize rare unseen emotions. Specifically, we propose a novel multimodal protagonist-aware
transformer network, which is composed of two branches: one is equipped with a novel
dynamic emotional attention mechanism and a visual transformer to learn better visual
representations; the other is an acoustic transformer for learning discriminative
acoustic representations. We manage to align the visual and acoustic representations
with semantic embeddings of fine-grained emotion labels through jointly mapping them
into a common space under a noise contrastive estimation objective. Extensive experimental
results on three datasets demonstrate the effectiveness of the proposed method.

SESSION: Session 10: Industrial Track

Show, Read and Reason: Table Structure Recognition with Flexible Context Aggregator

Hao Liu
Xin Li
Bing Liu
Deqiang Jiang
Yinsong Liu
Bo Ren
Rongrong Ji

We investigate the challenging problem of table structure recognition in this work.
Many recent methods adopt graph-based context aggregator with strong inductive bias
to reason sparse contextual relationships of table elements. However, the strong constraints
may be too restrictive to represent the complicated table relationships. In order
to learn more appropriate inductive bias from data, we try to introduce Transformer
as context aggregator in this work. Nevertheless, Transformer taking dense context
as input requires larger scale data and may suffer from unstable training procedure
due to the weakening of inductive bias. To overcome the above limitations, we in this
paper design a FLAG (FLexible context AGgregator), which marries Transformer with
graph-based context aggregator in an adaptive way. Based on FLAG, an end-to-end framework
requiring no extra meta-data or OCR information, termed FLAG-Net, is proposed to flexibly
modulate the aggregation of dense context and sparse one for the relational reasoning
of table elements. We investigate the modulation pattern in FLAG and show what contextual
information is focused, which is vital for recognizing table structure. Extensive
experimental results on benchmarks demonstrate the performance of our proposed FLAG-Net
surpasses other compared methods by a large margin.

TransFusion: Multi-Modal Fusion for Video Tag Inference via Translation-based Knowledge
Embedding

Di Jin
Zhongang Qi
Yingmin Luo
Ying Shan

Tag inference is an important task in the business of video platforms with wide applications
such as recommendation, interpretation, and more. Existing works are mainly based
on extracting video information from multiple modalities such as frames or music,
and then infer tags through classification or object detection. This, however, does
not apply to inferring generic tags or taxonomy that are less relevant to video contents,
such as video originality or its broader category, which are important in practice.
In this paper, we claim that these generic tags can be modeled through the semantic
relations between videos and tags, and can be utilized simultaneously with the multi-modal
features to achieve better video tagging. We propose TransFusion, an end-to-end supervised
learning framework that fuses multi-modal embeddings (e.g., vision, audio, texts,
etc.) with the knowledge embedding to derive the video representation. To infer the
diverse tags following heterogeneous relations, TransFusion adopts a dual attentive
approach to learn both the modality importance in fusion and relation importance in
inference. Besides, it is general enough and can be used with the existing translation-based
knowledge embedding approaches. Extensive experiments show that TransFusion outperforms
the baseline methods with lowered mean rank and at least 9.59% improvement in HITS@10
on the real-world video knowledge graph.

RecycleNet: An Overlapped Text Instance Recovery Approach

Yiqing Hu
Yan Zheng
Xinghua Jiang
Hao Liu
Deqiang Jiang
Yinsong Liu
Bo Ren
Rongrong Ji

Text recognition is the key pillar for many real-world multimedia applications. Existing
text recognition approaches focus on recognizing isolated instances, whose text fields
are visually separated and have no interference with each other. Moreover, these approaches
cannot handle overlapped instances that often appear in sheets like invoices, receipts
and math exercises, where printed templates are generated beforehand and extra contents
are added afterward on existing texts. In this paper, we aim to tackle this problem
by proposing RecycleNet, which automatically extracts and reconstructs overlapped
instances by fully recycling the intersecting pixels that used to be obstacles for
recognition. RecycleNet parallels to existing recognition systems, and serves as a
plug-and-play module to boost recognition performance with zero-effort. We also released
an OverlapText-500 dataset, which helps to boost the design of better overlapped text
recovery and recognition solutions.

ARShoe: Real-Time Augmented Reality Shoe Try-on System on Smartphones

Shan An
Guangfu Che
Jinghao Guo
Haogang Zhu
Junjie Ye
Fangru Zhou
Zhaoqi Zhu
Dong Wei
Aishan Liu
Wei Zhang

Virtual try-on technology enables users to try various fashion items using augmented
reality and provides a convenient online shopping experience. However, most previous
works focus on the virtual try-on for clothes while neglecting that for shoes, which
is also a promising task. To this concern, this work proposes a real-time augmented
reality virtual shoe try-on system for smartphones, namely ARShoe. Specifically, ARShoe
adopts a novel multi-branch network to realize pose estimation and segmentation simultaneously.
A solution to generate realistic 3D shoe model occlusion during the try-on process
is presented. To achieve a smooth and stable try-on effect, this work further develop
a novel stabilization method. Moreover, for training and evaluation, we construct
the very first large-scale foot benchmark with multiple virtual shoe try-on task-related
labels annotated. Exhaustive experiments on our newly constructed benchmark demonstrate
the satisfying performance of ARShoe. Practical tests on common smartphones validate
the real-time performance and stabilization of the proposed approach.

Inferring the Importance of Product Appearance with Semi-supervised Multi-modal Enhancement: A Step Towards the Screenless Retailing

Yongshun Gong
Jinfeng Yi
Dong-Dong Chen
Jian Zhang
Jiayu Zhou
Zhihua Zhou

Nowadays, almost all the online orders were placed through screened devices such as
mobile phones, tablets, and computers. With the rapid development of the Internet
of Things (IoT) and smart appliances, more and more screenless smart devices, e.g.,
smart speaker and smart refrigerator, appear in our daily lives. They open up new
means of interaction and may provide an excellent opportunity to reach new customers
and increase sales. However, not all the items are suitable for screenless shopping,
since some items' appearance play an important role in consumer decision making. Typical
examples include clothes, dolls, bags, and shoes. In this paper, we aim to infer the
significance of every item's appearance in consumer decision making and identify the
group of items that are suitable for screenless shopping. Specifically, we formulate
the problem as a classification task that predicts if an item's appearance has a significant
impact on people's purchase behavior. To solve this problem, we extract multi-modal
features from three different views, and collect a set of necessary labels via crowdsourcing.
We then propose an iterative semi-supervised learning framework with a carefully designed
multi-modal enhancement module. Experimental results verify the effectiveness of the
proposed method.

AsyNCE: Disentangling False-Positives for Weakly-Supervised Video Grounding

Cheng Da
Yanhao Zhang
Yun Zheng
Pan Pan
Yinghui Xu
Chunhong Pan

Weakly-supervised video grounding has been investigated to ground textual phases in
video content with only video-sentence pairs provided during training, for the lack
of prohibitively costly bounding box annotations. Existing methods cast this task
into a frame-level multiple instance learning (MIL) problem with the ranking loss.
While an object might appear sparsely across multiple frames, causing uncertain false-positive
frames. Thus, directly computing the average loss of all frames is inadequate in video
domain. Moreover, the positive and negative pairs are equally coupling in ranking
loss, so that it is impossible to handle false-positive frames individually. Additionally,
naive inner production is suboptimal for the similarity measure of cross domains.
To solve these issues, we propose a novel AsyNCE loss to flexibly disentangle the
positive pairs from negative ones in frame-level MIL, which allows for mitigating
the uncertainty of false-positive frames effectively. Besides, a cross-modal transformer
block is introduced to purify the text feature by image frame context, generating
a visual-guided text feature for better similarity measure. Extensive experiments
on YouCook2, RoboWatch and WAB datasets demonstrate the superiority and robustness
of our method over state-of-the-art methods.

Unifying Multimodal Transformer for Bi-directional Image and Text Generation

Yupan Huang
Hongwei Xue
Bei Liu
Yutong Lu

We study the joint learning of image-to-text and text-to-image generations, which
are naturally bi-directional tasks. Typical existing works design two separate task-specific
models for each task, which impose expensive design efforts. In this work, we propose
a unified image-and-text generative framework based on a single multimodal model to
jointly study the bi-directional tasks. We adopt Transformer as our unified architecture
for its strong performance and task-agnostic design. Specifically, we formulate both
tasks as sequence generation tasks, where we represent images and text as unified
sequences of tokens, and the Transformer learns multimodal interactions to generate
sequences. We further propose two-level granularity feature representations and sequence-level
training to improve the Transformer-based unified framework. Experiments show that
our approach significantly improves previous Transformer-based model X-LXMERT's FID
from 37.0 to 29.9 (lower is better) for text-to-image generation, and improves CIDEr-D
score from 100.9% to 122.6% for fine-tuned image-to-text generation on the MS-COCO
dataset. Our code is available online.

Once and for All: Self-supervised Multi-modal Co-training on One-billion Videos at Alibaba

Lianghua Huang
Yu Liu
Xiangzeng Zhou
Ansheng You
Ming Li
Bin Wang
Yingya Zhang
Pan Pan
Xu Yinghui

Videos grow to be one of the largest mediums on the Internet. E-commerce platforms
like Alibaba need to process millions of video data across multimedia (e.g., visual,
audio, image, and text) and on a variety of tasks (e.g., retrieval, tagging, and summary)
every day. In this work, we aim to develop a once and for all pretraining technique
for diverse modalities and downstream tasks. To achieve this, we make the following
contributions: (1) We propose a self-supervised multi-modal co-training framework.
It takes cross-modal pseudo-label consistency as the supervision and can jointly learn
representations of multiple modalities. (2) We introduce several novel techniques
(e.g., sliding-window subset sampling, coarse-to-fine clustering, fast spatial-temporal
convolution and parallel data transmission and processing) to optimize the training
process, making billion-scale stable training feasible. (3) We construct a large-scale
multi-modal dataset consisting of 1.4 billion videos (~0.5 PB) and train our framework
on it. The training takes only 4.6 days on an in-house 256 GPUs cluster, and it simultaneously
produces pretrained video, audio, image, motion, and text networks. (4) Finetuning
from our pretrained models, we obtain significant performance gains and faster convergence
on diverse multimedia tasks at Alibaba. Furthermore, we also validate the learned
representation on public datasets. Despite the domain gap between our commodity-centric
pretraining and the action-centric evaluation data, we show superior results against
state-of-the-arts.

L2RS: A Learning-to-Rescore Mechanism for Hybrid Speech Recognition

Yuanfeng Song
Di Jiang
Xuefang Zhao
Qian Xu
Raymond Chi-Wing Wong
Lixin Fan
Qiang Yang

This paper aims to advance the performance of industrial ASR systems by exploring
a more effective method for N-best rescoring, a critical step that greatly affects
the final recognition accuracy. Existing rescoring approaches suffer the following
issues: (i) limited performance since they optimize an unnecessarily harder problem,
namely predicting accurate grammatical legitimacy scores of the N-best hypotheses
rather than directly predicting their partial orders regarding a specific acoustic
input; (ii) hard to incorporate various information by advanced natural language processing
(NLP) models such as BERT to achieve a comprehensive evaluation of each N-best candidate.
To relieve the above drawbacks, we propose a simple yet effective mechanism, Learning-to-Rescore
(L2RS), to empower ASR systems with state-of-the-art information retrieval (IR) techniques.
Specifically, L2RS utilizes a wide range of textual information from the state-of-the-art
NLP models and automatically deciding their weights to directly learn the ranking
order of each N-best hypothesis with respect to a specific acoustic input. We incorporate
various features including BERT sentence embeddings, the topic vectors, and perplexity
scores produced by an n-gram language model (LM), topic modeling LM, BERT, and RNNLM
to train the rescoring model. Experimental results on a public dataset show that L2RS
outperforms not only traditional rescoring methods but also its deep neural network
counterparts by a substantial margin of 20.85% in terms of NDCG@10. The L2RS toolkit
has been successfully deployed for many online commercial services in WeBank Co.,
Ltd, China's leading digital bank. The efficacy and applicability of L2RS are validated
by real-life online customer datasets.

Distantly Supervised Semantic Text Detection and Recognition for Broadcast Sports
Videos Understanding

Avijit Shah
Topojoy Biswas
Sathish Ramadoss
Deven Santosh Shah

Comprehensive understanding of key players and actions in multiplayer sports broadcast
videos is a challenging problem. Unlike in news or finance videos, sports videos have
limited text. While both action recognition for multiplayer sports and detection of
players has seen robust research, understanding contextual text in video frames still
remains one of the most impactful avenues of sports video understanding. In this work
we study extremely accurate semantic text detection and recognition in sports clocks,
and challenges therein. We observe unique properties of sports clocks, which makes
it hard to utilize general-purpose pre-trained detectors and recognizers, so that
text can be accurately understood to the degree of being used to align to external
knowledge. We propose a novel distant supervision technique to automatically build
sports clock datasets. Along with suitable data augmentations, combined with any state-of-the-art
text detection and recognition model architectures, we extract extremely accurate
semantic text. Finally, we share our computational architecture pipeline to scale
this system in industrial setting and proposed a robust dataset for the same to validate
our results.

Focusing on Persons: Colorizing Old Images Learning from Modern Historical Movies

Xin Jin
Zhonglan Li
Ke Liu
Dongqing Zou
Xiaodong Li
Xingfan Zhu
Ziyin Zhou
Qilong Sun
Qingyu Liu

In industry, there exist plenty of scenarios where old gray photos need to be automatically
colored, such as video sites and archives. In this paper, we present the HistoryNet
focusing on historical person's diverse high fidelity clothing colorization based
on fine grained semantic understanding and prior. Colorization of historical persons
is realistic and practical, however, existing methods do not perform well in the regards.
In this paper, a HistoryNet including three parts, namely, classification, fine grained
semantic parsing and colorization, is proposed. Classification sub-module supplies
classifying of images according to the eras, nationalities and garment types; Parsing
sub-network supplies the semantic for person contours, clothing and background in
the image to achieve more accurate colorization of clothes and persons and prevent
color overflow. In the training process, we integrate classification and semantic
parsing features into the coloring generation network to improve colorization. Through
the design of classification and parsing subnetwork, the accuracy of image colorization
can be improved and the boundary of each part of image can be more clearly. Moreover,
we also propose a novel Modern Historical Movies Dataset (MHMD) containing 1,353,166
images and 42 labels of eras, nationalities, and garment types for automatic colorization
from 147 historical movies or TV series made in modern time. Various quantitative
and qualitative comparisons demonstrate that our method outperforms the state-of-the-art
colorization methods, especially on military uniforms, which has correct colors according
to the historical literatures.

Personalized Multi-modal Video Retrieval on Mobile Devices

Haotian Zhang
Allan D. Jepson
Iqbal Mohomed
Konstantinos G. Derpanis
Ran Zhang
Afsaneh Fazly

Current video retrieval systems on mobile devices cannot process complex natural language
queries, especially if they contain personalized concepts, such as proper names. To
address these shortcomings, we propose an efficient and privacy-preserving video retrieval
system that works well with personalized queries containing proper names, without
re-training using personalized labelled data from users. Our system first computes
an initial ranking of a video collection by using a generic attention-based video-text
matching model (i.e., a model designed for non-personalized queries), and then uses
a face detector to conduct personalized adjustments to these initial rankings. These
adjustments are done by reasoning over the face information from the detector and
the attention information provided by the generic model. We show that our system significantly
outperforms existing keyword-based retrieval systems, and achieves comparable performance
to the generic matching model fine-tuned on plenty of labelled data. Our results suggest
that the proposed system can effectively capture both semantic context and personalized
information in queries.

Boosting End-to-end Multi-Object Tracking and Person Search via Knowledge Distillation

Wei Zhang
Lingxiao He
Peng Chen
Xingyu Liao
Wu Liu
Qi Li
Zhenan Sun

Multi-Object Tracking (MOT) and Person Search both demand to localize and identify
specific targets from raw image frames. Existing methods can be classified into two
categories, namely two-step strategy and end-to-end strategy. Two-step approaches
have high accuracy but suffer from costly computations, while end-to-end methods show
greater efficiency with limited performance. In this paper, we dissect the gap between
two-step and end-to-end strategy and propose a simple yet effective end-to-end framework
with knowledge distillation. Our proposed framework is simple in concept and easy
to benefit from external datasets. Experimental results demonstrate that our model
performs competitively with other sophisticated two-step and end-to-end methods in
multi-object tracking and person search.

A Virtual Character Generation and Animation System for E-Commerce Live Streaming

Li Hu
Bang Zhang
Peng Zhang
Jinwei Qi
Jian Cao
Daiheng Gao
Haiming Zhao
Xiaoduan Feng
Qi Wang
Lian Zhuo
Pan Pan
Yinghui Xu

Virtual character has been widely adopted in many areas, such as virtual assistant,
virtual customer service, robotics and etc. In this paper, we focus on its application
in e-commerce live streaming. Particularly, we propose a virtual character generation
and animation system that supports e-commerce live streaming with virtual characters
as anchors. The system offers a virtual character face generation tool based on a
weakly supervised 3D face reconstruction method. The method takes a single photo as
input and generates a 3D face model with both similarity and aesthetics considered.
It does not require 3D face annotation data due to the assist of differentiable neural
rendering technique which seamlessly integrates rendering into a deep learning based
3D face reconstruction framework. Moreover, the system provides two animation approaches
which support two different ways of live stream respectively. The first approach is
based on real-time motion capture. An actor's performance is captured in real-time
via a monocular camera, and then utilized for animating a virtual anchor. The second
approach is text driven animation, in which the human-like animation is automatically
generated based on a text script. The relationship between text script and animation
is learned based on the training data which can be accumulated via the motion capture
based animation. To our best knowledge, the presented work is the first sophisticated
virtual character generation and animation system that is designed for e-commerce
live streaming and actually deployed on an online shopping platform with millions
of daily audiences.

Improving Fake News Detection by Using an Entity-enhanced Framework to Fuse Diverse
Multimodal Clues

Peng Qi
Juan Cao
Xirong Li
Huan Liu
Qiang Sheng
Xiaoyue Mi
Qin He
Yongbiao Lv
Chenyang Guo
Yingchao Yu

Recently, fake news with text and images have achieved more effective diffusion than
text-only fake news, raising a severe issue of multimodal fake news detection. Current
studies on this issue have made significant contributions to developing multimodal
models, but they are defective in modeling the multimodal content sufficiently. Most
of them only preliminarily model the basic semantics of the images as a supplement
to the text, which limits their performance on detection. In this paper, we find three
valuable text-image correlations in multimodal fake news: entity inconsistency, mutual
enhancement, and text complementation. To effectively capture these multimodal clues,
we innovatively extract visual entities (such as celebrities and landmarks) to understand
the news-related high-level semantics of images, and then model the multimodal entity
inconsistency and mutual enhancement with the help of visual entities. Moreover, we
extract the embedded text in images as the complementation of the original text. All
things considered, we propose a novel entity-enhanced multimodal fusion framework,
which simultaneously models three cross-modal correlations to detect diverse multimodal
fake news. Extensive experiments demonstrate the superiority of our model compared
to the state of the art.

SESSION: Session 11: Multimedia HCI and Quality of Experience

Fast Video Visual Quality and Resolution Improvement using SR-UNet

Federico Vaccaro
Marco Bertini
Tiberio Uricchio
Alberto Del Bimbo

In this paper, we address the problem of real-time video quality enhancement, considering
both frame super-resolution and compression artifact-removal. The first operation
increases the sampling resolution of video frames, the second removes visual artifacts
such as blurriness, noise, aliasing, or blockiness introduced by lossy compression
techniques, such as JPEG encoding for single-images, or H.264/H.265 for video data.

We propose to use SR-UNet, a novel network architecture based on UNet, that has been
specialized for fast visual quality improvement (i.e. capable of operating in less
than 40ms, to be able to operate on videos at 25FPS). We show how this network can
be used in a streaming context where the content is generated live, e.g. in video
calls, and how it can be optimized when video to be streamed are prepared in advance.
The network can be used as a final post processing, to optimize the visual appearance
of a frame before showing it to the end-user in a video player. Thus, it can be applied
without any change to existing video coding and transmission pipelines.

Experiments carried on standard video datasets, also considering the H.265 compression,
show that the proposed approach is able to either improve visual quality metrics given
a fixed bandwidth budget, or video distortion given a fixed quality goal.

MS-GraphSIM: Inferring Point Cloud Quality via Multiscale Graph Similarity

Yujie Zhang
Qi Yang
Yiling Xu

To address the point cloud quality assessment (PCQA) problem, GraphSIM was proposed
via jointly considering geometrical and color features, which shows compelling performance
in multiple distortion detection. However, GraphSIM does not take into account the
mutiscale characteristics of human perception. In this paper, we propose a multiscale
PCQA model, called Multiscale Graph Similarity (MS-GraphSIM), that can better predict
human subjective perception. First, exploring the multiscale processing method used
in image processing, we introduce a multiscale representation of point clouds based
on graph signal processing. Second, we extend GraphSIM into multiscale version based
on the proposed multiscale representation. Specifically, MS-GraphSIM constructs a
multiscale representation for each local patch extracted from the reference point
cloud or the distorted point cloud, and then fuses GraphSIM at different scales to
obtain an overall quality score. Experiment results demonstrate that the proposed
MS-GraphSIM outperforms the state-of-the-art PCQA metrics over two fairly large and
independent databases. Ablation studies further prove the proposed MS-GraphSIM is
robust to different model hyperparameter settings. The code is available at https://github.com/zyj1318053/MS_GraphSIM.

I Know Your Keyboard Input: A Robust Keystroke Eavesdropper Based-on Acoustic Signals

Jia-Xuan Bai
Bin Liu
Luchuan Song

Recently, smart devices equipped with microphones have become increasingly popular
in people's lives. However, when users type on a keyboard near devices with microphones,
the acoustic signals generated by different keystrokes may leak the user's privacy.
This paper proposes a robust side-channel attack scheme to infer keystrokes on the
surrounding keyboard, leveraging the smart devices' microphones. To address the challenge
of non-cooperative attacking environments, we propose an efficient scheme to estimate
the relative position between the microphones and the keyboard, and extract two robust
features from the acoustic signals to alleviate the impact of various victims and
keyboards. As a result, we can realize the side-channel attack through acoustic signals,
regardless of the exact location of microphones, the victims, and the type of keyboards.
We implement the proposed scheme on the commercial smartphone and conduct extensive
experiments to evaluate its performance. Experimental results show that the proposed
scheme could achieve good performance in predicting keyboard input under various conditions.
Overall, we can correctly identify 91.2% of keystrokes with 10-fold cross-validation.
When predicting keystrokes from unknown victims, the attack can obtain a Top-5 accuracy
of 91.52%. Furthermore, the Top-5 accuracy of predicting keystrokes can reach 72.25%
when the victims and keyboards are both unknown. When predicting meaningful contents,
we can obtain a Top-5 accuracy of 96.67% for the words entered by the victim.

Perceptual Quality Assessment of Internet Videos

Jiahua Xu
Jing Li
Xingguang Zhou
Wei Zhou
Baichao Wang
Zhibo Chen

With the fast proliferation of online video sites and social media platforms, user,
professionally and occupationally generated content (UGC, PGC, OGC) videos are streamed
and explosively shared over the Internet. Consequently, it is urgent to monitor the
content quality of these Internet videos to guarantee the user experience. However,
most existing modern video quality assessment (VQA) databases only include UGC videos
and cannot meet the demands for other kinds of Internet videos with real-world distortions.
To this end, we collect 1,072 videos from Youku, a leading Chinese video hosting service
platform, to establish the Internet video quality assessment database (Youku-V1K).
A special sampling method based on several quality indicators is adopted to maximize
the content and distortion diversities within a limited database, and a probabilistic
graphical model is applied to recover reliable labels from noisy crowdsourcing annotations.
Based on the properties of Internet videos originated from Youku, we propose a spatio-temporal
distortion-aware model (STDAM). First, the model works blindly which means the pristine
video is unnecessary. Second, the model is familiar with diverse contents by pre-training
on the large-scale image quality assessment databases. Third, to measure spatial and
temporal distortions, we introduce the graph convolution and attention module to extract
and enhance the features of the input video. Besides, we leverage the motion information
and integrate the frame-level features into video-level features via a bi-directional
long short-term memory network. Experimental results on the self-built database and
the public VQA databases demonstrate that our model outperforms the state-of-the-art
methods and exhibits promising generalization ability.

Using Interaction Data to Predict Engagement with Interactive Media

Jonathan Carlton
Andy Brown
Caroline Jay
John Keane

Media is evolving from traditional linear narratives to personalised experiences,
where control over information (or how it is presented) is given to individual audience
members. Measuring and understanding audience engagement with this media is important
in at least two ways: (1) a post-hoc understanding of how engaged audiences are with
the content will help production teams learn from experience and improve future productions;
(2), this type of media has potential for real-time measures of engagement to be used
to enhance the user experience by adapting content on-the-fly. Engagement is typically
measured by asking samples of users to self-report, which is time consuming and expensive.
In some domains, however, interaction data have been used to infer engagement. Fortuitously,
the nature of interactive media facilitates a much richer set of interaction data
than traditional media; our research aims to understand if these data can be used
to infer audience engagement. In this paper, we report a study using data captured
from audience interactions with an interactive TV show to model and predict engagement.
We find that temporal metrics, including overall time spent on the experience and
the interval between events, are predictive of engagement. The results demonstrate
that interaction data can be used to infer users' engagement during and after an experience,
and the proposed techniques are relevant to better understand audience preference
and responses.

Air-Text: Air-Writing and Recognition System

Sun-Kyung Lee
Jong-Hwan Kim

Text entry takes an important role of effectively delivering the intention of users
to computers, where physical and soft keyboards have been widely used. However, with
the recent trends of developing technologies like augmented reality and increasing
contactless services due to COVID-19, a more advanced type of text entry is required.
To tackle this issue, we propose Air-Text which is an intuitive system to write in
the air using fingertips as a pen. Unlike previously suggested air-writing systems,
Air-Text provides various functionalities by the seamless integration of air-writing
and text-recognition modules. Specifically, the air-writing module takes a sequence
of RGB images as input and tracks both the location of fingertips (5.33 pixel error
in 640x480 image) and current hand gesture class (98.29% classification accuracy)
frame by frame. Users can easily perform writing operations such as writing or deleting
a text by changing hand gestures, and tracked fingertip locations can be stored as
a binary image. Then the text-recognition module, which is compatible with any pre-trained
recognition models, predicts a written text in the binary image. In this paper, examples
of single digit recognition with MNIST classifier (96.0% accuracy) and word-level
recognition with text recognition model (79.36% character recognition rate) are provided.

SESSION: Session 12: Multimodal Analysis and Description-I

How to Learn a Domain-Adaptive Event Simulator?

Daxin Gu
Jia Li
Yu Zhang
Yonghong Tian

The low-latency streams captured by event cameras have shown impressive potential
in addressing vision tasks such as video reconstruction and optical flow estimation.
However, these tasks often require massive training event streams, which are expensive
to collect and largely bypassed by recently proposed event camera simulators. To align
the statistics of synthetic events with that of target event cameras, existing simulators
often need to be heuristically tuned with elaborative manual efforts and thus become
incompetent to automatically adapt to various domains. To address this issue, this
work proposes one of the first learning-based, domain-adaptive event simulator. Given
a specific domain, the proposed simulator learns pixel-wise distributions of event
contrast thresholds that, after stochastic sampling and paralleled rendering, can
generate event representations well aligned with those from the data from realistic
event cameras. To achieve such domain-specific alignment, we design a novel divide-and-conquer
discrimination scheme that adaptively evaluates the synthetic-to-real consistency
of event representations according to the local statistics of images and events. Trained
with the data synthesized by the proposed simulator, the performances of state-of-the-art
event-based video reconstruction and optical flow estimation approaches are boosted
up to 22.9% and 2.8%, respectively. In addition, we show significantly improved domain
adaptation capability over existing event simulators and tuning strategies, consistently
on three real event datasets.

A Stepwise Matching Method for Multi-modal Image based on Cascaded Network

Jinming Mu
Shuiping Gou
Shasha Mao
Shankui Zheng

Template matching of multi-modal image has been a challenge to image matching, and
it is difficult to balance the speed and the accuracy, especially for images with
large sizes. Based on this, we propose a stepwise image matching method to achieve
a precise location from the coarse-to-fine image matching by utilizing cascaded networks.
In the proposed method, a coarse-grained matching network is firstly constructed to
locate a rough matching position based on cross-correlating features of optical and
SAR images. Specially, to enhance the credible matching position, a suppression network
is designed to evaluate for the obtained cross-correlation feature and added into
the coarse-grained network as a feedback. Secondly, a fine-grained matching network
is constructed based on the obtained rough matching result to gain a more precise
matching. In this part, ternary groups are utilized to construct the training samples.
Interestingly, we apply the region with a few pixels offset as the negative class,
which effectively distinguishes similar neighbourhoods of the rough matching position.
Moreover, a modified Siamese network is used to extract features of SAR and optical
images, respectively. Finally, experimental results illustrate that the proposed method
obtains more precise matching compared with the state-of-the-art methods.

SINGA-Easy: An Easy-to-Use Framework for MultiModal Analysis

Naili Xing
Sai Ho Yeung
Cheng-Hao Cai
Teck Khim Ng
Wei Wang
Kaiyuan Yang
Nan Yang
Meihui Zhang
Gang Chen
Beng Chin Ooi

Deep learning has achieved great success in a wide spectrum of multimedia applications
such as image classification, natural language processing and multimodal data analysis.
Recent years have seen the development of many deep learning frameworks that provide
a high-level programming interface for users to design models, conduct training and
deploy inference. However, it remains challenging to build an efficient end-to-end
multimedia application with most existing frameworks. Specifically, in terms of usability,
it is demanding for non-experts to implement deep learning models, obtain the right
settings for the entire machine learning pipeline, manage models and datasets, and
exploit external data sources all together. Further, in terms of adaptability, elastic
computation solutions are much needed as the actual serving workload fluctuates constantly,
and scaling the hardware resources to handle the fluctuating workload is typically
infeasible. To address these challenges, we introduce SINGA-Easy, a new deep learning
framework that provides distributed hyper-parameter tuning at the training stage,
dynamic computational cost control at the inference stage, and intuitive user interactions
with multimedia contents facilitated by model explanation. Our experiments on the
training and deployment of multi-modality data analysis applications show that the
framework is both usable and adaptable to dynamic inference loads. We implement SINGA-Easy
on top of Apache SINGA and demonstrate our system with the entire machine learning
life cycle.

Informative Class-Conditioned Feature Alignment for Unsupervised Domain Adaptation

Wanxia Deng
Yawen Cui
Zhen Liu
Gangyao Kuang
Dewen Hu
Matti Pietikäinen
Li Liu

The goal of unsupervised domain adaptation is to learn a task classifier that performs
well for the unlabeled target domain by borrowing rich knowledge from a well-labeled
source domain. Although remarkable breakthroughs have been achieved in learning transferable
representation across domains, two bottlenecks remain to be further explored. First,
many existing approaches focus primarily on the adaptation of the entire image, ignoring
the limitation that not all features are transferable and informative for the object
classification task. Second, the features of the two domains are typically aligned
without considering the class labels; this can lead the resulting representations
to be domain-invariant but non-discriminative to the category. To overcome the two
issues, we present a novel Informative Class-Conditioned Feature Alignment (IC2FA)
approach for UDA, which utilizes a twofold method: informative feature disentanglement
and class-conditioned feature alignment, designed to address the above two challenges,
respectively. More specifically, to surmount the first drawback, we cooperatively
disentangle the two domains to obtain informative transferable features; here, Variational
Information Bottleneck (VIB) is employed to encourage the learning of task-related
semantic representations and suppress task-unrelated information. With regard to the
second bottleneck, we optimize a new metric, termed Conditional Sliced Wasserstein
Distance (CSWD), which explicitly estimates the intra-class discrepancy and the inter-class
margin. The intra-class and inter-class CSWDs are minimized and maximized, respectively,
to yield the domain-invariant discriminative features. IC2FA equips class-conditioned
feature alignment with informative feature disentanglement and causes the two procedures
to work cooperatively, which facilitates informative discriminative features adaptation.
Extensive experimental results on three domain adaptation datasets confirm the superiority
of IC2FA.

Hierarchical Multi-Task Learning for Diagram Question Answering with Multi-Modal Transformer

Zhaoquan Yuan
Xiao Peng
Xiao Wu
Changsheng Xu

Diagram question answering (DQA) is an effective way to evaluate the reasoning ability
for diagram semantic understanding, which is a very challenging task and largely understudied
compared with natural images. Existing separate two-stage methods for DQA are limited
in ineffective feedback mechanisms. To address this problem, in this paper, we propose
a novel structural parsing-integrated Hierarchical Multi-Task Learning (HMTL) model
for diagram question answering based on a multi-modal transformer framework. In the
proposed paradigm of multi-task learning, the two tasks of diagram structural parsing
and question answering are in the different semantic levels and equipped with different
transformer blocks, which constituents a hierarchical architecture. The structural
parsing module encodes the information of constituents and their relationships in
diagrams, while the diagram question answering module decodes the structural signals
and combines question-answers to infer correct answers. Visual diagrams and textual
question-answers are interplayed in the multi-modal transformer, which achieves cross-modal
semantic comprehension and reasoning. Extensive experiments on the benchmark AI2D
and FOODWEBS datasets demonstrate the effectiveness of our proposed HMTL over other
state-of-the-art methods.

Differentiated Learning for Multi-Modal Domain Adaptation

Jianming Lv
Kaijie Liu
Shengfeng He

Directly deploying a trained multi-modal classifier to a new environment usually leads
to poor performance due to the well-known domain shift problem. Existing multi-modal
domain adaptation methods treated each modality equally and optimize the sub-models
of different modalities synchronously. However, as observed in this paper, the degrees
of domain shift in different modalities are usually diverse. We propose a novel Differentiated
Learning framework to make use of the diversity between multiple modalities for more
effective domain adaptation. Specifically, we model the classifiers of different modalities
as a group of teacher/student sub-models, and a novel Prototype based Reliability
Measurement is presented to estimate the reliability of the recognition results made
by each sub-model on the target domain. More reliable results are then picked up as
teaching materials for all sub-models in the group. Considering the diversity of different
modalities, each sub-model performs the Asynchronous Curriculum Learning by choosing
the teaching materials from easy to hard measured by itself. Furthermore, a reliability-aware
fusion scheme is proposed to combine all optimized sub-models to support final decision.
Comprehensive experiments based on three multi-modal datasets with different learning
tasks have been conducted, which show the superior performance of our model while
comparing with state-of-the-art multi-modal domain adaptation models.

SESSION: Session 13: Multimodal Analysis and Description-II

Two-stage Visual Cues Enhancement Network for Referring Image Segmentation

Yang Jiao
Zequn Jie
Weixin Luo
Jingjing Chen
Yu-Gang Jiang
Xiaolin Wei
Lin Ma

Referring Image Segmentation (RIS) aims at segmenting the target object from an image
referred by one given natural language expression. The diverse and flexible expressions
and complex visual contents in the images raise the RIS model with higher demands
for investigating fine-grained matching behaviors between words in expressions and
objects presented in images. However, such matching behaviors are hard to be learned
and captured when the visual cues of referents (i.e. referred objects) are insufficient,
as the referents of weak visual cues tend to be easily confused by cluttered background
at boundary or even overwhelmed by salient objects in the image. And the insufficient
visual cues issue can not be handled by the cross-modal fusion mechanisms as done
in previous work.In this paper, we tackle this problem from a novel perspective of
enhancing the visual information for the referents by devising a Two-stage Visual
cues enhancement Network (TV-Net), where a novel Retrieval and Enrichment Scheme (RES)
and an Adaptive Multi-resolution feature Fusion (AMF) module are proposed. Specifically,
RES retrieves the most relevant image from an external data pool with regard to both
the visual and textual similarities, and then enriches the visual information of the
referent with the retrieved image for better multimodal feature learning. AMF further
enhances the visual detailed information by incorporating the high-resolution feature
maps from lower convolution layers of the image. Through the two-stage enhancement,
our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors
between the natural language expression and image, especially when the visual information
of the referent is inadequate, thus produces better segmentation results. Extensive
experiments are conducted to validate the effectiveness of the proposed method on
the RIS task, with our proposed TV-Net surpassing the state-of-the-art approaches
on four benchmark datasets.

Partial Tubal Nuclear Norm Regularized Multi-view Learning

Yongyong Chen
Shuqin Wang
Chong Peng
Guangming Lu
Yicong Zhou

Multi-view clustering and multi-view dimension reduction explore ubiquitous and complementary
information between multiple features to enhance the clustering, recognition performance.
However, multi-view clustering and multi-view dimension reduction are treated independently,
ignoring the underlying correlations between them. In addition, previous methods mainly
focus on using the tensor nuclear norm for low-rank representation to explore the
high correlation of multi-view features, which often causes the estimation bias of
the tensor rank. To overcome these limitations, we propose the partial tubal nuclear
norm regularized multi-view learning (PTN2ML) method, in which the partial tubal nuclear
norm as a non-convex surrogate of the tensor tubal multi-rank, only minimizes the
partial sum of the smaller tubal singular values to preserve the low-rank property
of the self-representation tensor. PTN2ML pursues the latent representation from the
projection space rather than from the input space to reveal the structural consensus
and suppress the disturbance of noisy data. The proposed method can be efficiently
optimized by the alternating direction method of multipliers. Extensive experiments,
including multi-view clustering and multi-view dimension reduction substantiate the
superiority of the proposed methods beyond state-of-the-arts.

Deep Unsupervised 3D SfM Face Reconstruction Based on Massive Landmark Bundle Adjustment

Yuxing Wang
Yawen Lu
Zhihua Xie
Guoyu Lu

We address the problem of reconstructing 3D human face from multi-view facial images
using Structure-from-Motion (SfM) based on deep neural networks. While recent learning-based
monocular view methods have shown impressive results for 3D facial reconstruction,
the single-view setting is easily affected by depth ambiguities and poor face pose
issues. In this paper, we propose a novel unsupervised 3D face reconstruction architecture
by leveraging the multi-view geometry constraints to train accurate face pose and
depth maps. Facial images from multiple perspectives of each 3D face model are input
to train the network. Multi-view geometry constraints are fused into unsupervised
network by establishing loss constraints from spatial and spectral perspectives. To
make the trained 3D face have more details, facial landmark detector is explored to
acquire massive facial information to constrain face pose and depth estimation. Through
minimizing massive landmark displacement distance by bundle adjustment, an accurate
3D face model can be reconstructed. Extensive experiments demonstrate the superiority
of our proposed approach over other methods.

SimulLR: Simultaneous Lip Reading Transducer with Attention-Guided Adaptive Memory

Zhijie Lin
Zhou Zhao
Haoyuan Li
Jinglin Liu
Meng Zhang
Xingshan Zeng
Xiaofei He

Lip reading, aiming to recognize spoken sentences according to the given video of
lip movements without relying on the audio stream, has attracted great interest due
to its application in many scenarios. Although prior works that explore lip reading
have obtained salient achievements, they are all trained in a non-simultaneous manner
where the predictions are generated requiring access to the full video. To breakthrough
this constraint, we study the task of simultaneous lip reading and devise SimulLR,
a simultaneous lip Reading transducer with attention-guided adaptive memory from three
aspects: (1) To address the challenge of monotonic alignments while considering the
syntactic structure of the generated sentences under simultaneous setting, we build
a transducer-based model and design several effective training strategies including
CTC pre-training, model warm-up and curriculum learning to promote the training of
the lip reading transducer. (2) To learn better spatio-temporal representations for
simultaneous encoder, we construct a truncated 3D convolution and time-restricted
self-attention layer to perform the frame-to-frame interaction within a video segment
containing fixed number of frames. (3) The history information is always limited due
to the storage in real-time scenarios, especially for massive video data. Therefore,
we devise a novel attention-guided adaptive memory to organize semantic information
of history segments and enhance the visual representations with acceptable computation-aware
latency. The experiments show that the SimulLR achieves the translation speedup 9.10x
compared with the state-of-the-art non-simultaneous methods, and also obtains competitive
results, which indicates the effectiveness of our proposed methods.

Dense Semantic Contrast for Self-Supervised Visual Representation Learning

Xiaoni Li
Yu Zhou
Yifei Zhang
Aoting Zhang
Wei Wang
Ning Jiang
Haiying Wu
Weiping Wang

Self-supervised representation learning for visual pre-training has achieved remarkable
success with sample (instance or pixel) discrimination and semantics discovery of
instance, whereas there still exists a non-negligible gap between pre-trained model
and downstream dense prediction tasks. Concretely, these downstream tasks require
more accurate representation, in other words, the pixels from the same object must
belong to a shared semantic category, which is lacking in the previous methods. In
this work, we present Dense Semantic Contrast (DSC) for modeling semantic category
decision boundaries at a dense level to meet the requirement of these tasks. Furthermore,
we propose a dense cross-image semantic contrastive learning framework for multi-granularity
representation learning. Specially, we explicitly explore the semantic structure of
the dataset by mining relations among pixels from different perspectives. For intra-image
relation modeling, we discover pixel neighbors from multiple views. And for inter-image
relations, we enforce pixel representation from the same semantic class to be more
similar than the representation from different classes in one mini-batch. Experimental
results show that our DSC model outperforms state-of-the-art methods when transferring
to downstream dense prediction tasks, including object detection, semantic segmentation,
and instance segmentation. Code will be made available.

Multiple Object Tracking by Trajectory Map Regression with Temporal Priors Embedding

Xingyu Wan
Sanping Zhou
Jinjun Wang
Rongye Meng

Prevailing Multiple Object Tracking (MOT) works following the Tracking-by-Detection
(TBD) paradigm pay most attention to either object detection in a first step or data
association in a second step. In this paper, we approach the MOT problem from a different
perspective by directly obtaining the embedded spatial-temporal information of trajectories
from raw video data. For the purpose we propose a joint trajectory locating and attributes
encoding framework for real-time, on-line MOT. We firstly introduce a trajectory attribute
representation scheme designed for each tracked target (instead of object) where the
extracted Trajectory Map (TM) encodes the spatial-temporal attributes of a trajectory
across a window of consecutive video frames. Next we present a Temporal Priors Embedding
(TPE) methodology to infer these attributes with a logical reasoning strategy based
on long-term feature dynamics. The proposed MOT framework projects multiple attributes
of tracked targets, e.g., presence, enter/exit, location, scale, motion, etc. into
a continuous TM to perform one-shot regression for real-time MOT. Experimental results
show that, our proposed video-based method runs at 33 FPS and is more accurate and
robust as compared to the detection-based tracking methods and a few other State-of-the-
Art (SOTA) approaches on MOT16/17/20 benchmarks.

SESSION: Session 14: Multimedia Cloud, Edge and Device Computing

DeepGame: Efficient Video Encoding for Cloud Gaming

Omar Mossad
Khaled Diab
Ihab Amer
Mohamed Hefeeda

Cloud gaming enables users to play games on virtually any device. This is achieved
by offloading the game rendering and encoding to cloud datacenters. As game resolutions
and frame rates increase, cloud gaming platforms face a major challenge to stream
high quality games due to the high bandwidth and low latency requirements. In this
paper, we propose a new video encoding pipeline, called DeepGame, for cloud gaming
platforms to reduce the bandwidth requirements with limited to no impact on the player
quality of experience. DeepGame learns the player's contextual interest in the game
and the temporal correlation of that interest using a spatio-temporal deep neural
network. Then, it encodes various areas in the video frames with different quality
levels proportional to their contextual importance. DeepGame does not change the source
code of the video encoder or the video game, and it does not require any additional
hardware or software at the client side. We implemented DeepGame in an open-source
cloud gaming platform and evaluated its performance using multiple popular games.
We also conducted a subjective study with real players to demonstrate the potential
gains achieved by DeepGame and its practicality. Our results show that DeepGame can
reduce the bandwidth requirements by up to 36% compared to the baseline encoder, while
maintaining the same level of perceived quality for players and running in real time.

ChartPointFlow for Topology-Aware 3D Point Cloud Generation

Takumi Kimura
Takashi Matsubara
Kuniaki Uehara

A point cloud serves as a representation of the surface of a three-dimensional (3D)
shape. Deep generative models have been adapted to model their variations typically
using a map from a ball-like set of latent variables. However, previous approaches
did not pay much attention to the topological structure of a point cloud, despite
that a continuous map cannot express the varying numbers of holes and intersections.
Moreover, a point cloud is often composed of multiple subparts, and it is also difficult
to express. In this study, we propose ChartPointFlow, a flow-based generative model
with multiple latent labels for 3D point clouds. Each label is assigned to points
in an unsupervised manner. Then, a map conditioned on a label is assigned to a continuous
subset of a point cloud, similar to a chart of a manifold. This enables our proposed
model to preserve the topological structure with clear boundaries, whereas previous
approaches tend to generate blurry point clouds and fail to generate holes. The experimental
results demonstrate that ChartPointFlow achieves state-of-the-art performance in terms
of generation and reconstruction compared with other point cloud generators. Moreover,
ChartPointFlow divides an object into semantic subparts using charts, and it demonstrates
superior performance in case of unsupervised segmentation.

Co-learning: Learning from Noisy Labels with Self-supervision

Cheng Tan
Jun Xia
Lirong Wu
Stan Z. Li

Noisy labels, resulting from mistakes in manual labeling or webly data collecting
for supervised learning, can cause neural networks to overfit the misleading information
and degrade the generalization performance. Self-supervised learning works in the
absence of labels and thus eliminates the negative impact of noisy labels. Motivated
by co-training with both supervised learning view and self-supervised learning view,
we propose a simple yet effective method called Co-learning for learning with noisy
labels. Co-learning performs supervised learning and self-supervised learning in a
cooperative way. The constraints of intrinsic similarity with the self-supervised
module and the structural similarity with the noisily-supervised module are imposed
on a shared common feature encoder to regularize the network to maximize the agreement
between the two constraints. Co-learning is compared with peer methods on corrupted
data from benchmark datasets fairly, and extensive results are provided which demonstrate
that Co-learning is superior to many state-of-the-art approaches.

Graph Convolutional Multi-modal Hashing for Flexible Multimedia Retrieval

Xu Lu
Lei Zhu
Li Liu
Liqiang Nie
Huaxiang Zhang

Multi-modal hashing makes an important contribution to multimedia retrieval, where
a key challenge is to encode heterogeneous modalities into compact hash codes. To
solve this dilemma, graph-based multi-modal hashing methods generally define individual
affinity matrix of each independent modality and apply linear algorithm for heterogeneous
modalities fusion and compact hash learning. Several other methods construct graph
Laplacian matrix based on semantic information to help learn discriminative hash code.
However, these conventional methods roughly ignore the structural similarity of training
set and the complex relations among multi-modal samples, which leads to unsatisfactory
complementarity of fused hash codes. More notably, they are faced with two other important
problems: huge computing and storage costs caused by graph construction and partial
modality feature lost problem when incomplete query sample comes. In this paper, we
propose a Flexible Graph Convolutional Multi-modal Hashing (FGCMH) method that adopts
GCNs with linear complexity to preserve both the modality-individual and modality-fused
structural similarity for discriminative hash learning. Necessarily, accurate multimedia
retrieval can be performed on complete and incomplete datasets with our method. Specifically,
multiple modality-individual GCNs under semantic guidance are proposed to act on each
individual modality independently for intra-modality similarity preserving, then the
output representations are fused into a fusion graph with adaptive weighting scheme.
Hash GCN and semantic GCN, which share parameters in the first two layers, propagate
fusion information and generate hash codes under high-level label space supervision.
In the query stage, our method adaptively captures various multi-modal contents in
a flexible and robust way, even if partial modality features are lost. Experimental
results on three publicly datasets show the flexibility and effectiveness of our proposed
method.

Hybrid Network Compression via Meta-Learning

Jianming Ye
Shiliang Zhang
Jingdong Wang

Neural network pruning and quantization are two major lines of network compression.
This raises a natural question that whether we can find the optimal compression by
considering multiple network compression criteria in a unified framework. This paper
incorporates two criteria and seeks layer-wise compression by leveraging the meta-learning
framework. A regularization loss is applied to unify the constraint of input and output
channel numbers, bit-width of network activations and weights, so that the compressed
network can satisfy a given Bit-OPerations counts (BOPs) constraint. We further propose
an iterative compression constraint for optimizing the compression procedure, which
effectively achieves a high compression rate and maintains the original network performance.
Extensive experiments on various networks and vision tasks show that the proposed
method yields better performance and compression rates than recent methods. For instance,
our method achieves better image classification accuracy and compactness than the
recent DJPQ. It achieves similar performance with the recent DHP in image super-resolution,
meanwhile saves about 50% computation.

Two-pronged Strategy: Lightweight Augmented Graph Network Hashing for Scalable Image Retrieval

Hui Cui
Lei Zhu
Jingjing Li
Zhiyong Cheng
Zheng Zhang

Hashing learns compact binary codes to store and retrieve massive data efficiently.
Particularly, unsupervised deep hashing is supported by powerful deep neural networks
and has the desirable advantage of label independence. It is a promising technique
for scalable image retrieval. However, deep models introduce a large number of parameters,
which is hard to optimize due to the lack of explicit semantic labels and brings considerable
training cost. As a result, the retrieval accuracy and training efficiency of existing
unsupervised deep hashing are still limited. To tackle the problems, in this paper,
we propose a simple and efficient Lightweight Augmented Graph Network Hashing (LAGNH)
method with a two-pronged strategy. For one thing, we extract the inner structure
of the image as the auxiliary semantics to enhance the semantic supervision of the
unsupervised hash learning process. For another, we design a lightweight network structure
with the assistance of the auxiliary semantics, which greatly reduces the number of
network parameters that needs to be optimized and thus greatly accelerates the training
process. Specifically, we design a cross-modal attention module based on the auxiliary
semantic information to adaptively mitigate the adverse effects in the deep image
features. Besides, the hash codes are learned by multi-layer message passing within
an adversarial regularized graph convolutional network. Simultaneously, the semantic
representation capability of hash codes is further enhanced by reconstructing the
similarity graph. Experimental results show that our method achieves significant performance
improvement compared with the state-of-the-art unsupervised deep hashing methods in
terms of both retrieval accuracy and efficiency. Notably, on MS-COCO dataset, our
method achieves more than 10% improvement on retrieval precision and 2.7x speedup
on training time compared with the second best result.

SESSION: Interactive Arts

Reconstruction: A Motion Driven Interactive Artwork Inspired by Chinese Shadow Puppet

Wenli Jiang
Chong Cao

Shadow puppet play is a representative Chinese intangible cultural heritage, which
has a history of more than two thousand years. However, with the popularity of digital
media, this traditional art form has become desolate. "Reconstruction" is an interactive
digital artwork inspired by the production and performance of Chinese shadow puppet.
The scenes and characters are designed based on the art style of shadow puppet. The
participant's motion is captured with a Kinect and used to control the motion of the
character.

Syntropic Counterpoints: Metaphysics of The Machines

Predrag K. Nikolic
Ruiyang Liu
Shengcheng Luo

In the artwork Syntropic Counterpoints: Metaphysics of The Machines, we tend to explore
phenomena of AI aesthetic and challenge machine abstraction. Our approach toward the
liberation of machine creativity is through the use of words and grammar as a creative
tool humans developed to express worlds "beyond" the world, existing and non-existing
realities. We are lead by Nietzsche's claim that grammar is the "Metaphysics of the
People," as such grammar, content, and vision generated during the philosophical discussion
between our AI clones is "Metaphysics of Machines" through we can experience their
realities and start to question our own.

Kandinsky Mobile: Abstract Art-Inspired Interactive Visualization of Social Discussions on Mobile Devices

Castillo Clarence Fitzgerald Gumtang
Sourav S. Bhowmick

Kandinsky Mobile is a mobile device-based interactive artwork that generates and displays
the social discussion landscape associated with a social mediaanchor post using a
collection of colorfulcircles andconcentric circles. It draws inspiration from the
famous abstract geometric art forms of Russian painter Wassily Kandinsky (1866-1944).
Intuitively, a circle and a concentric circle represent a social comment and a collection
of comments in a discussion thread, respectively. The artwork aims to facilitate user-friendly
and effective understanding and visualization of large volumes of comments associated
with ananchor post.

Sand Scope: An Interactive Installation for Revealing the Connection Between Mental Space and
Life Space in a Microcosm of the World

Lyn Chao-ling Chen

In the artwork, the topic of life space has been discussed. Instead of physical space,
mental space of human being was considered. People usually focus on themselves to
solve various life tasks, and the scales of their mental space influence how they
realize the world. The artwork tried to arouse people to aware the connection between
mental space and life space. The Sand Scope introduces a microcosm of the world, for
comparing the scale of mental space with the scale of the microcosm, from the relative
scales between the microcosm and the whole world. From the new perspective, the Sand
Scope reminds people to escape from the routine of their daily lives, for rethinking
meaning of life. Multimedia input contains gray image analysis to form the stamps
with portraits of current audiences and past participants, color image subtraction
to compose texture of the mountain drawing with wearing cloth information on the painting,
and a buffer with timer to capture and replay ambient sounds continuously in a delayed
time, along with color images with blending effect in the period. In the interactive
installation, an improvisational painting in the form of a Chinese brush painting
with stamps from connoisseurs was exhibited. The generation of stamps from the audiences
on the painting also indicates that they are parts of the microcosm. The microcosm
was constructed from the elements of the inhabitants who live in the real world in
physical aspect, and the awareness of meaning of life implies harmony between nature
and humanity on the Zen painting in mental aspect.

Heraclitus's Forest: An Interactive Artwork for Oral History

Lin Wang
Zhonghao Lin
Wei Cai

Heraclitus's Forest is an interactive artwork that utilizes birch trees as a metaphor
for the life stories recorded in an oral history database. We design a day/night cycle
system to present the forest experience along the time elapse, multiple interaction
modes to engage audiences' participation in history exploration, and evolving forest
to arouse people's reflection on the feature of history, which is constantly being
constructed but can never be returned to.

Affective Color Fields: Reimagining Rothkoesque Artwork as an Interactive Companion for Artistic Self-Expression

Aiden Kang
Liang Wang
Ziyu Zhou
Zhe Huang
Robert J.K. Jacob

In this art project, we create Affective Color Fields: an interactive artifact that
takes in a user's narrative of their emotional experiences and dynamically transforms
it into Rothkoesque color fields through emotion classification. Inspired by Mark
Rothko's abstract depiction of human emotions and Merleau-Ponty's phenomenological
inquiry, we wish to establish an intimate relationship between interactive art and
the subject by employing user's own interpretation and framing of life events. Through
the performative and improvisational art-making process, users can playfully appropriate
our artifact for a rich and personal aesthetic experience.

Apercevoir: Bio Internet of Things Interactive System

You-Yang Hu
Chiao-Chi Chou
Chia-Wei Li

Apercevoir is an artwork that can perceive its environmental perturbations and convert
them into a spatial sound field with location information. It consists of multiple
plant cyborgs comprised of a Mimosa Pudica (sensitive plant) connected to a bioamplifier
and can sense human movements by analyzing the biosignals with a machine learning
model. Through sharing multiple cyborgs' biosignals, this network portrays the concept
of multiple beings transcending an individual's physical confines to form a Bio Internet
of Things (IOT) system capable of perception, feedback, and group decision-making
within a wider scope. A particular feature of this system is its interactive bone
induction headphones, where the audience can listen to a sound field including 'vibrations'
of nearby human activities detected by plant cyborgs, and even warnings among the
cyborg network responding to foreign disturbance and damage. This sound field invites
audiences to close their eyes and listen attentively to plants while the biosignals
and changes in sound reveal the presence of other entities in the space.

SESSION: Poster Session 2

Visual Co-Occurrence Alignment Learning for Weakly-Supervised Video Moment Retrieval

Zheng Wang
Jingjing Chen
Yu-Gang Jiang

Video moment retrieval aims to localize the most relevant video moment given the text
query. Weakly supervised approaches leverage video-text pairs only for training, without
temporal annotations. Most current methods align the proposed video moment and the
text in a joint embedding space. However, in lack of temporal annotations, the semantic
gap between these two modalities makes it predominant to learn joint feature representation
for most methods, with less emphasis on learning visual feature representation. This
paper aims to improve the visual feature representation with supervisions in the visual
domain, obtaining discriminative visual features for cross-modal learning. Based on
the observation that relevant video moments (i.e., share similar activities) from
different videos are commonly described by similar sentences; hence the visual features
of these relevant video moments should also be similar despite that they come from
different videos. Therefore, to obtain more discriminative and robust visual features
for video moment retrieval, we propose to align the visual features of relevant video
moments from different videos that co-occurred in the same training batch. Besides,
a contrastive learning approach is introduced for learning the moment-level alignment
of these videos. Through extensive experiments, we demonstrate that the proposed visual
co-occurrence alignment learning method outperforms the cross-modal alignment learning
counterpart and achieves promising results for video moment retrieval.

Adaptive Normalized Representation Learning for Generalizable Face Anti-Spoofing

ShuBao Liu
Ke-Yue Zhang
Taiping Yao
Mingwei Bi
Shouhong Ding
Jilin Li
Feiyue Huang
Lizhuang Ma

With various face presentation attacks arising under unseen scenarios, face anti-spoofing
(FAS) based on domain generalization (DG) has drawn growing attention due to its robustness.
Most existing methods utilize DG frameworks to align the features to seek a compact
and generalized feature space. However, little attention has been paid to the feature
extraction process for the FAS task, especially the influence of normalization, which
also has a great impact on the generalization of the learned representation. To address
this issue, we propose a novel perspective of face anti-spoofing that focuses on the
normalization selection in the feature extraction process. Concretely, an Adaptive
Normalized Representation Learning (ANRL) framework is devised, which adaptively selects
feature normalization methods according to the inputs, aiming to learn domain-agnostic
and discriminative representation. Moreover, to facilitate the representation learning,
Dual Calibration Constraints are designed, including Inter-Domain Compatible loss
and Inter-Class Separable loss, which provide a better optimization direction for
generalizable representation. Extensive experiments and visualizations are presented
to demonstrate the effectiveness of our method against the SOTA competitors.

Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis

Haozhe Wu
Jia Jia
Haoyu Wang
Yishun Dou
Chao Duan
Qingshan Deng

People talk with diversified styles. For one piece of speech, different talking styles
exhibit significant differences in the facial and head pose movements. For example,
the "excited" style usually talks with the mouth wide open, while the "solemn" style
is more standardized and seldomly exhibits exaggerated motions. Due to such huge differences
between different styles, it is necessary to incorporate the talking style into audio-driven
talking face synthesis framework. In this paper, we propose to inject style into the
talking face synthesis framework through imitating arbitrary talking style of the
particular reference video. Specifically, we systematically investigate talking styles
with our collected Ted-HD dataset and construct style codes as several statistics
of 3D morphable model (3DMM) parameters. Afterwards, we devise a latent-style-fusion
(LSF) model to synthesize stylized talking faces by imitating talking styles from
the style codes. We emphasize the following novel characteristics of our framework:
(1) It doesn't require any annotation of the style, the talking style is learned in
an unsupervised manner from talking videos in the wild. (2) It can imitate arbitrary
styles from arbitrary videos, and the style codes can also be interpolated to generate
new styles. Extensive experiments demonstrate that the proposed framework has the
ability to synthesize more natural and expressive talking styles compared with baseline
methods.

Pose-guided Inter- and Intra-part Relational Transformer for Occluded Person Re-Identification

Zhongxing Ma
Yifan Zhao
Jia Li

Person Re-Identification (Re-Id) in occlusion scenarios is a challenging problem because
a pedestrian can be partially occluded. The use of local information for feature extraction
and matching is still necessary. Therefore, we propose a Pose-guided inter- and intra-part
relational transformer (Pirt) for occluded person Re-Id, which builds part-aware long-term
correlations by introducing transformer. In our framework, we firstly develop a pose-guided
feature extraction module with regional grouping and mask construction for robust
feature representations. The positions of a pedestrian in the image under surveillance
scenarios are relatively fixed, hence we propose intra-part and inter-part relational
transformer. The intra-part module creates local relations with mask-guided features,
while the inter-part relationship builds correlations with transformers, to develop
cross relationships between part nodes. With the collaborative learning inter- and
intra-part relationships, experiments reveal that our proposed Pirt model achieves
a new state of the art on the public occluded dataset, and further extensions on standard
non-occluded person Re-Id datasets also reveal our comparable performances.

VLAD-VSA: Cross-Domain Face Presentation Attack Detection with Vocabulary Separation
and Adaptation

Jiong Wang
Zhou Zhao
Weike Jin
Xinyu Duan
Zhen Lei
Baoxing Huai
Yiling Wu
Xiaofei He

For face presentation attack detection (PAD), most of the spoofing cues are subtle,
local image patterns (e.g., local image distortion, 3D mask edge and cut photo edges).
The representations of existing PAD works with simple global pooling method, however,
lose the local feature discriminability. In this paper, the VLAD aggregation method
is adopted to quantize local features with visual vocabulary locally partitioning
the feature space, and hence preserve the local discriminability. We further propose
the vocabulary separation and adaptation method to modify VLAD for cross-domain PAD
task. The proposed vocabulary separation method divides vocabulary into domain-shared
and domain-specific visual words to cope with the diversity of live and attack faces
under the cross-domain scenario.The proposed vocabulary adaptation method imitates
the maximization step of the k-means algorithm in the end-to-end training, which guarantees
the visual words be close to the center of assigned local features and thus brings
robust similarity measurement. We give illustrations and extensive experiments to
demonstrate the effectiveness of VLAD with the proposed vocabulary separation and
adaptation method on standard cross-domain PAD benchmarks. The codes are available
at https://github.com/Liubinggunzu/VLAD-VSA.

End-to-End Video Object Detection with Spatial-Temporal Transformers

Lu He
Qianyu Zhou
Xiangtai Li
Li Niu
Guangliang Cheng
Xiao Li
Wenxuan Liu
Yunhai Tong
Lizhuang Ma
Liqing Zhang

Recently, DETR and Deformable DETR have been proposed to eliminate the need for many
hand-designed components in object detection while demonstrating good performance
as previous complex hand-crafted detectors. However, their performance on Video Object
Detection (VOD) has not been well explored. In this paper, we present TransVOD, an
end-to-end video object detection model based on a spatial-temporal Transformer architecture.
The goal of this paper is to streamline the pipeline of VOD, effectively removing
the need for many hand-crafted components for feature aggregation, e.g., optical flow,
recurrent neural networks, relation networks. Besides, benefited from the object query
design in DETR, our method does not need complicated post-processing methods such
as Seq-NMS or Tubelet rescoring, which keeps the pipeline simple and clean. In particular,
we present temporal Transformer to aggregate both the spatial object queries and the
feature memories of each frame. Our temporal Transformer consists of three components:
Temporal Deformable Transformer Encoder (TDTE) to encode the multiple frame spatial
details, Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable
Transformer Decoder (TDTD) to obtain current frame detection results. These designs
boost the strong baseline deformable DETR by a significant margin (3%-4% mAP) on the
ImageNet VID dataset. TransVOD yields comparable results performance on the benchmark
of ImageNet VID. We hope our TransVOD can provide a new perspective for video object
detection.

Joint-teaching: Learning to Refine Knowledge for Resource-constrained Unsupervised
Cross-modal Retrieval

Peng-Fei Zhang
Jiasheng Duan
Zi Huang
Hongzhi Yin

Cross-modal retrieval has received considerable attention owing to its applicability
to enable users to search desired information with diversified forms. Existing retrieval
methods retain good performance mainly relying on complex deep neural networks and
high-quality supervision signals, which deters them from real-world resource-constrained
development and deployment. In this paper, we propose an effective unsupervised learning
framework named JOint-teachinG (JOG) to pursue a high-performance yet light-weight
cross-modal retrieval model. The key idea is to utilize the knowledge of a pre-trained
model (a.k.a. the "teacher") to endow the to-be-learned model (a.k.a. the "student")
with strong feature learning ability and predictive power. Considering that a teacher
model serving the same task as the student is not always available, we resort to a
cross-task teacher to leverage transferrable knowledge to guide student learning.
To eliminate the inevitable noises in the distilled knowledge resulting from the task
discrepancy, an online knowledge-refinement strategy is designed to progressively
improve the quality of the cross-task knowledge in a joint-teaching manner, where
a peer student is engaged. In addition, the proposed JOG learns to represent the original
high-dimensional data with compact binary codes to accelerate the query processing,
further facilitating resource-limited retrieval. Through extensive experiments, we
demonstrate that in various network structures, the proposed method can yield promising
learning results on widely-used benchmarks. The proposed research is a pioneering
work for resource-constrained cross-modal retrieval, which has strong potential to
be applied to on-device deployment and is hoped to pave the way for further study.

AggNet for Self-supervised Monocular Depth Estimation: Go An Aggressive Step Furthe

Zhi Chen
Xiaoqing Ye
Liang Du
Wei Yang
Liusheng Huang
Xiao Tan
Zhenbo Shi
Fumin Shen
Errui Ding

Without appealing to exhaustive labeled data, self-supervised monocular depth estimation
(MDE) plays a fundamental role in computer vision. Previous methods usually adopt
a one-stage MDE network, which is insufficient to achieve high performance. In this
paper, we dig deep into this task to propose an aggressive framework termed AggNet.
The framework is based on a training-only progressive two-stage module to perform
pseudo counter-surveillance as well as a simple yet effective dual-warp loss function
between image pairs. In particular, we first propose a residual module, which follows
the MDE network to learn a refined depth. The residual module takes both the initial
depth generated from MDE and the initial color image as input to generate refined
depth with residual depth learning. Then, the refined depth is leveraged to supervise
the initial depth simultaneously during the training period. For inference, only the
MDE network is retained to regress depth from a single image, which gains better performance
without introducing extra computation. In addition to self-distillation loss, a simple
yet effective dual-warp consistency loss is introduced to encourage the MDE network
to keep depth consistency between stereo image pairs. Extensive experiments show that
our AggNet achieves state-of-the-art performance on the KITTI and Make3D datasets.

Boosting Lightweight Single Image Super-resolution via Joint-distillation

Xiaotong Luo
Qiuyuan Liang
Ding Liu
Yanyun Qu

The rising of deep learning has facilitated the development of single image super-resolution
(SISR). However, the growing burdensome model complexity and memory occupation severely
hinder its practical deployments on resource-limited devices. In this paper, we propose
a novel joint-distillation (JDSR) framework to boost the representation of various
off-the-shelf lightweight SR models. The framework includes two stages: the superior
LR generation and the joint-distillation learning. The superior LR is obtained from
the HR image itself. With less than $300$K parameters, the peer network using superior
LR as input can achieve comparable SR performance with large models, e.g., RCAN, with
15M parameters, which enables it as the input of peer network to save the training
expense. The joint-distillation learning consists of internal self-distillation and
external mutual learning. The internal self-distillation aims to achieve model self-boosting
by transferring the knowledge from the deeper SR output to the shallower one. Specifically,
each intermediate SR output is supervised by the HR image and the soft label from
subsequent deeper outputs. To shrink the capacity gap between shallow and deep layers,
a soft label generator is designed in a progressive backward fusion way with meta-learning
for adaptive weight fine-tuning. The external mutual learning focuses on obtaining
interaction information from a peer network in the process. Moreover, a curriculum
learning strategy and a performance gap threshold are introduced for balancing the
convergence rate of the original SR model and its peer network. Comprehensive experiments
on benchmark datasets demonstrate that our proposal improves the performance of recent
lightweight SR models by a large margin, with the same model architecture and inference
expense.

Discriminator-free Generative Adversarial Attack

Shaohao Lu
Yuqiao Xian
Ke Yan
Yi Hu
Xing Sun
Xiaowei Guo
Feiyue Huang
Wei-Shi Zheng

The Deep Neural Networks are vulnerable to adversarial examples (Figure 1), making
the DNNs-based systems collapsed by adding the inconspicuous perturbations to the
images. Most of the existing works for adversarial attack are gradient-based and suffer
from the latency efficiencies and the load on GPU memory. The generative-based adversarial
attacks can get rid of this limitation, and some relative works propose the approaches
based on GAN. However, suffering from the difficulty of the convergence of training
a GAN, the adversarial examples have either bad attack ability or bad visual quality.
In this work, we find that the discriminator could be not necessary for generative-based
adversarial attack, and propose the Symmetric Saliency-based Auto-Encoder (SSAE) to
generate the perturbations, which is composed of the saliency map module and the angle-norm
disentanglement of the features module. The advantage of our proposed method lies
in that it is not depending on discriminator, and uses the generative saliency map
to pay more attention to label-relevant regions. The extensive experiments among the
various tasks, datasets, and models demonstrate that the adversarial examples generated
by SSAE not only make the widely-used models collapse, but also achieves good visual
quality. The code is available at: https://github.com/BravoLu/SSAE.

Former-DFER: Dynamic Facial Expression Recognition Transformer

Zengqun Zhao
Qingshan Liu

This paper proposes a dynamic facial expression recognition transformer (Former-DFER)
for the in-the-wild scenario. Specifically, the proposed Former-DFER mainly consists
of a convolutional spatial transformer (CS-Former) and a temporal transformer (T-Former).
The CS-Former consists of five convolution blocks and N spatial encoders, which is
designed to guide the network to learn occlusion and pose-robust facial features from
the spatial perspective. And the temporal transformer consists of M temporal encoders,
which is designed to allow the network to learn contextual facial features from the
temporal perspective. The heatmaps of the leaned facial features demonstrate that
the proposed Former-DFER is capable of handling the issues such as occlusion, non-frontal
pose, and head motion. And the visualization of the feature distribution shows that
the proposed method can learn more discriminative facial features. Moreover, our Former-DFER
also achieves state-of-the-art results on the DFEW and AFEW benchmarks.

Discovering Density-Preserving Latent Space Walks in GANs for Semantic Image Transformations

Guanyue Li
Yi Liu
Xiwen Wei
Yang Zhang
Si Wu
Yong Xu
Hau-San Wong

Generative adversarial network (GAN)-based models possess superior capability of high-fidelity
image synthesis. There are a wide range of semantically meaningful directions in the
latent representation space of well-trained GANs, and the corresponding latent space
walks are meaningful for semantic controllability in the synthesized images. To explore
the underlying organization of a latent space, we propose an unsupervised Density-Preserving
Latent Semantics Exploration model (DP-LaSE). The important latent directions are
determined by maximizing the variations in intermediate features, while the correlation
between the directions is minimized. Considering that latent codes are sampled from
a prior distribution, we adopt a density-preserving regularization approach to ensure
latent space walks are maintained in iso-density regions, since moving to a higher/lower
density region tends to cause unexpected transformations. To further refine semantics-specific
transformations, we perform subspace learning over intermediate feature channels,
such that the transformations are limited to the most relevant subspaces. Extensive
experiments on a variety of benchmark datasets demonstrate that DP-LaSE is able to
discover interpretable latent space walks, and specific properties of synthesized
images can thus be precisely controlled.

MGH: Metadata Guided Hypergraph Modeling for Unsupervised Person Re-identification

Yiming Wu
Xintian Wu
Xi Li
Jian Tian

As a challenging task, unsupervised person ReID aims to match the same identity with
query images which does not require any labeled information. In general, most existing
approaches focus on the visual cues only, leaving potentially valuable auxiliary metadata
information (e.g., spatio-temporal context) unexplored. In the real world, such metadata
is normally available alongside captured images, and thus plays an important role
in separating several hard ReID matches. With this motivation in mind, we propose
MGH, a novel unsupervised person ReID approach that uses meta information to construct
a hypergraph for feature learning and label refinement. In principle, the hypergraph
is composed of camera-topology-aware hyperedges, which can model the heterogeneous
data correlations across cameras. Taking advantage of label propagation on the hypergraph,
the proposed approach is able to effectively refine the ReID results, such as correcting
the wrong labels or smoothing the noisy labels. Given the refined results, we further
present a memory-based listwise loss to directly optimize the average precision in
an approximate manner. Extensive experiments on three benchmarks demonstrate the effectiveness
of the proposed approach against the state-of-the-art.

Recovering the Unbiased Scene Graphs from the Biased Ones

Meng-Jiun Chiou
Henghui Ding
Hanshu Yan
Changhu Wang
Roger Zimmermann
Jiashi Feng

Given input images, scene graph generation (SGG) aims to produce comprehensive, graphical
representations describing visual relationships among salient objects. Recently, more
efforts have been paid to the long tail problem in SGG; however, the imbalance in
the fraction of missing labels of different classes, or reporting bias, exacerbating
the long tail is rarely considered and cannot be solved by the existing debiasing
methods. In this paper we show that, due to the missing labels, SGG can be viewed
as a "Learning from Positive and Unlabeled data" (PU learning) problem, where the
reporting bias can be removed by recovering the unbiased probabilities from the biased
ones by utilizing label frequencies, i.e., the per-class fraction of labeled, positive
examples in all the positive examples. To obtain accurate label frequency estimates,
we propose Dynamic Label Frequency Estimation (DLFE) to take advantage of training-time
data augmentation and average over multiple training iterations to introduce more
valid examples. Extensive experiments show that DLFE is more effective in estimating
label frequencies than a naive variant of the traditional estimate, and DLFE significantly
alleviates the long tail and achieves state-of-the-art debiasing performance on the
VG dataset. We also show qualitatively that SGG models with DLFE produce prominently
more balanced and unbiased scene graphs. The source code is publicly available.

Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization

Fa-Ting Hong
Jia-Chang Feng
Dan Xu
Ying Shan
Wei-Shi Zheng

Weakly supervised temporal action localization (WS-TAL) is a challenging task that
aims to localize action instances in the given video with video-level categorical
supervision. Previous works use the appearance and motion features extracted from
pre-trained feature encoder directly,e.g., feature concatenation or score-level fusion.
In this work, we argue that the features extracted from the pre-trained extractors,e.g.,
I3D, which are trained for trimmed video action classification, but not specific for
WS-TAL task, leading to inevitable redundancy and sub-optimization. Therefore, the
feature re-calibration is needed for reducing the task-irrelevant information redundancy.
Here, we propose a cross-modal consensus network(CO2-Net) to tackle this problem.
In CO2-Net, we mainly introduce two identical proposed cross-modal consensus modules
(CCM) that design a cross-modal attention mechanism to filter out the task-irrelevant
information redundancy using the global information from the main modality and the
cross-modal local information from the auxiliary modality. Moreover, we further explore
inter-modality consistency, where we treat the attention weights derived from each
CCM as the pseudo targets of the attention weights derived from another CCM to maintain
the consistency between the predictions derived from two CCMs, forming a mutual learning
manner. Finally, we conduct extensive experiments on two commonly used temporal action
localization datasets, THUMOS14 and ActivityNet1.2, to verify our method, which we
achieve state-of-the-art results. The experimental results show that our proposed
cross-modal consensus module can produce more representative features for temporal
action localization.

Searching a Hierarchically Aggregated Fusion Architecture for Fast Multi-Modality
Image Fusion

Risheng Liu
Zhu Liu
Jinyuan Liu
Xin Fan

Multi-modality image fusion refers to generating a complementary image that integrates
typical characteristics from source images. In recent years, we have witnessed the
remarkable progress of deep learning models for multi-modality fusion. Existing CNN-based
approaches strain every nerve to design various architectures for realizing these
tasks in an end-to-end manner. However, these handcrafted designs are unable to cope
with the high demanding fusion tasks, resulting in blurred targets and lost textural
details. To alleviate these issues, in this paper, we propose a novel approach, aiming
at searching effective architectures according to various modality principles and
fusion mechanisms.

Specifically, we construct a hierarchically aggregated fusion architecture to extract
and refine fused features from feature-level and object-level fusion perspectives,
which is responsible for obtaining complementary target/detail representations. Then
by investigating diverse effective practices, we composite a more flexible fusion-specific
search space. Motivated by the collaborative principle, we employ a new search strategy
with different principled losses and hardware constraints for sufficient discovery
of components. As a result, we can obtain a task-specific architecture with fast inference
time. Extensive quantitative and qualitative results demonstrate the superiority and
versatility of our method against state-of-the-art methods.

SuperFront: From Low-resolution to High-resolution Frontal Face Synthesis

Yu Yin
Joseph P. Robinson
Songyao Jiang
Yue Bai
Can Qin
Yun Fu

Even the most impressive achievement in frontal face synthesis is challenged by large
poses and low-quality data given one single side-view face. We propose a synthesizer
called SuperFront GAN (SF-GAN) to accept one or more low-resolution (LR) faces at
the input to then output a high-resolution (HR) frontal face with various poses and
such to preserve identity information. SF-GAN includes intra-class and inter-class
constraints, which allow it to learn an identity-preserving representation from multiple
LR faces in an improved, comprehensive manner. We adopt an orthogonal loss as the
intra-class constraint that diversifies the learned feature-space per subject. Hence,
each sample is made to complement the others to its max ability. Additionally, a triplet
loss is used as the inter-class constraint: it improves the discriminative power of
the new representation, which, hence, maintains the identity information. Furthermore,
we integrate a super-resolution (SR) side-view module as part of the SF-GAN to help
preserve the finer details of HR side-views. This helps the model reconstruct the
high-frequency parts of the face (i.e. periocular region, nose, and mouth regions).
Quantitative and qualitative results demonstrate the superiority of SF-GAN. SF-GAN
holds promise as a pre-processing step to normalize and align faces before passing
to CV system for processing.

Learning Segment Similarity and Alignment in Large-Scale Content Based Video Retrieval

Chen Jiang
Kaiming Huang
Sifeng He
Xudong Yang
Wei Zhang
Xiaobo Zhang
Yuan Cheng
Lei Yang
Qing Wang
Furong Xu
Tan Pan
Wei Chu

With the explosive growth of web videos in recent years, large-scale Content-Based
Video Retrieval (CBVR) becomes increasingly essential in video filtering, recommendation,
and copyright protection. Segment-level CBVR (S-CBVR) locates the start and end time
of similar segments in finer granularity, which is beneficial for user browsing efficiency
and infringement detection especially in long video scenarios. The challenge of S-CBVR
task is how to achieve high temporal alignment accuracy with efficient computation
and low storage consumption. In this paper, we propose a Segment Similarity and Alignment
Network (SSAN) in dealing with the challenge which is firstly trained end-to-end in
S-CBVR. SSAN is based on two newly proposed modules in video retrieval: (1) An efficient
Self-supervised Keyframe Extraction (SKE) module to reduce redundant frame features,
(2) A robust Similarity Pattern Detection (SPD) module for temporal alignment. In
comparison with uniform frame extraction, SKE not only saves feature storage and search
time, but also introduces comparable accuracy and limited extra computation time.
In terms of temporal alignment, SPD localizes similar segments with higher accuracy
and efficiency than existing deep learning methods. Furthermore, we jointly train
SSAN with SKE and SPD and achieve an end-to-end improvement. Meanwhile, the two key
modules SKE and SPD can also be effectively inserted into other video retrieval pipelines
and gain considerable performance improvements. Experimental results on public datasets
show that SSAN can obtain higher alignment accuracy while saving storage and online
query computational cost compared to existing methods.

Cut-Thumbnail: A Novel Data Augmentation for Convolutional Neural Network

Tianshu Xie
Xuan Cheng
Xiaomin Wang
Minghui Liu
Jiali Deng
Tao Zhou
Ming Liu

In this paper, we propose a novel data augmentation strategy named Cut-Thumbnail,
that aims to improve the shape bias of the network. We reduce an image to a certain
size and replace the random region of the original image with the reduced image. The
generated image not only retains most of the original image information but also has
global information in the reduced image. We call the reduced image as thumbnail. Furthermore,
we find that the idea of thumbnail can be perfectly integrated with Mixed Sample Data
Augmentation, so we put one image's thumbnail on another image while the ground truth
labels are also mixed, making great achievements on various computer vision tasks.
Extensive experiments show that Cut-Thumbnail works better than state-of-the-art augmentation
strategies across classification, fine-grained image classification, and object detection.
On ImageNet classification, ResNet-50 architecture with our method achieves 79.21%
accuracy, which is more than 2.8% improvement on the baseline.

Diffusing the Liveness Cues for Face Anti-spoofing

Sheng Li
Xun Zhu
Guorui Feng
Xinpeng Zhang
Zhenxing Qian

Face anti-spoofing is an important step for secure face recognition. One of the main
challenges is how to learn and build a general classifier that is able to resist various
presentation attacks. Recently, the patch-based face anti-spoofing schemes are shown
to be able to improve the robustness of the classifier. These schemes extract subtle
liveness cues from small local patches independently, which do not fully exploit the
correlations among the patches. In this paper, we propose a Patch-based Compact Graph
Network (PCGN) to diffuse the subtle liveness cues from all the patches. Firstly,
the image is encoded into a compact graph by connecting each node with its backward
neighbors. We then propose an asymmetrical updating strategy to update the compact
graph. Such a strategy aggregates the node based on whether it is a sender or receiver,
which leads to better message-passing. The updated graph is eventually decoded for
making the final decision. We conduct the experiments on four public databases with
four intra-database protocols and eight cross-database protocols, the results of which
demonstrate the effectiveness of our PCGN for face anti-spoofing.

Co-Transport for Class-Incremental Learning

Da-Wei Zhou
Han-Jia Ye
De-Chuan Zhan

Traditional learning systems are trained in closed-world for a fixed number of classes,
and need pre-collected datasets in advance. However, new classes often emerge in real-world
applications and should be learned incrementally. For example, in electronic commerce,
new types of products appear daily, and in a social media community, new topics emerge
frequently. Under such circumstances, incremental models should learn several new
classes at a time without forgetting. We find a strong correlation between old and
new classes in incremental learning, which can be applied to relate and facilitate
different learning stages mutually. As a result, we propose CO-transport for class
Incremental Learning (COIL), which learns to relate across incremental tasks with
the class-wise semantic relationship. In detail, co-transport has two aspects: prospective
transport tries to augment the old classifier with optimal transported knowledge as
fast model adaptation. Retrospective transport aims to transport new class classifiers
backward as old ones to overcome forgetting. With these transports, COIL efficiently
adapts to new tasks, and stably resists forgetting. Experiments on benchmark and real-world
multimedia datasets validate the effectiveness of our proposed method.

Skeleton-Contrastive 3D Action Representation Learning

Fida Mohammad Thoker
Hazel Doughty
Cees G. M. Snoek

This paper strives for self-supervised learning of a feature space suitable for skeleton-based
action recognition. Our proposal is built upon learning invariances to input skeleton
representations and various skeleton augmentations via a noise contrastive estimation.
In particular, we propose inter-skeleton contrastive learning, which learns from multiple
different input skeleton representations in a cross-contrastive manner. In addition,
we contribute several skeleton-specific spatial and temporal augmentations which further
encourage the model to learn the spatio-temporal dynamics of skeleton data. By learning
similarities between different skeleton representations as well as augmented views
of the same sequence, the network is encouraged to learn higher-level semantics of
the skeleton data than when only using the augmented views. Our approach achieves
state-of-the-art performance for self-supervised learning from skeleton data on the
challenging PKU and NTU datasets with multiple downstream tasks, including action
recognition, action retrieval and semi-supervised learning. Code is available at https://github.com/fmthoker/skeleton-contrast.

Fast-forwarding, Rewinding, and Path Exploration in Interactive Branched Video Streaming

Albin Vogel
Erik Kronberg
Niklas Carlsson

With interactive branched video, the storyline is typically determined by branch choices
made by the user during playback. Despite putting users in control of their viewing
experiences, prior work has not considered how to best help users that may want to
quickly navigate, explore, or skip parts of the branched video. Such functionalities
are important for both impatient users and those rewatching the video. To address
this void, we present the design, implementation and evaluation of interface solutions
that help users effectively navigate the video, and to identify and explore previously
unviewed storylines. Our solutions work with large, general video structures and allow
users to effectively forward/rewind the branched structures. Our user study demonstrates
the added value of our novel designs, presents promising tradeoffs, provides insights
into the pros/cons of different design alternatives, and highlights the features that
best address specific tasks and design aspects.

Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation)

Yunzhong Hou
Liang Zheng

Multiview detection incorporates multiple camera views to deal with occlusions, and
its central problem is multiview aggregation. Given feature map projections from multiple
views onto a common ground plane, the state-of-the-art method addresses this problem
via convolution, which applies the same calculation regardless of object locations.
However, such translation-invariant behaviors might not be the best choice, as object
features undergo various projection distortions according to their positions and cameras.
In this paper, we propose a novel multiview detector, MVDeTr, that adopts a newly
introduced shadow transformer to aggregate multiview information. Unlike convolutions,
shadow transformer attends differently at different positions and cameras to deal
with various shadow-like distortions. We propose an effective training scheme that
includes a new view-coherent data augmentation method, which applies random augmentations
while maintaining multiview consistency. On two multiview detection benchmarks, we
report new state-of-the-art accuracy with the proposed system. Code is available at
https://github.com/hou-yz/MVDeTr.

Domain Generalization via Feature Variation Decorrelation

Chang Liu
Lichen Wang
Kai Li
Yun Fu

Domain generalization aims to learn a model that generalizes to unseen target domains
from multiple source domains. Various approaches have been proposed to address this
problem by adversarial learning, meta-learning, and data augmentation. However, those
methods have no guarantee for target domain generalization. Motivated by an observation
that the class-irrelevant information of sample in the form of semantic variation
would lead to negative transfer, we propose to linearly disentangle the variation
out of sample in feature space and impose a novel class decorrelation regularization
on the feature variation. By doing so, the model would focus on the high-level categorical
concept for model prediction while ignoring the misleading clue from other variations
(including domain changes). As a result, we achieve state-of-the-art performances
over all of widely used domain generalization benchmarks, namely PACS, VLCS, Office-Home,
and Digits-DG with large margins. Further analysis reveals our method could learn
a better domain-invariant representation, and decorrelated feature variation could
successfully capture semantic meaning.

Occlusion-aware Bi-directional Guided Network for Light Field Salient Object Detection

Dong Jing
Shuo Zhang
Runmin Cong
Youfang Lin

Existing light field based works utilize either views or focal stacks for saliency
detection. However, since depth information exists implicitly in adjacent views or
different focal slices, it is difficult to exploit scene depth information from both.
By comparison, Epipolar Plane Images (EPIs) provide explicit accurate scene depth
and occlusion information by projected pixel lines. Due to the fact that the depth
of an object is often continuous, the distribution of occlusion edges concentrates
more on object boundaries compared with traditional color edges, which is more beneficial
for improving accuracy and completeness of saliency detection. In this paper, we propose
a learning-based network to exploit occlusion features from EPIs and integrate high-level
features from the central view for accurate salient object detection. Specifically,
a novel Occlusion Extraction Module is proposed to extract occlusion boundary features
from horizontal and vertical EPIs. In order to naturally combine occlusion features
in EPIs and high-level features in central view, we design a concise Bi-directional
Guiding Flow based on cascaded decoders. The flow leverages generated salient edge
predictions and salient object predictions to refine features in mutual encoding processes.
Experimental results demonstrate that our approach achieves state-of-the-art performance
in both segmentation accuracy and edge clarity.

One-Stage Visual Grounding via Semantic-Aware Feature Filter

Jiabo Ye
Xin Lin
Liang He
Dingbang Li
Qin Chen

Visual grounding has attracted much attention with the popularity of vision language.
Existing one-stage methods are far ahead of two-stage methods in speed. However, these
methods fuse the textual feature and visual feature map by simply concatenation, which
ignores the textual semantics and limits these models' ability in cross-modal understanding.
To overcome this weakness, we propose a semantic-aware framework that utilizes both
queries' structured knowledge and context-sensitive representations to filter the
visual feature maps to localize the referents more accurately. Our framework contains
an entity filter, an attribute filter, and a location filter. These three filters
filter the input visual feature map step by step according to each query's aspects
respectively. A grounding module further regresses the bounding boxes to localize
the referential object. Experiments on various commonly used datasets show that our
framework achieves a real-time inference speed and outperforms all state-of-the-art
methods.

Few-Shot Multi-Agent Perception

Chenyou Fan
Junjie Hu
Jianwei Huang

We study few-shot learning (FSL) under multi-agent scenarios, in which participating
agents only have local scarce labeled data and need to collaborate to predict query
data labels. Though each of the agents, such as drones and robots, has minimal communication
and computation capability, we aim at designing coordination schemes such that they
can collectively perceive the environment accurately and efficiently. We propose a
novel metric-based multi-agent FSL framework which has three main components: an efficient
communication mechanism that propagates compact and fine-grained query feature maps
from query agents to support agents; an asymmetric attention mechanism that computes
region-level attention weights between query and support feature maps; and a metric-learning
module which calculates the image-level relevance between query and support data fast
and accurately. Through analysis and extensive numerical studies, we demonstrate that
our approach can save communication and computation costs and significantly improve
performance in both visual and acoustic perception tasks such as face identification,
semantic segmentation, and sound genre recognition.

SI3DP: Source Identification Challenges and Benchmark for Consumer-Level 3D Printer
Forensics

Bo Seok Shim
Yoo Seung Shin
Seong Wook Park
Jong-Uk Hou

This paper lays the foundation for a new 3D content market by establishing a content
security framework using databases and benchmarks for in-depth research on source
identification of 3D printed objects. The proposed benchmark, SI3DP dataset, offers
a more generalized multimedia forensic technique. Assuming that identifying the source
of a 3D printing object can arise from various invisible traces occurring in the printing
process, we obtain close-up images, full object images from 252 printed objects from
18 different printing setups. We then propose a benchmark with five challenging tasks
such as device-level identification and scan-and-reprint detection using the provided
dataset. Our baseline shows that the printer type and its attributes can be identified
based on the microscopic difference of surface texture. Contrary to the conventional
belief that only microscopic views such as close-up images are useful to identify
printer model, we also achieved a certain level of performance even at a relatively
macroscopic point of view. We then propose a multitask-multimodal architecture for
device-level identification task to exploit rich knowledge from different image modality
and task. The SI3DP dataset can promote future in-depth research studies related to
digital forensics and intellectual property protection.

Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers

Wen Wang
Yang Cao
Jing Zhang
Fengxiang He
Zheng-Jun Zha
Yonggang Wen
Dacheng Tao

Detection transformers have recently shown promising object detection results and
attracted increasing attention. However, how to develop effective domain adaptation
techniques to improve its cross-domain performance remains unexplored and unclear.
In this paper, we delve into this topic and empirically find that direct feature distribution
alignment on the CNN backbone only brings limited improvements, as it does not guarantee
domain-invariant sequence features in the transformer for prediction. To address this
issue, we propose a novel Sequence Feature Alignment (SFA) method that is specially
designed for the adaptation of detection transformers. Technically, SFA consists of
a domain query-based feature alignment (DQFA) module and a token-wise feature alignment
(TDA) module. In DQFA, a novel domain query is used to aggregate and align global
context from the token sequence of both domains. DQFA reduces the domain discrepancy
in global feature representations and object relations when deploying in the transformer
encoder and decoder, respectively. Meanwhile, TDA aligns token features in the sequence
from both domains, which reduces the domain gaps in local and instance-level feature
representations in the transformer encoder and decoder, respectively. Besides, a novel
bipartite matching consistency loss is proposed to enhance the feature discriminability
for robust object detection. Experiments on three challenging benchmarks show that
SFA outperforms state-of-the-art domain adaptive object detection methods. Code has
been made available at: https://github.com/encounter1997/SFA.

Towards Realistic Visual Dubbing with Heterogeneous Sources

Tianyi Xie
Liucheng Liao
Cheng Bi
Benlai Tang
Xiang Yin
Jianfei Yang
Mingjie Wang
Jiali Yao
Yang Zhang
Zejun Ma

The task of few-shot visual dubbing focuses on synchronizing the lip movements with
arbitrary speech input for any talking head video. Albeit moderate improvements in
current approaches, they commonly require high-quality homologous data sources of
videos and audios, thus causing the failure to leverage heterogeneous data sufficiently.
In practice, it may be intractable to collect the perfect homologous data in some
cases, for example, audio-corrupted or picture-blurry videos. To explore this kind
of data and support high-fidelity few-shot visual dubbing, in this paper, we novelly
propose a simple yet efficient two-stage framework with a higher flexibility of mining
heterogeneous data. Specifically, our two-stage paradigm employs facial landmarks
as intermediate prior of latent representations and disentangles the lip movements
prediction from the core task of realistic talking head generation. By this means,
our method makes it possible to independently utilize the training corpus for two-stage
sub-networks using more available heterogeneous data easily acquired. Besides, thanks
to the disentanglement, our framework allows a further fine-tuning for a given talking
head, thereby leading to better speaker-identity preserving in the final synthesized
results. Moreover, the proposed method can also transfer appearance features from
others to the target speaker. Extensive experimental results demonstrate the superiority
of our proposed method in generating highly realistic videos synchronized with the
speech over the state-of-the-art.

Deep Self-Supervised t-SNE for Multi-modal Subspace Clustering

Qianqian Wang
Wei Xia
Zhiqiang Tao
Quanxue Gao
Xiaochun Cao

Existing multi-modal subspace clustering methods, aiming to exploit the correlation
information between different modalities, have achieved promising preliminary results.
However, these methods might be incapable of handling real problems with complex heterogeneous
structures between different modalities, since the large heterogeneous structure makes
it difficult to directly learn a discriminative shared self-representation for multi-modal
clustering. To tackle this problem, in this paper, we propose a deep Self-supervised
t-SNE method (StSNE) for multi-modal subspace clustering, which learns soft label
features by multi-modal encoders and utilizes the common label feature to supervise
soft label feature of each modal by adversarial training and reconstruction networks.
Specifically, the proposed StSNE consists of four components: 1) multi-modal convolutional
encoders; 2) a self-supervised t-SNE module; 3) a self-expressive layer; 4) multi-modal
convolutional decoders. Multi-modal data are fed to encoders to obtain soft label
features, for which the self-supervised t-SNE module is added to make full use of
the label information among different modalities. Simultaneously, the latent representations
given by encoders are constrained by a self-expressive layer to capture the hierarchical
information of each modal, followed by decoders reconstructing the encoded features
to preserve the structure of the original data. Experimental results on several public
datasets demonstrate the superior clustering performance of the proposed method over
state-of-the-art methods.

Multimodal Video Summarization via Time-Aware Transformers

Xindi Shang
Zehuan Yuan
Anran Wang
Changhu Wang

With the growing number of videos in video sharing platforms, how to facilitate the
searching and browsing of the user-generated video has attracted intense attention
by multimedia community. To help people efficiently search and browse relevant videos,
summaries of videos become important. The prior works in multimodal video summarization
mainly explore visual and ASR tokens as two separate sources and struggle to fuse
the multimodal information for generating the summaries. However, the time information
inside videos is commonly ignored. In this paper, we find that it is important to
leverage the timestamps to accurately incorporate multimodal signals for the task.
We propose a Time-Aware Multimodal Transformer (TAMT) with a novel short-term order-sensitive
attention mechanism. The attention mechanism can attend the inputs differently based
on time difference to explore the time information inherent inside video more thoroughly.
As such, TAMT can fuse the different modalities better for summarizing the videos.
Experiments show that our proposed approach is effective and achieves the state-of-the-art
performances on both YouCookII and open-domain How2 datasets.

State-aware Video Procedural Captioning

Taichi Nishimura
Atsushi Hashimoto
Yoshitaka Ushiku
Hirotaka Kameko
Shinsuke Mori

Video procedural captioning (VPC), which generates procedural text from instructional
videos, is an essential task for scene understanding and real-world applications.
The main challenge of VPC is to describe how to manipulate materials accurately. This
paper focuses on this challenge by designing a new VPC task, generating a procedural
text from the clip sequence of an instructional video and material list. In this task,
the state of materials is sequentially changed by manipulations, yielding their state-aware
visual representations (e.g., eggs are transformed into cracked, stirred, then fried
forms). The essential difficulty is to convert such visual representations into textual
representations; that is, a model should track the material states after manipulations
to better associate the cross-modal relations. To achieve this, we propose a novel
VPC method, which modifies an existing textual simulator for tracking material states
as a visual simulator and incorporates it into a video captioning model. Our experimental
results show the effectiveness of the proposed method, which outperforms state-of-the-art
video captioning models. We further analyze the learned embedding of materials to
demonstrate that the simulators capture their state transition. The code and dataset
are available from https://github.com/misogil0116/svpc

AMSS-Net: Audio Manipulation on User-Specified Sources with Textual Queries

Woosung Choi
Minseok Kim
Marco A. Martínez Ramírez
Jaehwa Chung
Soonyoung Jung

This paper proposes a neural network that performs audio transformations to user-specified
sources (e.g., vocals) of a given audio track according to a given description while
preserving other sources not mentioned in the description. Audio Manipulation on a
Specific Source (AMSS) is challenging because a sound object (i.e., a waveform sample
or frequency bin) is 'transparent'; it usually carries information from multiple sources,
in contrast to a pixel in an image. To address this challenging problem, we propose
AMSS-Net, which extracts latent sources and selectively manipulates them while preserving
irrelevant sources. We also propose an evaluation benchmark for several AMSS tasks,
and we show that AMSS-Net outperforms baselines on several AMSS tasks via objective
metrics and empirical verification.

Fully Functional Image Manipulation Using Scene Graphs in A Bounding-Box Free Way

Sitong Su
Lianli Gao
Junchen Zhu
Jie Shao
Jingkuan Song

Recently, performing semantic editing of an image by modifying a scene graph has been
proposed to support high-level image manipulation, and plays an important role for
image generation. However, existing methods are all based on bounding boxes, and they
suffer from the bounding box constraint. First, a bounding box often involves other
instances (e.g, objects or environments) which do not need to be modified, but existing
methods manipulate all the contents included in the bounding box. Secondly, prior
methods fail to support adding instances when the bounding box of the target instance
cannot be provided. To address the two issues above, we propose a novel bounding box
free approach, which consists of two parts: a Local Bounding Box Free (Local-BBox-Free)
Mask Generation and a Global Bounding Box Free (Global-BBox-Free) Instance Generation.
The first part relieves the model of reliance on bounding boxes by generating the
mask of the target instance to be manipulated without using the target instance bounding
box. This enables our method to be the first to support fully functional image manipulation
using scene graphs, including adding, removing, replacing and repositing instances.
The second part is designed to synthesize the target instance directly from the generated
mask and then paste it back to the inpainted original image using the generated mask,
which preserves the unchanged part to the largest extent and precisely controls the
target instance generation. Extensive experiments on Visual Genome and COCO-Stuff
demonstrate that our model significantly surpasses the state-of-the-art both quantitatively
and qualitatively.

Multi-Level Counterfactual Contrast for Visual Commonsense Reasoning

Xi Zhang
Feifei Zhang
Changsheng Xu

Given a question about an image, a Visual Commonsense Reasoning (VCR) model needs
to provide not only a correct answer, but also a rationale to justify the answer.
It is a challenging task due to the requirements of diverse visual content understanding,
abstract language comprehending, and complicated inter-modality relationship reasoning.
To solve above challenges, previous methods either resort to holistic attention mechanism
or explore transformer-based model with pre-training, which, however, cannot perform
comprehensive understanding and usually suffer from heavy computing burden. In this
paper, we propose a novel multi-level counterfactual contrastive learning network
for VCR by jointly modeling the hierarchical visual contents and the inter-modality
relationships between the visual and linguistic domains. The proposed method enjoys
several merits. First, with sufficient instance-level, image-level, and semantic-level
contrastive learning, our model can extract discriminative features and perform comprehensive
understanding for the image and linguistic expressions. Second, taking advantage of
counterfactual thinking, we can generate informative factual and counterfactual samples
for contrastive learning, resulting in stronger perception ability of our model. Third,
an auxiliary contrast module is incorporated into our method to directly optimize
the answer prediction in VCR, which further facilitates the representation learning.
Extensive experiments on the VCR dataset demonstrate that our approach performs favorably
against the state-of-the-arts.

Data-Free Ensemble Knowledge Distillation for Privacy-conscious Multimedia Model Compression

Zhiwei Hao
Yong Luo
Han Hu
Jianping An
Yonggang Wen

Recent advances in deep learning bring impressive performance for multimedia applications.
Hence, compressing and deploying these applications on resource-limited edge devices
via model compression becomes attractive. Knowledge distillation (KD) is one of the
most popular model compression techniques. However, most well-behaved KD approaches
require the original dataset, which is usually unavailable due to privacy issues,
while existing data-free KD methods perform much worse than data-required counterparts.
In this paper, we analyze previous data-free KD methods from the data perspective
and point out that using a single pre-trained model limits the performance of these
approaches. We then propose a Data-Free Ensemble knowledge Distillation (DFED) framework,
which contains a student network, a generator network, and multiple pre-trained teacher
networks. During training, the student mimics behaviors of the ensemble of teachers
using samples synthesized by a generator, which aims to enlarge the prediction discrepancy
between the student and teachers. A moment matching loss term assists the generator
training by minimizing the distance between activations of synthesized samples and
real samples. We evaluate DFED on three popular image classification datasets. Results
demonstrate that our method achieves significant performance improvements compared
with previous works. We also design an ablation study to verify the effectiveness
of each component of the proposed framework.

SM-SGE: A Self-Supervised Multi-Scale Skeleton Graph Encoding Framework for Person
Re-Identification

Haocong Rao
Xiping Hu
Jun Cheng
Bin Hu

Person re-identification via 3D skeletons is an emerging topic with great potential
in security-critical applications. Existing methods typically learn body and motion
features from the body-joint trajectory, whereas they lack a systematic way to model
body structure and underlying relations of body components beyond the scale of body
joints. In this paper, we for the first time propose a Self-supervised Multi-scale
Skeleton Graph Encoding (SM-SGE) framework that comprehensively models human body,
component relations, and skeleton dynamics from unlabeled skeleton graphs of various
scales to learn an effective skeleton representation for person Re-ID. Specifically,
we first devise multi-scale skeleton graphs with coarse-to-fine human body partitions,
which enables us to model body structure and skeleton dynamics at multiple levels.
Second, to mine inherent correlations between body components in skeletal motion,
we propose a multi-scale graph relation network to learn structural relations between
adjacent body-component nodes and collaborative relations among nodes of different
scales, so as to capture more discriminative skeleton graph features. Last, we propose
a novel multi-scale skeleton reconstruction mechanism to enable our framework to encode
skeleton dynamics and high-level semantics from unlabeled skeleton graphs, which encourages
learning a discriminative skeleton representation for person Re-ID. Extensive experiments
show that SM-SGE outperforms most state-of-the-art skeleton-based methods. We further
demonstrate its effectiveness on 3D skeleton data estimated from large-scale RGB videos.
Our codes are open at https://github.com/Kali-Hac/SM-SGE.

Video Transformer for Deepfake Detection with Incremental Learning

Sohail Ahmed Khan
Hang Dai

Face forgery by deepfake is widely spread over the internet and this raises severe
societal concerns. In this paper, we propose a novel video transformer with incremental
learning for detecting deepfake videos. To better align the input face images, we
use a 3D face reconstruction method to generate UV texture from a single input face
image. The aligned face image can also provide pose, eyes blink and mouth movement
information that cannot be perceived in the UV texture image, so we use both face
images and their UV texture maps to extract the image features. We present an incremental
learning strategy to fine-tune the proposed model on a smaller amount of data and
achieve better deepfake detection performance. The comprehensive experiments on various
public deepfake datasets demonstrate that the proposed video transformer model with
incremental learning achieves state-of-the-art performance in the deepfake video detection
task with enhanced feature learning from the sequenced data.

Chinese Character Inpainting with Contextual Semantic Constraints

Jiahao Wang
Gang Pan
Di Sun
Jiawan Zhang

Chinese character inpainting is a challenging task where large missing regions have
to be filled with both visually and semantic realistic contents. Existing methods
generally produce pseudo or ambiguous characters due to lack of semantic information.
Given the key observation that Chinese characters contain visually glyph representation
and intrinsic contextual semantics, we tackle the challenge of similar Chinese characters
by modeling the underlying regularities among glyph and semantic information. We propose
a semantics enhanced generative framework for Chinese character inpainting, where
a global semantic supervising module (GSSM) is introduced to constrain contextual
semantics. In particular, sentence embedding is used to guide the encoding of continuous
contextual characters. The method can not only generate realistic Chinese character,
but also explicitly utilize context as reference during network training to eliminate
ambiguity. The proposed method is evaluated on both handwritten and printed Chinese
characters with various masks. The experiments show that the method successfully predicts
missing character information without any mask input, and achieves significant sentence-level
results benefiting from global semantic supervising in a wide variety of scenes.

Curriculum-Based Meta-learning

Ji Zhang
Jingkuan Song
Yazhou Yao
Lianli Gao

Meta-learning offers an effective solution to learn new concepts with scarce supervision
through an episodic training scheme: a series of target-like tasks sampled from base
classes are sequentially fed into a meta-learner to extract common knowledge across
tasks, which can facilitate the quick acquisition of task-specific knowledge of the
target task with few samples. Despite its noticeable improvements, the episodic training
strategy samples tasks randomly and uniformly, without considering their hardness
and quality, which may not progressively improve the meta-leaner's generalization
ability. In this paper, we present a Curriculum-Based Meta-learning (CubMeta) method
to train the meta-learner using tasks from easy to hard. Specifically, the framework
of CubMeta is in a progressive way, and in each step, we design a module named BrotherNet
to establish harder tasks and an effective learning scheme for obtaining an ensemble
of stronger meta-learners. In this way, the meta-learner's generalization ability
can be progressively improved, and better performance can be obtained even with fewer
training tasks. We evaluate our method for few-shot classification on two benchmarks
- mini-ImageNet and tiered-ImageNet, where it achieves consistent performance improvements
on various meta-learning paradigms.

Ego-Deliver: A Large-Scale Dataset For Egocentric Video Analysis

Haonan Qiu
Pan He
Shuchun Liu
Weiyuan Shao
Feiyun Zhang
Jiajun Wang
Liang He
Feng Wang

The egocentric video provides a unique view of event participants to show their attention,
vision, and interaction with objects. In this paper, we introduce Ego-Deliver, a new
large-scale egocentric video benchmark recorded by takeaway riders about their daily
work. To the best of our knowledge, Ego-Deliver presents the first attempt in understanding
activities from the takeaway delivery process while being one of the largest egocentric
video action datasets to date. Our dataset provides a total of 5,360 videos with more
than 139,000 multi-track annotations and 45 different attributes, which we believe
is pivotal to future research in this area. We introduce the FS-Net architecture,
a new anchor-free action detection approach handling extreme variations of action
durations. We partition videos into fragments and build dynamic graphs over fragments,
where multi-fragment context information is aggregated to boost fragment classification.
A splicing and scoring module is applied to obtain final action proposals. Our experimental
evaluation confirms that the proposed framework outperforms existing approaches on
the proposed Ego-Deliver benchmark and is competitive on other popular benchmarks.
In our current version, Ego-Deliver is used to make a comprehensive comparison between
algorithms for activity detection. We also show its application to action recognition
with promising results. The dataset, toolkits and baseline results will be made available
at: https://egodeliver.github.io/EgoDeliver_Dataset/

Adversarial Pixel Masking: A Defense against Physical Attacks for Pre-trained Object Detectors

Ping-Han Chiang
Chi-Shen Chan
Shan-Hung Wu

Object detection based on pre-trained deep neural networks (DNNs) has achieved impressive
performance and enabled many applications. However, DNN-based object detectors are
shown to be vulnerable to physical adversarial attacks. Despite that recent efforts
have been made to defend against these attacks, they either use strong assumptions
or become less effective with pre-trained object detectors. In this paper, we propose
adversarial pixel masking (APM), a defense against physical attacks, which is designed
specifically for pre-trained object detectors. APM does not require any assumptions
beyond the "patch-like" nature of a physical attack and can work with different pre-trained
object detectors of different architectures and weights, making it a practical solution
in many applications. We conduct extensive experiments, and the empirical results
show that APM can significantly improve model robustness without significantly degrading
clean performance.

Knowledge-Supervised Learning: Knowledge Consensus Constraints for Person Re-Identification

Li Wang
Baoyu Fan
Zhenhua Guo
Yaqian Zhao
Runze Zhang
Rengang Li
Weifeng Gong
Endong Wang

The consensus of multiple views on the same data will provide extra regularization,
thereby improving accuracy. Based on this idea, we proposed a novel Knowledge-Supervised
Learning (KSL) method for person re-identification (Re-ID), which can improve the
performance without introducing extra inference cost. Firstly, we introduce isomorphic
auxiliary training strategy to conduct basic multiple views that simultaneously train
multiple classifier heads of the same network on the same training data. The consensus
constraints aim to maximize the agreement among multiple views. To introduce this
regular constraint, inspired by knowledge distillation that paired branches can be
trained collaboratively through mutual imitation learning. Three novel constraints
losses are proposed to distill the knowledge that needs to be transferred across different
branches: similarity of predicted classification probability for cosine space constraints,
distance of embedding features for euclidean space constraints, hard sample mutual
mining for hard sample space constraints. From different perspectives, these losses
complement each other. Experiments on four mainstream Re-ID datasets show that a standard
model with KSL method trained from scratch outperforms its ImageNet pre-training results
by a clear margin. With KSL method, a lightweight model without ImageNet pre-training
outperforms most large models. We expect that these discoveries can attract some attention
from the current de facto paradigm of "pre-training and fine-tuning" in Re-ID task
to the knowledge discovery during model training.

View-normalized Skeleton Generation for Action Recognition

Qingzhe Pan
Zhifu Zhao
Xuemei Xie
Jianan Li
Yuhan Cao
Guangming Shi

Skeleton-based action recognition has attracted great interest due to low cost of
skeleton data acquisition and high robustness to external conditions. A challenging
problem of skeleton-based action recognition is the large intra-class gap caused by
various viewpoints of skeleton data, which makes the action modeling difficult for
network. To alleviate this problem, a feasible solution is to utilize label supervised
methods to learn a view-normalization model. However, since the skeleton data in real
scenes is acquired from diverse viewpoints, it is difficult to obtain the corresponding
view-normalized skeleton as label. Therefore, how to learn a view-normalization model
without the supervised label is the key to solving view-variance problem. To this
end, we propose a view normalization-based action recognition framework, which is
composed of view-normalization generative adversarial network (VN-GAN) and classification
network. For VN-GAN, the model is designed to learn the mapping from diverse-view
distribution to normalized-view distribution. In detail, it is implemented by graph
convolution, where the generator predicts the transformation angles for view normalization
and discriminator classifies the real input samples from the generated ones. For classification
network, view-normalized data is processed to predict the action class. Without the
interference of view variances, classification network can extract more discriminative
feature of action. Furthermore, by combining the joint and bone modalities, the proposed
method reaches the state-of-the-art performance on NTU RGB+D and NTU-120 RGB+D datasets.
Especially in NTU-120 RGB+D, the accuracy is improved by 3.2% and 2.3% under cross-subject
and cross-set criteria, respectively.

Learning Hierarchical Embedding for Video Instance Segmentation

Zheyun Qin
Xiankai Lu
Xiushan Nie
Xiantong Zhen
Yilong Yin

In this paper, we address video instance segmentation using a new generative model
that learns effective representations of the target and background appearance. We
propose to exploit hierarchical structural embedding over spatio-temporal space, which
is compact, powerful, and flexible in contrast to current tracking-by-detection methods.
Specifically, our model segments and tracks instances across space and time in a single
forward pass, which is formulated as hierarchical embedding learning. The model is
trained to locate the pixels belonging to specific instances over a video clip. We
firstly take advantage of a novel mixing function to better fuse spatio-temporal embeddings.
Moreover, we introduce normalizing flows to further improve the robustness of the
learned appearance embedding, which theoretically extends conventional generative
flows to a factorized conditional scheme. Comprehensive experiments on the video instance
segmentation benchmark, i.e., YouTube-VIS, demonstrate the effectiveness of the proposed
approach. Furthermore, we evaluate our method on an unsupervised video object segmentation
dataset to demonstrate its generalizability.

Text as Neural Operator:Image Manipulation by Text Instruction

Tianhao Zhang
Hung-Yu Tseng
Lu Jiang
Weilong Yang
Honglak Lee
Irfan Essa

n recent years, text-guided image manipulation has gained increasing attention in
the multimedia and computer vision community. The input to conditional image generation
has evolved from image-only to multimodality. In this paper, we study a setting that
allows users to edit an image with multiple objects using complex text instructions
to add, remove, or change the objects. The inputs of the task are multimodal including
(1) a reference image and (2) an instruction in natural language that describes desired
modifications to the image. We propose a GAN-based method to tackle this problem.
The key idea is to treat text as neural operators to locally modify the image feature.
We show that the proposed model performs favorably against recent strong baselines
on three public datasets. Specifically, it generates images of greater fidelity and
semantic relevance, and when used as a image query, leads to better retrieval performance.

DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning

Wenhao Wu
Yuxiang Zhao
Yanwu Xu
Xiao Tan
Dongliang He
Zhikang Zou
Jin Ye
Yingying Li
Mingde Yao
Zichao Dong
Yifeng Shi

Long-range and short-range temporal modeling are two complementary and crucial aspects
of video recognition. Most of the state-of-the-arts focus on short-range spatio-temporal
modeling and then average multiple snippet-level predictions to yield the final video-level
prediction. Thus, their video-level prediction does not consider spatio-temporal features
of how video evolves along the temporal dimension. In this paper, we introduce a novel
Dynamic Segment Aggregation (DSA) module to capture relationship among snippets. To
be more specific, we attempt to generate a dynamic kernel for a convolutional operation
to aggregate long-range temporal information among adjacent snippets adaptively. The
DSA module is an efficient plug-and-play module and can be combined with the off-the-shelf
clip-based models (i.e., TSM, I3D) to perform powerful long-range modeling with minimal
overhead. The final video architecture, coined as DSANet. We conduct extensive experiments
on several video recognition benchmarks (i.e., Mini-Kinetics-200, Kinetics-400, Something-Something
V1 and ActivityNet) to show its superiority. Our proposed DSA module is shown to benefit
various video recognition models significantly. For example, equipped with DSA modules,
the top-1 accuracy of I3D ResNet-50 is improved from 74.9% to 78.2% on Kinetics-400.
Codes are available at https://github.com/whwu95/DSANet.

StrucTexT: Structured Text Understanding with Multi-Modal Transformers

Yulin Li
Yuxi Qian
Yuechen Yu
Xiameng Qin
Chengquan Zhang
Yan Liu
Kun Yao
Junyu Han
Jingtuo Liu
Errui Ding

Structured text understanding on Visually Rich Documents (VRDs) is a crucial part
of Document Intelligence. Due to the complexity of content and layout in VRDs, structured
text understanding has been a challenging task. Most existing studies decoupled this
problem into two sub-tasks: entity labeling and entity linking, which require an entire
understanding of the context of documents at both token and segment levels. However,
little work has been concerned with the solutions that efficiently extract the structured
data from different levels. This paper proposes a unified framework named StrucTexT,
which is flexible and effective for handling both sub-tasks. Specifically, based on
the transformer, we introduce a segment-token aligned encoder to deal with the entity
labeling and entity linking tasks at different levels of granularity. Moreover, we
design a novel pre-training strategy with three self-supervised tasks to learn a richer
representation. StrucTexT uses the existing Masked Visual Language Modeling task and
the new Sentence Length Prediction and Paired Boxes Direction tasks to incorporate
the multi-modal information across text, image, and layout. We evaluate our method
for structured text understanding at segment-level and token-level and show it outperforms
the state-of-the-art counterparts with significantly superior performance on the FUNSD,
SROIE, and EPHOIE datasets.

Local Graph Convolutional Networks for Cross-Modal Hashing

Yudong Chen
Sen Wang
Jianglin Lu
Zhi Chen
Zheng Zhang
Zi Huang

Cross-modal hashing aims to map the data of different modalities into a common binary
space to accelerate the retrieval speed. Recently, deep cross-modal hashing methods
have shown promising performance by applying deep neural networks to facilitate feature
learning. However, the known supervised deep methods mainly rely on the labeled information
of datasets, which is insufficient to characterize the latent structures that exist
among different modalities. To mitigate this problem, in this paper, we propose to
use Graph Convolutional Networks (GCNs) to exploit the local structure information
of datasets for cross-modal hash learning. Specifically, a local graph is constructed
according to the neighborhood relationships between samples in deep feature spaces
and fed into GCNs to generate graph embeddings. Then, a within-modality loss is designed
to measure the inner products between deep features and graph embeddings so that hashing
networks and GCNs can be jointly optimized. By taking advantage of GCNs to assist
model's training, the performance of hashing networks can be improved. Extensive experiments
on benchmarks verify the effectiveness of the proposed method.

Metric Learning for Anti-Compression Facial Forgery Detection

Shenhao Cao
Qin Zou
Xiuqing Mao
Dengpan Ye
Zhongyuan Wang

Detecting facial forgery images and videos is an increasingly important topic in multimedia
forensics. As forgery images and videos are usually compressed into different formats
such as JPEG and H264 when circulating on the Internet, existing forgery-detection
methods trained on uncompressed data often suffer from significant performance degradation
in identifying them. To solve this problem, we propose a novel anti-compression facial
forgery detection framework, which learns a compression-insensitive embedding feature
space utilizing both original and compressed forgeries. Specifically, our approach
consists of three ideas: (i) extracting compression-insensitive features from both
uncompressed and compressed forgeries using an adversarial learning strategy; (ii)
learning a robust partition by constructing a metric loss that can reduce the distance
of the paired original and compressed images in the embedding space; (iii) improving
the accuracy of tampered localization with an attention-transfer module. Experimental
results demonstrate that, the proposed method is highly effective in handling both
compressed and uncompressed facial forgery images.

ASFM-Net: Asymmetrical Siamese Feature Matching Network for Point Completion

Yaqi Xia
Yan Xia
Wei Li
Rui Song
Kailang Cao
Uwe Stilla

We tackle the problem of object completion from point clouds and propose a novel point
cloud completion network employing an Asymmetrical Siamese Feature Matching strategy,
termed as ASFM-Net. Specifically, the Siamese auto-encoder neural network is adopted
to map the partial and complete input point cloud into a shared latent space, which
can capture detailed shape prior. Then we design an iterative refinement unit to generate
complete shapes with fine-grained details by integrating prior information. Experiments
are conducted on the PCN dataset and the Completion3D benchmark, demonstrating the
state-of-the-art performance of the proposed ASFM-Net. Our method achieves the 1st
place in the leaderboard of Completion3D and outperforms existing methods with a large
margin, about 12%. The codes and trained models are released publicly at https://github.com/Yan-Xia/ASFM-Net.

Capsule-based Object Tracking with Natural Language Specification

Ding Ma
Xiangqian Wu

Tracking with Natural-Language Specification (TNL) is a joint topic of understanding
the vision and natural language with a wide range of applications. In previous works,
the communication between two heterogeneous features of vision and language is mainly
through a simple dynamic convolution. However, the performance of prior works is capped
by the difficulty of linguistic variation of natural language in modeling the dynamically
changing target and its surroundings. In the meanwhile, natural language and vision
are firstly fused and then utilized for tracking, which is hard to model the query-focused
context. Query-focused should pay more attention to context modeling to promote the
correlation between these two features. To address these issues, we propose a capsule-based
network, referred to as CapsuleTNL, which performs regression tracking with natural
language query. In the beginning, the visual and textual input is encoded with capsules,
which can not only establish the relationship between entities but also the relationship
between the parts of the entity itself. Then, we devise two interaction routing modules,
which consist of visual-textual routing module to reduce the linguistic variation
of input query and textual-visual routing module to precisely incorporate query-based
visual cues simultaneously. To validate the potential of the proposed network for
visual object tracking, we evaluate our method on two large tracking benchmarks. The
experimental evaluation demonstrates the effectiveness of our capsule-based network.

Faster-PPN: Towards Real-Time Semantic Segmentation with Dual Mutual Learning for
Ultra-High Resolution Images

Bicheng Dai
Kaisheng Wu
Tong Wu
Kai Li
Yanyun Qu
Yuan Xie
Yun Fu

Despite recent progress on semantic segmentation, there still exist huge challenges
in high or ultra-high resolution images semantic segmentation. Although the latest
collaborative global-local semantic segmentation methods such as GLNet [4] and PPN
[18] have achieved impressive results, they are inefficient and not fit for practical
applications. Thus, in this paper, we propose a novel and efficient collaborative
global-local framework on the basis of PPN named Faster-PPN for high or ultra-high
resolution images semantic segmentation which makes a better trade-off between the
efficient and effectiveness towards the real-time speed. Specially, we propose Dual
Mutual Learning to improve the feature representation of global and local branches,
which conducts knowledge distillation mutually between the global and local branches.
Furthermore, we design the Pixel Proposal Fusion Module to conduct the fine-grained
selection mechanism which further reduces the redundant pixels for fusion resulting
in the improvement of inference speed. The experimental results on three challenging
high or ultra-high resolution datasets DeepGlobe, ISIC and BACH demonstrate that Faster-PPN
achieves the best performance on accuracy, inference speed and memory usage compared
with state-of-the-art approaches. Especially, our method achieves real-time and near
real-time speed with 36 FPS and 17.7 FPS on ISIC and DeepGlobe, respectively.

Distributed Attention for Grounded Image Captioning

Nenglun Chen
Xingjia Pan
Runnan Chen
Lei Yang
Zhiwen Lin
Yuqiang Ren
Haolei Yuan
Xiaowei Guo
Feiyue Huang
Wenping Wang

We study the problem of weakly supervised grounded image captioning. That is, given
an image, the goal is to automatically generate a sentence describing the context
of the image with each noun word grounded to the corresponding region in the image.
This task is challenging due to the lack of explicit fine-grained region word alignments
as supervision. Previous weakly supervised methods mainly explore various kinds of
regularization schemes to improve attention accuracy. However, their performances
are still far from the fully supervised ones. One main issue that has been ignored
is that the attention for generating visually groundable words may only focus on the
most discriminate parts and can not cover the whole object. To this end, we propose
a simple yet effective method to alleviate the issue, termed as partial grounding
problem in our paper. Specifically, we design a distributed attention mechanism to
enforce the network to aggregate information from multiple spatially different regions
with consistent semantics while generating the words. Therefore, the union of the
focused region proposals should form a visual region that encloses the object of interest
completely. Extensive experiments have demonstrated the superiority of our proposed
method compared with the state-of-the-arts.

Multi-initialization Optimization Network for Accurate 3D Human Pose and Shape Estimation

Zhiwei Liu
Xiangyu Zhu
Lu Yang
Xiang Yan
Ming Tang
Zhen Lei
Guibo Zhu
Xuetao Feng
Yan Wang
Jinqiao Wang

3D human pose and shape recovery from a monocular RGB image is a challenging task.
Existing learning based methods highly depend on weak supervision signals, e.g. 2D
and 3D joint location, due to the lack of in-the-wild paired 3D supervision. However,
considering the 2D-to-3D ambiguities existed in these weak supervision labels, the
network is easy to get stuck in local optima when trained with such labels. In this
paper, we reduce the ambituity by optimizing multiple initializations. Specifically,
we propose a three-stage framework named Multi-Initialization Optimization Network
(MION). In the first stage, we strategically select different coarse 3D reconstruction
candidates which are compatible with the 2D keypoints of input sample. Each coarse
reconstruction can be regarded as an initialization leads to one optimization branch.
In the second stage, we design a mesh refinement transformer (MRT) to respectively
refine each coarse reconstruction result via a self-attention mechanism. Finally,
a Consistency Estimation Network (CEN) is proposed to find the best result from mutiple
candidates by evaluating if the visual evidence in RGB image matches a given 3D reconstruction.
Experiments demonstrate that our Multi-Initialization Optimization Network outperforms
existing 3D mesh based methods on multiple public benchmarks.

Feedback Network for Mutually Boosted Stereo Image Super-Resolution and Disparity
Estimation

Qinyan Dai
Juncheng Li
Qiaosi Yi
Faming Fang
Guixu Zhang

Under stereo settings, the problem of image super-resolution (SR) and disparity estimation
are interrelated that the result of each problem could help to solve the other. The
effective exploitation of correspondence between different views facilitates the SR
performance, while the high-resolution (HR) features with richer details benefit the
correspondence estimation. According to this motivation, we propose a Stereo Super-Resolution
and Disparity Estimation Feedback Network (SSRDE-FNet), which simultaneously handles
the stereo image super-resolution and disparity estimation in a unified framework
and interact them with each other to further improve their performance. Specifically,
the SSRDE-FNet is composed of two dual recursive sub-networks for left and right views.
Besides the cross-view information exploitation in the low-resolution (LR) space,
HR representations produced by the SR process are utilized to perform HR disparity
estimation with higher accuracy, through which the HR features can be aggregated to
generate a finer SR result. Afterward, the proposed HR Disparity Information Feedback
(HRDIF) mechanism delivers information carried by HR disparity back to previous layers
to further refine the SR image reconstruction. Extensive experiments demonstrate the
effectiveness and advancement of SSRDE-FNet.

Merging Multiple Template Matching Predictions in Intra Coding with Attentive Convolutional
Neural Network

Qijun Wang
Guodong Zheng

In intra coding, template matching prediction is an effective method to reduce the
non-local redundancy inside image content. However, the prediction indicated by the
best template matching is not always the actually best prediction. To solve this problem,
we propose a method, which merges multiple template matching predictions through a
convolutional neural network with attention module. The convolutional neural network
aims at exploring different combinations of the candidate template matching predictions,
and the attention module focuses on determining the most significant prediction candidate.
Besides, the spatial module in attention mechanism can be utilized to model the relationship
between the original pixels in current block and the reconstructed pixels in adjacent
regions (template). Compared to the directional intra prediction and traditional template
matching prediction, our method can provide a unified framework to generate prediction
with high accuracy. The experimental results show that, compared the averaging strategy,
the BD-rate reductions can reach up to 4.7%, 5.5% and 18.3% on the classic standard
sequences (classB-classF), SIQAD dataset (screen content), and Urban100 dataset (natural
scenes) respectively, while the average bit rate saving are 0.5%, 2.7% and 1.8%, respectively.

Camera-Agnostic Person Re-Identification via Adversarial Disentangling Learning

Hao Ni
Jingkuan Song
Xiaosu Zhu
Feng Zheng
Lianli Gao

Despite the success of single-domain person re-identification (ReID), current supervised
models degrade dramatically when deployed to unseen domains, mainly due to the discrepancy
across cameras. To tackle this issue, we propose an Adversarial Disentangling Learning
(ADL) framework to decouple camera-related and ID-related features, which can be readily
used for camera-agnostic person ReID. ADL adopts a discriminative way instead of the
mainstream generative styles in disentangling methods, eg., GAN or VAE based, because
for person ReID task only the information to discriminate IDs is needed, and more
information to generate images are redundant and may be noisy. Specifically, our model
involves a feature separation module that encodes images into two separate feature
spaces and a disentangled feature learning module that performs adversarial training
to minimize mutual information. We design an effective solution to approximate and
minimize mutual information by transforming it into a discrimination problem. The
two modules are co-designed to obtain strong generalization ability by only using
source dataset. Extensive experiments on three public benchmarks show that our method
outperforms the state-of-the-art generalizable person ReID model by a large margin.
Our code is publicly available at https://github.com/luckyaci/ADL_ReID.

SESSION: Session 15: Best Paper Session

Speech2AffectiveGestures: Synthesizing Co-Speech Gestures with Generative Adversarial
Affective Expression Learning

Uttaran Bhattacharya
Elizabeth Childs
Nicholas Rewkowski
Dinesh Manocha

We present a generative adversarial network to synthesize 3D pose sequences of co-speech
upper-body gestures with appropriate affective expressions. Our network consists of
two components: a generator to synthesize gestures from a joint embedding space of
features encoded from the input speech and the seed poses, and a discriminator to
distinguish between the synthesized pose sequences and real 3D pose sequences. We
leverage the Mel-frequency cepstral coefficients and the text transcript computed
from the input speech in separate encoders in our generator to learn the desired sentiments
and the associated affective cues. We design an affective encoder using multi-scale
spatial-temporal graph convolutions to transform 3D pose sequences into latent, pose-based
affective features. We use our affective encoder in both our generator, where it learns
affective features from the seed poses to guide the gesture synthesis, and our discriminator,
where it enforces the synthesized gestures to contain the appropriate affective expressions.
We perform extensive evaluations on two benchmark datasets for gesture synthesis from
the speech, the TED Gesture Dataset and the GENEA Challenge 2020 Dataset. Compared
to the best baselines, we improve the mean absolute joint error by 10-33%, the mean
acceleration difference by 8-58%, and the Fréchet Gesture Distance by 21-34%. We also
conduct a user study and observe that compared to the best current baselines, around
15.28% of participants indicated our synthesized gestures appear more plausible, and
around 16.32% of participants felt the gestures had more appropriate affective expressions
aligned with the speech.

Video Background Music Generation with Controllable Music Transformer

Shangzhe Di
Zeren Jiang
Si Liu
Zhaokai Wang
Leyan Zhu
Zexin He
Hongming Liu
Shuicheng Yan

In this work, we address the task of video background music generation. Some previous
works achieve effective music generation but are unable to generate melodious music
specifically for a given video, and none of them considers the video-music rhythmic
consistency. To generate the background music that matches the given video, we first
establish the rhythmic relationships between video and background music. In particular,
we connect timing, motion speed, and motion saliency from video with beat, simu-note
density, and simu-note strength from music, respectively. We then propose CMT, a Controllable
Music Transformer that enables the local control of the aforementioned rhythmic features,
as well as the global control of the music genre and the used instrument specified
by users. Objective and subjective evaluations show that the generated background
music has achieved satisfactory compatibility with the input videos, and at the same
time, impressive music quality.

PIMNet: A Parallel, Iterative and Mimicking Network for Scene Text Recognition

Zhi Qiao
Yu Zhou
Jin Wei
Wei Wang
Yuan Zhang
Ning Jiang
Hongbin Wang
Weiping Wang

Nowadays, scene text recognition has attracted more and more attention due to its
various applications. Most state-of-the-art methods adopt an encoder-decoder framework
with attention mechanism, which generates text autoregressively from left to right.
Despite the convincing performance, the speed is limited because of the one-by-one
decoding strategy. As opposed to autoregressive models, non-autoregressive models
predict the results in parallel with a much shorter inference time, but the accuracy
falls behind the autoregressive counterpart considerably. In this paper, we propose
a Parallel, Iterative and Mimicking Network (PIMNet) to balance accuracy and efficiency.
Specifically, PIMNet adopts a parallel attention mechanism to predict the text faster
and an iterative generation mechanism to make the predictions more accurate. In each
iteration, the context information is fully explored. To improve learning of the hidden
layer, we exploit the mimicking learning in the training phase, where an additional
autoregressive decoder is adopted and the parallel decoder mimics the autoregressive
decoder with fitting outputs of the hidden layer. With the shared backbone between
the two decoders, the proposed PIMNet can be trained end-to-end without pre-training.
During inference, the branch of the autoregressive decoder is removed for a faster
speed. Extensive experiments on public benchmarks demonstrate the effectiveness and
efficiency of PIMNet. Our code is available in the supplementary material.

Theophany: Multimodal Speech Augmentation in Instantaneous Privacy Channels

Abhishek Kumar
Tristan Braud
Lik Hang Lee
Pan Hui

Many factors affect speech intelligibility in face-to-face conversations. These factors
lead conversation participants to speak louder and more distinctively, exposing the
content to potential eavesdroppers. To address these issues, we introduce Theophany,
a privacy-preserving framework for augmenting speech. Theophany establishes ad-hoc
social networks between conversation participants to exchange contextual information,
improving speech intelligibility in real-time. At the core of Theophany, we develop
the first privacy perception model that assesses the privacy risk of a face-to-face
conversation based on its topic, location, and participants. This framework allows
to develop any privacy-preserving application for face-to-face conversation. We implement
the framework within a prototype system that augments the speaker's speech with real-life
subtitles to overcome the loss of contextual cues brought by mask-wearing and social
distancing during the COVID-19 pandemic. We evaluate Theophany through a user survey
and a user study on 53 and 17 participants, respectively. Theophany's privacy predictions
match the participants' privacy preferences with an accuracy of 71.26%. Users considered
Theophany to be useful to protect their privacy (3.88/5), easy to use (4.71/5), and
enjoyable to use (4.24/5). We also raise the question of demographic and individual
differences in the design of privacy-preserving solutions.

aBio: Active Bi-Olfactory Display Using Subwoofers for Virtual Reality

You-Yang Hu
Yao-Fu Jan
Kuan-Wei Tseng
You-Shin Tsai
Hung-Ming Sung
Jin-Yao Lin
Yi-Ping Hung

Including olfactory cues in virtual reality (VR) would enhance user immersion in the
virtual environment, and precise control of smell would facilitate a more realistic
experience for users. In this paper, we present aBio, an active bi-olfactory display
system that delivers scents precisely to specific locations rather than diffusing
scented air into the atmosphere. aBio provides users with a natural olfactory experience
in free air by colliding two vortex rings launched from dual speaker-based vortex
generators, which also has the effect of cushioning the force of air impact. According
to the various requests of different applications, the collision point of the vortex
rings can be positioned anywhere in front of the user's nose. To verify the effectiveness
of our device and understand user sensations when using different parameters in our
system, we conduct a series of experiments and user studies. The results show that
the proposed system is effective in the sense that users perceive smell without sensible
haptic disturbance while the system consumes only a very small amount of fragrant
essential oil. We believe that aBio has great potential for increasing the level of
presence in VR by delivering smells with high efficiency.

SESSION: Poster Session 3

Learning to Understand Traffic Signs

Yunfei Guo
Wei Feng
Fei Yin
Tao Xue
Shuqi Mei
Cheng-Lin Liu

One of the intelligent transportation system's critical tasks is to understand traffic
signs and convey traffic information to humans. However, most related works are focused
on the detection and recognition of traffic sign texts or symbols, which is not sufficient
for understanding. Besides, there has been no public dataset for traffic sign understanding
research. Our work takes the first step towards addressing this problem. First, we
propose a "CASIA-Tencent Chinese Traffic Sign Understanding Dataset" (CTSU Dataset),
which contains 5000 images of traffic signs with rich semantic descriptions. Second,
we introduce a novel multi-task learning architecture that extracts text and symbol
information from traffic signs, reasons the relationship between texts and symbols,
classifies signs into different categories, and finally, composes the descriptions
of the signs. Experiments show that the task of traffic sign understanding is achievable,
and our architecture demonstrates state-of-the-art and superior performance. The CTSU
Dataset is available at http://www.nlpr.ia.ac.cn/databases/CASIA-Tencent%20CTSU/index.html.

R-GAN: Exploring Human-like Way for Reasonable Text-to-Image Synthesis via Generative
Adversarial Networks

Yanyuan Qiao
Qi Chen
Chaorui Deng
Ning Ding
Yuankai Qi
Mingkui Tan
Xincheng Ren
Qi Wu

Despite recent significant progress on generative models, context-rich text-to-image
synthesis depicting multiple complex objects is still non-trivial. The main challenges
lie in the ambiguous semantic of a complex description and the intricate scene of
an image with various objects, different positional relationship and diverse appearances.
To address these challenges, we propose R-GAN, which can generate reasonable images
according to the given text in a human-like way. Specifically, just like humans will
first find and settle the essential elements to create a simple sketch, we first capture
a monolithic-structural text representation by building a scene graph to find the
essential semantic elements. Then, based on this representation, we design a bounding
box generator to estimate the layout with position and size of target objects, and
a following shape generator, which draws a fine-detailed shape for each object. Different
from previous work only generating coarse shapes blindly, we introduce a coarse-to-fine
shape generator based on a shape knowledge base. At last, to finish the final image
synthesis, we propose a multi-modal geometry-aware spatially-adaptive generator conditioned
on the monolithic-structural text representation and the geometry-aware map of the
shapes. Extensive experiments on the real-world dataset MSCOCO show the superiority
of our method in terms of both quantitative and qualitative metrics.

Cross-modality Discrepant Interaction Network for RGB-D Salient Object Detection

Chen Zhang
Runmin Cong
Qinwei Lin
Lin Ma
Feng Li
Yao Zhao
Sam Kwong

The popularity and promotion of depth maps have brought new vigor and vitality into
salient object detection (SOD), and a mass of RGB-D SOD algorithms have been proposed,
mainly concentrating on how to better integrate cross-modality features from RGB image
and depth map. For the cross-modality interaction in feature encoder, existing methods
either indiscriminately treat RGB and depth modalities, or only habitually utilize
depth cues as auxiliary information of the RGB branch. Different from them, we reconsider
the status of two modalities and propose a novel Cross-modality Discrepant Interaction
Network (CDINet) for RGB-D SOD, which differentially models the dependence of two
modalities according to the feature representations of different layers. To this end,
two components are designed to implement the effective cross-modality interaction:
1) the RGB-induced Detail Enhancement (RDE) module leverages RGB modality to enhance
the details of the depth features in low-level encoder stage. 2) the Depth-induced
Semantic Enhancement (DSE) module transfers the object positioning and internal consistency
of depth features to the RGB branch in high-level encoder stage. Furthermore, we also
design a Dense Decoding Reconstruction (DDR) structure, which constructs a semantic
block by combining multi-level encoder features to upgrade the skip connection in
the feature decoding. Extensive experiments on five benchmark datasets demonstrate
that our network outperforms $15$ state-of-the-art methods both quantitatively and
qualitatively. Our code is publicly available at:https://rmcong.github.io/proj_CDINet.html.

Deconfounded and Explainable Interactive Vision-Language Retrieval of Complex Scenes

Junda Wu
Tong Yu
Shuai Li

In vision-language retrieval systems, users provide natural language feedback to find
target images. Vision-language explanations in the systems can better guide users
to provide feedback and thus improve the retrieval. However, developing explainable
vision-language retrieval systems can be challenging, due to limited labeled multimodal
data. In the retrieval of complex scenes, the issue of limited labeled data can be
more severe. With multiple objects in the complex scenes, each user query may not
exhaustively describe all objects in the desired image and thus more labeled queries
are needed. The issue of limited labeled data can cause data selection biases, and
result in spurious correlations learned by the models. When learning spurious correlations,
existing explainable models may not be able to accurately extract regions from images
and keywords from user queries.

In this paper, we discover that deconfounded learning is an important step to provide
better vision-language explanations. Thus we propose a deconfounded explainable vision-language
retrieval system. By introducing deconfounded learning to pretrain our vision-language
model, the spurious correlations in the model can be reduced through interventions
by potential confounders. This helps to train more accurate representations and further
enable better explainability. Based on explainable retrieval results, we propose novel
interactive mechanisms. In such interactions, users can better understand why the
system returns particular results and give feedback effectively improving the results.
This additional feedback is sample efficient and thus alleviates the data limitation
problem. Through extensive experiments, our system achieves about $60%$ improvements,
compared to the state-of-the-art.

Long Short-term Convolutional Transformer for No-Reference Video Quality Assessment

Junyong You

No-reference video quality assessment has not been widely benefited from deep learning,
mainly due to the complexity, diversity and particularity of modelling spatial and
temporal characteristics in quality assessment scenario. Image quality assessment
(IQA) performed on video frames plays a key role in NR-VQA. A perceptual hierarchical
network (PHIQNet) with an integrated attention module is first proposed that can appropriately
simulate the visual mechanisms of contrast sensitivity and selective attention in
IQA. Subsequently, perceptual quality features of video frames derived from PHIQNet
are fed into a long short-term convolutional Transformer (LSCT) architecture to predict
the perceived video quality. LSCT consists of CNN formulating quality features in
video frames within short-term units that are then fed into Transformer to capture
the long-range dependence and attention allocation over temporal units. Such architecture
is in line with the intrinsic properties of VQA. Experimental results on publicly
available video quality databases have demonstrated that the LSCT architecture based
on PHIQNet significantly outperforms state-of-the-art video quality models.

Automatic Channel Pruning with Hyper-parameter Search and Dynamic Masking

Baopu Li
Yanwen Fan
Zhihong Pan
Yuchen Bian
Gang Zhang

Modern deep neural network models tend to be large and computationally intensive.
One typical solution to this issue is model pruning. However, most current model pruning
algorithms depend on hand crafted rules or need to input the pruning ratio beforehand.
To overcome this problem, we propose a learning based automatic channel pruning algorithm
for deep neural network, which is inspired by recent automatic machine learning (Auto
ML). A two objectives' pruning problem that aims for the weights and the remaining
channels for each layer is first formulated. An alternative optimization approach
is then proposed to derive the channel numbers and weights simultaneously. In the
process of pruning, we utilize a searchable hyper-parameter, remaining ratio, to denote
the number of channels in each convolution layer, and then a dynamic masking process
is proposed to describe the corresponding channel evolution. To adjust the trade-off
between accuracy of a model and the pruning ratio of floating point operations, a
new loss function is further introduced. Extensive experimental results on benchmark
datasets demonstrate that our scheme achieves competitive results for neural network
pruning.

SVHAN: Sequential View Based Hierarchical Attention Network for 3D Shape Recognition

Yue Zhao
Weizhi Nie
An-An Liu
Zan Gao
Yuting Su

As an important field of multimedia, 3D shape recognition has attracted much research
attention in recent years. A lot of deep learning models have been proposed for effective
3D shape representation. The view-based methods show the superiority due to the comprehensive
exploration of the visual characteristics with the help of established 2D CNN architectures.
Generally, the current approaches contain the following disadvantages: First, the
most majority of methods lack the consideration for sequential information among the
multiple views, which can provide descriptive characteristics for shape representation.
Second, the incomprehensive exploration for the multi-view correlations directly affects
the discrimination of shape descriptors. Finally, roughly aggregating multi-view features
leads to the loss of descriptive information, which limits the shape representation
effectiveness. To handle these issues, we propose a novel sequential view based hierarchical
attention network (SVHAN) for 3D shape recognition. Specifically, we first divide
the view sequence into several view blocks. Then, we introduce a novel hierarchical
feature aggregation module (HFAM), which hierarchically exploits the view-level, block-level,
and shape-level features, the intra- and inter- view-block correlations are also captured
to improve the discrimination of learned features. Subsequently, a novel selective
fusion module (SFM) is designed for feature aggregation, considering the correlations
between different levels and preserving effective information. Finally, discriminative
and informative shape descriptors are generated for the recognition task. We validate
the effectiveness of our proposed method on two public databases. The experimental
results show the superiority of SVHAN against the current state-of-the-art approaches.

ASFD: Automatic and Scalable Face Detector

Jian Li
Bin Zhang
Yabiao Wang
Ying Tai
Zhenyu Zhang
Chengjie Wang
Jilin Li
Xiaoming Huang
Yili Xia

Along with current multi-scale based detectors, Feature Aggregation and Enhancement
(FAE) modules have shown superior performance gains for cutting-edge object detection.
However, these hand-crafted FAE modules show inconsistent improvements on face detection,
which is mainly due to the significant distribution difference between its training
and applying corpus, i.e. COCO vs. WIDER Face. To tackle this problem, we essentially
analyse the effect of data distribution, and consequently propose to search an effective
FAE architecture, termed AutoFAE by a differentiable architecture search, which outperforms
all existing FAE modules in face detection with a considerable margin. Upon the found
AutoFAE and existing backbones, a supernet is further built and trained, which automatically
obtains a family of detectors under the different complexity constraints. Extensive
experiments conducted on popular benchmarks, i.e. WIDER Face and FDDB, demonstrate
the state-of-the-art performance-efficiency trade-off for the proposed automatic and
scalable face detector (ASFD) family. In particular, our strong ASFD-D6 outperforms
the best competitor with AP 96.7/96.2/92.1 on WIDER Face test, and the lightweight
ASFD-D0 costs about 3.1 ms, i.e. more than 320 FPS, on the V100 GPU with VGA-resolution
images.

BridgeNet: A Joint Learning Network of Depth Map Super-Resolution and Monocular Depth
Estimation

Qi Tang
Runmin Cong
Ronghui Sheng
Lingzhi He
Dan Zhang
Yao Zhao
Sam Kwong

Depth map super-resolution is a task with high practical application requirements
in the industry. Existing color-guided depth map super-resolution methods usually
necessitate an extra branch to extract high-frequency detail information from RGB
image to guide the low-resolution depth map reconstruction. However, because there
are still some differences between the two modalities, direct information transmission
in the feature dimension or edge map dimension cannot achieve satisfactory result,
and may even trigger texture copying in areas where the structures of the RGB-D pair
are inconsistent. Inspired by the multi-task learning, we propose a joint learning
network of depth map super-resolution (DSR) and monocular depth estimation (MDE) without
introducing additional supervision labels. For the interaction of two subnetworks,
we adopt a differentiated guidance strategy and design two bridges correspondingly.
One is the high-frequency attention bridge (HABdg) designed for the feature encoding
process, which learns the high-frequency information of the MDE task to guide the
DSR task. The other is the content guidance bridge (CGBdg) designed for the depth
map reconstruction process, which provides the content guidance learned from DSR task
for MDE task. The entire network architecture is highly portable and can provide a
paradigm for associating the DSR and MDE tasks. Extensive experiments on benchmark
datasets demonstrate that our method achieves competitive performance. Our code and
models are available at https://rmcong.github.io/proj_BridgeNet.html.

LSTC: Boosting Atomic Action Detection with Long-Short-Term Context

Yuxi Li
Boshen Zhang
Jian Li
Yabiao Wang
Weiyao Lin
Chengjie Wang
Jilin Li
Feiyue Huang

In this paper, we place the atomic action detection problem intoa Long-Short Term
Context (LSTC) to analyze how the temporalreliance among video signals affect the
action detection results. Todo this, we decompose the action recognition pipeline
into short-term and long-term reliance, in terms of the hypothesis that the twokinds
of context are conditionally independent given the objectiveaction instance. Within
our design, a local aggregation branch isutilized to gather dense and informative
short-term cues, while ahigh order long-term inference branch is designed to reason
theobjective action class from high-order interaction between actor andother person
or person pairs. Both branches independently predictthe context-specific actions and
the results are merged in the end.We demonstrate that both temporal grains are beneficial
to atomicaction recognition. On the mainstream benchmarks of atomic actiondetection,
our design can bring significant performance gain fromthe existing state-of-the-art
pipeline.

UACANet: Uncertainty Augmented Context Attention for Polyp Segmentation

Taehun Kim
Hyemin Lee
Daijin Kim

We propose Uncertainty Augmented Context Attention network (UACANet) for polyp segmentation
which considers an uncertain area of the saliency map. We construct a modified version
of U-Net shape network with additional encoder and decoder and compute a saliency
map in each bottom-up stream prediction module and propagate to the next prediction
module. In each prediction module, previously predicted saliency map is utilized to
compute foreground, background and uncertain area map and we aggregate the feature
map with three area maps for each representation. Then we compute the relation between
each representation and each pixel in the feature map. We conduct experiments on five
popular polyp segmentation benchmarks, Kvasir, CVC-ClinicDB, ETIS, CVC-ColonDB and
CVC-300, and our method achieves state-of-the-art performance. Especially, we achieve
76.6% mean Dice on ETIS dataset which is 13.8% improvement compared to the previous
state-of-the-art method. Source code is publicly available at https://github.com/plemeri/UACANet

Weight Evolution: Improving Deep Neural Networks Training through Evolving Inferior
Weight Values

Zhenquan Lin
Kailing Guo
Xiaofen Xing
Xiangmin Xu

To obtain good performance, convolutional neural networks are usually over-parameterized.
This phenomenon has stimulated two interesting topics: pruning the unimportant weights
for compression and reactivating the unimportant weights to make full use of network
capability. However, current weight reactivation methods usually reactivate the entire
filters, which may not be precise enough. Looking back in history, the prosperity
of filter pruning is mainly due to its friendliness to hardware implementation, but
pruning at a finer structure level, i.e., weight elements, usually leads to better
network performance. We study the problem of weight element reactivation in this paper.
Motivated by evolution, we select the unimportant filters and update their unimportant
elements by combining them with the important elements of important filters, just
like gene crossover to produce better offspring, and the proposed method is called
weight evolution (WE). WE is mainly composed of four strategies. We propose a global
selection strategy and a local selection strategy and combine them to locate the unimportant
filters. A forward matching strategy is proposed to find the matched important filters
and a crossover strategy is proposed to utilize the important elements of the important
filters for updating unimportant filters. WE is plug-in to existing network architectures.
Comprehensive experiments show that WE outperforms the other reactivation methods
and plug-in training methods with typical convolutional neural networks, especially
lightweight networks. Our code is available at https://github.com/BZQLin/Weight-evolution.

Coarse to Fine: Domain Adaptive Crowd Counting via Adversarial Scoring Network

Zhikang Zou
Xiaoye Qu
Pan Zhou
Shuangjie Xu
Xiaoqing Ye
Wenhao Wu
Jin Ye

Recent deep networks have convincingly demonstrated high capability in crowd counting,
which is a critical task attracting widespread attention due to its various industrial
applications. Despite such progress, trained data-dependent models usually can not
generalize well to unseen scenarios because of the inherent domain shift. To facilitate
this issue, this paper proposes a novel adversarial scoring network (ASNet) to gradually
bridge the gap across domains from coarse to fine granularity. In specific, at the
coarse-grained stage, we design a dual-discriminator strategy to adapt source domain
to be close to the targets from the perspectives of both global and local feature
space via adversarial learning. The distributions between two domains can thus be
aligned roughly. At the fine-grained stage, we explore the transferability of source
characteristics by scoring how similar the source samples are to target ones from
multiple levels based on generative probability derived from coarse stage. Guided
by these hierarchical scores, the transferable source features are properly selected
to enhance the knowledge transfer during the adaptation process. With the coarse-to-fine
design, the generalization bottleneck induced from the domain discrepancy can be effectively
alleviated. Three sets of migration experiments show that the proposed methods achieve
state-of-the-art counting performance compared with major unsupervised methods.

Towards Adversarial Patch Analysis and Certified Defense against Crowd Counting

Qiming Wu
Zhikang Zou
Pan Zhou
Xiaoqing Ye
Binghui Wang
Ang Li

Crowd counting has drawn much attention due to its importance in safety-critical surveillance
systems. Especially, deep neural network (DNN) methods have significantly reduced
estimation errors for crowd counting missions. Recent studies have demonstrated that
DNNs are vulnerable to adversarial attacks, i.e., normal images with human-imperceptible
perturbations could mislead DNNs to make false predictions. In this work, we propose
a robust attack strategy called Adversarial Patch Attack with Momentum (APAM) to systematically
evaluate the robustness of crowd counting models, where the attacker's goal is to
create an adversarial perturbation that severely degrades their performances, thus
leading to public safety accidents (e.g., stampede accidents). Especially, the proposed
attack leverages the extreme-density background information of input images to generate
robust adversarial patches via a series of transformations (e.g., interpolation, rotation,
etc.). We observe that by perturbing less than 6% of image pixels, our attacks severely
degrade the performance of crowd counting systems, both digitally and physically.
To better enhance the adversarial robustness of crowd counting models, we propose
the first regression model-based Randomized Ablation (RA), which is more sufficient
than Adversarial Training (ADT) (Mean Absolute Error of RA is 5 lower than ADT on
clean samples and 30 lower than ADT on adversarial examples). Extensive experiments
on five crowd counting models demonstrate the effectiveness and generality of the
proposed method.

Conceptual and Syntactical Cross-modal Alignment with Cross-level Consistency for
Image-Text Matching

Pengpeng Zeng
Lianli Gao
Xinyu Lyu
Shuaiqi Jing
Jingkuan Song

Image-Text Matching (ITM) is a fundamental and emerging task, which plays a key role
in cross-modal understanding. It remains a challenge because prior works mainly focus
on learning fine-grained (i.e. coarse and/or phrase) correspondence, without considering
the syntactical correspondence. In theory, a sentence is not only a set of words or
phrases but also a syntactic structure, consisting of a set of basic syntactic tuples
(i.e.(attribute) object - predicate - (attribute) subject). Inspired by this, we propose
a Conceptual and Syntactical Cross-modal Alignment with Cross-level Consistency (CSCC)
for Image-text Matching by simultaneously exploring the multiple-level cross-modal
alignments across the concept and syntactic with a consistency constraint. Specifically,
a conceptual-level cross-modal alignment is introduced for exploring the fine-grained
correspondence, while a syntactical-level cross-modal alignment is proposed to explicitly
learn a high-level syntactic similarity function. Moreover, an empirical cross-level
consistent attention loss is introduced to maintain the consistency between cross-modal
attentions obtained from the above two cross-modal alignments. To justify our method,
comprehensive experiments are conducted on two public benchmark datasets, i.e. MS-COCO
(1K and 5K) and Flickr30K, which show that our CSCC outperforms state-of-the-art methods
with fairly competitive improvements.

SSPU-Net: Self-Supervised Point Cloud Upsampling via Differentiable Rendering

Yifan Zhao
Le Hui
Jin Xie

Point clouds obtained from 3D sensors are usually sparse. Existing methods mainly
focus on upsampling sparse point clouds in a supervised manner by using dense ground
truth point clouds. In this paper, we propose a self-supervised point cloud upsampling
network (SSPU-Net) to generate dense point clouds without using ground truth. To achieve
this, we exploit the consistency between the input sparse point cloud and generated
dense point cloud for the shapes and rendered images. Specifically, we first propose
a neighbor expansion unit (NEU) to upsample the sparse point clouds, where the local
geometric structures of the sparse point clouds are exploited to learn weights for
point interpolation. Then, we develop a differentiable point cloud rendering unit
(DRU) as an end-to-end module in our network to render the point cloud into multi-view
images. Finally, we formulate a shape-consistent loss and an image-consistent loss
to train the network so that the shapes of the sparse and dense point clouds are as
consistent as possible. Extensive results on the CAD and scanned datasets demonstrate
that our method can achieve impressive results in a self-supervised manner.

VmAP: A Fair Metric for Video Object Detection

Anupam Sobti
Vaibhav Mavi
M Balakrishnan
Chetan Arora

Video object detection is the task of detecting objects in a sequence of frames, typically,
with a significant overlap in content among consecutive frames. Mean Average Precision
(mAP) was originally proposed for evaluating object detection techniques in independent
frames, but has been used for evaluating video based object detectors as well. This
is undesirable since the average precision over all frames masks the biases that a
certain object detector might have against certain types of objects depending on the
number of frames for which the object is present in a video sequence. In this paper
we show several disadvantages of mAP as a metric for evaluating video based object
detection. Specifically, we show that: (a) some object detectors could be severely
biased against some specific kind of objects, such as small, blurred, or low contrast
objects, and such differences may not reflect in mAP based evaluation, (b) operating
a video based object detector at the best frame based precision/recall value (high
F1 score) may lead to many false positives without a significant increase in the number
of objects detected. (c) mAP does not take into account that tracking can be potentially
used to recover missed detections in the temporal neighborhood while this can be account
for while evaluating detectors. As an alternate, we suggest a novel evaluation metric
(VmAP) which takes the focus away from evaluating detections on every frame. Unlike
mAP, VmAP rewards a high recall of different object views throughout the video. We
form sets of bounding boxes having similar views of an object in a temporal neighborhood
and use a set-level recall for evaluation. We show that VmAP is able to address all
the challenges with the mAP listed above. Our experiments demonstrate hidden biases
in object detectors, shows upto 99% reduction in false positives while maintaining
similar object recall and shows a 9% improvement in correlation with post-tracking
performance.

Source Data-free Unsupervised Domain Adaptation for Semantic Segmentation

Mucong Ye
Jing Zhang
Jinpeng Ouyang
Ding Yuan

Deep\footnote learning-based semantic segmentation methods require a huge amount of
training images with pixel-level annotations. Unsupervised domain adaptation (UDA)
for semantic segmentation enables transferring knowledge learned from the synthetic
data (source domain) with low-cost annotations to the real images (target domain).
However, current UDA methods mostly require full access to the source domain data
for feasible adaptation, which limits their applications in real-world scenarios with
privacy, storage, or transmission issues. To this end, this paper identifies and addresses
a more practical but challenging problem of UDA for semantic segmentation, where access
to the original source domain data is forbidden. In other words, only the pre-trained
source model and unlabelled target domain data are available for adaptation. To tackle
the problem, we propose to construct a set of source domain virtual data to mimic
the source domain distribution by identifying the target domain high-confidence samples
predicted by the pre-trained source model. Then by analyzing the data properties in
the cross-domain semantic segmentation tasks, we propose an uncertainty and prior
distribution-aware domain adaptation method to align the virtual source domain and
the target domain with both adversarial learning and self-training strategies. Extensive
experiments on three cross-domain semantic segmentation datasets with in-depth analyses
verify the effectiveness of the proposed method.

Yes, "Attention Is All You Need", for Exemplar based Colorization

Wang Yin
Peng Lu
Zhaoran Zhao
Xujun Peng

Conventional exemplar based image colorization tends to transfer colors from reference
image only to grayscale image based on the semantic correspondence between them. But
their practical capabilities are limited when semantic correspondence can hardly be
found. To overcome this issue, additional information, such as colors from the database
is normally introduced. However, it's a great challenge to consider color information
from reference image and database simultaneously because there lacks a unified framework
to model different color information and the multi-modal ambiguity in database cannot
be removed easily. Also, it is difficult to fuse different color information effectively.
Thus, a general attention based colorization framework is proposed in this work, where
the color histogram of reference image is adopted as a prior to eliminate the ambiguity
in database. Moreover, a sparse loss is designed to guarantee the success of information
fusion. Both qualitative and quantitative experimental results show that the proposed
approach achieves better colorization performance compared with the state-of-the-art
methods on public databases with different quality metrics.

Heuristic Depth Estimation with Progressive Depth Reconstruction and Confidence-Aware
Loss

Jiehua Zhang
Liang Li
Chenggang Yan
Yaoqi Sun
Tao Shen
Jiyong Zhang
Zhan Wang

Recently deep learning-based depth estimation has shown the promising result, especially
with the help of sparse depth reference samples. Existing works focus on directly
inferring the depth information from sparse samples with high confidence. In this
paper, we propose a Heuristic Depth Estimation Network (HDEN) with progressive depth
reconstruction and confidence-aware loss. The HDEN leverages the reference samples
with low confidence to distill the spatial geometric and local semantic information
for dense depth prediction. Specifically, we first train a U-NET network to generate
a coarse-level dense reference map. Second, the progressive depth reconstruction module
successively reconstructs the fine-level dense depth map from different scales, where
a multi-level upsampling block is designed to recover the local structure of object.
Finally, the confidence-aware loss is proposed to trigger the reference samples with
low confidence, which enforces the model focusing on estimating the depth of the tiny
structure. Extensive experiments on the NYU-Depth-v2 and KITTI-Odometry dataset show
the effectiveness of our method. Visualization results demonstrate that the dense
depth maps generated by HDEN have better consistency at the entity edge with RGB image.

Unsupervised Cross-Modal Distillation for Thermal Infrared Tracking

Jingxian Sun
Lichao Zhang
Yufei Zha
Abel Gonzalez-Garcia
Peng Zhang
Wei Huang
Yanning Zhang

The target representation learned by convolutional neural networks plays an important
role in Thermal Infrared (TIR) tracking. Currently, most of the top-performing TIR
trackers are still employing representations learned by the model trained on the RGB
data. However, this representation does not take into account the information in the
TIR modality itself, limiting the performance of TIR tracking.

To solve this problem, we propose to distill representations of the TIR modality from
the RGB modality with Cross-Modal Distillation (CMD) on a large amount of unlabeled
paired RGB-TIR data. We take advantage of the two-branch architecture of the baseline
tracker, i.e. DiMP, for cross-modal distillation working on two components of the
tracker. Specifically, we use one branch as a teacher module to distill the representation
learned by the model into the other branch. Benefiting from the powerful model in
the RGB modality, the cross-modal distillation can learn the TIR-specific representation
for promoting TIR tracking. The proposed approach can be incorporated into different
baseline trackers conveniently as a generic and independent component. Furthermore,
the semantic coherence of paired RGB and TIR images is utilized as a supervised signal
in the distillation loss for cross-modal knowledge transfer. In practice, three different
approaches are explored to generate paired RGB-TIR patches with the same semantics
for training in an unsupervised way. It is easy to extend to an even larger scale
of unlabeled training data. Extensive experiments on the LSOTB-TIR dataset and PTB-TIR
dataset demonstrate that our proposed cross-modal distillation method effectively
learns TIR-specific target representations transferred from the RGB modality. Our
tracker outperforms the baseline tracker by achieving absolute gains of 2.3% Success,
2.7% Precision, and 2.5% Normalized Precision respectively. Code and models are available
at https://github.com/zhanglichao/cmdTIRtracking.

ABPNet: Adaptive Background Modeling for Generalized Few Shot Segmentation

Kaiqi Dong
Wei Yang
Zhenbo Xu
Liusheng Huang
Zhidong Yu

Existing Few Shot Segmentation (FS-Seg) methods mostly study a restricted setting
where only foreground and background are required to be discriminated and fall short
at discriminating multiple classes. In this paper, we focus on a challenging but more
practical variant: Generalized Few Shot Segmentation (GFS-Seg), where all SEEN and
UNSEEN classes are segmented simultaneously. Previous methods treat the background
as a regular class, leading to difficulty in differentiating UNSEEN classes from it
at the test stage. To address this issue, we propose Adaptive Background Modeling
and Prototype Query Network (ABPNet), in which the background is formulated as the
complement of the set of interested classes. With the help of the attention mechanism
and a novel meta-training strategy, it learns an effective set difference function
that predicts task-specific background adaptively. Furthermore, we design a Prototype
Querying (PQ) module that effectively transfers the learned knowledge to UNSEEN classes
with a neural dictionary. Experimental results demonstrate that ABPNet significantly
outperforms the state-of-the-art method CAPL on PASCAL-5i and COCO-20i, especially
on UNSEEN classes. Also, without retraining, ABPNet can generalize well to FS-Seg.

Towards Reasoning Ability in Scene Text Visual Question Answering

Qingqing Wang
Liqiang Xiao
Yue Lu
Yaohui Jin
Hao He

Works on scene text visual question answering (TextVQA) always emphasize the importance
of reasoning questions and image contents. However, we find current TextVQA models
lack reasoning ability and tend to answer questions by exploiting dataset bias and
language priors. Moreover, our observations indicate that recent accuracy improvement
in TextVQA is mainly contributed by stronger OCR engines, better pre-training strategies
and more Transformer layers, instead of newly proposed networks. In this work, towards
the reasoning ability, we 1) conduct module-wise contribution analysis to quantitatively
investigate how existing works improve accuracies in TextVQA; 2) design a gradient-based
explainability method to explore why TextVQA models answer what they answer and find
evidence for their predictions; 3) perform qualitative experiments to visually analyze
models reasoning ability and explore potential reasons behind such a poor ability.

Multi-caption Text-to-Face Synthesis: Dataset and Algorithm

Jianxin Sun
Qi Li
Weining Wang
Jian Zhao
Zhenan Sun

Text-to-Face synthesis with multiple captions is still an important yet less addressed
problem because of the lack of effective algorithms and large-scale datasets. We accordingly
propose a Semantic Embedding and Attention (SEA-T2F) network that allows multiple
captions as input to generate highly semantically related face images. With a novel
Sentence Features Injection Module, SEA-T2F can integrate any number of captions into
the network. In addition, an attention mechanism named Attention for Multiple Captions
is proposed to fuse multiple word features and synthesize fine-grained details. Considering
text-to-face generation is an ill-posed problem, we also introduce an attribute loss
to guide the network to generate sentence-related attributes. Existing datasets for
text-to-face are either too small or roughly generated according to attribute labels,
which is not enough to train deep learning based methods to synthesize natural face
images. Therefore, we build a large-scale dataset named CelebAText-HQ, in which each
image is manually annotated with 10 captions. Extensive experiments demonstrate the
effectiveness of our algorithm.

Multimodal Compatibility Modeling via Exploring the Consistent and Complementary Correlations

Weili Guan
Haokun Wen
Xuemeng Song
Chung-Hsing Yeh
Xiaojun Chang
Liqiang Nie

Existing methods towards outfit compatibility modeling seldom explicitly consider
multimodal correlations. In this work, we explore the consistent and complementary
correlations for better compatibility modeling. This is, however, non-trivial due
to the following challenges: 1) how to separate and model these two kinds of correlations;
2) how to leverage the derived complementary cues to strengthen the text and vision-oriented
representations of the given item; and 3) how to reinforce the compatibility modeling
with text and vision-oriented representations. To address these challenges, we present
a comprehensive multimodal outfit compatibility modeling scheme. It first nonlinearly
projects each modality into separable consistent and complementary spaces via multi-layer
perceptron, and then models the consistent and complementary correlations between
two modalities by parallel and orthogonal regularization. Thereafter, we strengthen
the visual and textual representation of items with complementary information, and
further induct both the text-oriented and vision- oriented outfit compatibility modeling.
We ultimately employ the mutual learning strategy to reinforce the final performance
of compatibility modeling. Extensive experiments demonstrate the superiority of our
scheme.

CDD: Multi-view Subspace Clustering via Cross-view Diversity Detection

Shudong Huang
Ivor W. Tsang
Zenglin Xu
Jiancheng Lv
Quanhui Liu

The goal of multi-view subspace clustering is to explore a common latent space where
the multi-view data points lying on. Myriads of subspace learning algorithms have
been investigated to boost the performance of multi-view clustering, but seldom exploiting
both the multi-view consistency and multi-view diversity, let alone taking them into
consideration simultaneously. To do so, we lodge a novel multi-view subspace clustering
via cross-view diversity detection (CDD). CDD is able to exploit these two complementary
criteria seamlessly into a holistic design of clustering algorithms. With the consistent
part and diverse part being detected, a pure graph can be derived for each view. The
consistent pure parts of different views are further fused to a consensus structured
graph with exactly k connected components where k is the number of clusters. Thus
we can directly obtain the final clustering result without any postprocessing as each
connected component precisely corresponds to an individual cluster. We model the above
concerns into a unified optimization framework. Our empirical studies validate that
the proposed model outperforms several other state-of-the-art methods.

Learning Spatio-temporal Representation by Channel Aliasing Video Perception

Yiqi Lin
Jinpeng Wang
Manlin Zhang
Andy J. Ma

In this paper, we propose a novel pretext task namely Channel Aliasing Video Perception
(CAVP) for self-supervised video representation learning. The main idea of our approach
is to generate channel aliasing videos, which carry different motion cues simultaneously
by assembling distinct channels from different videos. With the generated channel
aliasing videos, we propose to recognize the number of different motion flows within
a channel aliasing video for perception of discriminative motion cues. As a plug-and-play
method, the proposed pretext task can be integrated into a co-training framework with
other self-supervised learning methods to further improve the performance. Experimental
results on publicly available action recognition benchmarks verify the effectiveness
of our method for spatio-temporal representation learning.

Efficient Sparse Attacks on Videos using Reinforcement Learning

Huanqian Yan
Xingxing Wei

More and more deep neural network models have been deployed in real-time video systems.
However, it is proved that deep models are susceptible to the crafted adversarial
examples. The adversarial examples are imperceptible and can make the normal deep
models misclassify them. Although there exist a few works aiming at the adversarial
examples of video recognition in the black-box attack mode, most of them need large
perturbations or hundreds of thousands of queries. There are still lack of effective
adversarial methods to produce adversarial videos with small perturbations and limited
query numbers at the same time.

In this paper, an efficient and powerful method is proposed for adversarial video
attacks in the black-box attack mode. The proposed method is based on Reinforcement
Learning (RL) like the previous work, i.e. using the agent in RL to adaptively find
the sparse key frames to add perturbations. The key difference is that we design the
new reward functions based on the loss reduction and the perturbation increment, and
thus propose an efficient update mechanism to guide the agent to finish the attacks
with smaller perturbations and fewer query numbers. The proposed algorithm has a new
working mechanism. It is simple, efficient, and effective. Extensive experiments show
our method has a good trade-off between the perturbation amplitude and the query numbers.
Compared with the state-of-the-art algorithms, it has reduced 65.75% query numbers
without image quality loss in the un-targeted attacks and simultaneously reduced 22.47%
perturbations and 54.77% query numbers in the targeted attacks.

AdvHash: Set-to-set Targeted Attack on Deep Hashing with One Single Adversarial Patch

Shengshan Hu
Yechao Zhang
Xiaogeng Liu
Leo Yu Zhang
Minghui Li
Hai Jin

In this paper, we propose AdvHash, the first targeted mismatch attack on deep hashing
through adversarial patch. After superimposed with the same adversarial patch, any
query image with a chosen label will retrieve a set of irrelevant images with the
target label. Concretely, we first formulate a set-to-set problem, where a set of
samples are pushed into a predefined clustered area in the Hamming space. Then we
obtain a target anchor hash code and transform the attack to a set-to-point optimization.
In order to generate a image-agnostic stable adversarial patch for a chosen label
more efficiently, we propose a product-based weighted gradient aggregation strategy
to dynamically adjust the gradient directions of the patch, by exploiting the Hamming
distances between training samples and the target anchor hash code and assigning different
weights to discriminatively aggregate gradients. Extensive experiments on benchmark
datasets verify that AdvHash is highly effective at attacking two state-of-the-art
deep hashing schemes. Our codes are available at: https://github.com/CGCL-codes/AdvHash.

TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding

Dailan He
Yusheng Zhao
Junyu Luo
Tianrui Hui
Shaofei Huang
Aixi Zhang
Si Liu

Recently proposed fine-grained 3D visual grounding is an essential and challenging
task, whose goal is to identify the 3D object referred by a natural language sentence
from other distractive objects of the same category. Existing works usually adopt
dynamic graph networks to indirectly model the intra/inter-modal interactions, making
the model difficult to distinguish the referred object from distractors due to the
monolithic representations of visual and linguistic contents. In this work, we exploit
Transformer for its natural suitability on permutation-invariant 3D point clouds data
and propose a TransRefer3D network to extract entity-and-relation aware multimodal
context among objects for more discriminative feature learning. Concretely, we devise
an Entity-aware Attention (EA) module and a Relation-aware Attention (RA) module to
conduct fine-grained cross-modal feature matching. Facilitated by co-attention operation,
our EA module matches visual entity features with linguistic entity features while
RA module matches pair-wise visual relation features with linguistic relation features,
respectively. We further integrate EA and RA modules into an Entity-and-Relation aware
Contextual Block (ERCB) and stack several ERCBs to form our TransRefer3D for hierarchical
multimodal context modeling. Extensive experiments on both Nr3D and Sr3D datasets
demonstrate that our proposed model significantly outperforms existing approaches
by up to 10.6% and claims the new state-of-the-art performance. To the best of our
knowledge, this is the first work investigating Transformer architecture for fine-grained
3D visual grounding task.

Single Image 3D Object Estimation with Primitive Graph Networks

Qian He
Desen Zhou
Bo Wan
Xuming He

Reconstructing 3D object from a single image (RGB or depth) is a fundamental problem
in visual scene understanding and yet remains challenging due to its ill-posed nature
and complexity in real-world scenes. To address those challenges, we adopt a primitive-based
representation for 3D object, and propose a two-stage graph network for primitive-based
3D object estimation, which consists of a sequential proposal module and a graph reasoning
module. Given a 2D image, our proposal module first generates a sequence of 3D primitives
from input image with local feature attention. Then the graph reasoning module performs
joint reasoning on a primitive graph to capture the global shape context for each
primitive. Such a framework is capable of taking into account rich geometry and semantic
constraints during 3D structure recovery, producing 3D objects with more coherent
structure even under challenging viewing conditions. We train the entire graph neural
network in a stage-wise strategy and evaluate it on three benchmarks: Pix3D, ModelNet
and NYU Depth V2. Extensive experiments show that our approach outperforms the previous
state of the arts with a considerable margin.

Boosting Mobile CNN Inference through Semantic Memory

Yun Li
Chen Zhang
Shihao Han
Li Lyna Zhang
Baoqun Yin
Yunxin Liu
Mengwei Xu

Human brains are known to be capable of speeding up visual recognition of repeatedly
presented objects through faster memory encoding and accessing procedures on activated
neurons. For the first time, we borrow and distill such a capability into a semantic
memory design, namely SMTM, to improve on-device CNN inference. SMTM employs a hierarchical
memory architecture to leverage the long-tail distribution of objects of interest,
and further incorporates several novel techniques to put it into effects: (1) it encodes
high-dimensional feature maps into low-dimensional, semantic vectors for low-cost
yet accurate cache and lookup; (2) it uses a novel metric in determining the exit
timing considering different layers' inherent characteristics; (3) it adaptively adjusts
the cache size and semantic vectors to fit the scene dynamics. SMTM is prototyped
on commodity CNN engine and runs on both mobile CPU and GPU. Extensive experiments
on large-scale datasets and models show that SMTM can significantly speed up the model
inference over standard approach (up to 2×) and prior cache designs (up to 1.5x),
with acceptable accuracy loss.

Knowing When to Quit: Selective Cascaded Regression with Patch Attention for Real-Time
Face Alignment

Gil Shapira
Noga Levy
Ishay Goldin
Roy J. Jevnisek

Facial landmarks (FLM) estimation is a critical component in many face-related applications.
In this work, we aim to optimize for both accuracy and speed and explore the trade-off
between them. Our key observation is that not all faces are created equal. Frontal
faces with neutral expressions converge faster than faces with extreme poses or expressions.
To differentiate among samples, we train our model to predict the regression error
after each iteration. If the current iteration is accurate enough, we stop iterating,
saving redundant iterations while keeping the accuracy in check. We also observe that
as neighboring patches overlap, we can infer all facial landmarks (FLMs) with only
a small number of patches without a major accuracy sacrifice. Architecturally, we
offer a multi-scale, patch-based, lightweight feature extractor with a fine-grained
local patch attention module, which computes a patch weighting according to the information
in the patch itself and enhances the expressive power of the patch features. We analyze
the patch attention data to infer where the model is attending when regressing facial
landmarks and compare it to face attention in humans. Our model runs in real-time
on a mobile device GPU, with 95 Mega Multiply-Add (MMA) operations, outperforming
all state-of-the-art methods under 1000 MMA, with a normalized mean error of 8.16
on the 300W challenging dataset. The code is available at https://github.com/ligaripash/MuSiCa

End-to-end Boundary Exploration for Weakly-supervised Semantic Segmentation

Jianjun Chen
Shancheng Fang
Hongtao Xie
Zheng-Jun Zha
Yue Hu
Jianlong Tan

It is full of challenges for weakly supervised semantic segmentation (WSSS) acquiring
the pixel-level object location with only image-level annotations. Especially, the
single-stage methods learn image- and pixel-level labels simultaneously to avoid complicated
multi-stage computations and sophisticated training procedures. In this paper, we
argue that using a single model to accomplish image- and pixel-level classification
will fall into the balance of multi-target and consequently weakens the recognition
capability. Because the image-level task tends to learn position-independent features,
but the pixel-level task tends to be position-sensitive. Hence, we propose an effective
encoder-decoder framework to explore object boundaries and solve the above dilemma.
The encoder and decoder learn position-independent and position-sensitive features
independently during the end-to-end training. In addition, a global soft pooling is
suggested to suppress background pixels' activation for the encoder training and further
improve the class activation map (CAM) performance. The edge annotations for the decoder
training are synthesized by the high confidence CAMs, which do not requires extra
supervision. The extensive experiments on the Pascal VOC12 dataset demonstrate that
our method achieves state-of-the-art compared to the end-to-end approaches. It gets
63.6% and 65.7% mIoU scores on val and test sets respectively.

SFE-Net: EEG-based Emotion Recognition with Symmetrical Spatial Feature Extraction

Xiangwen Deng
Junlin Zhu
Shangming Yang

Emotion recognition based on EEG (electroencephalography) has been widely used in
human-computer interaction, distance education and health care. However, the conventional
methods ignore the adjacent and symmetrical characteristics of EEG signals, which
also contain salient information related to emotion. In this paper, a spatial folding
ensemble network (SFE-Net) is presented for EEG feature extraction and emotion recognition.
Firstly, for the undetected area between EEG electrodes, an improved Bicubic-EEG interpolation
algorithm is developed for EEG channels information completion, which allows us to
extract a wider range of adjacent space features. Then, motivated by the spatial symmetric
mechanism of human brain, we fold the input EEG channels data with five different
symmetrical strategies, which enable the proposed network to extract the information
of space features of EEG signals more effectively. Finally, a 3DCNN-based spatial,
temporal extraction, and a multi-voting strategy of ensemble learning are integrated
to model a new neural network. With this network, the spatial features of different
symmetric folding signals can be extracted simultaneously, which greatly improves
the robustness and accuracy of emotion recognition. The experimental results on DEAP
and SEED datasets show that the proposed algorithm has comparable performance in terms
of recognition accuracy.

Bridging the Gap between Low-Light Scenes: Bilevel Learning for Fast Adaptation

Dian Jin
Long Ma
Risheng Liu
Xin Fan

Brightening low-light images of diverse scenes is a challenging but widely concerned
task in the multimedia community. Convolutional Neural Networks (CNNs) based approaches
mostly acquire the enhanced model by learning the data distribution from the specific
scenes. However, these works present poor adaptability (even fail) when meeting real-world
scenarios that never encountered before. To conquer it, we develop a novel bilevel
learning scheme for fast adaptation to bridge the gap between low-light scenes. Concretely,
we construct a Retinex-induced encoder-decoder with an adaptive denoising mechanism,
aiming at covering more practical cases. Different from existing works that directly
learn model parameters by using the massive data, we provide a new hyperparameter
optimization perspective to formulate a bilevel learning scheme towards general low-light
scenarios. This scheme depicts the latent correspondence (i.e., scene-irrelevant encoder)
and the respective characteristic (i.e., scene-specific decoder) among different data
distributions. Due to the expensive inner optimization, estimating the hyper-parameter
gradient exactly can be prohibitive, we develop an approximate hyper-parameter gradient
method by introducing the one-step forward approximation and finite difference approximation
to ensure the high-efficient inference. Extensive experiments are conducted to reveal
our superiority against other state-of-the-art methods. A series of analytical experiments
are also executed to verify our effectiveness.

Handling Difficult Labels for Multi-label Image Classification via Uncertainty Distillation

Liangchen Song
Jialian Wu
Ming Yang
Qian Zhang
Yuan Li
Junsong Yuan

Multi-label image classification aims to predict multiple labels for a single image.
However, the difficulties of predicting different labels may vary dramatically due
to semantic variations of the label as well as the image context. Direct learning
of multi-label classification models has the risk of being biased and overfitting
those difficult labels, e.g., deep network based classifiers are over-trained on the
difficult labels, therefore, lead to false-positive errors of those difficult labels
during testing. To handle difficult labels of multi-label image classification, we
propose to calibrate the model, which not only predicts the labels but also estimates
the uncertainty of the prediction. With the new calibration branch of the network,
the classification model is trained with the pick-all-labels normalized loss and optimized
pertaining to the number of positive labels. Moreover, to improve performance on difficult
labels, instead of annotating them, we leverage the calibrated model as the teacher
network and teach the student network about handling difficult labels via uncertainty
distillation. Our proposed uncertainty distillation teaches the student network which
labels are highly uncertain through prediction distribution distillation, and locates
the image regions that cause such uncertain predictions through uncertainty attention
distillation. Conducting extensive evaluations on benchmark datasets, we demonstrate
that our proposed uncertainty distillation is valuable to handle difficult labels
of multi-label image classification.

Perception-Oriented Stereo Image Super-Resolution

Chenxi Ma
Bo Yan
Weimin Tan
Xuhao Jiang

Recent studies of deep learning based stereo image super-resolution (StereoSR) have
promoted the development of StereoSR. However, existing StereoSR models mainly concentrate
on improving quantitative evaluation metrics and neglect the visual quality of super-resolved
stereo images. To improve the perceptual performance, this paper proposes the first
perception-oriented stereo image super-resolution approach by exploiting the feedback,
provided by the evaluation on the perceptual quality of StereoSR results. To provide
accurate guidance for the StereoSR model, we develop the first special stereo image
super-resolution quality assessment (StereoSRQA) model, and further construct a StereoSRQA
database. Extensive experiments demonstrate that our StereoSR approach significantly
improves the perceptual quality and enhances the reliability of stereo images for
disparity estimation.

ReLLIE: Deep Reinforcement Learning for Customized Low-Light Image Enhancement

Rongkai Zhang
Lanqing Guo
Siyu Huang
Bihan Wen

Low-light image enhancement (LLIE) is a pervasive yet challenging problem, since:
1) low-light measurements may vary due to different imaging conditions in practice;
2) images can be enlightened subjectively according to diverse preference by each
individual. To tackle these two challenges, this paper presents a novel deep reinforcement
learning based method, dubbed ReLLIE, for customized low-light enhancement. ReLLIE
models LLIE as a markov decision process, i.e., estimating the pixel-wise image-specific
curves sequentially and recurrently. Given the reward computed from a set of carefully
crafted non-reference loss functions, a lightweight network is proposed to estimate
the curves for enlightening of a low-light image input. As ReLLIE learns a policy
instead of one-one image translation, it can handle various low-light measurements
and provide customized enhanced outputs by flexibly applying the policy different
times. Furthermore, ReLLIE can enhance real-world images with hybrid corruptions,
i.e., noise, by using a plug-and-play denoiser easily. Extensive experiments on various
benchmarks demonstrate the advantages of ReLLIE, comparing to the state-of-the-art
methods. (Code is available: https://github.com/GuoLanqing/ReLLIE.)

Intrinsic Temporal Regularization for High-resolution Human Video Synthesis

Lingbo Yang
Zhanning Gao
Siwei Ma
Wen Gao

Fashion video synthesis has attracted increasing attention due to its huge potential
in immersive media, virtual reality and online retail applications, yet traditional
3D graphic pipelines often require extensive manual labor on data capture and model
rigging. In this paper, we investigate an image-based approach to this problem that
generates a fashion video clip from a still source image of the desired outfit, which
is then rigged in a framewise fashion under the guidance of a driving video. A key
challenge for this task lies in the modeling of feature transformation across source
and driving frames, where fine-grained transform helps promote visual details at garment
regions, but often at the expense of intensified temporal flickering. To resolve this
dilemma, we propose a novel framework with 1) a multi-scale transform estimation and
feature fusion module to preserve fine-grained garment details, and 2) an intrinsic
regularization loss to enforce temporal consistency of learned transform between adjacent
frames. Our solution is capable of generating 512\times512 fashion videos with rich
garment details and smooth fabric movements beyond existing results. Extensive experiments
over the FashionVideo benchmark dataset have demonstrated the superiority of the proposed
framework over several competitive baselines.

A2W: Context-Aware Recommendation System for Mobile Augmented Reality Web Browser

Kit Yung Lam
Lik Hang Lee
Pan Hui

Augmented Reality (AR) offers new capabilities for blurring the boundaries between
physical reality and digital media. However, the capabilities of integrating web contents
and AR remain underexplored. This paper presents an AR web browser with an integrated
context-aware AR-to-Web content recommendation service named as A2W browser, to provide
continuously user-centric web browsing experiences driven by AR headsets. We implement
the A2W browser on an AR headset as our demonstration application, demonstrating the
features and performance of A2W framework. The A2W browser visualizes the AR-driven
web contents to the user, which is suggested by the content-based filtering model
in our recommendation system. In our experiments, 20 participants with the adaptive
UIs and recommendation system in A2W browser achieve up to 30.69% time saving compared
to smartphone conditions. Accordingly, A2W-supported web browsing on workstations
facilitates the recommended information leading to 41.67% faster reaches to the target
information than typical web browsing.

Cross-modal Self-Supervised Learning for Lip Reading: When Contrastive Learning meets
Adversarial Training

Changchong Sheng
Matti Pietikäinen
Qi Tian
Li Liu

The goal of this work is to learn discriminative visual representations for lip reading
without access to manual text annotation. Recent advances in cross-modal self-supervised
learning have shown that the corresponding audio can serve as a supervisory signal
to learn effective visual representations for lip reading. However, existing methods
only exploit the natural synchronization of the video and the corresponding audio.
We find that both video and audio are actually composed of speech-related information,
identity-related information, and modal information. To make the visual representations
(i) more discriminative for lip reading and (ii) indiscriminate with respect to the
identities and modals, we propose a novel self-supervised learning framework called
Adversarial Dual-Contrast Self-Supervised Learning (ADC-SSL), to go beyond previous
methods by explicitly forcing the visual representations disentangled from speech-unrelated
information. Experimental results clearly show that the proposed method outperforms
state-of-the-art cross-modal self-supervised baselines by a large margin. Besides,
ADC-SSL can outperform its supervised counterpart without any finetune.

OsGG-Net: One-step Graph Generation Network for Unbiased Head Pose Estimation

Shentong Mo
Xin Miao

Head pose estimation is a crucial problem that involves the prediction of the Euler
angles of a human head in an image. Previous approaches predict head poses through
landmarks detection, which can be applied to multiple downstream tasks. However, previous
landmark-based methods can not achieve comparable performance to the current landmark-free
methods due to lack of modeling the complex nonlinear relationships between the geometric
distribution of landmarks and head poses. Another reason for the performance bottleneck
is that there exists biased underlying distribution of the 3D pose angles in the current
head pose benchmarks. In this work, we propose OsGG-Net, a One-step Graph Generation
Network for estimating head poses from a single image by generating a landmark-connection
graph to model the 3D angle associated with the landmark distribution robustly. To
further ease the angle-biased issues caused by the biased data distribution in learning
the graph structure, we propose the UnBiased Head Pose Dataset, called UBHPD, and
a new unbiased metric, namely UBMAE, for unbiased head pose estimation. We conduct
extensive experiments on various benchmarks and UBHPD where our method achieves the
state-of-the-art results in terms of the commonly-used MAE metric and our proposed
UBMAE. Comprehensive ablation studies also demonstrate the effectiveness of each part
in our approach.

Multi-Modal Multi-Instance Learning for Retinal Disease Recognition

Xirong Li
Yang Zhou
Jie Wang
Hailan Lin
Jianchun Zhao
Dayong Ding
Weihong Yu
Youxin Chen

This paper attacks an emerging challenge of multi-modal retinal disease recognition.
Given a multi-modal case consisting of a color fundus photo (CFP) and an array of
OCT B-scan images acquired during an eye examination, we aim to build a deep neural
network that recognizes multiple vision-threatening diseases for the given case. As
the diagnostic efficacy of CFP and OCT is disease-dependent, the network's ability
of being both selective and interpretable is important. Moreover, as both data acquisition
and manual labeling are extremely expensive in the medical domain, the network has
to be relatively lightweight for learning from a limited set of labeled multi-modal
samples. Prior art on retinal disease recognition focuses either on a single disease
or on a single modality, leaving multi-modal fusion largely underexplored. We propose
in this paper Multi-Modal Multi-Instance Learning (MM-MIL) for selectively fusing
CFP and OCT modalities. Its lightweight architecture (as compared to current multi-head
attention modules) makes it suited for learning from relatively small-sized datasets.
For an effective use of MM-MIL, we propose to generate a pseudo sequence of CFPs by
over sampling a given CFP. The benefits of this tactic include well balancing instances
across modalities, increasing the resolution of the CFP input, and finding out regions
of the CFP most relevant with respect to the final diagnosis. Extensive experiments
on a real-world dataset consisting of 1,206 multi-modal cases from 1,193 eyes of 836
subjects demonstrate the viability of the proposed model.

Locally Adaptive Structure and Texture Similarity for Image Quality Assessment

Keyan Ding
Yi Liu
Xueyi Zou
Shiqi Wang
Kede Ma

The latest advances in full-reference image quality assessment (IQA) involve unifying
structure and texture similarity based on deep representations. The resulting Deep
Image Structure and Texture Similarity (DISTS) metric, however, makes rather global
quality measurements, ignoring the fact that natural photographic images are locally
structured and textured across space and scale. In this paper, we describe a locally
adaptive structure and texture similarity index for full-reference IQA, which we term
A-DISTS. Specifically, we rely on a single statistical feature, namely the dispersion
index, to localize texture regions at different scales. The estimated probability
(of one patch being texture) is in turn used to adaptively pool local structure and
texture measurements. The resulting A-DISTS is adapted to local image content, and
is free of expensive human perceptual scores for supervised training. We demonstrate
the advantages of A-DISTS in terms of correlation with human data on ten IQA databases
and optimization of single image super-resolution methods.

CALLip: Lipreading using Contrastive and Attribute Learning

Yiyang Huang
Xuefeng Liang
Chaowei Fang

Lipreading, aiming at interpreting speech by watching the lip movements of the speaker,
has great significance in human communication and speech understanding. Despite having
reached a feasible performance, lipreading still faces two crucial challenges: 1)
the considerable lip movement variations cross different persons when they utter the
same words; 2) the similar lip movements of people when they utter some confused phonemes.
To tackle these two problems, we propose a novel lipreading framework, CALLip, which
employs attribute learning and contrastive learning. The attribute learning extracts
the speaker identity-aware features through a speaker recognition branch, which are
able to normalize the lip shapes to eliminate cross-speaker variations. Considering
that audio signals are intrinsically more distinguishable than visual signals, the
contrastive learning is devised between visual and audio signals to enhance the discrimination
of visual features and alleviate the viseme confusion problem. Experimental results
show that CALLip does learn better features of lip movements. The comparisons on both
English and Chinese benchmark datasets, GRID and CMLR, demonstrate that CALLip outperforms
six state-of-the-art lipreading methods without using any additional data.

Cross-Modal Recipe Embeddings by Disentangling Recipe Contents and Dish Styles

Yu Sugiyama
Keiji Yanai

Nowadays, cooking recipe sharing sites on the Web are widely used, and play a major
role in everyday home cooking. Since cooking recipes consist of dish photos and recipe
texts, cross-modal recipe search is being actively explored. To enable cross-modal
search, both food image features and cooking text recipe features are embedded into
the same shared space in general. However, in most of the existing studies, a one-to-one
correspondence between a recipe text and a dish image in the embedding space is assumed,
although an unlimited number of photos with different serving styles and different
plates can be associated with the same recipe. In this paper, we propose a RDE-GAN
(Recipe Disentangled Embedding GAN) which separates food image information into a
recipe image feature and a non-recipe shape feature. In addition, we generate a food
image by integrating both the recipe embedding and a shape feature. Since the proposed
embedding is free from serving and plate styles which are unrelated to cooking recipes,
the experimental results showed that it outperformed the existing methods on cross-modal
recipe search. We also confirmed that only either shape or recipe elements can be
changed at the time of food image generation.

TDI TextSpotter: Taking Data Imbalance into Account in Scene Text Spotting

Yu Zhou
Hongtao Xie
Shancheng Fang
Jing Wang
Zhengjun Zha
Yongdong Zhang

Recent scene text spotters that integrate text detection module and recognition module
have made significant progress. However, existing methods encounter two problems.
1). The data imbalance issue between text detection module and text recognition module
limits the performance of text spotters. 2). The default left-to-right reading direction
leads to errors in unconventional text spotting. In this paper, we propose a novel
scene text spotter TDI to solve these problems. Firstly, in order to solve the data
imbalance problem, a sample generation algorithm is proposed to generate plenty of
samples online for training the text recognition module by using character features
and character labels. Secondly, a weakly supervised character generation algorithm
is designed to generate character-level labels from word-level labels for the sample
generation algorithm and the training of the text detection module. Finally, in order
to spot arbitrarily arranged text correctly, a direction perception module is proposed
to perceive the reading direction of text instance. Experiments on several benchmarks
show that these designs can significantly improve the performance of text spotter.
Specifically, our method outperforms state-of-the-art methods on three public datasets
in both text detection and end-to-end text recognition, which fully proves the effectiveness
and robustness of our method.

Position-Augmented Transformers with Entity-Aligned Mesh for TextVQA

Xuanyu Zhang
Qing Yang

In addition to visual components, many images usually contain valuable text information,
which is essential for understanding the scene. Thus, we study the TextVQA task that
requires reading texts in images to answer corresponding questions. However, most
of previous works utilize sophisticated graph structure and manually crafted features
to model the position relationship between visual entities and texts in images. And
traditional multimodal transformers cannot effectively capture relative position information
and original image features. To address these issues in an intuitive but effective
way, we propose a novel model, position-augmented transformers with entity-aligned
mesh, for the TextVQA task. Different from traditional attention mechanism in transformers,
we explicitly introduce continuous relative position information of objects and OCR
tokens without complex rules. Furthermore, we replace the complicated graph structure
with intuitive entity-aligned mesh according to perspective mapping. In this mesh,
the information of discrete entities and image patches at different positions can
interact with each other. Extensive experiments on two benchmark datasets (TextVQA
and ST-VQA) show that our proposed model is superior to several state-of-the-art methods.

Learning Contextual Transformer Network for Image Inpainting

Ye Deng
Siqi Hui
Sanping Zhou
Deyu Meng
Jinjun Wang

Fully Convolutional Networks with attention modules have been proven effective for
learning-based image inpainting. While many existing approaches could produce visually
reasonable results, the generated images often show blurry textures or distorted structures
around corrupted areas. The main reason is due to the fact that convolutional neural
networks have limited capacity for modeling contextual information with long range
dependencies. Although the attention mechanism can alleviate this problem to some
extent, existing attention modules tend to emphasize similarities between the corrupted
and the uncorrupted regions while ignoring the dependencies from within each of them.
Hence, this paper proposes the Contextual Transformer Network (CTN) which not only
learns relationships between the corrupted and the uncorrupted regions but also exploits
their respective internal closeness. Besides, instead of a fully convolutional network,
in our CTN, we stack several transformer blocks to replace convolution layers to better
model the long range dependencies. Finally, by dividing the image into patches of
different sizes, we propose a multi-scale multi-head attention module to better model
the affinity among various image regions. Experiments on several benchmark datasets
demonstrate superior performance by our proposed approach.

Milliseconds Color Stippling

Lei Ma
Jian Shi
Yanyun Chen

Stippling is a popular and fascinating sketching art in stylized illustrations. Various
digital stippling techniques have been proposed to reduce tedious manual work. In
this paper, we present a novel method to create high-quality color stippling from
an input image in milliseconds. The key idea is to obtain stipples with predetermined
incremental 2D sample sequences, which algorithms generate with sequential incrementality
and distributional uniformity features. Two typical sequences are employed in our
work: one is constructed from incremental Voronoi sets, and the other is from Poisson
disk distributions. A threshold-based algorithm is then applied to determine stipple
appearance and guarantee result quality. We extend color stippling with multitone
level and radius adjustment to achieve improved visual quality. Detailed comparisons
of the two sequences are conducted to explore further the strengths and weaknesses
of the proposed method. For more information, please visit https://gitlab.com/maleiwhat/milliseconds-color-stippling.

AFD-Net: Adaptive Fully-Dual Network for Few-Shot Object Detection

Longyao Liu
Bo Ma
Yulin Zhang
Xin Yi
Haozhi Li

Few-shot object detection (FSOD) aims at learning a detector that can fast adapt to
previously unseen objects with scarce annotated examples. Existing methods solve this
problem by performing subtasks of classification and localization utilizing a shared
component in the detector, yet few of them take the distinct preferences towards feature
embedding of two subtasks into consideration. In this paper, we carefully analyze
the characteristics of FSOD, and present that a few-shot detector should consider
the explicit decomposition of two subtasks, as well as leveraging information from
both of them to enhance feature representations. To the end, we propose a simple yet
effective Adaptive Fully-Dual Network (AFD-Net). Specifically, we extend Faster R-CNN
by introducing Dual Query Encoder and Dual Attention Generator for separate feature
extraction, and Dual Aggregator for separate model reweighting. In this way, separate
state estimation is achieved by the R-CNN detector. Furthermore, we introduce Adaptive
Fusion Mechanism to guide the design of encoders for efficient feature fusion in the
specific subtask. Extensive experiments on PASCAL VOC and MS COCO show that our approach
achieves state-of-the-art performance by a large margin, demonstrating its effectiveness
and generalization ability.

Missing Data Imputation for Solar Yield Prediction using Temporal Multi-Modal Variational
Auto-Encoder

Meng Shen
Huaizheng Zhang
Yixin Cao
Fan Yang
Yonggang Wen

The accurate and robust prediction of short-term solar power generation is significant
for the management of modern smart grids, where solar power has become a major energy
source due to its green and economical nature. However, the solar yield prediction
can be difficult to conduct in the real world where hardware and network issues can
make the sensors unreachable. Such data missing problem is so prevalent that it degrades
the performance of deployed prediction models and even fails the model execution.
In this paper, we propose a novel temporal multi-modal variational auto-encoder (TMMVAE)
model, to enhance the robustness of short-term solar power yield prediction with missing
data. It can impute the missing values in time-series sensor data, and reconstruct
them by consolidating multi-modality data, which then facilitates more accurate solar
power yield prediction. TMMVAE can be deployed efficiently with an end-to-end framework.
The framework is verified at our real-world testbed on campus. The results of extensive
experiments show that our proposed framework can significantly improve the imputation
accuracy when the inference data is severely corrupted, and can hence dramatically
improve the robustness of short-term solar energy yield forecasting.

Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training

Chenyi Lei
Shixian Luo
Yong Liu
Wanggui He
Jiamang Wang
Guoxin Wang
Haihong Tang
Chunyan Miao
Houqiang Li

The pre-trained neural models have recently achieved impressive performance in understanding
multimodal content. However, it is still very challenging to pre-train neural models
for video and language understanding, especially for Chinese video-language data,
due to the following reasons. Firstly, existing video-language pre-training algorithms
mainly focus on the co-occurrence of words and video frames, but ignore other valuable
semantic and structure information of video-language content, e.g., sequential order
and spatiotemporal relationships. Secondly, there exist conflicts between video sentence
alignment and other proxy tasks. Thirdly, there is a lack of large-scale and high-quality
Chinese video-language datasets (eg. including 10 million unique videos), which are
the fundamental success conditions for pre-training techniques. In this work, we propose
a novel video-language understanding framework named Victor, which stands for VIdeo-language
understanding via Contrastive mulTimOdal pRe-training. Besides general proxy tasks
such as masked language modeling, Victor constructs several novel proxy tasks under
the contrastive learning paradigm, making the model be more robust and able to capture
more complex multimodal semantic and structural relationships from different perspectives.
Victor is trained on a large-scale Chinese video-language dataset, including over
10 million complete videos with corresponding high-quality textual descriptions. We
apply the pre-trained Victor model to a series of downstream applications and demonstrate
its superior performance, comparing against the state-of-the-art pre-training methods
such as VideoBERT and UniVL.

DehazeFlow: Multi-scale Conditional Flow Network for Single Image Dehazing

Hongyu Li
Jia Li
Dong Zhao
Long Xu

Single image dehazing is a crucial and preliminary task for many computer vision applications,
making progress with deep learning. The dehazing task is an ill-posed problem since
the haze in the image leads to the loss of information. Thus, there are multiple feasible
solutions for image restoration of a hazy image. Most existing methods learn a deterministic
one-to-one mapping between a hazy image and its ground-truth, which ignores the ill-posedness
of the dehazing task. To solve this problem, we propose DehazeFlow, a novel single
image dehazing framework based on conditional normalizing flow. Our method learns
the conditional distribution of haze-free images given a hazy image, enabling the
model to sample multiple dehazed results. Furthermore, we propose an attention-based
coupling layer to enhance the expression ability of a single flow step, which converts
natural images into latent space and fuses features of paired data. These designs
enable our model to achieve state-of-the-art performance while considering the ill-posedness
of the task. We carry out sufficient experiments on both synthetic datasets and real-world
hazy images to illustrate the effectiveness of our method. The extensive experiments
indicate that DehazeFlow surpasses the state-of-the-art methods in terms of PSNR,
SSIM, LPIPS, and subjective visual effects.

GCM-Net: Towards Effective Global Context Modeling for Image Inpainting

Huan Zheng
Zhao Zhang
Yang Wang
Zheng Zhang
Mingliang Xu
Yi Yang
Meng Wang

Deep learning based inpainting methods have obtained promising performance for image
restoration, however current image inpainting methods still tend to produce unreasonable
structures and blurry textures when processing the damaged images with heavy corruptions.
In this paper, we propose a new image inpainting method termed Global Context Modeling
Network (GCM-Net). By capturing the global contextual information, GCM-Net can potentially
improve the performance of recovering the missing region in the damaged images with
irregular masks. To be specific, we first use four convolution layers to extract the
shadow features. Then, we design a progressive multi-scale fusion block termed PMSFB
to extract and fuse the multi-scale features for obtaining local features. Besides,
a dense context extraction (DCE) module is also designed to aggregate the local features
extracted by PMSFBs. To improve the information flow, a channel attention guided residual
learning module is deployed in both the DCE and PMSFB, which can reweight the learned
residual features and refine the extracted information. To capture more global contextual
information and enhance the representation ability, a coordinate context attention
(CCA) based module is also presented. Finally, the extracted features with rich information
are decoded as the image inpainting result. Extensive results on the Paris Street
View, Places2 and CelebA-HQ datasets demonstrate that our method can better recover
the structures and textures, and deliver significant improvements, compared with some
related inpainting methods.

Embracing the Dark Knowledge: Domain Generalization Using Regularized Knowledge Distillation

Yufei Wang
Haoliang Li
Lap-pui Chau
Alex C. Kot

Though convolutional neural networks are widely used in different tasks, lack of generalization
capability in the absence of sufficient and representative data is one of the challenges
that hinders their practical application. In this paper, we propose a simple, effective,
and plug-and-play training strategy named Knowledge Distillation for Domain Generalization
(KDDG) which is built upon a knowledge distillation framework with the gradient filter
as a novel regularization term. We find that both the "richer dark knowledge" from
the teacher network, as well as the gradient filter we proposed, can reduce the difficulty
of learning the mapping which further improves the generalization ability of the model.
We also conduct experiments extensively to show that our framework can significantly
improve the generalization capability of deep neural networks in different tasks including
image classification, segmentation, reinforcement learning by comparing our method
with existing state-of-the-art domain generalization techniques. Last but not the
least, we propose to adopt two metrics to analyze our proposed method in order to
better understand how our proposed method benefits the generalization capability of
deep neural networks.

Cluster and Scatter: A Multi-grained Active Semi-supervised Learning Framework for
Scalable Person Re-identification

Bingyu Hu
Zheng-Jun Zha
Jiawei Liu
Xierong Zhu
Hongtao Xie

Active learning has recently attracted increasing attention in the task of person
re-identification, due to its unique scalability that not only maximally reduces the
annotation cost but also retains the satisfying performance. Although some preliminary
active learning methods have been explored in scalable person re-identification task,
they have the following two problems: 1) the inefficiency in the selection process
of image pairs due to the huge search space, and 2) the ineffectiveness caused by
ignoring the impact of unlabeled data in model training. Considering that, we propose
a Multi-grained Active Semi-Supervised learning framework, named MASS, to address
the scalable person re-identification problem existing in the practical scenarios.
Specifically, we firstly design a cluster-scatter procedure to alleviate the inefficiency
problem, which consists of two components: cluster step and scatter step. The cluster
step shrinks the search space into individual small clusters by a coarse-grained clustering
method, and the subsequent scatter step further mines the hard distinguished image
pairs from unlabelled set to purify the learned clusters by a novel centrality-based
adaptive purification strategy. Afterward, we introduce a customized purification
loss for the purified clustering, which utilizes the complementary information in
both labeled and unlabeled data to optimize the model for solving the ineffectiveness
problem. The cluster-scatter procedure and the model optimization are performed in
an iterative fashion to achieve the promising performance while greatly reducing the
annotation cost. Extensive experimental results have demonstrated that MASS can even
achieve a competitive performance with fully supervised methods in the case of extremely
less annotation requirements.

Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image
Captioning

Xinzhi Dong
Chengjiang Long
Wenju Xu
Chunxia Xiao

Existing image captioning methods just focus on understanding the relationship between
objects or instances in a single image, without exploring the contextual correlation
existed among contextual image. In this paper, we propose Dual Graph Convolutional
Networks (Dual-GCN) with transformer and curriculum learning for image captioning.
In particular, we not only use an object-level GCN to capture the object to object
spatial relation within a single image, but also adopt an image-level GCN to capture
the feature information provided by similar images. With the well-designed Dual-GCN,
we can make the linguistic transformer better understand the relationship between
different objects in a single image and make full use of similar images as auxiliary
information to generate a reasonable caption description for a single image. Meanwhile,
with a cross-review strategy introduced to determine difficulty levels, we adopt curriculum
learning as the training strategy to increase the robustness and generalization of
our proposed model. We conduct extensive experiments on the large-scale MS COCO dataset,
and the experimental results powerfully demonstrate that our proposed method outperforms
recent state-of-the-art approaches. It achieves a BLEU-1 score of 82.2 and a BLEU-2
score of 67.6. Our source code is available at https://github.com/Unbear430/DGCN-for-image-captioning.

Build Your Own Bundle - A Neural Combinatorial Optimization Method

Qilin Deng
Kai Wang
Minghao Zhao
Runze Wu
Yu Ding
Zhene Zou
Yue Shang
Jianrong Tao
Changjie Fan

In the business domain,bundling is one of the most important marketing strategies
to conduct product promotions, which is commonly used in online e-commerce and offline
retailers. Existing recommender systems mostly focus on recommending individual items
that users may be interested in, such as the considerable research work on collaborative
filtering that directly models the interaction between users and items. In this paper,
we target at a practical but less explored recommendation problem named personalized
bundle composition, which aims to offer an optimal bundle (i.e., a combination of
items) to the target user. To tackle this specific recommendation problem, we formalize
it as a combinatorial optimization problem on a set of candidate items and solve it
within a neural combinatorial optimization framework. Extensive experiments on public
datasets are conducted to demonstrate the superiority of the proposed method.

Unsupervised Image Deraining: Optimization Model Driven Deep CNN

Changfeng Yu
Yi Chang
Yi Li
Xile Zhao
Luxin Yan

The deep convolutional neural network has achieved significant progress for single
image rain streak removal. However, most of the data-driven learning methods are full-supervised
or semi-supervised, unexpectedly suffering from significant performance drop when
dealing with the real rain. These data-driven learning methods are representative
yet generalize poor for real rain. The opposite holds true for the model-driven unsupervised
optimization methods. To overcome these problems, we propose a unified unsupervised
learning framework which inherits the generalization and representation merits for
real rain removal. Specifically, we first discover a simple yet important domain knowledge
that directional rain streak is anisotropic while the natural clean image is isotropic,
and formulate the structural discrepancy into the energy function of the optimization
model. Consequently, we design an optimization model driven deep CNN in which the
unsupervised loss function of the optimization model is enforced on the proposed network
for better generalization. In addition, the architecture of the network mimics the
main role of the optimization models with better feature representation. On one hand,
we take advantage of the deep network to improve the representation. On the other
hand, we utilize the unsupervised loss of the optimization model for better generalization.
Overall, the unsupervised learning framework achieves good generalization and representation:
unsupervised training (loss) with only a few real rainy images (input) and physical
meaning network (architecture). Extensive experiments on synthetic and real-world
rain datasets show the superiority of the proposed method.

SESSION: Keynote Talks III&IV

Do you see what I see?: Large-scale Learning from Multimodal Videos

Cordelia Schmid

In this talk we present recent progress on large-scale learning of multimodal video
representations. We start by presenting VideoBert, a joint model for video and language,
repurposing the Bert model for multimodal data. This model achieves state-of-the-art
results on zero shot prediction and video captioning. Next we show how to extend learning
from instruction videos to general movies based on cross-modal supervision. We use
movie screenplays to learn a speech to action classifiers and use these classifiers
to mine video clips from thousands of hours of movies. We demonstrate a performance
comparable or better than fully supervised approaches for action classification. Next
we present an approach for video question answering which relies on training from
instruction videos and cross-modal supervision with a textual question answer module.
We show state-of-the-art results for video question answering without any supervision
(zero-shot VQA) and demonstrate that our approach obtains competitive results for
pre-training and then fine-tuning on video question answering datasets. We conclude
our talk by presenting a recent video feature which is fully transformer based. Our
Video Vision Transformer (ViViT) is shown to outperform the state-of-the-art on video
classification. Furthermore, it is flexible and allows for performance / accuracy
trade-off based on several different architectures.

Large-scale Multi-Modality Pretrained Models: Applications and Experiences

Jingren Zhou

In this talk, we present our experiences and applications of large-scale multi-modality
pretrained models, developed at Alibaba and Ant Group. We first present a cross-modal
pretraining method called M6 (Multi-Modality to Multi-Modality Multitask Mega-transformer)
[1], for unified pretraining on the data of multiple modalities. We scale the model
size up to 1 trillion parameters [2], and build the largest pretrained model in Chinese.
We apply the model to a series of downstream applications, and demonstrate its outstanding
performance in comparison with strong baselines. Furthermore, we specifically design
a downstream task of text-guided image generation [3], and show that the finetuned
M6 can create high-quality images with high resolution and fidelity.

We also present research and applications of image editing with pretrained Generative
Adversarial Networks (GANs). A general principle between the underlying manifold and
the generator is discovered. Based on our discovery, we propose an algorithm for GANs
with low-rank factorization [4], which can be harnessed for image editing with pretrained
GAN models.

SESSION: Session 17: Multimodal Fusion and Embedding-I

Multi-Source Fusion and Automatic Predictor Selection for Zero-Shot Video Object Segmentation

Xiaoqi Zhao
Youwei Pang
Jiaxing Yang
Lihe Zhang
Huchuan Lu

Location and appearance are the key cues for video object segmentation. Many sources
such as RGB, depth, optical flow and static saliency can provide useful information
about the objects. However, existing approaches only utilize the RGB or RGB and optical
flow. In this paper, we propose a novel multi-source fusion network for zero-shot
video object segmentation. With the help of interoceptive spatial attention module
(ISAM), spatial importance of each source is highlighted. Furthermore, we design a
feature purification module (FPM) to filter the inter-source incompatible features.
By the ISAM and FPM, the multi-source features are effectively fused. In addition,
we put forward an automatic predictor selection network (APS) to select the better
prediction of either the static saliency predictor or the moving object predictor
in order to prevent over-reliance on the failed results caused by low-quality optical
flow maps. Extensive experiments on three challenging public benchmarks (i.e. DAVIS$_16
$, Youtube-Objects and FBMS) show that the proposed model achieves compelling performance
against the state-of-the-arts. The source code will be publicly available at https://github.com/Xiaoqi-Zhao-DLUT/Multi-Source-APS-ZVOS

Self-supervised Consensus Representation Learning for Attributed Graph

Changshu Liu
Liangjian Wen
Zhao Kang
Guangchun Luo
Ling Tian

Attempting to fully exploit the rich information of topological structure and node
features for attributed graph, we introduce self-supervised learning mechanism to
graph representation learning and propose a novel Self-supervised Consensus Representation
Learning (SCRL) framework. In contrast to most existing works that only explore one
graph, our proposed SCRL method treats graph from two perspectives: topology graph
and feature graph. We argue that their embeddings should share some common information,
which could serve as a supervisory signal. Specifically, we construct the feature
graph of node features via k-nearest neighbour algorithm. Then graph convolutional
network (GCN) encoders extract features from two graphs respectively. Self-supervised
loss is designed to maximize the agreement of the embeddings of the same node in the
topology graph and the feature graph. Extensive experiments on real citation networks
and social networks demonstrate the superiority of our proposed SCRL over the state-of-the-art
methods on semi-supervised node classification task. Meanwhile, compared with its
main competitors, SCRL is rather efficient.

Efficient Multi-Modal Fusion with Diversity Analysis

Shuhui Qu
Yan Kang
Janghwan Lee

Multi-modal machine learning has been a prominent multi-disciplinary research area
since its success in complex real-world problems. Empirically, multi-branch fusion
models tend to generate better results when there is a high diversity among each branch
of the model. However, such experience alone does not guarantee the fusion model's
best performance nor have sufficient theoretical support. We present the theoretical
estimation of the fusion models' performance by measuring each branch model's performance
and the distance between branches based on the analysis of several most popular fusion
methods. The theorem is validated empirically by numerical experiments. We further
present a branch model selection framework to identify the candidate branches for
fusion models to achieve the optimal multi-modal performance by using the theorem.
The framework's effectiveness is demonstrated on various datasets by showing how effectively
selecting the combination of branch models to attain superior performance.

GCCN: Geometric Constraint Co-attention Network for 6D Object Pose Estimation

Yongming Wen
Yiquan Fang
Junhao Cai
Kimwa Tung
Hui Cheng

In 6D object pose estimation task, object models are usually available and represented
as the point cloud set in canonical object frame, which are important references for
estimating object poses to the camera frame. However, directly introducing object
models as the prior knowledge (i.e., object model point cloud) will cause potential
perturbations and even degenerate pose estimation performance. To make the most of
object model priors and eliminate the problem, we present an end-to-end deep learning
approach called the Geometric Constraint Co-attention Network (GCCN) for 6D object
pose estimation. GCCN is designed to explicitly leverage the object model priors effectively
with the co-attention mechanism. We add explicit geometric constraints to a co-attention
module to inform the geometric correspondence relationships between points in the
scene and object model priors and develop a novel geometric constraint loss to guide
the training. In this manner, our method effectively eliminates the side effect of
directly introducing the object model priors into the network. Experiments on the
YCB-Video and LineMOD datasets demonstrate that our GCCN substantially improves the
performance of pose estimation and is robust against heavy occlusions. We also demonstrate
that GCCN is accurate and robust enough to be deployed in real-world robotic tasks.

Cross-Modal Generalization: Learning in Low Resource Modalities via Meta-Alignment

Paul Pu Liang
Peter Wu
Liu Ziyin
Louis-Philippe Morency
Ruslan Salakhutdinov

How can we generalize to a new prediction task at test time when it also uses a new
modality as input? More importantly, how can we do this with as little annotated data
as possible? This problem of cross-modal generalization is a new research milestone
with concrete impact on real-world applications. For example, can an AI system start
understanding spoken language from mostly written text? Or can it learn the visual
steps of a new recipe from only text descriptions? In this work, we formalize cross-modal
generalization as a learning paradigm to train a model that can (1) quickly perform
new tasks (from new domains) while (2) being originally trained on a different input
modality. Such a learning paradigm is crucial for generalization to low-resource modalities
such as spoken speech in rare languages while utilizing a different high-resource
modality such as text. One key technical challenge that makes it different from other
learning paradigms such as meta-learning and domain adaptation is the presence of
different source and target modalities which will require different encoders. We propose
an effective solution based on meta-alignment, a novel method to align representation
spaces using strongly and weakly paired cross-modal data while ensuring quick generalization
to new tasks across different modalities. This approach uses key ideas from cross-modal
learning and meta-learning, and presents strong results on the cross-modal generalization
problem. We benchmark several approaches on 3 real-world classification tasks: few-shot
recipe classification from text to images of recipes, object classification from images
to audio of objects, and language classification from text to spoken speech across
100 languages spanning many rare languages. Our results demonstrate strong performance
even when the new target modality has only a few (1-10) labeled samples and in the
presence of noisy labels, a scenario particularly prevalent in low-resource modalities.

Elastic Tactile Simulation Towards Tactile-Visual Perception

Yikai Wang
Wenbing Huang
Bin Fang
Fuchun Sun
Chang Li

Tactile sensing plays an important role in robotic perception and manipulation tasks.
To overcome the real-world limitations of data collection, simulating tactile response
in a virtual environment comes as a desirable direction of robotic research. In this
paper, we propose Elastic Interaction of Particles (EIP) for tactile simulation, which
is capable of reflecting the elastic property of the tactile sensor as well as characterizing
the fine-grained physical interaction during contact. Specifically, EIP models the
tactile sensor as a group of coordinated particles, and the elastic property is applied
to regulate the deformation of particles during contact. With the tactile simulation
by EIP, we further propose a tactile-visual perception network that enables information
fusion between tactile data and visual images. The perception network is based on
a global-to-local fusion mechanism where multi-scale tactile features are aggregated
to the corresponding local region of the visual modality with the guidance of tactile
positions and directions. The fusion method exhibits superiority regarding the 3D
geometric reconstruction task. Our code for EIP is available at https://github.com/yikaiw/EIP.

SESSION: Session 18: Multimodal Fusion and Embedding-II

A Novel Patch Convolutional Neural Network for View-based 3D Model Retrieval

Zan Gao
Yuxiang Shao
Weili Guan
Meng Liu
Zhiyong Cheng
Shengyong Chen

In industrial enterprises, effective retrieval of three-dimensional (3-D) computer-aided
design (CAD) models can greatly save time and cost in new product development and
manufacturing, thus, many researchers have focused on it. Recently, many view-based
3D model retrieval methods have been proposed and have achieved state-of-the-art performance.
However, most of these methods focus on extracting more discriminative view-level
features and effectively aggregating the multi-view images of a 3D model, and the
latent relationship among these multi-view images is not fully explored. Thus, we
tackle this problem from the perspective of exploiting the relationships between patch
features to capture long-range associations among multi-view images. To capture associations
among views, in this work, we propose a novel patch convolutional neural network (PCNN
) for view-based 3D model retrieval. Specifically, we first employ a CNN to extract
patch features of each view image separately. Second, a novel neural network module
named PatchConv is designed to exploit intrinsic relationships between neighboring
patches in the feature space to capture long-range associations among multi-view images.
Then, an adaptive weighted view layer is further embedded into PCNN to automatically
assign a weight to each view according to the similarity between each view feature
and the view-pooling feature. Finally, a discrimination loss function is employed
to extract the discriminative 3D model feature, which consists of softmax loss values
generated by the fusion classifier and the specific classifier. Extensive experimental
results on two public 3D model retrieval benchmarks, namely, the ModelNet40, and ModelNet10,
demonstrate that our proposed PCNN can outperform state-of-the-art approaches, with
mAP values of 93.67%, and 96.23%, respectively.

Semi-Autoregressive Image Captioning

Xu Yan
Zhengcong Fei
Zekang Li
Shuhui Wang
Qingming Huang
Qi Tian

Current state-of-the-art approaches for image captioning typically adopt an autoregressive
manner, i.e., generating descriptions word by word, which suffers from slow decoding
issue and becomes a bottleneck in real-time applications. Non-autoregressive image
captioning with continuous iterative refinement, which eliminates the sequential dependence
in a sentence generation, can achieve comparable performance to the autoregressive
counterparts with a considerable acceleration. Nevertheless, based on a well-designed
experiment, we empirically proved that iteration times can be effectively reduced
when providing sufficient prior knowledge for the language decoder. Towards that end,
we propose a novel two-stage framework, referred to as Semi-Autoregressive Image Captioning
(SAIC), to make a better trade-off between performance and speed. The proposed SAIC
model maintains autoregressive property in global but relieves it in local. Specifically,
SAIC model first jumpily generates an intermittent sequence in an autoregressive manner,
that is, it predicts the first word in every word group in order. Then, with the help
of the partially deterministic prior information and image features, SAIC model non-autoregressively
fills all the skipped words with one iteration. Experimental results on the MS COCO
benchmark demonstrate that our SAIC model outperforms the preceding non-autoregressive
image captioning models while obtaining a competitive inference speedup.

One-Stage Incomplete Multi-view Clustering via Late Fusion

Yi Zhang
Xinwang Liu
Siwei Wang
Jiyuan Liu
Sisi Dai
En Zhu

As a representative of multi-view clustering (MVC), late fusion MVC (LF-MVC) algorithm
has attracted intensive attention due to its superior clustering accuracy and high
computational efficiency. One common assumption adopted by existing LF-MVC algorithms
is that all views of each sample are available. However, it is widely observed that
there are incomplete views for partial samples in practice. In this paper, we propose
One-Stage Late Fusion Incomplete Multi-view Clustering (OS-LF-IMVC) to address this
issue. Specifically, we propose to unify the imputation of incomplete views and the
clustering task into a single optimization procedure, so that the learning of the
consensus partition matrix can directly assist the final clustering task. To optimize
the resultant optimization problem, we develop a five-step alternate strategy with
theoretically proved convergence. Comprehensive experiments on multiple benchmark
datasets are conducted to demonstrate the efficiency and effectiveness of the proposed
OS-LF-IMVC algorithm.

Self-Representation Subspace Clustering for Incomplete Multi-view Data

Jiyuan Liu
Xinwang Liu
Yi Zhang
Pei Zhang
Wenxuan Tu
Siwei Wang
Sihang Zhou
Weixuan Liang
Siqi Wang
Yuexiang Yang

Incomplete multi-view clustering is an important research topic in multimedia where
partial data entries of one or more views are missing. Current subspace clustering
approaches mostly employ matrix factorization on the observed feature matrices to
address this issue. Meanwhile, self-representation technique is left unexplored, since
it explicitly relies on full data entries to construct the coefficient matrix, which
is contradictory to the incomplete data setting. However, it is widely observed that
self-representation subspace method enjoys a better clustering performance over the
factorization based one. Therefore, we adapt it to incomplete data by jointly performing
data imputation and self-representation learning. To the best of our knowledge, this
is the first attempt in incomplete multi-view clustering literature. Besides, the
proposed method is carefully compared with current advances in experiment with respect
to different missing ratios, verifying its effectiveness.

Is Visual Context Really Helpful for Knowledge Graph? A Representation Learning Perspective

Meng Wang
Sen Wang
Han Yang
Zheng Zhang
Xi Chen
Guilin Qi

Visual modality recently has aroused extensive attention in the fields of knowledge
graph and multimedia because a lot of real-world knowledge is multi-modal in nature.
However, it is currently unclear to what extent the visual modality can improve the
performance of knowledge graph tasks over unimodal models, and equally treating structural
and visual features may encode too much irrelevant information from images. In this
paper, we probe the utility of the auxiliary visual context from knowledge graph representation
learning perspective by designing a Relation Sensitive Multi-modal Embedding model,
RSME for short. RSME can automatically encourage or filter the influence of visual
context during the representation learning. We also examine the effect of different
visual feature encoders. Experimental results validate the superiority of our approach
compared to the state-of-the-art methods. On the basis of in-depth analysis, we conclude
that under appropriate circumstances models are capable of leveraging the visual input
to generate better knowledge graph embeddings and vice versa.

Knowledge Perceived Multi-modal Pretraining in E-commerce

Yushan Zhu
Huaixiao Zhao
Wen Zhang
Ganqiang Ye
Hui Chen
Ningyu Zhang
Huajun Chen

In this paper, we address multi-modal pretraining of product data in the field of
E-commerce. Current multi-modal pretraining methods proposed for image and text modalities
lack robustness in the face of modality-missing and modality-noise, which are two
pervasive problems of multi-modal product data in real E-commerce scenarios. To this
end, we propose a novel method, K3M, which introduces knowledge modality in multi-modal
pretraining to correct the noise and supplement the missing of image and text modalities.
The modal-encoding layer extracts the features of each modality. The modal-interaction
layer is capable of effectively modeling the interaction of multiple modalities, where
an initial-interactive feature fusion model is designed to maintain the independence
of image modality and text modality, and a structure aggregation module is designed
to fuse the information of image, text, and knowledge modalities. We pretrain K3M
with three pretraining tasks, including masked object modeling (MOM), masked language
modeling (MLM), and link prediction modeling (LPM). Experimental results on a real-world
E-commerce dataset and a series of product-based downstream tasks demonstrate that
K3M achieves significant improvements in performances than the baseline and state-of-the-art
methods when modality-noise or modality-missing exists.

SESSION: Session 19: Video Program and Demo Session

Text2Video: Automatic Video Generation Based on Text Scripts

Yipeng Yu
Zirui Tu
Longyu Lu
Xiao Chen
Hui Zhan
Zixun Sun

To make video creation simpler, in this paper we present Text2Video, a novel system
to automatically produce videos using only text-editing for novice users. Given an
input text script, the director-like system can generate game-related engaging videos
which illustrate the given narrative, provide diverse multi-modal content, and follow
video editing guidelines. The system involves five modules: (1) A material manager
extracts highlights from raw live game videos, and tags each video highlight, image
and audio with labels. (2) A natural language processor extracts entities and semantics
from the input text scripts. (3) A refined cross-modal retrieval searches for matching
candidate shots from the material manager. (4) A text to speech speaker reads the
processed text scripts with synthesized human voice. (5) The selected material shots
and synthesized speech are assembled artistically through appropriate video editing
techniques.

A System for Interactive and Intelligent AD Auxiliary Screening

Sen Yang
Qike Zhao
Lanxin Miao
Min Chen
Lianli Gao
Jingkuan Song
Weidong Le

Montreal Cognitive Assessment (MoCA) test is an auxiliary medical screening method
for Alzheimer's disease (AD). During the traditional process, a testee is required
to conduct several test items on the paper questionnaire following the guidance of
a medical staff. It is inefficient and dependents largely on the doctor's subjective
judgment and experience level. Therefore, we propose an Interactive and Intelligent
AD Auxiliary Screening (IAS) system consisting of speech-based Interactive Unit Testing
Module (IUTM) and truth-based Intelligent Analysis Module (IAM), both of which are
developed by deep learning techniques. Following the guidance of voice commands, the
testee could achieve the MoCA test independently in IUTM just by a mobile device,
and then the testing data is analyzed accurately and objectively by IAM. Moreover,
the electronic system is beneficial to collect and analyze clinical data for further
research compared to the traditional method. And the system is deployed in the Department
of Neurology, Sichuan Provincial People's Hospital in June 2021 and has been used
in the clinical screening of Alzheimer's disease.

Move As You Like: Image Animation in E-Commerce Scenario

Borun Xu
Biao Wang
Jiale Tao
Tiezheng Ge
Yuning Jiang
Wen Li
Lixin Duan

Creative image animations are attractive in e-commerce applications, where motion
transfer is one of the import ways to generate animations from static images. However,
existing methods rarely transfer motion to objects other than human body or human
face, and even fewer apply motion transfer in practical scenarios. In this work, we
apply motion transfer on the Taobao product images in real e-commerce scenario to
generate creative animations, which are more attractive than static images and they
will bring more benefits. We animate the Taobao products of dolls, copper running
horses and toy dinosaurs based on motion transfer method for demonstration.

MDMS: Music Data Matching System for Query Variant Retrieval

Rinita Roy
Ruben Mayer
Hans-Arno Jacobsen

The distribution of royalty fees to music right holders is slow and inefficient due
to the lack of automation in music recognition and music licensing processes. The
challenge for an improved system is to recognise different versions of a music such
as remix or cover versions, leading to clear assessment and unique identification
of each music work. Through our music data matching system called MDMS, we query many
indexed and stored music pieces with a small part of a music piece. The system retrieves
the closest stored variant of the input query by using music fingerprints of the underlying
melody together with signal processing techniques. Tailored indices based on fingerprint
hashes accelerate processing across a large corpus of stored music. Results are found
even if the stored versions vary from the query song in terms of one or more music
features --- tempo, key/mode, presence of instruments/vocals, and singer --- and the
differences are highlighted in the output.

Community Generated VR Painting using Eye Gaze

Mu Mu
Murtada Dohan

The social experience is an important part of art exhibitions. This demo introduces
an eye-gaze based generative art prototype for virtual reality (VR) art exhibitions.
Our work extends the visitors' experience from individual art exploration to become
content co-creators. The design generates live community artworks based on all visitors'
visual interactions with VR paintings. During our VR exhibition at a public gallery,
over 100 visitors participated in the new creative process for community-generated
artworks.

Sync Glass: Virtual Pouring and Toasting Experience with Multimodal Presentation

Yuki Tajima
Toshiharu Horiuchi
Gen Hattori

One of the challenges of non-face-to-face communication is the absence of the haptic
dimension. To solve this, a haptic communication system via the Internet has been
proposed. The system has to be designed in such a way that it does not create discomfort
during general use. The "Sync Glass" that we have developed transmits and presents
the feeling of pouring a drink and making a toast accompanied by haptic, sound and
visual effects. The device is designed to resemble a glass cup and, moreover, each
action, including drinking and making a toast is performed in the customary way, making
its use more acceptable to users. In the internal user demonstrations we performed,
the experience has been reviewed with participants saying that "the feeling of pouring
is so realistic", "so enjoyable!", and similar affirmative statements.

VideoDiscovery: An Automatic Short-Video Generation System for E-commerce Live-streaming

Yanhao Zhang
Qiang Wang
Yun Zheng
Pan Pan
Yinghui Xu

We demonstrate an end-to-end intelligent system of short-video generation for live-streaming,
namely "VideoDiscovery'', which aims to automatically produce batches of high-value
short-videos by discovering and organizing highlight content for commodity delivery.
Traditionally, production of high-value short-videos for live-streaming is cost-expensive
and time-consuming, which also demands experienced editing skills. To this end, we
construct this system with three modules: 1)Semantic segment structuring first decodes
live-streaming into a series of semantic candidates including commodity, Q&A, action,
multi-modal, etc. 2)Hierarchical search engine performs automatically searches for
semantically matching candidate shots from scripts. 3)Script-aware shot assembly is
formulated combination problem over a graph of shots, considering temporal constraints
and candidate idioms. Specifically, given an input live-streaming, the recommended
video results illustrate diverse visual-semantic content, and follow script guidelines.
Currently, our system has been launched online for Taobao stores, which enables to
generate appealing videos in minutes for advertising and recommendation. The entry
of our system is available at https://discovery.aliyun.com/index.

SmartSales: An AI-Powered Telemarketing Coaching System in FinTech

Yuanfeng Song
Xuefang Zhao
Di Jiang
Xiaoling Huang
Weiwei Zhao
Qian Xu
Raymond Chi-Wing Wong
Qiang Yang

Telemarketing is a primary and mature method for enterprises to solicit prospective
customers to buy products or services. However, training telesales representatives
is always a pain point for enterprises since it is usually conducted manually and
costs great effort and time. In this demonstration, we propose a telemarketing coaching
system named SmartSales to help enterprises develop better salespeople. Powered by
artificial intelligence (AI), SmartSales aims to accumulate the experienced sales
pitch from customer-sales dialogues and use it to coach junior salespersons. To the
best of our knowledge, this is the first practice of an AI telemarketing coaching
system in the domain of Chinese FinTech in the literature. SmartSales has been successfully
deployed in the WeBank's telemarketing team. We expect that SmartSales will inspire
more research on AI assistant systems.

SmartMeeting: Automatic Meeting Transcription and Summarization for In-Person Conversations

Yuanfeng Song
Di Jiang
Xuefang Zhao
Xiaoling Huang
Qian Xu
Raymond Chi-Wing Wong
Qiang Yang

Meetings are a necessary part of the operations of any institution, whether they are
held online or in-person. However, meeting transcription and summarization are always
painful requirements since they involve tedious human effort. This drives the need
for automatic meeting transcription and summarization (AMTS) systems. A successful
AMTS system relies on systematic integration of multiple natural language processing
(NLP) techniques, such as automatic speech recognition, speaker identification, and
meeting summarization, which are traditionally developed separately and validated
offline with standard datasets. In this demonstration, we provide a novel productive
meeting tool named SmartMeeting, which enables users to automatically record, transcribe,
summarize, and manage the information in an in-person meeting. SmartMeeting transcribes
every word on the fly, enriches the transcript with speaker identification and voice
separation, and extracts essential decisions and crucial insights automatically. In
our demonstration, the audience can experience the great potential of the state-of-the-art
NLP techniques in this real-life application.

Aesthetic Evaluation and Guidance for Mobile Photography

Hao Lou
Heng Huang
Chaoen Xiao
Xin Jin

Nowadays, almost everyone can shoot photos using smart phones. However, not everyone
can take good photos. We propose to use computational aesthetics to automatically
teach people without photography training to take excellent photos. We present Aesthetic
Dashboard: a system of rich aesthetic evaluation and guidance for mobile photography.
We take 2 most used types of photos: landscapes and portraits into consideration.
When people take photos in the preview mode, for landscapes, we show the overall aesthetic
score and scores of 3 basic attributes: light, composition and color usage. Meanwhile,
the matching scores of the 3 basic attributes of current preview to typical templates
are shown, which can help users to adjust 3 basic attributes accordingly. For portraits,
besides the above basic attributes, the facial appearance, the guidance of face light,
body pose and the garment color are also shown to the users. This is the first system
that can teach mobile users to shoot good photos in the form of aesthetic dashboard,
through which, users can adjust several aesthetic attributes to take good photos easily.

A Question Answering System for Unstructured Table Images

Wenyuan Xue
Siqi Cai
Wen Wang
Qingyong Li
Baosheng Yu
Yibing Zhan
Dacheng Tao

Question answering over tables is a very popular semantic parsing task in natural
language processing (NLP). However, few existing methods focus on table images, even
though there are usually large-scale unstructured tables in practice (e.g., table
images). Table parsing from images is nontrivial since it is closely related to not
only NLP but also computer vision (CV) to parse the tabular structure from an image.
In this demo, we present a question answering system for unstructured table images.
The proposed system mainly consists of 1) a table recognizer to recognize the tabular
structure from an image and 2) a table parser to generate the answer to a natural
language question over the table. In addition, to train the model, we further provide
table images and structure annotations for two widely used semantic parsing datasets.
Specifically, the test set is used for this demo, from where the users can either
choose from default questions or enter a new custom question.

Post2Story: Automatically Generating Storylines from Microblogging Platforms

Xujian Zhao
Chongwei Wang
Peiquan Jin
Hui Zhang
Chunming Yang
Bo Li

In this paper, we demonstrate Post2Story, which aims to detect events and generate
storylines on microblog posts. Post2Story has several new features: (1) It proposes
to employ social influence to extract events from microblogs. (2) It presents a new
Event Graph Convolutional Network (E-GCN) model to learn the latent relationships
among events, which can help predict the story branch of an event and link events.
(3) It offers a user-friendly interface to extract and visualize the development of
events. After an introduction to the system architecture and key technologies of Post2Story,
we demonstrate the functionalities of Post2Story on a real dataset.

ViDA-MAN: Visual Dialog with Digital Humans

Tong Shen
Jiawei Zuo
Fan Shi
Jin Zhang
Liqin Jiang
Meng Chen
Zhengchen Zhang
Wei Zhang
Xiaodong He
Tao Mei

We demonstrate ViDA-MAN, a digital-human agent for multi-modal interaction, which
offers realtime audio-visual responses to instant speech inquiries. Compared to traditional
text or voice-based system, ViDA-MAN offers human-like interactions (e.g, vivid voice,
natural facial expression and body gestures). Given a speech request, the demonstration
is able to response with high quality videos in sub-second latency. To deliver immersive
user experience, ViDA-MAN seamlessly integrates multi-modal techniques including Acoustic
Speech Recognition (ASR), multi-turn dialog, Text To Speech (TTS), talking heads video
generation. Backed with large knowledge base, ViDA-MAN is able to chat with users
on a number of topics including chit-chat, weather, device control, News recommendations,
booking hotels, as well as answering questions via structured knowledge.

A Picture is Worth a Thousand Words: A Unified System for Diverse Captions and Rich Images Generation

Yupan Huang
Bei Liu
Jianlong Fu
Yutong Lu

A creative image-and-text generative AI system mimics humans' extraordinary abilities
to provide users with diverse and comprehensive caption suggestions, as well as rich
image creations. In this work, we demonstrate such an AI creation system to produce
both diverse captions and rich images. When users imagine an image and associate it
with multiple captions, our system paints a rich image to reflect all captions faithfully.
Likewise, when users upload an image, our system depicts it with multiple diverse
captions. We propose a unified multi-modal framework to achieve this goal. Specifically,
our framework jointly models image-and-text representations with a Transformer network,
which supports rich image creation by accepting multiple captions as input. We consider
the relations among input captions to encourage diversity in training and adopt a
non-autoregressive decoding strategy to enable real-time inference. Based on these,
our system supports both diverse captions and rich images generations. Our code is
available online.

Softly: Simulated Empathic Touch between an Agent and a Human

Maxime Grandidier
Fabien Boucaud
Indira Thouvenin
Catherine Pelachaud

RecipeLog: Recipe Authoring App for Accurate Food Recording

Akihisa Ishino
Yoko Yamakata
Hiroaki Karasawa
Kiyoharu Aizawa

Diet management is usually conducted by recording the name of foods eaten, but in
fact, the nutritional value of food in the same name varies greatly from recipe to
recipe. To know accurate nutritional values of the foods, recording personal recipes
is effective but time-consuming. Therefore, we are developing a mobile application
"RecipeLog", that assists users to write their own recipes by modifying prepared ones.
In our experiments, we show that with RecipeLog users create personal recipes with
45% less edit distance compared to writing from scratch.

iART: A Search Engine for Art-Historical Images to Support Research in the Humanities

Matthias Springstein
Stefanie Schneider
Javad Rahnama
Eyke Hüllermeier
Hubertus Kohle
Ralph Ewerth

In this paper, we introduce iART: an open Web platform for art-historical research
that facilitates the process of comparative vision. The system integrates various
machine learning techniques for keyword- and content-based image retrieval as well
as category formation via clustering. An intuitive GUI supports users to define queries
and explore results. By using a state-of-the-art cross-modal deep learning approach,
it is possible to search for concepts that were not previously detected by trained
classification models. Art-historical objects from large, openly licensed collections
such as Amsterdam Rijksmuseum and Wikidata are made available to users.

ArtiVisual: A Platform to Generate and Compare Art

Jardenna Mohazzab
Abe Vos
Jonathan van Westendorp
Lucas Lageweg
Dylan Prins
Aritra Bhowmik

ArtiVisual is a platform for generating new art-pieces based on an existing art style
and comparing commonalities between paintings from different era. We combine an image
generative network with established state-of-the-art visualisation techniques to deepen
the users' understanding of art in general. With ArtiVisual we can generate images
based on art- styles via an interactive timeline. Common features between art-styles
are reflected on the generated art piece produced by the network after learning the
subspace of each artist's specific features. Visualisations are presented to provide
insight into commonalities between existing and generated images. The combination
of a trained network and our visualisation techniques provides a rigid framework for
thorough exploration and understanding of art datasets.

GCNIllustrator: Illustrating the Effect of Hyperparameters on Graph Convolutional Networks

Ivona Najdenkoska
Jeroen den Boef
Thomas Schneider
Justo van der Werf
Reinier de Ridder
Fajar Fathurrahman
Marcel Worring

An increasing number of real-world applications are using graph-structured datasets,
imposing challenges to existing machine learning algorithms. Graph Convolutional Networks
(GCNs) are deep learning models, specifically designed to operate on graphs. One of
the most tedious steps in training GCNs is the choice of the hyperparameters, especially
since they exhibit unique properties compared to other neural models. Not only machine
learning beginners, but also experienced practitioners often have difficulties to
properly tune their models. We hypothesize that having a tool that visualizes the
effect of hyperparameters choice on the performance can accelerate the model development
and improve the understanding of these black-box models. Additionally, observing clusters
of certain nodes helps to empirically understand how a given prediction was made due
to the feature propagation step of GCNs. Therefore, this demo introduces GCNIllustrator
- a web-based visual analytics tool for illustrating the effect of hyperparameters
on the predictions in a citations graph.

On-demand Action Detection System using Pose Information

Noboru Yoshida
Jianquan Liu

Human action detection is a very important yet difficult task for various multimedia
applications such as safety surveillance, sports video analysis and video editing
in media industry. Most existing methods proposed for action detection are machine
learning based approaches, however, highly time- and cost-consuming to prepare training
data with annotations. Thus, it is still very difficult to apply these methods for
industrial applications where the actions of interests might happen rarely in real
scenarios such as criminal or suspicious behaviors, because it is impossible to collect
a large number of such training data for target actions. In this paper, we disruptively
abandon these conventional methods, alternatively, adopting an on-demand retrieval
approach using pose information to handle the action detection task. We introduce
a demo system that can detect similar actions immediately by specifying a few second
sample video without any training process. The system demonstrates the usability and
efficacy of our on-demand approach for human action detection. The experimental results
are reported to show that our approach outperforms the state-of-the-art method in
higher precision and recall, up to 11% and 6.1% improvement, respectively.

APF: An Adversarial Privacy-preserving Filter to Protect Portrait Information

Xian Zhao
Jiaming Zhang
Xiaowen Huang

While widely adopted in practical applications, face recognition has been disputed
on the malicious use of face images and potential privacy issues. Online photo sharing
services accidentally act as the main approach for the malicious crawlers to exploit
face recognition to access portrait privacy. In this demo, we propose an adversarial
privacy-preserving filter, which can preserve face image from malicious face recognition
algorithms. This filter is generated by an end-cloud collaborated adversarial attack
framework consisting of three modules: (1) Image-specific gradient generation module,
to extract image-specific gradient in the user end; (2) Adversarial gradient transfer
module, to fine-tune the image-specific gradient in the server; and (3) Universal
adversarial perturbation enhancement module, to append image-independent perturbation
to derive the final adversarial perturbation. A short video about our system is available
at https://github.com/Anonymity-for-submission/3247.

Text-driven 3D Avatar Animation with Emotional and Expressive Behaviors

Li Hu
Jinwei Qi
Bang Zhang
Pan Pan
Yinghui Xu

Text-driven 3D avatar animation has been an essential part of virtual human techniques,
which has a wide range of applications in movie, digital games and video streaming.
In this work, we introduce a practical system which drives both facial and body movements
of 3D avatar by text input. Our proposed system first converts text input to speech
signal and conducts text analysis to extract semantic tags simultaneously. Then we
generate the lip movements from the synthetic speech, and meanwhile facial expression
and body movement are generated by the joint modeling of speech and textual information,
which can drive our virtual 3D avatar talking and acting like a real human.

Text to Scene: A System of Configurable 3D Indoor Scene Synthesis

Xinyan Yang
Fei Hu
Long Ye

In this work, we show the Text to Scene system, which can configure 3D indoor scene
from natural language. Given a text, the system will organize inclusive semantic message
to a graph template, complete the graph with a novel graph-based contextual completion
method Contextual ConvE(CConvE) and visulize the graph by arranging 3D models under
an object location protocol. In the experiments, qualitative results obtained by the
Text to Scene(T2S) system and quantitative evaluation of CConvE compared with other
state-of-the-art approaches are reported.

MovieREP: A New Movie Reproduction Framework for Film Soundtrack

Ruiqi Wang
Long Ye
Qin Zhang

Film sound reproduction is the process of converting the image-form film soundtrack
to wave-form movie sound. In this paper, a novel optical imaging based reproduction
framework is proposed with the basic idea that restoring film audio damage in the
image domain. In traditional reproduction method, the scanning light emitted by film
projector causes inversible physical damage to the flammable film soundtrack (made
of Nitrate compounds). By using optical imaging method in film soundtrack capturing,
our framework can avoid the damage and the self-ignition problem. Experiment results
show that our framework can improve the reproduction speed to 2 times while maintaining
equal sound quality. Also, the sound sampling rate can be enhanced to 162.08%.

SESSION: Session 20: Multimodal Fusion and Embedding-III

DSP: Dual Soft-Paste for Unsupervised Domain Adaptive Semantic Segmentation

Li Gao
Jing Zhang
Lefei Zhang
Dacheng Tao

Unsupervised domain adaptation (UDA) for semantic segmentation aims to adapt a segmentation
model trained on the labeled source domain to the unlabeled target domain. Existing
methods try to learn domain invariant features while suffering from large domain gaps
that make it difficult to correctly align discrepant features, especially in the initial
training phase. To address this issue, we propose a novel Dual Soft-Paste (DSP) method
in this paper. Specifically, DSP selects some classes from a source domain image using
a long-tail class first sampling strategy and softly pastes the corresponding image
patch on both the source and target training images with a fusion weight. Technically,
we adopt the mean teacher framework for domain adaptation, where the pasted source
and target images go through the student network while the original target image goes
through the teacher network. Output-level alignment is carried out by aligning the
probability maps of the target fused image from both networks using a weighted cross-entropy
loss. In addition, feature-level alignment is carried out by aligning the feature
maps of the source and target images from student network using a weighted maximum
mean discrepancy loss. DSP facilitates the model learning domain-invariant features
from the intermediate domains, leading to faster convergence and better performance.
Experiments on two challenging benchmarks demonstrate the superiority of DSP over
state-of-the-art methods. Code is available at https://github.com/GaoLii/DSP.

Generating Point Cloud from Single Image in The Few Shot Scenario

Yu Lin
Jinghui Guo
Yang Gao
Yi-fan Li
Zhuoyi Wang
Latifur Khan

Reconstructing point clouds from images would extremely benefit many practical CV
applications, such as robotics, automated vehicles, and Augmented Reality. Fueled
by the advances of deep neural network, many deep learning frameworks are proposed
to address this problem recently. However, these frameworks generally rely on a large
amount of labeled training data (e.g., image and point cloud pairs). Although we usually
have numerous 2D images, corresponding 3D shapes are insufficient in practice. In
addition, most available 3D data covers only a limited amount of classes, which further
restricts the models' generalization ability to novel classes. To mitigate these issues,
we propose a novel few-shot single-view point cloud generation framework by considering
both class-specific and class-agnostic 3D shape priors. Specifically, we abstract
each class by a prototype vector that embeds class-specific shape priors. Class-agnostic
shape priors are modeled by a set of learnable shape primitives that encode universal
3D shape information shared across classes. Later, we combine the input image with
class-specific prototypes and class-agnostic shape primitives to guide the point cloud
generation process. Experiments on the popular ModelNet and ShapeNet datasets demonstrate
that our method outperforms state-of-the-art methods in the few-shot setting.

Product-oriented Machine Translation with Cross-modal Cross-lingual Pre-training

Yuqing Song
Shizhe Chen
Qin Jin
Wei Luo
Jun Xie
Fei Huang

Translating e-commercial product descriptions, a.k.a product-oriented machine translation
(PMT), is essential to serve e-shoppers all over the world. However, due to the domain
specialty, the PMT task is more challenging than traditional machine translation problems.
Firstly, there are many specialized jargons in the product description, which are
ambiguous to translate without the product image. Secondly, product descriptions are
related to the image in more complicated ways than standard image descriptions, involving
various visual aspects such as objects, shapes, colors or even subjective styles.
Moreover, existing PMT datasets are small in scale to support the research. In this
paper, we first construct a large-scale bilingual product description dataset called
Fashion-MMT, which contains over 114k noisy and 40k manually cleaned description translations
with multiple product images. To effectively learn semantic alignments among product
images and bilingual texts in translation, we design a unified product-oriented cross-modal
cross-lingual model for pre-training and fine-tuning. Experiments on the Fashion-MMT
and Multi30k datasets show that our model significantly outperforms the state-of-the-art
models even pre-trained on the same dataset. It is also shown to benefit more from
large-scale noisy data to improve the translation quality. We will release the dataset
and codes at https://github.com/syuqings/Fashion-MMT.

Pre-training Graph Transformer with Multimodal Side Information for Recommendation

Yong Liu
Susen Yang
Chenyi Lei
Guoxin Wang
Haihong Tang
Juyong Zhang
Aixin Sun
Chunyan Miao

Side information of items, e.g., images and text description, has shown to be effective
in contributing to accurate recommendations. Inspired by the recent success of pre-training
models on natural language and images, we propose a pre-training strategy to learn
item representations by considering both item side information and their relationships.
We relate items by common user activities, e.g., co-purchase, and construct a homogeneous
item graph. This graph provides a unified view of item relations and their associated
side information in multimodality. We develop a novel sampling algorithm named MCNSampling
to select contextual neighbors for each item. The proposed Pre-trained Multimodal
Graph Transformer (PMGT) learns item representations with two objectives: 1) graph
structure reconstruction, and 2) masked node feature reconstruction. Experimental
results on real datasets demonstrate that the proposed PMGT model effectively exploits
the multimodality side information to achieve better accuracies in downstream tasks
including item recommendation and click-through ratio prediction. In addition, we
also report a case study of testing PMGT in an online setting with 600 thousand users.

Learning Disentangled Factors from Paired Data in Cross-Modal Retrieval: An Implicit
Identifiable VAE Approach

Minyoung Kim
Ricardo Guerrero
Vladimir Pavlovic

We tackle the problem of learning the underlying disentangled latent factors that
are shared between the paired bi-modal data in cross-modal retrieval. Typically the
data in both modalities are complex, structured, and high dimensional (e.g., image
and text), for which the conventional deep auto-encoding latent variable models such
as the Variational Autoencoder (VAE) often suffer from difficulty of accurate decoder
training or realistic synthesis. In this paper we propose a novel idea of the implicit
decoder, which completely removes the ambient data decoding module from a latent variable
model, via implicit encoder inversion that is achieved by Jacobian regularization
of the low-dimensional embedding function. Motivated from the recent Identifiable-VAE
(IVAE) model, we modify it to incorporate the query modality data as conditioning
auxiliary input, which allows us to prove that the true parameters of the model can
be identifiable under some regularity conditions. Tested on various datasets where
the true factors are fully/partially available, our model is shown to identify the
factors accurately, significantly outperforming conventional latent variable models.

Progressive Graph Attention Network for Video Question Answering

Liang Peng
Shuangji Yang
Yi Bin
Guoqing Wang

Video question answering~(Video-QA) is a task of answering a natural language question
related to the content of a video. Existing methods generally explore the single interactions
between objects or between frames, which are insufficient to deal with the sophisticated
scenes in videos. To tackle this problem, we propose a novel model, termed Progressive
Graph Attention Network (PGAT), which can jointly explore the multiple visual relations
on object-level, frame-level and clip-level. Specifically, in the object-level relation
encoding, we design two kinds of complementary graphs, one for learning the spatial
and semantic relations between objects from the same frame, the other for modeling
the temporal relations between the same object from different frames. The frame-level
graph explores the interactions between diverse frames to record the fine-grained
appearance change, while the clip-level graph models the temporal and semantic relations
between various actions from clips. These different-level graphs are concatenated
in a progressive manner to learn the visual relations from low-level to high-level.
Furthermore, we for the first time identified that there are serious answer biases
with TGIF-QA, a very large Video-QA dataset, and reconstructed a new dataset based
on it to overcome the biases, called TGIF-QA-R. We evaluate the proposed model on
three benchmark datasets and the new TGIF-QA-R, and the experimental results demonstrate
that our model significantly outperforms other state-of-the-art models. Our codes
and dataset are available at https://github.com/PengLiang-cn/PGAT.

SESSION: Session 21: Media Interpretation-I

Mix-order Attention Networks for Image Restoration

Tao Dai
Yalei Lv
Bin Chen
Zhi Wang
Zexuan Zhu
Shu-Tao Xia

Convolutional neural networks (CNNs) have obtained great success in image restoration
tasks, like single image denoising, demosaicing, and super-resolution. However, most
existing CNN-based methods neglect the diversity of image contents and degradations
in the corrupted images and treat channel-wise features equally, thus hindering the
representation ability of CNNs. To address this issue, we propose deep mix-order attention
networks (MAN) to extract features that capture rich feature statistics within networks.
Our MAN is mainly built on simple residual blocks and our mix-order channel attention
(MOCA) module, which further consists of feature gating and feature pooling blocks
to capture different types of semantic information. With our MOCA, our MAN can be
flexible to handle various types of image contents and degradations. Besides, our
MAN can be generalized to different image restoration tasks, like image denoising,
super-resolution, and demosaicing. Extensive experiments demonstrate that our method
obtains favorably against state-of-the-art methods in terms of quantitative and qualitative
metrics.

Vehicle Counting Network with Attention-based Mask Refinement and Spatial-awareness
Block Loss

Ji Zhang
Jian-Jun Qiao
Xiao Wu
Wei Li

Vehicle counting aims to calculate the number of vehicles in congested traffic scenes.
Although object detection and crowd counting have made tremendous progress with the
development of deep learning, vehicle counting remains a challenging task, due to
scale variations, viewpoint changes, inconsistent location distributions, diverse
visual appearances and severe occlusions. In this paper, a well-designed Vehicle Counting
Network (VCNet) is novelly proposed to alleviate the problem of scale variation and
inconsistent spatial distribution in congested traffic scenes. Specifically, VCNet
is composed of two major components: (i) To capture multi-scale vehicles across different
types and camera viewpoints, an effective multi-scale density map estimation structure
is designed by building an attention-based mask refinement module. The multi-branch
structure with hybrid dilated convolution blocks is proposed to assign receptive fields
to generate multi-scale density maps. To efficiently aggregate multi-scale density
maps, the attention-based mask refinement is well-designed to highlight the vehicle
regions, which enables each branch to suppress the scale interference from other branches.
(ii) In order to capture the inconsistent spatial distributions, a spatial-awareness
block loss (SBL) based on the region-weighted reward strategy is proposed to calculate
the loss of different spatial regions including sparse, congested and occluded regions
independently by dividing the density map into different regions. Extensive experiments
conducted on three benchmark datasets, TRANCOS, VisDrone2019 Vehicle and CVCSet demonstrate
that the proposed VCNet outperforms the state-of-the-art approaches in vehicle counting.
Moreover, the proposed idea can be applicable for crowd counting, which produces competitive
results on ShanghaiTech crowd counting dataset.

DPT: Deformable Patch-based Transformer for Visual Recognition

Zhiyang Chen
Yousong Zhu
Chaoyang Zhao
Guosheng Hu
Wei Zeng
Jinqiao Wang
Ming Tang

Transformer has achieved great success in computer vision, while how to split patches
in an image remains a problem. Existing methods usually use a fixed-size patch embedding
which might destroy the semantics of objects. To address this problem, we propose
a new Deformable Patch (DePatch) module which learns to adaptively split the images
into patches with different positions and scales in a data-driven way rather than
using predefined fixed patches. In this way, our method can well preserve the semantics
in patches. The DePatch module can work as a plug-and-play module, which can easily
be incorporated into different transformers to achieve an end-to-end training. We
term this DePatch-embedded transformer as Deformable Patch-based Transformer (DPT)
and conduct extensive evaluations of DPT on image classification and object detection.
Results show DPT can achieve 81.8% top-1 accuracy on ImageNet classification, and
43.7% box AP with RetinaNet, 44.3% with Mask R-CNN on MSCOCO object detection. Code
has been made available at: https://github.com/CASIA-IVA-Lab/DPT.

Scene Text Image Super-Resolution via Parallelly Contextual Attention Network

Cairong Zhao
Shuyang Feng
Brian Nlong Zhao
Zhijun Ding
Jun Wu
Fumin Shen
Heng Tao Shen

Optical degradation blurs text shapes and edges, so existing scene text recognition
methods have difficulties in achieving desirable results on low-resolution (LR) scene
text images acquired in real-world environments. The above problem can be solved by
efficiently extracting sequential information to reconstruct super-resolution (SR)
text images, which remains a challenging task. In this paper, we propose a Parallelly
Contextual Attention Network (PCAN), which effectively learns sequence-dependent features
and focuses more on high-frequency information of the reconstruction in text images.
Firstly, we explore the importance of sequence-dependent features in horizontal and
vertical directions parallelly for text SR, and then design a parallelly contextual
attention block to adaptively select the key information in the text sequence that
contributes to image super-resolution. Secondly, we propose a hierarchically orthogonal
texture-aware attention module and an edge guidance loss function, which can help
to reconstruct high-frequency information in text images. Finally, we conduct extensive
experiments on TextZoom dataset, and the results can be easily incorporated into mainstream
text recognition algorithms to further improve their performance in LR image recognition.
Besides, our approach exhibits great robustness in defending against adversarial attacks
on seven mainstream scene text recognition datasets, which means it can also improve
the security of the text recognition pipeline. Compared with directly recognizing
LR images, our method can respectively improve the recognition accuracy of ASTER,
MORAN, and CRNN by 14.9%, 14.0%, and 20.1%. Our method outperforms eleven state-of-the-art
(SOTA) SR methods in terms of boosting text recognition performance. Most importantly,
it outperforms the current optimal text-orient SR method TSRN by 3.2%, 3.7%, and 6.0%
on the recognition accuracy of ASTER, MORAN, and CRNN respectively.

Improving Pedestrian Detection from a Long-tailed Domain Perspective

Mengyuan Ding
Shanshan Zhang
Jian Yang

Although pedestrian detection has developed a lot recently, there still exists some
challenging scenarios, such as small-scale, occlusion and low-light. Current works
usually focus on one of these scenarios independently and propose specific methods.
However, different challenges may happen at a time simultaneously and change across
time, making a specific method infeasible in practice. Therefore we are motivated
to design a method which is able to handle various challenges and to obtain reasonable
performance across different scenarios. In this paper, we first propose Instance Domain
Compactness (IDC) to measure the difference of each instance in the feature space
and handle hard cases from a novel long-tailed domain perspective. Specifically, we
first propose a Feature Augmentation Module (FAM) to augment the tail instances in
the feature space, thereby increasing the number and diversity of tail samples. Besides,
a IDC-guided loss weighting module (IDCW) is formulated to adaptively re-weight the
loss of each sample so as to balance the optimization procedure. Extensive analysis
and experiments illustrate that our method improves the generalization of the model
without any extra parameters and achieves comparable results across different challenging
scenarios on both CityPersons and Caltech datasets.

Robust Shadow Detection by Exploring Effective Shadow Contexts

Xianyong Fang
Xiaohao He
Linbo Wang
Jianbing Shen

Effective contexts for separating shadows from non-shadow objects can appear in different
scales due to different object sizes. This paper introduces a new module, Effective-Context
Augmentation (ECA), to utilize these contexts for robust shadow detection with deep
structures. Taking regular deep features as global references, ECA enhances the discriminative
features from the parallelly computed fine-scale features and, therefore, obtains
robust features embedded with effective object contexts by boosting them. We further
propose a novel encoder-decoder style of shadow detection method where ECA acts as
the main building block of the encoder to extract strong feature representations and
the guidance to the classification process of the decoder. Moreover, the networks
are optimized with only one loss, which is easy to train and does not have the instability
caused by extra losses superimposed on the intermediate features among existing popular
studies. Experimental results show that the proposed method can effectively eliminate
fake detections. Especially, our method outperforms state-of-the-arts methods and
improves over $13.97%$ and $34.67%$ on the challenging SBU and UCF datasets respectively
in balance error rate.

SESSION: Session 22: Doctoral Symposium

End-to-end Quality of Experience Evaluation for HTTP Adaptive Streaming

Babak Taraghi

Exponential growth in multimedia streaming traffic over the Internet motivates the
research and further investigation of the user's perceived quality of such services.
Enhancement of experienced quality by the users becomes more substantial when service
providers compete on establishing superiority by gaining more subscribers or customers.
Quality of Experience (QoE) enhancement would not be possible without an authentic
and accurate assessment of the streaming sessions. HTTP Adaptive Streaming (HAS) is
today's prevailing technique to deliver the highest possible audio and video content
quality to the users. An end-to-end evaluation of QoE in HAS covers the precise measurement
of the metrics that affect the perceived quality, eg. startup delay, stall events,
and delivered media quality. Mentioned metrics improvements could limit the service's
scalability, which is an important factor in real-world scenarios. In this study,
we will investigate the stated metrics, best practices and evaluations methods, and
available techniques with an aim to (i) design and develop practical and scalable
measurement tools and prototypes, (ii) provide a better understanding of current technologies
and techniques (eg. Adaptive Bitrate algorithms), (iii) conduct in-depth research
on the significant metrics in a way that improvements of QoE with scalability in mind
would be feasible, and finally (iv) provide a comprehensive QoE model which outperforms
state-of-the-art models.

Generative Adversarial Network for Text-to-Face Synthesis and Manipulation

Yutong Zhou

Over the past few years, several studies have been conducted on text-to-image synthesis
techniques, which transfer input textual descriptions into realistic images. However,
facial image synthesis and manipulation from input sentences have not been widely
explored due to the lack of datasets. My research interests center around the development
of multi-modality technology and facial image generation with Generative Adversarial
Networks. Towards that end, we propose an approach for facial image generation and
manipulation from text descriptions. We also introduce the first Text-to-Face synthesis
dataset with large-scale facial attributes. In this extended abstract, we first present
the existing condition and further direction of my Ph.D. research that I have followed
during the first year. Then, we introduce the proposed method (accepted by IEEE FG2021),
annotated novel dataset and experimental results. Finally, the future outlook on other
challenges, proposed dataset and expected impact are discussed. Codes and paper lists
studied in text-to-image synthesis are summarized on https://github.com/Yutong-Zhou-cv/Awesome-Text-to-Image.

GAN-aided Serial Dependence Study in Medical Image Perception

Zhihang Ren

Medical imaging has been critically important for the health and well-being of millions
of patients. Although deep learning has been widely studied in medical imaging area
and the performance of deep learning has exceeded human's performance in certain medical
diagnostic tasks, detecting and diagnosing lesions still depends on the visual system
of human observers (radiologists), who completed years of training to scrutinize anomalies.
Routinely, radiologists sequentially read batches of medical images one after the
other. A basic underlying assumption of radiologists' precise diagnosis is that their
perceptions and decisions on a current medical image are completely independent from
the previous reading history of medical images. However, recent research proposed
that the human visual system has visual serial dependencies (VSDs) at many levels.
VSD means that what was seen in the past influences (and captures) what is seen and
reported at this moment. Our pilot data via naive artificial stimuli has shown that
VSD has a disruptive effect in radiologic searches that impairs accurate detection
and recognition of tumors or other structures. However, the naive artificial stimuli
have been noted by both untrained observers and expert radiologists to be less authentic.
In this project, we will generate authentic medical images via Generative Adversarial
Networks (GANs) in order to replace the simple stimuli in future experiments. The
rationale for the proposed research project is that once it is known how serial dependence
arises and how it impacts visual search, we can understand how to control for it.
Hence, the accuracy of diagnosis via medical imaging can significantly improve. The
specific goals of this project are to establish, identify and mitigate the impact
of VSD on visual search tasks in clinical settings.

Image Style Transfer with Generative Adversarial Networks

Ru Li

Image style transfer is a recently popular research field, which aims to learn the
mapping between different domains and involves different computer vision techniques.
Recently, Generative Adversarial Networks (GAN) have demonstrated their potentials
of translating images from source domain X to target domain Y in the absence of paired
examples. However, such a translation cannot guarantee to generate high perceptual
quality results. Existing style transfer methods work well with relatively uniform
content, they often fail to capture geometric or structural patterns that reflect
the quality of generated images. The goal of this doctoral research is to investigate
the image style transfer approaches, and design advanced and useful methods to solve
existing problems. Though preliminary experiments conducted so far, we demonstrate
our insights on the image style translation approaches, and present the directions
to be pursued in the future.

Annotation-Efficient Semantic Segmentation with Shape Prior Knowledge

Yuhang Lu

Deep learning methods have achieved great success on semantic segmentation in recent
years. But the training typically relies on large-scale fully-annotated ground truth
masks, which are difficult to obtain in practice. In this research, we study the problem
of reducing the annotation cost of segmentation network training with a focus on exploring
the shape prior knowledge of objects. Under the context of three applications, we
study three types of shape priors. Specifically, we first exploit the implicit shape
prior of curve structures to propose a weakly supervised curve structure segmentation
method, and then explicitly formulate the shape prior of anatomical structures as
loss functions to propose a one-shot anatomical structures segmentation network. Last,
we try to generalize the shape constraint to arbitrary objects to propose a class-agnostic
few-shot segmentation framework. Experiment results show that our methods could achieve
comparable or better performance than fully supervised segmentation methods with less
annotation costs on the studied applications.

Neural-based Rendering and Application

Peng Dai

Rendering plays an important role in many fields such as virtual reality and film,
but the high dependence on computing sources and human experience hinders its application.
With the development of deep learning, neural rendering has attracted much attention
due to its impressive performance and efficiency than traditional rendering. In this
paper, we mainly introduce two neural rendering works, one is rendering simulation
and the other is image-based novel view rendering. Moreover, we also discuss the potential
applications (i.e. data augmentation) based on the results of neural rendering, which
has received little attention.

Towards Bridging Video and Language by Caption Generation and Sentence Localization

Shaoxiang Chen

Various video understanding tasks (classification, tracking, action detection, etc.)
have been extensively studied in the multimedia and computer vision communities over
the recent years. While these tasks are important, we think that bridging video and
language is a more natural and intuitive way to interact with videos. Caption generation
and sentence localization are two representative tasks for connecting video and language,
and my research is focused on these two tasks. In this extended abstract, I present
approaches for tackling each of these tasks by exploiting fine-grained information
in videos, together with ideas about how these two tasks can be connected. So far,
my work have demonstrated that these two tasks share a common foundation, and by connecting
them to form a cycle, video and language can be more closely bridged. Finally, several
challenges and future directions will be discussed.

Situational Anomaly Detection in Multimedia Data under Concept Drift

Pratibha Kumari

Anomaly detection has been a very challenging and active area of research for decades,
particularly for video surveillance. However, most of the works detect predefined
anomaly classes using static models. These frameworks have limited applicability for
real-life surveillance where the data have concept drift. Under concept drift, the
distribution of both normal and anomaly classes changes over time. An event may change
its class from anomaly to normal or vice-versa. The non-adaptive frameworks do not
handle this drift. Additionally, the focus has been on detecting local anomalies,
such as a region of an image. In contrast, in CCTV-based monitoring, flagging unseen
anomalous situations can be of greater interest. Utilizing multiple sensory information
for anomaly detection has also received less attention. This extended abstract discusses
these gaps and possible solutions.

Dynamic Knowledge Distillation with Cross-Modality Knowledge Transfer

Guangzhi Wang

Supervised learning for vision tasks has achieved great success be-cause of the advances
of deep learning research in many areas, such as high quality datasets, network architectures
and regularization methods. In the vanilla deep learning paradigm, training a model
for visual tasks is mainly based on the provided training images and annotations.
Inspired by human learning with knowledge transfer where information from multiples
modalities are considered, we pro-pose to improve visual tasks' performance by introducing
explicit knowledge extracted from other modalities. As the first step, we propose
to improve image classification performance by introducing linguistic knowledge as
additional constraints in model learning. This knowledge is represented as a set of
constraints to be jointly utilized with visual knowledge. To coordinate the training
dynamic, we propose to imbue our model the ability of dynamic distilling from multiple
knowledge sources. This is done via a model agnostic knowledge weighting module which
guides the learning process and updates via meta-steps during training. Preliminary
experiments on various benchmark datasets validate the efficacy of our method. Our
code will be made publicly available to ensure reproducibility.

SESSION: Session 23: Media Interpretation-II

WeClick: Weakly-Supervised Video Semantic Segmentation with Click Annotations

Peidong Liu
Zibin He
Xiyu Yan
Yong Jiang
Shu-Tao Xia
Feng Zheng
Hu Maowei

Compared with tedious per-pixel mask annotating, it is much easier to annotate data
by clicks, which costs only several seconds for an image. However, applying clicks
to learn video semantic segmentation model has not been explored before. In this work,
we propose an effective weakly-supervised video semantic segmentation pipeline with
click annotations, called WeClick, for saving laborious annotating effort by segmenting
an instance of the semantic class with only a single click. Since detailed semantic
information is not captured by clicks, directly training with click labels leads to
poor segmentation predictions. To mitigate this problem, we design a novel memory
flow knowledge distillation strategy to exploit temporal information (named memory
flow) in abundant unlabeled video frames, by distilling the neighboring predictions
to the target frame via estimated motion. Moreover, we adopt vanilla knowledge distillation
for model compression. In this case, WeClick learns compact video semantic segmentation
models with the low-cost click annotations during the training phase yet achieves
real-time and accurate models during the inference period. Experimental results on
Cityscapes and Camvid show that WeClick outperforms the state-of-the-art methods,
increases performance by 10.24% mIoU than baseline, and achieves real-time execution.

Towards Cross-Granularity Few-Shot Learning: Coarse-to-Fine Pseudo-Labeling with Visual-Semantic
Meta-Embedding

Jinhai Yang
Hua Yang
Lin Chen

Few-shot learning aims at rapidly adapting to novel categories with only a handful
of samples at test time, which has been predominantly tackled with the idea of meta-learning.
However, meta-learning approaches essentially learn across a variety of few-shot tasks
and thus still require large-scale training data with fine-grained supervision to
derive a generalized model, thereby involving prohibitive annotation cost. In this
paper, we advance the few-shot classification paradigm towards a more challenging
scenario, i.e, cross-granularity few-shot classification, where the model observes
only coarse labels during training while is expected to perform fine-grained classification
during testing. This task largely relieves the annotation cost since fine-grained
labeling usually requires strong domain-specific expertise. To bridge the cross-granularity
gap, we approximate the fine-grained data distribution by greedy clustering of each
coarse-class into pseudo-fine-classes according to the similarity of image embeddings.
We then propose a meta-embedder that jointly optimizes the visual- and semantic-discrimination,
in both instance-wise and coarse class-wise, to obtain a good feature space for this
coarse-to-fine pseudo-labeling process. Extensive experiments and ablation studies
are conducted to demonstrate the effectiveness and robustness of our approach on three
representative datasets.

Disentangled Representation Learning and Enhancement Network for Single Image De-Raining

Guoqing Wang
Changming Sun
Xing Xu
Jingjing Li
Zheng Wang
Zeyu Ma

In this paper, we present a disentangled representation learning and enhancement network
(DRLE-Net) to address the challenging single image de-raining problems, i.e., raindrop
and rain streak removal. Specifically, the DRLE-Net is formulated as a multi-task
learning framework, and an elegant knowledge transfer strategy is designed to train
the encoder of DRLE-Net to embed a rainy image into two separated latent spaces representing
the task (clean image reconstruction in this paper) relevant and irrelevant variations
respectively, such that only the essential task-relevant factors will be used by the
decoder of DRLE-Net to generate high-quality de-raining results. Furthermore, visual
attention information is modeled and fed into the disentangled representation learning
network to enhance the task-relevant factor learning. To facilitate the optimization
of the hierarchical network, a new adversarial loss formulation is proposed and used
together with the reconstruction loss to train the proposed DRLE-Net. Extensive experiments
are carried out for removing raindrops or rainstreaks from both synthetic and real
rainy images, and DRLE-Net is demonstrated to produce significantly better results
than state-of-the-art models.

Towards Robust Cross-domain Image Understanding with Unsupervised Noise Removal

Lei Zhu
Zhaojing Luo
Wei Wang
Meihui Zhang
Gang Chen
Kaiping Zheng

Deep learning has made a tremendous impact on various applications in multimedia,
such as media interpretation and multimodal retrieval. However, deep learning models
usually require a large amount of labeled data to achieve satisfactory performance.
In multimedia analysis, domain adaptation studies the problem of cross-domain knowledge
transfer from a label rich source domain to a label scarce target domain, thus potentially
alleviates the annotation requirement for deep learning models. However, we find that
contemporary domain adaptation methods for cross-domain image understanding perform
poorly when source domain is noisy. Weakly Supervised Domain Adaptation (WSDA) studies
the domain adaptation problem under the scenario where source data can be noisy. Prior
methods on WSDA remove noisy source data and align the marginal distribution across
domains without considering the fine-grained semantic structure in the embedding space,
which have the problem of class misalignment, e.g., features of cats in the target
domain might be mapped near features of dogs in the source domain. In this paper,
we propose a novel method, termed Noise Tolerant Domain Adaptation (NTDA), for WSDA.
Specifically, we adopt the cluster assumption and learn cluster discriminatively with
class prototypes (centroids) in the embedding space. We propose to leverage the location
information of the data points in the embedding space and model the location information
with a Gaussian mixture model to identify noisy source data. We then design a network
which incorporates the Gaussian mixture noise model as a sub-module for unsupervised
noise removal and propose a novel cluster-level adversarial adaptation method based
on the Generative Adversarial Network (GAN) framework which aligns unlabeled target
data with the less noisy class prototypes for mapping the semantic structure across
domains. Finally, we devise a simple and effective algorithm to train the network
from end to end. We conduct extensive experiments to evaluate the effectiveness of
our method on both general images and medical images from COVID-19 and e-commerce
datasets. The results show that our method significantly outperforms state-of-the-art
WSDA methods.

Exploiting BERT for Multimodal Target Sentiment Classification through Input Space
Translation

Zaid Khan
Yun Fu

Multimodal target/aspect sentiment classification combines multimodal sentiment analysis
and aspect/target sentiment classification. The goal of the task is to combine vision
and language to understand the sentiment towards a target entity in a sentence. Twitter
is an ideal setting for the task because it is inherently multimodal, highly emotional,
and affects real world events. However, multimodal tweets are short and accompanied
by complex, possibly irrelevant images. We introduce a two-stream model that translates
images in input space using an object-aware transformer followed by a single-pass
non-autoregressive text generation approach. We then leverage the translation to construct
an auxiliary sentence that provides multimodal information to a language model. Our
approach increases the amount of text available to the language model and distills
the object-level information in complex images. We achieve state-of-the-art performance
on two multimodal Twitter datasets without modifying the internals of the language
model to accept multimodal data, demonstrating the effectiveness of our translation.
In addition, we explain a failure mode of a popular approach for aspect sentiment
analysis when applied to tweets. Our code is available at https://github.com/codezakh/exploiting-BERT-thru-translation.

Video Representation Learning with Graph Contrastive Augmentation

Jingran Zhang
Xing Xu
Fumin Shen
Yazhou Yao
Jie Shao
Xiaofeng Zhu

Contrastive-based self-supervised learning for image representations has significantly
closed the gap with supervised learning. A natural extension of image-based contrastive
learning methods to the video domain is to fully exploit the temporal structure presented
in videos. We propose a novel contrastive self-supervised video representation learning
framework, termed Graph Contrastive Augmentation (GCA), by constructing a video temporal
graph and devising a graph augmentation that is designed to enhance the correlation
across frames of videos and developing a new view for exploring temporal structure
in videos. Specifically, we construct the temporal graph in the video by leveraging
the relational knowledge behind the correlated sequence video features. Afterwards,
we apply the proposed graph augmentation to generate another graph view by cooperating
random corruption of the original graph to enhance the diversity of the intrinsic
structure of the temporal graph. To this end, we provide two different kinds of contrastive
learning methods to train our framework using temporal relationships concealed in
videos as self-supervised signals. We perform empirical experiments on downstream
tasks, action recognition and video retrieval, using the learned video representation,
and the results demonstrate that with the graph view of temporal structure, our proposed
GCA remarkably improves performance against or on par with the recent methods.

SESSION: Poster Session 4

An EM Framework for Online Incremental Learning of Semantic Segmentation

Shipeng Yan
Jiale Zhou
Jiangwei Xie
Songyang Zhang
Xuming He

Incremental learning of semantic segmentation has emerged as a promising strategy
for visual scene interpretation in the open-world setting. However, it remains challenging
to acquire novel classes in an online fashion for the segmentation task, mainly due
to its continuously-evolving semantic label space, partial pixelwise ground-truth
annotations, and constrained data availability. To address this, we propose an incremental
learning strategy that can fast adapt deep segmentation models without catastrophic
forgetting, using a streaming input data with pixel annotations on the novel classes
only. To this end, we develop a unified learning strategy based on the Expectation-Maximization
(EM) framework, which integrates an iterative relabeling strategy that fills in the
missing labels and a rehearsal-based incremental learning step that balances the stability-plasticity
of the model. Moreover, our EM algorithm adopts an adaptive sampling method to select
informative training data and a class-balancing training strategy in the incremental
model updates, both improving the efficacy of model learning. We validate our approach
on the PASCAL VOC 2012 and ADE20K datasets, and the results demonstrate its superior
performance over the existing incremental methods.

I2V-GAN: Unpaired Infrared-to-Visible Video Translation

Shuang Li
Bingfeng Han
Zhenjie Yu
Chi Harold Liu
Kai Chen
Shuigen Wang

Human vision is often adversely affected by complex environmental factors, especially
in night vision scenarios. Thus, infrared cameras are often leveraged to help enhance
the visual effects via detecting infrared radiation in the surrounding environment,
but the infrared videos are undesirable due to the lack of detailed semantic information.
In such a case, an effective video-to-video translation method from the infrared domain
to the visible light counterpart is strongly needed by overcoming the intrinsic huge
gap between infrared and visible fields. To address this challenging problem, we propose
an infrared-to-visible (I2V) video translation method I2V-GAN to generate fine-grained
and spatial-temporal consistent visible light videos by given unpaired infrared videos.
Technically, our model capitalizes on three types of constraints: 1) adversarial constraint
to generate synthetic frames that are similar to the real ones, 2) cyclic consistency
with the introduced perceptual loss for effective content conversion as well as style
preservation, and 3) similarity constraints across and within domains to enhance the
content and motion consistency in both spatial and temporal spaces at a fine-grained
level. Furthermore, the current public available infrared and visible light datasets
are mainly used for object detection or tracking, and some are composed of discontinuous
images which are not suitable for video tasks. Thus, we provide a new dataset for
infrared-to-visible video translation, which is named IRVI. Specifically, it has 12
consecutive video clips of vehicle and monitoring scenes, and both infrared and visible
light videos could be apart into 24352 frames. Comprehensive experiments on IRVI validate
that I2V-GAN is superior to the compared state-of-the-art methods in the translation
of infrared-to-visible videos with higher fluency and finer semantic details. Moreover,
additional experimental results on the flower-to-flower dataset indicate I2V-GAN is
also applicable to other video translation tasks. The code and IRVI dataset are available
at https://github.com/BIT-DA/I2V-GAN.

Implicit Feedbacks are Not Always Favorable: Iterative Relabeled One-Class Collaborative
Filtering against Noisy Interactions

Zitai Wang
Qianqian Xu
Zhiyong Yang
Xiaochun Cao
Qingming Huang

Due to privacy concerns, there is a rising favor in Recommender System community for
the One-class Collaborative Filtering (OCCF) framework, which predicts user preferences
only based on binary implicit feedback (e.g., click or not-click, rated or unrated).
The major challenge in OCCF problem stems from the inherent noise in implicit interaction.
Previous approaches have taken into account the noise in unobserved interactions (i.e.,
not-click only means a missing value, rather than negative feedback). However, they
generally ignore the noise in observed interactions (i.e., click does not necessarily
represent positive feedback), which might induce performance degradation. To attack
this issue, we propose a novel iteratively relabeling framework to jointly mitigate
the noise in both observed and unobserved interactions. As the core of the framework,
the iterative relabeling module exploits the self-training principle to dynamically
generate pseudo labels for user preferences. The downstream module for a recommendation
task is then trained with the refreshed labels where the noisy patterns are largely
alleviated. Finally, extensive experiments on three real-world datasets demonstrate
the effectiveness of our proposed methods.

InsPose: Instance-Aware Networks for Single-Stage Multi-Person Pose Estimation

Dahu Shi
Xing Wei
Xiaodong Yu
Wenming Tan
Ye Ren
Shiliang Pu

Multi-person pose estimation is an attractive and challenging task. Existing methods
are mostly based on two-stage frameworks, which include top-down and bottom-up methods.
Two-stage methods either suffer from high computational redundancy for additional
person detectors or they need to group keypoints heuristically after predicting all
the instance-agnostic keypoints. The single-stage paradigm aims to simplify the multi-person
pose estimation pipeline and receives a lot of attention. However, recent single-stage
methods have the limitation of low performance due to the difficulty of regressing
various full-body poses from a single feature vector. Different from previous solutions
that involve complex heuristic designs, we present a simple yet effective solution
by employing instance-aware dynamic networks. Specifically, we propose an instance-aware
module to adaptively adjust (part of) the network parameters for each instance. Our
solution can significantly increase the capacity and adaptive-ability of the network
for recognizing various poses, while maintaining a compact end-to-end trainable pipeline.
Extensive experiments on the MS-COCO dataset demonstrate that our method achieves
significant improvement over existing single-stage methods, and makes a better balance
of accuracy and efficiency compared to the state-of-the-art two-stage approaches.

Implicit Feature Refinement for Instance Segmentation

Lufan Ma
Tiancai Wang
Bin Dong
Jiangpeng Yan
Xiu Li
Xiangyu Zhang

We propose a novel implicit feature refinement module for high-quality instance segmentation.
Existing image/video instance segmentation methods rely on explicitly stacked convolutions
to refine instance features before the final prediction. In this paper, we first give
an empirical comparison of different refinement strategies, which reveals that the
widely-used four consecutive convolutions are not necessary. As an alternative, weight-sharing
convolution blocks provides competitive performance. When such block is iterated for
infinite times, the block output will eventually converge to an equilibrium state.
Based on this observation, the implicit feature refinement (IFR) is developed by constructing
an implicit function. The equilibrium state of instance features can be obtained by
fixed-point iteration via a simulated infinite-depth network. Our IFR enjoys several
advantages: 1) simulates an infinite-depth refinement network while only requiring
parameters of single residual block; 2) produces high-level equilibrium instance features
of global receptive field; 3) serves as a plug-and-play general module easily extended
to most object recognition frameworks. Experiments on the COCO and YouTube-VIS benchmarks
show that our IFR achieves improved performance on state-of-the-art image/video instance
segmentation frameworks, while reducing the parameter burden (e.g. 1% AP improvement
on Mask R-CNN with only 30.0% parameters in mask head). Code will be made available
at \hrefhttps://github.com/lufanma/IFR.git https://github.com/lufanma/IFR.git .

Question-controlled Text-aware Image Captioning

Anwen Hu
Shizhe Chen
Qin Jin

For an image with multiple scene texts, different people may be interested in different
text information. Current text-aware image captioning models are not able to generate
distinctive captions according to various information needs. To explore how to generate
personalized text-aware captions, we define a new challenging task, namely Question-controlled
Text-aware Image Captioning (Qc-TextCap). With questions as control signals, this
task requires models to understand questions, find related scene texts and describe
them together with objects fluently in human language. Based on two existing text-aware
captioning datasets, we automatically construct two datasets, ControlTextCaps and
ControlVizWiz to support the task. We propose a novel Geometry and Question Aware
Model (GQAM). GQAM first applies a Geometry-informed Visual Encoder to fuse region-level
object features and region-level scene text features with considering spatial relationships.
Then, we design a Question-guided Encoder to select the most relevant visual features
for each question. Finally, GQAM generates a personalized text-aware caption with
a Multimodal Decoder. Our model achieves better captioning performance and question
answering ability than carefully designed baselines on both two datasets. With questions
as control signals, our model generates more informative and diverse captions than
the state-of-the-art text-aware captioning model. Our code and datasets are publicly
available at https://github.com/HAWLYQ/Qc-TextCap.

Style-Aware Image Recommendation for Social Media Marketing

Yiwei Zhang
Toshihiko Yamasaki

Social media have become a popular platform for brands to allocate marketing budget
and build their relationship with customers. Posting images with a consistent concept
on social media helps customers recognize, remember, and consider brands. This strategy
is known as brand concept consistency in marketing literature. Consequently, brands
spend immense manpower and financial resources in choosing which images to post or
repost. Therefore, automatically recommending images with a consistent brand concept
is a necessary task for social media marketing. In this paper, we propose a content-based
recommendation system that learns the concept of brands and recommends images that
are coherent with the brand. Specifically, brand representation is performed from
the brand posts on social media. Existing methods rely on visual features extracted
by pre-trained neural networks, which can represent objects in the image but not the
style of the image. To bridge this gap, a framework using both object and style vectors
as input is proposed to learn the brand representation. In addition, we show that
the proposed method can not only be applied to brands but also be applied to influencers.
We collected a new Instagram influencer dataset, consisting of 616 influencers and
about 1 million images, which can greatly benefit future research in this area. The
experimental results on two large-scale Instagram datasets show the superiority of
the proposed method over state-of-the-art methods.

WePerson: Learning a Generalized Re-identification Model from All-weather Virtual
Data

He Li
Mang Ye
Bo Du

The aim of person re-identification (Re-ID) is retrieving a person of interest across
multiple non-overlapping cameras. Re-ID has gained significantly increased advancement
in recent years. However, real data annotation is costly and model generalization
ability is hindered by the lack of large-scale and diverse data. To address this problem,
we propose a Weather Person pipeline that can generate a synthesized Re-ID dataset
with different weather, scenes, and natural lighting conditions automatically. The
pipeline is built on the top of a game engine which contains a digital city, weather
and lighting simulation system, and various character models with manifold dressing.
To train a generalizable Re-ID model from the large-scale virtual WePerson dataset,
we design an adaptive sample selection strategy to close the domain gap and avoid
redundancy. We also design an informative sampling method for a mini-batch sampler
to accelerate the learning process. In addition, an efficient training method is introduced
by adopting instance normalization to capture identity invariant components from various
appearances. We evaluate our pipeline using direct transfer on 3 widely-used real-world
benchmarks, achieving competitive performance without any real-world image training.
This dataset starts the attempt to evaluate diverse environmental factors in a controllable
virtual engine, which provides important guidance for future generalizable Re-ID model
design. Notably, we improve the current state-of-the-art accuracy from 38.5% to 46.4%
on the challenging MSMT17 dataset. Dataset and code are available at https://github.com/lihe404/WePerson
https://github.com/lihe404/WePerson.

Polar Ray: A Single-stage Angle-free Detector for Oriented Object Detection in Aerial
Images

Shuai Liu
Lu Zhang
Shuai Hao
Huchuan Lu
You He

Oriented bounding boxes are widely used for object detection in aerial images. Existing
oriented object detection methods typically follow the general object detection paradigm
by adding an extra rotation angle on the horizontal bounding boxes. However, the angular
periodicity incurs the difficulty in angle regression and rotation sensitivity on
bounding boxes. In this paper, we propose a new anchor-free oriented object detector,
Polar Ray Network (PRNet), where object keypoints are represented by polar coordinates
without angle regression. Our PRNet learns a set of polar rays from the object center
to boundary with predefined equal-distributed angles. We introduce a dynamic PointConv
module to optimize the regression of polar ray by incorporating object corner features.
Furthermore, a classification feature guidance module is presented to improve the
classification accuracy by incorporating more spatial contents from polar rays. Experimental
results on two public datasets, i.e., DOTA and HRSC2016, demonstrate that the proposed
PRNet significantly outperforms existing anchor-free detectors, and shows highly competitiveness
with the state-of-the-art two-stage anchor-based methods.

Self-Contrastive Learning with Hard Negative Sampling for Self-supervised Point Cloud
Learning

Bi'an Du
Xiang Gao
Wei Hu
Xin Li

Point clouds have attracted increasing attention. Significant progress has been made
in methods for point cloud analysis, which often requires costly human annotation
as supervision. To address this issue, we propose a novel self-contrastive learning
for self-supervised point cloud representation learning, aiming to capture both local
geometric patterns and nonlocal semantic primitives based on the nonlocal self-similarity
of point clouds. The contributions are two-fold: on the one hand, instead of contrasting
among different point clouds as commonly employed in contrastive learning, we exploit
self-similar point cloud patches within a single point cloud as positive samples and
otherwise negative ones to facilitate the task of contrastive learning. On the other
hand, we actively learn hard negative samples that are close to positive samples for
discriminative feature learning, which are sampled conditional on each anchor patch
leveraging on the degree of self-similarity. Experimental results show that the proposed
method achieves state-of-the-art performance on widely used benchmark datasets for
self-supervised point cloud segmentation and transfer learning for classification.

Generally Boosting Few-Shot Learning with HandCrafted Features

Yi Zhang
Sheng Huang
Fengtao Zhou

Existing Few-Shot Learning (FSL) methods predominantly focus on developing different
types of sophisticated models to extract the transferable prior knowledge for recognizing
novel classes, while they almost pay less attention to the feature learning part in
FSL which often simply leverage some well-known CNN as the feature learner. However,
feature is the core medium for encoding such transferable knowledge. Feature learning
is easy to be trapped in the over-fitting particularly in the scarcity of the training
data, and thereby degenerates the performances of FSL. The handcrafted features, such
as Histogram of Oriented Gradient (HOG) and Local Binary Pattern (LBP), have no requirement
on the amount of training data, and used to perform quite well in many small-scale
data scenarios, since their extractions involve no learning process, and are mainly
based on the empirically observed and summarized prior feature engineering knowledge.
In this paper, we intend to develop a general and simple approach for generally boosting
FSL via exploiting such prior knowledge in the feature learning phase. To this end,
we introduce two novel handcrafted feature regression modules, namely HOG and LBP
regression, to the feature learning parts of deep learning-based FSL models. These
two modules are separately plugged into the different convolutional layers of backbone
based on the characteristics of the corresponding handcrafted features to guide the
backbone optimization from different feature granularity, and also ensure that the
learned feature can encode the handcrafted feature knowledge which improves the generalization
ability of feature and alleviate the over-fitting of the models. Three recent state-of-the-art
FSL approaches are leveraged for examining the effectiveness of our method. Extensive
experiments on miniImageNet, CIFAR-FS and FC100 datasets show that the performances
of all these FSL approaches are well boosted via applying our method on all three
datasets. Our codes and models have been released.

ROECS: A Robust Semi-direct Pipeline Towards Online Extrinsics Correction of the Surround-view
System

Tianjun Zhang
Nlong Zhao
Ying Shen
Xuan Shao
Lin Zhang
Yicong Zhou

Generally, a surround-view system (SVS), which is an indispensable component of advanced
driving assistant systems (ADAS), consists of four to six wide-angle fisheye cameras.
As long as both intrinsics and extrinsics of all cameras have been calibrated, a top-down
surround-view with the real scale can be synthesized at runtime from fisheye images
captured by these cameras. However, when the vehicle is driving on the road, relative
poses between cameras in the SVS may change from the initial calibrated states due
to bumps or collisions. In case that extrinsics' representations are not adjusted
accordingly, on the surround-view, obvious geometric misalignment will appear. Currently,
the researches on correcting the extrinsics of the SVS in an online manner are quite
sporadic, and a mature and robust pipeline is still lacking. As an attempt to fill
this research gap to some extent, in this work, we present a novel extrinsics correction
pipeline designed specially for the SVS, namely ROECS (Robust Online Extrinsics Correction
of the Surround-view system). Specifically, a "refined bi-camera error" model is firstly
designed. Then, by minimizing the overall "bi-camera error" within a sparse and semi-direct
framework, the SVS's extrinsics can be iteratively optimized and become accurate eventually.
Besides, an innovative three-step pixel selection strategy is also proposed. The superior
robustness and the generalization capability of ROECS are validated by both quantitative
and qualitative experimental results. To make the results reproducible, the collected
data and the source code have been released at https://cslinzhang.github.io/ROECS/.

Pseudo Graph Convolutional Network for Vehicle ReID

Wen Qian
Zhiqun He
Silong Peng
Chen Chen
Wei Wu

Image-based Vehicle ReID methods have suffered from limited information caused by
viewpoints, illumination, and occlusion as they usually use a single image as input.
Graph convolutional methods (GCN) can alleviate the aforementioned problem by aggregating
neighbor samples' information to enhance the feature representation. However, it's
uneconomical and computational for the inference processes of GCN-based methods since
they need to iterate over all samples for searching the neighbor nodes. In this paper,
we propose the first Pseudo-GCN Vehicle ReID method (PGVR) which enables a CNN-based
module to performs competitively to GCN-based methods and has a faster and lightweight
inference process. To enable the Pseudo-GCN mechanism, a two-branch network and a
graph-based knowledge distillation are proposed. The two-branch network consists of
a CNN-based student branch and a GCN-based teacher branch. The GCN-based teacher branch
adopts a ReID-based GCN to learn the topological optimization ability under the supervision
of ReID tasks during training time. Moreover, the graph-based knowledge distillation
explicitly transfers the topological optimization ability from the teacher branch
to the student branch which acknowledges all nodes. We evaluate our proposed method
PGVR on three mainstream Vehicle ReID benchmarks and demonstrate that PGVR achieves
state-of-the-art performance.

Towards Fast and High-Quality Sign Language Production

Wencan Huang
Wenwen Pan
Zhou Zhao
Qi Tian

Sign Language Production (SLP) aims to automatically translate a spoken language description
to its corresponding sign language video. The core procedure of SLP is to transform
sign gloss intermediaries into sign pose sequences (G2P). Most existing methods for
G2P are based on sequential autoregression or sequence-to-sequence encoder-decoder
learning. However, by generating target pose frames conditioned on the previously
generated ones, these models are prone to bringing issues such as error accumulation
and high inference latency. In this paper, we argue that such issues are mainly caused
by adopting autoregressive manner. Hence, we propose a novel Non-AuToregressive (NAT)
model with a parallel decoding scheme, as well as an External Aligner for sequence
alignment learning. Specifically, we extract alignments from the external aligner
by monotonic alignment search for gloss duration prediction, which is used by a length
regulator to expand the source gloss sequence to match the length of the target sign
pose sequence for parallel sign pose generation. Furthermore, we devise a spatial-temporal
graph convolutional pose generator in the NAT model to generate smoother and more
natural sign pose sequences. Extensive experiments conducted on PHOENIX14T dataset
show that our proposed model outperforms state-of-the-art autoregressive models in
terms of speed and quality.

Effective De-identification Generative Adversarial Network for Face Anonymization

Zhenzhong Kuang
Huigui Liu
Jun Yu
Aikui Tian
Lei Wang
Jianping Fan
Noboru Babaguchi

The growing application of face images and modern AI technology has raised another
important concern in privacy protection. In many real scenarios like scientific research,
social sharing and commercial application, lots of images are released without privacy
processing to protect people's identity. In this paper, we develop a novel effective
de-identification generative adversarial network (DeIdGAN) for face anonymization
by seamlessly replacing a given face image with a different synthesized yet realistic
one. Our approach consists of two steps. First, we anonymize the input face to obfuscate
its original identity. Then, we use our designed de-identification generator to synthesize
an anonymized face. During the training process, we leverage a pair of identity-adversarial
discriminators to explicitly constrain identity protection by pushing the synthesized
face away from the predefined sensitive faces to resist re-identification and identity
invasion. Finally, we validate the effectiveness of our approach on public datasets.
Compared with existing methods, our approach can not only achieve better identity
protection rates but also preserve superior image quality and data reusability, which
suggests the state-of-the-art performance.

Cross-modal Retrieval and Synthesis (X-MRS): Closing the Modality Gap in Shared Subspace
Learning

Ricardo Guerrero
Hai X. Pham
Vladimir Pavlovic

Computational food analysis (CFA) naturally requires multi-modal evidence of a particular
food, e.g., images, recipe text, etc. A key to making CFA possible is multi-modal
shared representation learning, which aims to create a joint representation of the
multiple views (text and image) of the data. In this work we propose a method for
food domain cross-modal shared representation learning that preserves the vast semantic
richness present in the food data. Our proposed method employs an effective transformer-based
multilingual recipe encoder coupled with a traditional image embedding architecture.
Here, we propose the use of imperfect multilingual translations to effectively regularize
the model while at the same time adding functionality across multiple languages and
alphabets. Experimental analysis on the public Recipe1M dataset shows that the representation
learned via the proposed method significantly outperforms the current state-of-the-arts
(SOTA) on retrieval tasks. Furthermore, the representational power of the learned
representation is demonstrated through a generative food image synthesis model conditioned
on recipe embeddings. Synthesized images can effectively reproduce the visual appearance
of paired samples, indicating that the learned representation captures the joint semantics
of both the textual recipe and its visual content, thus narrowing the modality gap.

When Face Completion Meets Irregular Holes: An Attributes Guided Deep Inpainting Network

Jie Xiao
Dandan Zhan
Haoran Qi
Zhi Jin

Lots of convolutional neural network (CNN)-based methods have been proposed to implement
face completion with regular holes. However, in practical applications, irregular
holes are more common to see. Moreover, due to the distinct attributes and large variation
of appearance for human faces, it is more challenging to fill irregular holes in face
images while keeping content consistent with the rest region. Since facial attributes
(e.g., gender, smiling, pointy nose, etc.) allow for a more understandable description
of one face, they can provide some hints that benefit the face completion task. In
this work, we propose a novel attributes-guided face completion network (AttrFaceNet),
which comprises a facial attribute prediction subnet and a face completion subnet.
The attribute prediction subnet predicts facial attributes from the rest parts of
the corrupted images and guides the face completion subnet to fill the missing regions.
The proposed AttrFaceNet is evaluated in an end-to-end way on commonly used datasets
CelebA and Helen. Extensive experimental results show that our method outperforms
state-of-the-art methods qualitatively and quantitatively especially in large mask
size cases. Code is available at https://github.com/FVL2020/AttrFaceNet.

Non-Linear Fusion for Self-Paced Multi-View Clustering

Zongmo Huang
Yazhou Ren
Xiaorong Pu
Lifang He

With the advance of the multi-media and multi-modal data, multi-view clustering (MVC)
has drawn increasing attentions recently. In this field, one of the most crucial challenges
is that the characteristics and qualities of different views usually vary extensively.
Therefore, it is essential for MVC methods to find an effective approach that handles
the diversity of multiple views appropriately. To this end, a series of MVC methods
focusing on how to integrate the loss from each view have been proposed in the past
few years. Among these methods, the mainstream idea is assigning weights to each view
and then combining them linearly. In this paper, inspired by the effectiveness of
non-linear combination in instance learning and the auto-weighted approaches, we propose
Non-Linear Fusion for Self-Paced Multi-View Clustering (NSMVC), which is totally different
from the the conventional linear-weighting algorithms. In NSMVC, we directly assign
different exponents to different views according to their qualities. By this way,
the negative impact from the corrupt views can be significantly reduced. Meanwhile,
to address the non-convex issue of the MVC model, we further define a novel regularizer-free
modality of Self-Paced Learning (SPL), which fits the proposed non-linear model perfectly.
Experimental results on various real-world data sets demonstrate the effectiveness
of the proposed method.

Counterfactual Debiasing Inference for Compositional Action Recognition

Pengzhan Sun
Bo Wu
Xunsong Li
Wen Li
Lixin Duan
Chuang Gan

Compositional action recognition is a novel challenge in the computer vision community
and focuses on revealing the different combinations of verbs and nouns instead of
treating subject-object interactions in videos as individual instances only. Existing
methods tackle this challenging task by simply ignoring appearance information or
fusing object appearances with dynamic instance tracklets. However, those strategies
usually do not perform well for unseen action instances. For that, in this work we
propose a novel learning framework called Counterfactual Debiasing Network (CDN) to
improve the model generalization ability by removing the interference introduced by
visual appearances of objects/subjects. It explicitly learns the appearance information
in action representations and later removes the effect of such information in a causal
inference manner. Specifically, we use tracklets and video content to model the factual
inference by considering both appearance information and structure information. In
contrast, only video content with appearance information is leveraged in the counterfactual
inference. With the two inferences, we conduct a causal graph which captures and removes
the bias introduced by the appearance information by subtracting the result of the
counterfactual inference from that of the factual inference. By doing that, our proposed
CDN method can better recognize unseen action instances by debiasing the effect of
appearances. Extensive experiments on the Something-Else dataset clearly show the
effectiveness of our proposed CDN over existing state-of-the-art methods.

STST: Spatial-Temporal Specialized Transformer for Skeleton-based Action Recognition

Yuhan Zhang
Bo Wu
Wen Li
Lixin Duan
Chuang Gan

Skeleton-based action recognition has been widely investigated considering their strong
adaptability to dynamic circumstances and complicated backgrounds. To recognize different
actions from skeleton sequences, it is essential and crucial to model the posture
of the human represented by the skeleton and its changes in the temporal dimension.
However, most of the existing works treat skeleton sequences in the temporal and spatial
dimension in the same way, ignoring the difference between the temporal and spatial
dimension in skeleton data which is not an optimal way to model skeleton sequences.
The posture represented by the skeleton in each frame is proposed to be modeled individually.
Meanwhile, capturing the movement of the entire skeleton in the temporal dimension
is needed. So, we designed Spatial Transformer Block and Directional Temporal Transformer
Block for modeling skeleton sequences in spatial and temporal dimensions respectively.
Due to occlusion/sensor/raw video, etc., there are noises on both temporal and spatial
dimensions in the extracted skeleton data reducing the recognition capabilities of
models. To adapt to this imperfect information condition, we propose a multi-task
self-supervised learning method by providing confusing samples in different situations
to improve the robustness of our model. Combining the above design, we propose our
Spatial-Temporal Specialized Transformer~(STST) and conduct experiments with our model
on the SHREC, NTU-RGB+D, and Kinetics-Skeleton. Extensive experimental results demonstrate
the improved performances and analysis of the proposed method.

Exploring Gradient Flow Based Saliency for DNN Model Compression

Xinyu Liu
Baopu Li
Zhen Chen
Yixuan Yuan

Model pruning aims to reduce the deep neural network (DNN) model size or computational
overhead. Traditional model pruning methods such as l-1 pruning that evaluates the
channel significance for DNN pay too much attention to the local analysis of each
channel and make use of the magnitude of the entire feature while ignoring its relevance
to the batch normalization (BN) and ReLU layer after each convolutional operation.
To overcome these problems, we propose a new model pruning method from a new perspective
of gradient flow in this paper. Specifically, we first theoretically analyze the channel's
influence based on Taylor expansion by integrating the effects of BN layer and ReLU
activation function. Then, the incorporation of the first-order Talyor polynomial
of the scaling parameter and the shifting parameter in the BN layer is suggested to
effectively indicate the significance of a channel in a DNN. Comprehensive experiments
on both image classification and image denoising tasks demonstrate the superiority
of the proposed novel theory and scheme. Code is available at https://github.com/CityU-AIM-Group/GFBS.

An Adaptive Iterative Inpainting Method with More Information Exploration

Shengjie Chen
Zhenhua Guo
Bo Yuan

The CNN-based image inpainting methods have achieved promising performance because
of its outstanding semantic understanding and reasoning potentialities. However, previous
works could not get satisfied results in some situations because information is not
fully explored. In this paper, we propose a new method by combining three innovative
ideas. First, to increase the diversity of the semantic information obtained by the
network in image synthesis, we propose a multiple hidden space perceptual (MHSP) loss,
which extracts high-level features from multiple pre-trained autoencoders. Second,
we adopt an adaptive iterative reasoning (AIR) stategy to reduce the calculations
under small-hole circumstances while ensuring the performance in large-hole circumstances.
Third, we find that color inconsistencies occasionally occurred in the final image
merging process, so we add a novel interval maximum saturation (IMS) loss to the final
loss function. Experiments on the benchmark datasets show our method performs favorably
against state-of-the-art approaches. Code is made publicly available at: https://github.com/IC-LAB/adaptive_iterative_inpainting.

Assisting News Media Editors with Cohesive Visual Storylines

Gonçalo Marcelino
David Semedo
André Mourão
Saverio Blasi
João Magalhães
Marta Mrak

Creating a cohesive, high-quality, relevant, media story is a challenge that news
media editors face on a daily basis. This challenge is aggravated by the flood of
highly-relevant information that is constantly pouring onto the newsroom. To assist
news media editors in this daunting task, this paper proposes a framework to organize
news content into cohesive, high-quality, relevant visual storylines. First, we formalize,
in a nonsubjective manner, the concept of visual story transition. Leveraging it,
we propose four graph based methods of storyline creation, aiming for global story
cohesiveness. These where created and implemented to take full advantage of existing
graph algorithms, ensuring their correctness and good computational performance. They
leverage a strong ensemble-based estimator which was trained to predict story transition
quality based on both the semantic and visual features present in the pair of images
under scrutiny. A user study covered a total of 28 curated stories about sports and
cultural events. Experiments showed that (i) visual transitions in storylines can
be learned with a quality above 90%, and (ii) the proposed graph methods can produce
cohesive storylines with a quality in the range of 88% to 96%.

MM-Flow: Multi-modal Flow Network for Point Cloud Completion

Yiqiang Zhao
Yiyao Zhou
Rui Chen
Bin Hu
Xiding Ai

Point cloud is often noisy and incomplete. Existing completion methods usually generate
the complete shapes for missing regions of 3D objects based on the deterministic learning
frameworks, which only predict a single reconstruction output. However, these methods
ignore the ill-posed nature of the completion problem and do not fully account for
multiple possible completion predictions corresponding to one incomplete input. To
address this problem, we propose a flow-based network together with a multi-modal
mapping strategy for 3D point cloud completion. Specially, an encoder is first introduced
to encode the input point cloud data into a rich latent representation suitable for
conditioning in all flow-layers. Then we design a conditional normalizing flow architecture
to learn the exact distribution of the plausible completion shapes over the multi-modal
latent space. Finally, in order to fully utilize additional shape information, we
propose a tree-structured decoder to perform the inverse mapping for complete shape
generation with high fidelity. The proposed flow network is trained using a single
loss named the negative log-likelihood to capture the distribution variations between
input and output, without complex reconstruction loss and adversarial loss. Extensive
experiments on ShapeNet dataset, KITTI dataset and measured data demonstrate that
our method outperforms the state-of-the-art point cloud completion methods through
qualitative and quantitative analysis.

Long-tailed Distribution Adaptation

Zhiliang Peng
Wei Huang
Zonghao Guo
Xiaosong Zhang
Jianbin Jiao
Qixiang Ye

Recognizing images with long-tailed distributions remains a challenging problem while
there lacks an interpretable mechanism to solve this problem. In this study, we formulate
Long-tailed recognition as Domain Adaption (LDA), by modeling the long-tailed distribution
as an unbalanced domain and the general distribution as a balanced domain. Within
the balanced domain, we propose to slack the generalization error bound, which is
defined upon the empirical risks of unbalanced and balanced domains and the divergence
between them. We propose to jointly optimize empirical risks of the unbalanced and
balanced domains and approximate their domain divergence by intra-class and inter-class
distances, with the aim to adapt models trained on the long-tailed distribution to
general distributions in an interpretable way. Experiments on benchmark datasets for
image recognition, object detection, and instance segmentation validate that our LDA
approach, beyond its interpretability, achieves state-of-the-art performance.

Lesion-Inspired Denoising Network: Connecting Medical Image Denoising and Lesion Detection

Kecheng Chen
Kun Long
Yazhou Ren
Jiayu Sun
Xiaorong Pu

Deep learning has achieved notable performance in the denoising task of low-quality
medical images and the detection task of lesions, respectively. However, existing
low-quality medical image denoising approaches are disconnected from the detection
task of lesions. Intuitively, the quality of denoised images will influence the lesion
detection accuracy that in turn can be used to affect the denoising performance. To
this end, we propose a play-and-plug medical image denoising framework, namely Lesion-Inspired
Denoising Network (LIDnet), to collaboratively improve both denoising performance
and detection accuracy of denoised medical images. Specifically, we propose to insert
the feedback of downstream detection task into existing denoising framework by jointly
learning a multi-loss objective. Instead of using perceptual loss calculated on the
entire feature map, a novel region-of-interest (ROI) perceptual loss induced by the
lesion detection task is proposed to further connect these two tasks. To achieve better
optimization for overall framework, we propose a customized collaborative training
strategy for LIDnet. On consideration of clinical usability and imaging characteristics,
three low-dose CT images datasets are used to evaluate the effectiveness of the proposed
LIDnet. Experiments show that, by equipping with LIDnet, both of the denoising and
lesion detection performance of baseline methods can be significantly improved.

Domain Adaptive Semantic Segmentation without Source Data

Fuming You
Jingjing Li
Lei Zhu
Zhi Chen
Zi Huang

Domain adaptive semantic segmentation is recognized as a promising technique to alleviate
the domain shift between the labeled source domain and the unlabeled target domain
in many real-world applications, such as automatic pilot. However, large amounts of
source domain data often introduce significant costs in storage and training, and
sometimes the source data is inaccessible due to privacy policies. To address these
problems, we investigate domain adaptive semantic segmentation without source data,
which assumes that the model is pre-trained on the source domain, and then adapting
to the target domain without accessing source data anymore. Since there is no supervision
from the source domain data, many self-training methods tend to fall into the winner-takes-all
dilemma, where the majority classes totally dominate the segmentation networks and
the networks fail to classify the minority classes. Consequently, we propose an effective
framework for this challenging problem with two components: positive learning and
negative learning. In positive learning, we select the class-balanced pseudo-labeled
pixels with intra-class threshold, while in negative learning, for each pixel, we
investigate which category the pixel does not belong to with the proposed heuristic
complementary label selection. Notably, our framework can be easily implemented and
incorporated with other methods to further enhance the performance. Extensive experiments
on two widely-used synthetic-to-real benchmarks demonstrate our claims and the effectiveness
of our framework, which outperforms the baseline with a large margin. Code is available
at https://github.com/fumyou13/LDBE.

Cross-modal Joint Prediction and Alignment for Composed Query Image Retrieval

Yuchen Yang
Min Wang
Wengang Zhou
Houqiang Li

In this paper, we focus on the composed query image retrieval task, namely retrieving
the target images that are similar to a composed query, in which a modification text
is combined with a query image to describe a user's accurate search intention. Previous
methods usually focus on learning the joint image-text representations, but rarely
consider the intrinsic relationship among the query image, the target image and the
modification text. To address this problem, we propose a new cross-modal joint prediction
and alignment framework for composed query image retrieval. In our framework, the
modification text is regarded as an implicit transformation between the query image
and the target image. Motivated by that, not only the combination of the query image
and modification text should be similar to the target image, but also the modification
text should be predicted according to the query image and the target image. We devote
to aligning this relationship by a novel Joint Prediction Module (JPM). Our proposed
framework can seamlessly incorporate the JPM into the existing methods to effectively
improve the discrimination and robustness of visual and textual representations. The
experiments on three public datasets demonstrate the effectiveness of our proposed
framework, proving that our proposed JPM can be simply incorporated with the existing
methods while effectively improving the performance.

JDMAN: Joint Discriminative and Mutual Adaptation Networks for Cross-Domain Facial
Expression Recognition

Yingjian Li
Yingnan Gao
Bingzhi Chen
Zheng Zhang
Lei Zhu
Guangming Lu

Cross-domain Facial Expression Recognition (FER) is challenging due to the difficulty
of concurrently handling the domain shift and semantic gap during domain adaptation.
Existing methods mainly focus on reducing the domain discrepancy for transferable
features but fail to decrease the semantic one, which may result in negative transfer.
To this end, we propose Joint Discriminative and Mutual Adaptation Networks (JDMAN),
which collaboratively bridge the domain shift and semantic gap by domain- and category-level
co-adaptation based on mutual information and discriminative metric learning techniques.
Specifically, we design a mutual information minimization module for domain-level
adaptation, which narrows the domain shift by simultaneously distilling the domain-invariant
components and eliminating the untransferable ones lying in different domains. Moreover,
we propose a semantic metric learning module for category-level adaptation, which
can close the semantic discrepancy during discriminative intra-domain representation
learning and transferable inter-domain knowledge discovery. These two modules are
jointly leveraged in our JDMAN to safely transfer the source knowledge to target data
in an end-to-end manner. Extensive experimental results on six databases show that
our method achieves state-of-the-art performance. The code of our JDMAN is available
at https://github.com/YingjianLi/JDMAN.

Improving Weakly Supervised Object Localization via Causal Intervention

Feifei Shao
Yawei Luo
Li Zhang
Lu Ye
Siliang Tang
Yi Yang
Jun Xiao

The recently emerged weakly-supervised object localization (WSOL) methods can learn
to localize an object in the image only using image-level labels. Previous works endeavor
to perceive the interval objects from the small and sparse discriminative attention
map, yet ignoring the co-occurrence confounder (e.g., duck and water), which makes
the model inspection (e.g., CAM) hard to distinguish between the object and context.
In this paper, we make an early attempt to tackle this challenge via causal intervention
(CI). Our proposed method, dubbed CI-CAM, explores the causalities among image features,
contexts, and categories to eliminate the biased object-context entanglement in the
class activation maps thus improving the accuracy of object localization. Extensive
experiments on several benchmarks demonstrate the effectiveness of CI-CAM in learning
the clear object boundary from confounding contexts. Particularly, on the CUB-200-2011
which severely suffers from the co-occurrence confounder, CI-CAM significantly outperforms
the traditional CAM-based baseline (58.39% vs 52.4% in Top-1 localization accuracy).
While in more general scenarios such as ILSVRC 2016, CI-CAM can also perform on par
with the state of the arts.

Imbalanced Source-free Domain Adaptation

Xinhao Li
Jingjing Li
Lei Zhu
Guoqing Wang
Zi Huang

Conventional Unsupervised Domain Adaptation (UDA) aims to transfer knowledge from
a well-labeled source domain to an unlabeled target domain only when data from both
domains is simultaneously accessible, which is challenged by the recent Source-free
Domain Adaptation (SFDA). However, we notice that the performance of existing SFDA
methods would be dramatically degraded by intra-domain class imbalance and inter-domain
label shift. Unfortunately, class-imbalance is a common phenomenon in real-world domain
adaptation applications. To address this issue, we present Imbalanced Source-free
Domain Adaptation (ISFDA) in this paper. Specifically, we first train a uniformed
model from the source domain, and then propose secondary label correction, curriculum
sampling, plus intra-class tightening and inter-class separation to overcome the joint
presence of covariate shift and label shift. Extensive experiments on three imbalanced
benchmarks verify that ISFDA could perform favorably against existing UDA and SFDA
methods under various conditions of class-imbalance, and outperform existing SFDA
methods by over 15% in terms of per-class average accuracy on a large-scale long-tailed
imbalanced dataset.

Learning Transferrable and Interpretable Representations for Domain Generalization

Zhekai Du
Jingjing Li
Ke Lu
Lei Zhu
Zi Huang

Conventional machine learning models are often vulnerable to samples with different
distributions from the ones of training samples, which is known as domain shift. Domain
Generalization (DG) challenges this issue by training a model based on multiple source
domains and generalizing it to arbitrary unseen target domains. In spite of remarkable
results made in DG, a majority of existing works lack a deep understanding of the
feature representations learned in DG models, resulting in limited generalization
ability when facing domainsout-of-distribution. In this paper, we aim to learn a domain
transformation space via a domain transformer network (DTN) which explicitly mines
the relationship among multiple domains and constructs transferable feature representations
for down-stream tasks by interpreting each feature as a semantically weighted combination
of multiple domain-specific features. Our DTN is encouraged to meta-learn the properties
and characteristics of domains during the training process based on multiple seen
domains, making transformed feature representations more semantical, thus generalizing
better to unseen domains. Once the model is constructed, the feature representations
of unseen target domains can also be inferred adaptively by selectively combining
the feature representations from the diverse set of seen domains. We conduct extensive
experiments on five DG benchmarks and the results strongly demonstrate the effectiveness
of our approach.

WAS-VTON: Warping Architecture Search for Virtual Try-on Network

Zhenyu Xie
Xujie Zhang
Fuwei Zhao
Haoye Dong
Michael C. Kampffmeyer
Haonan Yan
Xiaodan Liang

Despite recent progress on image-based virtual try-on, current methods are constraint
by shared warping networks and thus fail to synthesize natural try-on results when
faced with clothing categories that require different warping operations. In this
paper, we address this problem by finding clothing category-specific warping networks
for the virtual try-on task via Neural Architecture Search (NAS). We introduce a NAS-Warping
Module and elaborately design a bilevel hierarchical search space to identify the
optimal network-level and operation-level flow estimation architecture. Given the
network-level search space, containing different numbers of warping blocks, and the
operation-level search space with different convolution operations, we jointly learn
a combination of repeatable warping cells and convolution operations specifically
for the clothing-person alignment. Moreover, a NAS-Fusion Module is proposed to synthesize
more natural final try-on results, which is realized by leveraging particular skip
connections to produce better-fused features that are required for seamlessly fusing
the warped clothing and the unchanged person part. We adopt an efficient and stable
one-shot searching strategy to search the above two modules. Extensive experiments
demonstrate that our WAS-VTON significantly outperforms the previous fixed-architecture
try-on methods with more natural warping results and virtual try-on results.

DFR-Net: A Novel Multi-Task Learning Network for Real-Time Multi-Instrument Segmentation

Yan-Jie Zhou
Shi-Qi Liu
Xiao-Liang Xie
Zeng-Guang Hou

In computer-assisted vascular surgery, real-time multi-instrument segmentation serves
as a pre-requisite step. However, a large amount of effort has been dedicated to single-instrument
rather than multi-instrument in computer-assisted intervention research to this day.
To fill the overlooked gap, this study introduces a Light-Weight Deep Feature Refinement
Network (DFR-Net) based on multi-task learning for real-time multi-instrument segmentation.
In this network, the proposed feature refinement module (FRM) can capture long-term
dependencies while retaining precise positional information, which helps model locate
the foreground objects of interest. The designed channel calibration module (CCM)
can re-calibrate fusion weights of multi-level features, which helps model balance
the importance of semantic information and appearance information. Besides, the connectivity
loss function is developed to address fractures in the wire-like structure segmentation
results. Extensive experiments on two different types of datasets consistently demonstrate
that DFR-Net can achieve state-of-the-art segmentation performance while meeting the
real-time requirements.

From Superficial to Deep: Language Bias driven Curriculum Learning for Visual Question
Answering

Mingrui Lao
Yanming Guo
Yu Liu
Wei Chen
Nan Pu
Michael S. Lew

Most Visual Question Answering (VQA) models are faced with language bias when learning
to answer a given question, thereby failing to understand multimodal knowledge simultaneously.
Based on the fact that VQA samples with different levels of language bias contribute
differently for answer prediction, in this paper, we overcome the language prior problem
by proposing a novel Language Bias driven Curriculum Learning (LBCL) approach, which
employs an easy-to-hard learning strategy with a novel difficulty metric Visual Sensitive
Coefficient (VSC). Specifically, in the initial training stage, the VQA model mainly
learns the superficial textual correlations between questions and answers (easy concept)
from more-biased examples, and then progressively focuses on learning the multimodal
reasoning (hard concept) from less-biased examples in the following stages. The curriculum
selection of examples on different stages is according to our proposed difficulty
metric VSC, which is to evaluate the difficulty driven by the language bias of each
VQA sample. Furthermore, to avoid the catastrophic forgetting of the learned concept
during the multi-stage learning procedure, we propose to integrate knowledge distillation
into the curriculum learning framework. Extensive experiments show that our LBCL can
be generally applied to common VQA baseline models, and achieves remarkably better
performance on the VQA-CP v1 and v2 datasets, with an overall 20% accuracy boost over
baseline models.

Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and Benchmark

Xun Gao
Yin Zhao
Jie Zhang
Longjun Cai

Recognizing the emotional state of people is a basic but challenging task in video
understanding. In this paper, we propose a new task in this field, named Pairwise
Emotional Relationship Recognition (PERR). This task aims to recognize the emotional
relationship between the two interactive characters in a given video clip. It is different
from the traditional emotion and social relation recognition task. Varieties of information,
consisting of character appearance, behaviors, facial emotions, dialogues, background
music as well as subtitles contribute differently to the final results, which makes
the task more challenging but meaningful in developing more advanced multi-modal models.
To facilitate the task, we develop a new dataset called Emotional RelAtionship of
inTeractiOn (ERATO) based on dramas and movies. ERATO is a large-scale multi-modal
dataset for PERR task, which has 31,182 video clips, lasting about 203 video hours.
Different from the existing datasets, ERATO contains interaction-centric videos with
multi-shots, varied video length, and multiple modalities including visual, audio
and text. As a minor contribution, we propose a baseline model composed of Synchronous
Modal-Temporal Attention (SMTA) unit to fuse the multi-modal information for the PERR
task. In contrast to other prevailing attention mechanisms, our proposed SMTA can
steadily improve the performance by about 1%. We expect the ERATO as well as our proposed
SMTA to open up a new way for PERR task in video understanding and further improve
the research of multi-modal fusion methodology.

Block Popularity Prediction for Multimedia Storage Systems Using Spatial-Temporal-Sequential
Neural Networks

Yingying Cheng
Fan Zhang
Gang Hu
Yiwen Wang
Hanhui Yang
Gong Zhang
Zhuo Cheng

Predicting block popularity is of crucial importance for data placement in multi-tiered
multimedia storage systems. Traditional methods, such as least recently used and exponential
smoothing, are commonly employed to predict future block access frequencies and fail
to achieve good performance for complex and changing access patterns. Recently, deep
neural networks have brought great success to pattern recognition and prediction,
which motivates us to introduce deep learning to solve the problem of block popularity
prediction. In this paper, we first analyze and verify the temporal and spatial correlations
among the multimedia I/O traces. Then, we design a multi-dimension feature to capture
such correlations, which serves as the input of the designed deep neural network.
A spatial-temporal-sequential neural network (STSNN) and its variants that capture
the locality information, time dependency information, and block sequential information
are proposed to predict the block popularity. We systematically evaluate our STSNN
models against six baseline models from three different categories, i.e., heuristic
methods, regression methods and neural network-based methods. Experiment results show
that our proposed STSNN models are very promising for predicting block access frequencies
under some of Huawei and Microsoft datasets and particularly achieve 2-6 times better
performance compared with the baselines in terms of the I/O hit ratio, I/O recall
rate and I/O prediction ratio under the Microsoft 64 MB-block dataset.

Transferrable Contrastive Learning for Visual Domain Adaptation

Yang Chen
Yingwei Pan
Yu Wang
Ting Yao
Xinmei Tian
Tao Mei

Self-supervised learning (SSL) has recently become the favorite among feature learning
methodologies. It is therefore appealing for domain adaptation approaches to consider
incorporating SSL. The intuition is to enforce instance-level feature consistency
such that the predictor becomes somehow invariant across domains. However, most existing
SSL methods in the regime of domain adaptation usually are treated as standalone auxiliary
components, leaving the signatures of domain adaptation unattended. Actually, the
optimal region where the domain gap vanishes and the instance level constraint that
SSL peruses may not coincide at all. From this point, we present a particular paradigm
of self-supervised learning tailored for domain adaptation, i.e., Transferrable Contrastive
Learning (TCL), which links the SSL and the desired cross-domain transferability congruently.
We find contrastive learning intrinsically a suitable candidate for domain adaptation,
as its instance invariance assumption can be conveniently promoted to cross-domain
class-level invariance favored by domain adaptation tasks. Based on particular memory
bank constructions and pseudo label strategies, TCL then penalizes cross-domain intra-class
domain discrepancy between source and target through a clean and novel contrastive
loss. The free lunch is, thanks to the incorporation of contrastive learning, TCL
relies on a moving-averaged key encoder that naturally achieves a temporally ensembled
version of pseudo labels for target data, which avoids pseudo label error propagation
at no extra cost. TCL therefore efficiently reduces cross-domain gaps. Through extensive
experiments on benchmarks (Office-Home, VisDA-2017, Digits-five, PACS and DomainNet)
for both single-source and multi-source domain adaptation tasks, TCL has demonstrated
state-of-the-art performances.

Weighted Gaussian Loss based Hamming Hashing

Rong-Cheng Tu
Xian-Ling Mao
Cihang Kong
Zihang Shao
Ze-Lin Li
Wei Wei
Heyan Huang

Recently, deep Hamming hashing methods have been proposed for Hamming space retrieval
which enables constant-time search by hash table lookups instead of linear scan. When
carrying out Hamming space retrieval, for each query datapoint, there is a Hamming
ball centered on the query datapoint, and only the datapoints within the Hamming ball
are returned as the relevant ones, while those beyond are discarded directly. Thus,
to further enhance the retrieval performance, it is a key point for the Hamming hashing
methods to decrease the dissimilar datapoints within the Hamming ball. However, nearly
all existing Hamming hashing methods cannot effectively penalize the dissimilar pairs
within the Hamming ball to push them out. To tackle this problem, in this paper, we
propose a novel Weighted Gaussian Loss based Hamming Hashing, called WGLHH, which
introduces a weighted Gaussian loss to optimize hashing model. Specifically, the weighted
Gaussian loss consists of three parts: a novel Gaussian-distribution based loss, a
novel badly-trained-pair attention mechanism and a quantization loss. The Gaussian-distribution
based loss is proposed to effectively penalize the dissimilar pairs within the Hamming
ball. The badly-trained-pair attention mechanism is proposed to assign a weight for
each data pair, which puts more weight on data pairs whose corresponding hash codes
cannot preserve original similarity well, and less on those having already handled
well. The quantization loss is used to reduce the quantization error. By incorporating
the three parts, the proposed weighted Gaussian loss will penalize significantly on
the dissimilar pairs within the Hamming ball to generate more compact hashing codes.
Extensive experiments on two benchmark datasets show that the proposed method outperforms
the state-of-the-art baselines in image retrieval task.

Domain-Aware SE Network for Sketch-based Image Retrieval with Multiplicative Euclidean
Margin Softmax

Peng Lu
Gao Huang
Hangyu Lin
Wenming Yang
Guodong Guo
Yanwei Fu

This paper proposes a novel approach for Sketch-Based Image Retrieval (SBIR), for
which the key is to bridge the gap between sketches and photos in terms of the data
representation. Inspired by channel-wise attention explored in recent years, we present
a Domain-Aware Squeeze-and-Excitation (DASE) network, which seamlessly incorporates
the prior knowledge of sample sketch or photo into SE module and make the SE module
capable of emphasizing appropriate channels according to domain signal. Accordingly,
the proposed network can switch its mode to achieve a better domain feature with lower
intra-class discrepancy. Moreover, while previous works simply focus on minimizing
intra-class distance and maximizing inter-class distance, we introduce a loss function,
named Multiplicative Euclidean Margin Softmax (MEMS), which introduces multiplicative
Euclidean margin into feature space and ensure that the maximum intra-class distance
is smaller than the minimum inter-class distance. This facilitates learning a highly
discriminative feature space and ensures a more accurate image retrieval result. Extensive
experiments are conducted on two widely used SBIR benchmark datasets. Our approach
achieves better results on both datasets, surpassing the state-of-the-art methods
by a large margin.

FTAFace: Context-enhanced Face Detector with Fine-grained Task Attention

Deyu Wang
Dongchao Wen
Wei Tao
Lingxiao Yin
Tse-Wei Chen
Tadayuki Ito
Kinya Osa
Masami Kato

In face detection, it is a common strategy to treat samples differently according
to their difficulty for balancing training data distribution. However, we observe
that widely used sampling strategies, such as OHEM and Focal loss, can lead to the
performance imbalance between different tasks (e.g., classification and localization).
Through analysis, we point out that, due to the driving of classification information,
these sample-based strategies are difficult to coordinate the attention of different
tasks during the training, thus leading to the above imbalance. Accordingly, we first
confirm this by shifting the attention from the sample level to the task level. Then,
we propose a fine-grained task attention method, a.k.a FTA, including inter-task importance
and intra-task importance, which adaptively adjusts the attention of each item in
the task from both global and local perspectives, so as to achieve finer optimization.
In addition, we introduce transformer as a feature enhancer to assist our convolution
network, and propose a context enhancement transformer, a.k.a CET, to mine the spatial
relationship in the features towards more robust feature representation. Extensive
experiments on WiderFace and FDDB benchmarks demonstrate that our method significantly
boosts the baseline performance by 2.7%, 2.3% and 4.9% on easy, medium and hard validation
sets respectively. Furthermore, the proposed FTAFace-light achieves higher accuracy
than the state-of-the-art and reduces the amount of computation by 28.9%.

Identity-aware Graph Memory Network for Action Detection

Jingcheng Ni
Jie Qin
Di Huang

Action detection plays an important role in high-level video understanding and media
interpretation. Many existing studies fulfill this spatio-temporal localization by
modeling the context, capturing the relationship of actors, objects, and scenes conveyed
in the video. However, they often universally treat all the actors without considering
the consistency and distinctness between individuals, leaving much room for improvement.
In this paper, we explicitly highlight the identity information of the actors in terms
of both long-term and short-term context through a graph memory network, namely identity-aware
graph memory network (IGMN). Specifically, we propose the hierarchical graph neural
network (HGNN) to comprehensively conduct long-term relation modeling within the same
identity as well as between different ones. Regarding short-term context, we develop
a dual attention module (DAM) to generate identity-aware constraint to reduce the
influence of interference by the actors of different identities. Extensive experiments
on the challenging AVA dataset demonstrate the effectiveness of our method, which
achieves state-of-the-art results on AVA v2.1 and v2.2.

Improving Robustness and Accuracy via Relative Information Encoding in 3D Human Pose
Estimation

Wenkang Shan
Haopeng Lu
Shanshe Wang
Xinfeng Zhang
Wen Gao

Most of the existing 3D human pose estimation approaches mainly focus on predicting
3D positional relationships between the root joint and other human joints (local motion)
instead of the overall trajectory of the human body (global motion). Despite the great
progress achieved by these approaches, they are not robust to global motion, and lack
the ability to accurately predict local motion with a small movement range. To alleviate
these two problems, we propose a relative information encoding method that yields
positional and temporal enhanced representations. Firstly, we encode positional information
by utilizing relative coordinates of 2D poses to enhance the consistency between the
input and output distribution. The same posture with different absolute 2D positions
can be mapped to a common representation. It is beneficial to resist the interference
of global motion on the prediction results. Second, we encode temporal information
by establishing the connection between the current pose and other poses of the same
person within a period of time. More attention will be paid to the movement changes
before and after the current pose, resulting in better prediction performance on local
motion with a small movement range. The ablation studies validate the effectiveness
of the proposed relative information encoding method. Besides, we introduce a multi-stage
optimization method to the whole framework to further exploit the positional and temporal
enhanced representations. Our method outperforms state-of-the-art methods on two public
datasets. Code is available at https://github.com/paTRICK-swk/Pose3D-RIE.

Deep Neural Network Retrieval

Nan Zhong
Zhenxing Qian
Xinpeng Zhang

With the rapid development of deep learning-based techniques, the general public can
use a lot of "machine learning as a service" (MLaaS), which provides end-to-end machine
learning solutions. Taking the image classification task as an example, users only
need to update their dataset and labels to MLaaS without requiring the specific knowledge
of machine learning or a concrete structure of the classifier. Afterward, MLaaS returns
a well-trained classifier to them. In this paper, we explore a potential novel task
named "deep neural network retrieval" and its application which helps MLaaS to save
computation resources. MLaaS usually owns a huge amount of well-trained models for
various tasks and datasets. If a user requires a task that is similar to the one having
been finished previously, MLaaS can quickly retrieve a model rather than training
from scratch. We propose a pragmatic solution and two different approaches to extract
the semantic feature of DNN representing the function of DNN, which is analogous to
the usage of word2vec in natural language processing. The semantic feature of DNN
can be expressed as a vector by feeding some well-designed litmus images into the
DNN or as a matrix by reversely constructing the most desired input of DNN. Both methods
can consider the topological information and parameters of the DNN simultaneously.
Extensive experiments, including multiple datasets and networks, also demonstrate
the efficiency of our method and show the high accuracy of deep neural network retrieval.

Adversarial Learning with Mask Reconstruction for Text-Guided Image Inpainting

Xingcai Wu
Yucheng Xie
Jiaqi Zeng
Zhenguo Yang
Yi Yu
Qing Li
Wenyin Liu

Text-guided image inpainting aims to complete the corrupted patches coherent with
both visual and textual context. On one hand, existing works focus on surrounding
pixels of the corrupted patches without considering the objects in the image, resulting
in the characteristics of objects described in text being painted on non-object regions.
On the other hand, the redundant information in text may distract the generation of
objects of interest in the restored image. In this paper, we propose an adversarial
learning framework with mask reconstruction (ALMR) for image inpainting with textual
guidance, which consists of a two-stage generator and dual discriminators. The two-stage
generator aims to restore coarse-grained and fine-grained images, respectively. In
particular, we devise a dual-attention module (DAM) to incorporate the word-level
and sentence-level textual features as guidance on generating the coarse-grained and
fine-grained details in the two stages. Furthermore, we design a mask reconstruction
module (MRM) to penalize the restoration of the objects of interest with the given
textual descriptions about the objects. For adversarial training, we exploit global
and local discriminators for the whole image and corrupted patches, respectively.
Extensive experiments conducted on CUB-200-2011, Oxford-102 and CelebA-HQ show the
outperformance of the proposed ALMR (e.g., FID value is reduced from 29.69 to 14.69
compared with the state-of-the-art approach on CUB-200-2011). Codes are available
at https://github.com/GaranWu/ALMR

Spatiotemporal Inconsistency Learning for DeepFake Video Detection

Zhihao Gu
Yang Chen
Taiping Yao
Shouhong Ding
Jilin Li
Feiyue Huang
Lizhuang Ma

The rapid development of facial manipulation techniques has aroused public concerns
in recent years. Following the success of deep learning, existing methods always formulate
DeepFake video detection as a binary classification problem and develop frame-based
and video-based solutions. However, little attention has been paid to capturing the
spatial-temporal inconsistency in forged videos. To address this issue, we term this
task as a Spatial-Temporal Inconsistency Learning (STIL) process and instantiate it
into a novel STIL block, which consists of a Spatial Inconsistency Module (SIM), a
Temporal Inconsistency Module (TIM), and an Information Supplement Module (ISM). Specifically,
we present a novel temporal modeling paradigm in TIM by exploiting the temporal difference
over adjacent frames along with both horizontal and vertical directions. And the ISM
simultaneously utilizes the spatial information from SIM and temporal information
from TIM to establish a more comprehensive spatial-temporal representation. Moreover,
our STIL block is flexible and could be plugged into existing 2D CNNs. Extensive experiments
and visualizations are presented to demonstrate the effectiveness of our method against
the state-of-the-art competitors.

VeloCity: Using Voice Assistants for Cyclists to Provide Traffic Reports

Gian-Luca Savino
Jessé Moraes Braga
Johannes Schöning

Cycling is on the rise as a relevant alternative to car-based mobility and even though
there are mobile applications specifically designed for cyclists to support this development,
many still face unresolved challenges in terms of safe user interaction with complex
data while riding. We present the design, development, and evaluation of VeloCity
- an application for reporting traffic incidents and structures relevant to cyclists.
In a case study, we compared its' three input methods (touch, in-app speech recognition,
the voice assistant of the operating system) to evaluate which attributes make for
safe interaction while cycling. We found that participants prefer to use the voice
assistant over the other modalities as it was least distracting due to its hands-
and eyes-free interaction design. Furthermore, they chose short commands over conversational
phrases. Based on our results, we present five guidelines for designing voice user
interfaces for cyclists and argue for moving away from touch-based interfaces in this
domain, which still make up most of the applied interaction techniques today.

Edit Like A Designer: Modeling Design Workflows for Unaligned Fashion Editing

Qiyu Dai
Shuai Yang
Wenjing Wang
Wei Xiang
Jiaying Liu

Fashion editing has drawn increasing research interest with its extensive application
prospect. Instead of directly manipulating the real fashion item image, it is more
intuitive for designers to modify it via the design draft. In this paper, we model
design workflows for a novel task of unaligned fashion editing, allowing the user
to edit a fashion item through manipulating its corresponding design draft. The challenge
lies in the large misalignment between the real fashion item and the design draft,
which could severely degrade the quality of editing results. To address this issue,
we propose an Unaligned Fashion Editing Network (UFE-Net). A coarsely rendered fashion
item is firstly generated from the edited design draft via a translation module. With
this as guidance, we align and manipulate the original unedited fashion item via a
novel alignment-driven fashion editing module, and then optimize the details and shape
via a reference-guided refinement module. Furthermore, a joint training strategy is
introduced to exploit the synergy between the alignment and editing tasks. Our UFE-Net
enables the edited fashion item to have semantically consistent geometric shape and
realistic details to the edited draft in the edited region, as well as to keep the
unedited region intact. Experiments demonstrate our superiority over the competing
methods on unaligned fashion editing.

Privacy-Preserving Portrait Matting

Jizhizi Li
Sihan Ma
Jing Zhang
Dacheng Tao

Recently, there has been an increasing concern about the privacy issue raised by using
personally identifiable information in machine learning. However, previous portrait
matting methods were all based on identifiable portrait images. To fill the gap, we
present P3M-10k in this paper, which is the first large-scale anonymized benchmark
for Privacy-Preserving Portrait Matting. P3M-10k consists of 10,000 high-resolution
face-blurred portrait images along with high-quality alpha mattes. We systematically
evaluate both trimap-free and trimap-based matting methods on P3M-10k and find that
existing matting methods show different generalization capabilities when following
the Privacy-Preserving Training (PPT) setting, i.e., training on face-blurred images
and testing on arbitrary images. To devise a better trimap-free portrait matting model,
we propose P3M-Net, which leverages the power of a unified framework for both semantic
perception and detail matting, and specifically emphasizes the interaction between
them and the encoder to facilitate the matting process. Extensive experiments on P3M-10k
demonstrate that P3M-Net outperforms the state-of-the-art methods in terms of both
objective metrics and subjective visual quality. Besides, it shows good generalization
capacity under the PPT setting, confirming the value of P3M-10k for facilitating future
research and enabling potential real-world applications. The source code and dataset
are available at https://github.com/JizhiziLi/P3M.

A Transformer based Approach for Image Manipulation Chain Detection

Jiaxiang You
Yuanman Li
Jiantao Zhou
Zhongyun Hua
Weiwei Sun
Xia Li

Image manipulation chain detection aims to identify the existence of involved operations
and also their orders, playing an important role in multimedia forensics and image
analysis. However,all the existing algorithms model the manipulation chain detection
as a classification problem, and can only detect chains containing up to two operations.
Due to the exponentially increased solution space and the complex interactions among
operations, how to reveal a long chain from a processed image remains a long-standing
problem in the multimedia forensic community. To address this challenge, in this paper,
we propose a new direction for manipulation chain detection. Different from previous
works, we treat the manipulation chain detection as a machine translation problem
rather than a classification one, where we model the chains as the sentences of a
target language, and each word serves as one possible image operation. Specifically,
we first transform the manipulated image into a deep feature space, and further model
the traces left by the manipulation chain as a sentence of a latent source language.
Then, we propose to detect the manipulation chain through learning the mapping from
the source language to the target one under a machine translation framework. Our method
can detect manipulation chains consisting of up to five operations, and we obtain
promising results on both the short-chain detection and the long-chain detection.

HANet: Hierarchical Alignment Networks for Video-Text Retrieval

Peng Wu
Xiangteng He
Mingqian Tang
Yiliang Lv
Jing Liu

Video-text retrieval is an important yet challenging task in vision-language understanding,
which aims to learn a joint embedding space where related video and text instances
are close to each other. Most current works simply measure the video-text similarity
based on video-level and text-level embeddings. However, the neglect of more fine-grained
or local information causes the problem of insufficient representation. Some works
exploit the local details by disentangling sentences, but overlook the corresponding
videos, causing the asymmetry of video-text representation. To address the above limitations,
we propose a Hierarchical Alignment Network (HANet) to align different level representations
for video-text matching. Specifically, we first decompose video and text into three
semantic levels, namely event (video and text), action (motion and verb), and entity
(appearance and noun). Based on these, we naturally construct hierarchical representations
in the individual-local-global manner, where the individual level focuses on the alignment
between frame and word, local level focuses on the alignment between video clip and
textual context, and global level focuses on the alignment between the whole video
and text. Different level alignments capture fine-to-coarse correlations between video
and text, as well as take the advantage of the complementary information among three
semantic levels. Besides, our HANet is also richly interpretable by explicitly learning
key semantic concepts. Extensive experiments on two public datasets, namely MSR-VTT
and VATEX, show the proposed HANet outperforms other state-of-the-art methods, which
demonstrates the effectiveness of hierarchical representation and alignment. Our code
is publicly available at https://github.com/Roc-Ng/HANet.

Scalable Multi-view Subspace Clustering with Unified Anchors

Mengjing Sun
Pei Zhang
Siwei Wang
Sihang Zhou
Wenxuan Tu
Xinwang Liu
En Zhu
Changjian Wang

Multi-view subspace clustering has received widespread attention to effectively fuse
multi-view information among multimedia applications. Considering that most existing
approaches' cubic time complexity makes it challenging to apply to realistic large-scale
scenarios, some researchers have addressed this challenge by sampling anchor points
to capture distributions in different views. However, the separation of the heuristic
sampling and clustering process leads to weak discriminate anchor points. Moreover,
the complementary multi-view information has not been well utilized since the graphs
are constructed independently by the anchors from the corresponding views. To address
these issues, we propose a Scalable Multi-view Subspace Clustering with Unified Anchors
(SMVSC). To be specific, we combine anchor learning and graph construction into a
unified optimization framework. Therefore, the learned anchors can represent the actual
latent data distribution more accurately, leading to a more discriminative clustering
structure. Most importantly, the linear time complexity of our proposed algorithm
allows the multi-view subspace clustering approach to be applied to large-scale data.
Then, we design a four-step alternative optimization algorithm with proven convergence.
Compared with state-of-the-art multi-view subspace clustering methods and large-scale
oriented methods, the experimental results on several datasets demonstrate that our
SMVSC method achieves comparable or better clustering performance much more efficiently.
The code of SMVSC is available at https://github.com/Jeaninezpp/SMVSC.

PRNet: A Progressive Recovery Network for Revealing Perceptually Encrypted Images

Tao Xiang
Ying Yang
Shangwei Guo
Hangcheng Liu
Hantao Liu

Perceptual encryption is an efficient way of protecting image content by only selectively
encrypting a portion of significant data in plain images. Existing security analysis
of perceptual encryption usually resorts to traditional cryptanalysis techniques,
which require heavy manual work and strict prior knowledge of encryption schemes.
In this paper, we introduce a new end-to-end method of analyzing the visual security
of perceptually encrypted images, without any manual work or knowing any prior knowledge
of the encryption scheme. Specifically, by leveraging convolutional neural networks
(CNNs), we propose a progressive recovery network (PRNet) to recover visual content
from perceptually encrypted images. Our PRNet is stacked with several dense attention
recovery blocks (DARBs), where each DARB contains two branches: feature extraction
branch and image recovery branch. These two branches cooperate to rehabilitate more
detailed visual information and generate efficient feature representation via densely
connected structure and dual-saliency mechanism. We conduct extensive experiments
to demonstrate that PRNet works on different perceptual encryption schemes with different
settings, and the results show that PRNet significantly outperforms the state-of-the-art
CNN-based image restoration methods.

FakeTagger: Robust Safeguards against DeepFake Dissemination via Provenance Tracking

Run Wang
Felix Juefei-Xu
Meng Luo
Yang Liu
Lina Wang

In recent years, DeepFake is becoming a common threat to our society, due to the remarkable
progress of generative adversarial networks (GAN) in image synthesis. Unfortunately,
existing studies that propose various approaches, in fighting against DeepFake and
determining if the facial image is real or fake, is still at an early stage. Obviously,
the current DeepFake detection method struggles to catch the rapid progress of GANs,
especially in the adversarial scenarios where attackers can evade the detection intentionally,
such as adding perturbations to fool the DNN-based detectors. While passive detection
simply tells whether the image is fake or real, DeepFake provenance, on the other
hand, provides clues for tracking the sources in DeepFake forensics. Thus, the tracked
fake images could be blocked immediately by administrators and avoid further spread
in social networks.

In this paper, we investigate the potentials of image tagging in serving the DeepFake
provenance tracking. Specifically, we devise a deep learning-based approach, named
FakeTagger, with a simple yet effective encoder and decoder design along with channel
coding to embed message to the facial image, which is to recover the embedded message
after various drastic GAN-based DeepFake transformation with high confidence. The
embedded message could be employed to represent the identity of facial images, which
further contributed to DeepFake detection and provenance. Experimental results demonstrate
that our proposed approach could recover the embedded message with an average accuracy
of more than 95% over the four common types of DeepFakes. Our research finding confirms
effective privacy-preserving techniques for protecting personal photos from being
DeepFaked.

Discriminative Latent Semantic Graph for Video Captioning

Yang Bai
Junyan Wang
Yang Long
Bingzhang Hu
Yang Song
Maurice Pagnucco
Yu Guan

Video captioning aims to automatically generate natural language sentences that can
describe the visual contents of a given video. Existing generative models like encoder-decoder
frameworks cannot explicitly explore the object-level interactions and frame-level
information from complex spatio-temporal data to generate semantic-rich captions.
Our main contribution is to identify three key problems in a joint framework for future
video summarization tasks. 1) Enhanced Object Proposal: we propose a novel Conditional
Graph that can fuse spatio-temporal information into latent object proposal. 2) Visual
Knowledge: Latent Proposal Aggregation is proposed to dynamically extract visual words
with higher semantic levels. 3) Sentence Validation: A novel Discriminative Language
Validator is proposed to verify generated captions so that key semantic concepts can
be effectively preserved. Our experiments on two public datasets (MVSD and MSR-VTT)
manifest significant improvements over state-of-the-art approaches on all metrics,
especially for BLEU-4 and CIDEr. Our code is available at https://github.com/baiyang4/D-LSG-Video-Caption.

From Image to Imuge: Immunized Image Generation

Qichao Ying
Zhenxing Qian
Hang Zhou
Haisheng Xu
Xinpeng Zhang
Siyi Li

We introduce Imuge, an image tamper resilient generative scheme for image self-recovery.
The traditional manner of concealing image content within the image are inflexible
and fragile to diverse digital attack, i.e. image cropping and JPEG compression. To
address this issue, we jointly train a U-Net backboned encoder, a tamper localization
network and a decoder for image recovery. Given an original image, the encoder produces
a visually indistinguishable immunized image. At the recipient's side, the verifying
network localizes the malicious modifications, and the original content can be approximately
recovered by the decoder, despite the presence of the attacks. Several strategies
are proposed to boost the training efficiency. We demonstrate that our method can
recover the details of the tampered regions with a high quality despite the presence
of various kinds of attacks. Comprehensive ablation studies are conducted to validate
our network designs.

Wisdom of (Binned) Crowds: A Bayesian Stratification Paradigm for Crowd Counting

Sravya Vardhani Shivapuja
Mansi Pradeep Khamkar
Divij Bajaj
Ganesh Ramakrishnan
Ravi Kiran Sarvadevabhatla

Datasets for training crowd counting deep networks are typically heavy-tailed in count
distribution and exhibit discontinuities across the count range. As a result, the
de facto statistical measures (MSE, MAE) exhibit large variance and tend to be unreliable
indicators of performance across the count range. To address these concerns in a holistic
manner, we revise processes at various stages of the standard crowd counting pipeline.
To enable principled and balanced minibatch sampling, we propose a novel smoothed
Bayesian sample stratification approach. We propose a novel cost function which can
be readily incorporated into existing crowd counting deep networks to encourage strata-aware
optimization. We analyze the performance of representative crowd counting approaches
across standard datasets at per strata level and in aggregate. We analyze the performance
of crowd counting approaches across standard datasets and demonstrate that our proposed
modifications noticeably reduce error standard deviation. Our contributions represent
a nuanced, statistically balanced and fine-grained characterization of performance
for crowd counting approaches.

Demystifying Commercial Video Conferencing Applications

Insoo Lee
Jinsung Lee
Kyunghan Lee
Dirk Grunwald
Sangtae Ha

Video conferencing applications have seen explosive growth both in the number of available
applications and their use. However, there have been few studies on the detailed analysis
of video conferencing applications with respect to network dynamics, yet understanding
these dynamics is essential for network design and improving these applications. In
this paper, we carry out an in-depth measurement and modeling study on the rate control
algorithms used in six popular commercial video conferencing applications. Based on
macroscopic behaviors commonly observed across these applications in our extensive
measurements, we construct a unified architecture to model the rate control mechanisms
of individual applications. We then reconstruct each application's rate control by
inferring key parameters that closely follow its rate control and quality adaptation
behaviors. To our knowledge, this is the first work that reverse-engineers rate control
algorithms of popular video conferencing applications, which are often unknown or
hidden as they are proprietary software. We confirm our analysis and models using
an end-to-end testbed that can capture the dynamics of each application under a variety
of network conditions. We also show how we can use these models to gain insights into
the particular behaviors of an application in two practical scenarios.

LightFEC: Network Adaptive FEC with a Lightweight Deep-Learning Approach

Han Hu
Sheng Cheng
Xinggong Zhang
Zongming Guo

Nowadays, the interest of real-time video streaming reaches a peak. To deal with the
problem of packet loss and optimize users' Quality of Experience (QoE), Forward error
correction (FEC) has been studied and applied extensively. The performance of FEC
depends on whether the future loss pattern is precisely predicted, while the previous
researches have not provided a robust packet loss prediction method. In this work,
we propose LightFEC to make accurate and fast prediction of packet loss pattern. By
applying long short-term memory (LSTM) networks, clustering algorithms and model compression
methods, LightFEC is able to accurately predict packet loss in various network conditions
without consuming too much time. According to the results of well-designed experiments,
we find out that LightFEC outperforms other schemes on prediction accuracy, which
improves the packet recovery ratio while keeping the redundancy ratio at a low level.

SOGAN: 3D-Aware Shadow and Occlusion Robust GAN for Makeup Transfer

Yueming Lyu
Jing Dong
Bo Peng
Wei Wang
Tieniu Tan

In recent years, virtual makeup applications have become more and more popular. However,
it is still challenging to propose a robust makeup transfer method in the real-world
environment. Current makeup transfer methods mostly work well on good-conditioned
clean makeup images, but transferring makeup that exhibits shadow and occlusion is
not satisfying. To alleviate it, we propose a novel makeup transfer method, called
3D-Aware Shadow and Occlusion Robust GAN (SOGAN). Given the source and the reference
faces, we first fit a 3D face model and then disentangle the faces into shape and
texture. In the texture branch, we map the texture to the UV space and design a UV
texture generator to transfer the makeup. Since human faces are symmetrical in the
UV space, we can conveniently remove the undesired shadow and occlusion from the reference
image by carefully designing a Flip Attention Module (FAM). After obtaining cleaner
makeup features from the reference image, a Makeup Transfer Module (MTM) is introduced
to perform accurate makeup transfer. The qualitative and quantitative experiments
demonstrate that our SOGAN not only achieves superior results in shadow and occlusion
situations but also performs well in large pose and expression variations.

SESSION: Reproducibility

Reproducibility Companion Paper: Campus3D: A Photogrammetry Point Cloud Benchmark for Outdoor Scene Hierarchical Understanding

Yuqing Liao
Xinke Li
Zekun Tong
Yabang Zhao
Andrew Lim
Zhenzhong Kuang
Cise Midoglu

This companion paper is to support the replication of paper "Campus3D: A Photogrammetry
Point Cloud Benchmark for Outdoor Scene Hierarchical Understanding", which was presented
at ACM Multimedia 2020. The supported paper's main purpose was to provide a photogrammetry
point cloud-based dataset with hierarchical multilabels to facilitate the area of
3D deep learning. Based on this provided dataset and source code, in this work, we
build a complete package to reimplement the proposed methods and experiments (i.e.,
the hierarchical learning framework and the benchmarks of the hierarchical semantic
segmentation task). Specifically, this paper contains the technical details of the
package, including file structure, dataset preparation, installation package, and
the conduction of the experiment. We also present the replicated experiment results
and indicate our contributions to the original implementation.

Reproducibility Companion Paper: Norm-in-Norm Loss with Faster Convergence and Better Performance for Image Quality
Assessment

Dingquan Li
Tingting Jiang
Ming Jiang
Vajira Lasantha Thambawita
Haoliang Wang

This companion paper supports the experimental replication of the paper "Norm-in-Norm
Loss with Faster Convergence and Better Performance for Image Quality Assessment''
presented at ACM Multimedia 2020. We provide the software package for replicating
the implementation of the "Norm-in-Norm'' loss and the corresponding "LinearityIQA''
model used in the original paper. This paper contains the guidelines to reproduce
all the experimental results of the original paper.

Reproducibility Companion Paper: Kalman Filter-Based Head Motion Prediction for Cloud-Based Mixed Reality

Serhan Gül
Sebastian Bosse
Dimitri Podborski
Thomas Schierl
Cornelius Hellge
Marc A. Kastner
Jan Zahálka

In our MM'20 paper,, we presented a Kalman filter-based approach for prediction of
head motion in 6DoF. The proposed approach was employed in our cloud-based volumetric
video streaming system to reduce the interaction latency experienced by the user.
In this companion paper, we present the dataset collected for our experiments and
our simulation framework that reproduces the obtained experimental results. Our implementation
is freely available on Github to facilitate further research.

Reproducibility Companion Paper: Blind Natural Video Quality Prediction via Statistical Temporal Features and Deep
Spatial Features

Jari Korhonen
Yicheng Su
Junyong You
Steven Hicks
Cise Midoglu

Blind natural video quality assessment (BVQA), also known as no-reference video quality
assessment, is a highly active research topic. In our recent contribution titled "Blind
Natural Video Quality Prediction via Statistical Temporal Features and Deep Spatial
Features" published in ACM Multimedia 2020, we proposed a two-level video quality
model employing statistical temporal features and spatial features extracted by a
deep convolutional neural network (CNN) for this purpose. At the time of publishing,
the proposed model (CNN-TLVQM) achieved state-of-the-art results in BVQA. In this
paper, we describe the process of reproducing the published results by using CNN-TLVQM
on two publicly available natural video quality datasets.

Reproducibility Companion Paper: Describing Subjective Experiment Consistency by p-Value P-P Plot

Jakub Nawala
Lucjan Janowski
Bogdan Cmiel
Krzysztof Rusek
Marc A. Kastner
Jan Zahálka

In this paper we reproduce experimental results presented in our earlier work titled
"Describing Subjective Experiment Consistency by p-Value P-P Plot" that was presented
in the course of the 28th ACM International Conference on Multimedia. The paper aims
at verifying the soundness of our prior results and helping others understand our
software framework. We present artifacts that help reproduce tables, figures and all
the data derived from raw subjective responses that were included in our earlier work.
Using the artifacts we show that our results are reproducible. We invite everyone
to use our software framework for subjective responses analyses going beyond reproducibility
efforts.

Reproducibility Companion Paper: Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework

Li Tao
Xueting Wang
Toshihiko Yamasaki
Jingjing Chen
Steven Hicks

In this companion paper, we provide details of the artifacts to support the replication
of "Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework",
which was presented at MM'20. The Inter-intra Contrastive (IIC) framework aims to
extract more discriminative temporal information by extending intra-negative samples
in contrastive self-supervised learning. In this paper, we first summarize our contribution.
Then we explain the file structure of the source code and detailed settings. Since
our proposal is a framework which contain a lot of different settings, we provide
some custom settings to help other researchers to use our methods easily. The source
code is available at https://github.com/BestJuly/IIC.

Reproducibility Companion Paper: Visual Relation of Interest Detection

Fan Yu
Haonan Wang
Tongwei Ren
Jinhui Tang
Gangshan Wu
Jingjing Chen
Zhenzhong Kuang

In this companion paper, we provide the details of the reproducibility artifacts of
the paper "Visual Relation of Interest Detection" presented at MM'20. Visual Relation
of Interest Detection (VROID) aims to detect visual relations that are important for
conveying the main content of an image. In this paper, we explain the file structure
of the source code and publish the details of our ViROI dataset, which can be used
to retrain the model with custom parameters. We also detail the scripts for component
analysis and comparison with other methods and list the parameters that can be modified
for custom training and inference.

Reproducibility Companion Paper: On Learning Disentangled Representation for Acoustic
Event Detection

Lijian Gao
Qirong Mao
Jingjing Chen
Ming Dong
Ratna Chinnam
Lucile Sassatelli
Miguel Romero Rondon
Ujjwal Sharma

This companion paper is provided to describe the major experiments reported in our
paper "On Learning Disentangled Representation for Acoustic Event Detection" published
in ACM Multimedia 2019. To make the replication of our work easier, we first give
an introduction of the computing environment where all of our experiments are conducted.
Furthermore, we provide an environmental configuration file to setup the compiling
environment and other artifacts including the source code, datasets and the files
generated during our experiments. Finally, we summarize the structure and usage of
the source code. For more details, please consult the README file in the archive of
artifacts on GitHub: https://github.com/mastergofujs/SED_PyTorch.

SESSION: Keynote Talk V&VI

AI and the Future of Education

James Lester

It has become clear that AI will profoundly transform society. AI will dramatically
change the socio-technological landscape, produce seismic economic shifts, and fundamentally
reshape the workforce in ways that we are only beginning to grasp. With its imminent
arrival, it is critically important to deeply engage with questions around how we
should design education in the Age of AI. Fortunately, while we must address the significant
challenges posed by AI, we can also leverage AI itself to address these challenges.
In this talk we will consider how (and at what rate) AI technologies for education
will evolve, discuss emerging innovations in AI-augmented learning environments for
formal and informal contexts, and explore what competencies will be elevated in an
AI-pervasive workforce. We will discuss near-future AI technologies that leverage
advances in natural language processing, computer vision, and machine learning to
create narrative-centered learning environments, embodied conversational agents for
learning, and multimodal learning analytics. We will conclude by considering what
all of these developments suggest for K-12 education and the future of human learning.

Digital Human in an Integrated Physical-Digital World (IPhD)

Zhengyou Zhang

With the rapid development of digital technologies such as VR, AR, XR, and more importantly
the almost ubiquitous mobile broadband coverage, we are entering an Integrated Physical-Digital
World (IPhD), the tight integration of virtual world with the physical world. The
IPhD is characterized with four key technologies: Virtualization of the physical world,
Realization of the virtual world, Holographic internet, and Intelligent Agent. Internet
will continue its development with faster speed and broader bandwidth, and will eventually
be able to communicate holographic contents including 3D shape, appearance, spatial
audio, touch sensing and smell. Intelligent agents, such as digital human, and digital/physical
robots, travels between digital and physical worlds. In this talk, we will describe
our work on digital human for this IPhD world. This includes: computer vision techniques
for building digital humans, multimodal text-to-speech synthesis (voice and lip shapes),
speech-driven face animation, neural-network-based body motion control, human-digital-human
interaction, and an emotional video game anchor.

SESSION: Session 24: Media Interpretation-III

Cross-Camera Feature Prediction for Intra-Camera Supervised Person Re-identification
across Distant Scenes

Wenhang Ge
Chunyan Pan
Ancong Wu
Hongwei Zheng
Wei-Shi Zheng

Person re-identification (Re-ID) aims to match person images across non-overlapping
camera views. The majority of Re-ID methods focus on small-scale surveillance systems
in which each pedestrian is captured in different camera views of adjacent scenes.
However, in large-scale surveillance systems that cover larger areas, it is required
to track a pedestrian of interest across distant scenes (e.g., a criminal suspect
escapes from one city to another). Since most pedestrians appear in limited local
areas, it is difficult to collect training data with cross-camera pairs of the same
person. In this work, we study intra-camera supervised person re-identification across
distant scenes (ICS-DS Re-ID), which uses cross-camera unpaired data with intra-camera
identity labels for training. It is challenging as cross-camera paired data plays
a crucial role for learning camera-invariant features in most existing Re-ID methods.
To learn camera-invariant representation from cross-camera unpaired training data,
we propose a cross-camera feature prediction method to mine cross-camera self supervision
information from camera-specific feature distribution by transforming fake cross-camera
positive feature pairs and minimize the distances of the fake pairs. Furthermore,
we automatically localize and extract local-level feature by a transformer. Joint
learning of global-level and local-level features forms a global-local cross-camera
feature prediction scheme for mining fine-grained cross-camera self supervision information.
Finally, cross-camera self supervision and intra-camera supervision are aggregated
in a framework. The experiments are conducted in the ICS-DS setting on Market-SCT,
Duke-SCT and MSMT17-SCT datasets. The evaluation results demonstrate the superiority
of our method, which gains significant improvements of 15.4 Rank-1 and 22.3 mAP on
Market-SCT as compared to the second best method. Our code is available at https://github.com/g3956/CCFP.

Video Visual Relation Detection via Iterative Inference

Xindi Shang
Yicong Li
Junbin Xiao
Wei Ji
Tat-Seng Chua

The core problem of video visual relation detection (VidVRD) lies in accurately classifying
the relation triplets, which comprise of the classes of subject and object entities,
and the predicate classes of various relationships between them. Existing VidVRD approaches
classify these three relation components in either independent or cascaded manner,
thus fail to fully exploit the inter-dependency among them. In order to utilize this
inter-dependency in tackling the challenges of visual relation recognition in videos,
we propose a novel iterative relation inference approach for VidVRD. We derive our
model from the viewpoint of joint relation classification which is light-weight yet
effective, and propose a training approach to better learn the dependency knowledge
from the likely correct triplet combinations. As such, the proposed inference approach
is able to gradually refine each component based on its learnt dependency and the
other two's predictions. Our ablation studies show that this iterative relation inference
can empirically converge in a few steps and consistently boost the performance over
baselines. Further, we incorporate it into a newly designed VidVRD architecture, named
VidVRD-II (Iterative Inference), which generalizes well across different datasets.
Experiments show that VidVRD-II achieves the start-of-the-art performance on both
of ImageNet-VidVRD and VidOR benchmark datasets.

Instance-wise or Class-wise? A Tale of Neighbor Shapley for Concept-based Explanation

Jiahui Li
Kun Kuang
Lin Li
Long Chen
Songyang Zhang
Jian Shao
Jun Xiao

Interpreting model knowledge is an essential topic to improve human understanding
of deep black-box models. Traditional methods contribute to providing intuitive instance-wise
explanations which allocating importance scores for low-level features (e.g, pixels
for images). To adapt to the human way of thinking, one strand of recent researches
has shifted its spotlight to mining important concepts. However, these concept-based
interpretation methods focus on computing the contribution of each discovered concept
on the class level and can not precisely give instance-wise explanations. Besides,
they consider each concept as an independent unit, and ignore the interactions among
concepts. To this end, in this paper, we propose a novel COncept-based NEighbor Shapley
approach (dubbed as CONE-SHAP) to evaluate the importance of each concept by considering
its physical and semantic neighbors, and interpret model knowledge with both instance-wise
and class-wise explanations. Thanks to this design, the interactions among concepts
in the same image are fully considered. Meanwhile, the computational complexity of
Shapley Value is reduced from exponential to polynomial. Moreover, for a more comprehensive
evaluation, we further propose three criteria to quantify the rationality of the allocated
contributions for the concepts, including coherency, complexity, and faithfulness.
Extensive experiments and ablations have demonstrated that our CONE-SHAP algorithm
outperforms existing concept-based methods and simultaneously provides precise explanations
for each instance and class.

Multifocal Attention-Based Cross-Scale Network for Image De-raining

Zheyu Zhang
Yurui Zhu
Xueyang Fu
Zhiwei Xiong
Zheng-Jun Zha
Feng Wu

Albeit existing deep learning-based image de-raining methods have achieved promising
results, most of them only extract single scale features, and neglect the fact that
similar rain streaks appear repeatedly across different scales. Therefore, this paper
aims to explore the cross-scale cues in a multi-scale fashion. Specifically, we first
introduce an adaptive-kernel pyramid to provide effective multi-scale information.
Then, we design two cross-scale similarity attention blocks (CSSABs) to search spatial
and channel relationships between two scales, respectively. The spatial CSSAB explores
the spatial similarity between pixels of cross-scale features, while the channel CSSAB
emphasizes the interdependencies among cross-scale features. To further improve the
diversity of features, we adopt the wavelet transformation and multi-head mechanism
in CSSABs to generate multifocal features which focus on different areas. Finally,
based on our CSSABs, we construct an effective multifocal attention-based cross-scale
network, which exhaustively utilizes the cross-scale correlations of both rain streaks
and background, to achieve image de-raining. Experiments show the superiority of our
network over state-of-the-art image de-raining approaches both qualitatively and quantitatively.
The source code and pre-trained models are available at https://github.com/zhangzheyu0/Multifocal_derain.

PFFN: Progressive Feature Fusion Network for Lightweight Image Super-Resolution

Dongyang Zhang
Changyu Li
Ning Xie
Guoqing Wang
Jie Shao

Recently, convolutional neural network (CNN) has been the core ingredient of modern
models, triggering the surge of deep learning in super-resolution (SR). Despite the
great success of these CNN-based methods which are prone to be deeper and heavier,
it is impracticable to directly apply these methods for some low-budget devices due
to the superfluous computational overhead. To alleviate this problem, a novel lightweight
SR network named progressive feature fusion network (PFFN) is developed to seek for
better balance between performance and running efficiency. Specifically, to fully
exploit the feature maps, a novel progressive attention block (PAB) is proposed as
the main building block of PFFN. The proposed PAB adopts several parallel but connected
paths with pixel attention, which could significantly increase the receptive field
of each layer, distill useful information and finally learn more discriminative feature
representations. In PAB, a powerful dual attention module (DAM) is further incorporated
to provide the channel and spatial attention mechanism in fairly lightweight manner.
Besides, we construct a pretty concise and effective upsampling module with the help
of multi-scale pixel attention, named MPAU. All of the above modules ensure the network
can benefit from attention mechanism while still being lightweight enough. Furthermore,
a novel training strategy following the cosine annealing learning scheme is proposed
to maximize the representation ability of the model. Comprehensive experiments show
that our PFFN achieves the best performance against all existing lightweight state-of-the-art
SR methods with less number of parameters and even performs comparably to computationally
expensive networks.

InterBN: Channel Fusion for Adversarial Unsupervised Domain Adaptation

Mengzhu Wang
Wei Wang
Baopu Li
Xiang Zhang
Long Lan
Huibin Tan
Tianyi Liang
Wei Yu
Zhigang Luo

A classifier trained on one dataset rarely works on other datasets obtained under
different conditions because of domain shifting. Such a problem is usually solved
by domain adaptation methods. In this paper, we propose a novel unsupervised domain
adaptation (UDA) method based on Interchangeable Batch Normalization (InterBN) to
fuse different channels in deep neural networks for adversarial domain adaptation.Specifically,
we first observe that the channels with small batch normalization scaling factor have
less influence on the whole domain adaption, followed by a theoretical proof that
the scaling factors for some channels will definitely come close to zero when imposing
a sparsity regularization. Then, we replace the channels that have smaller scaling
factors in the source domain with the mean of the channels which have larger scaling
factors in the target domain or vice versa. Such a simple but effective channel fusion
scheme can drastically increase the domain adaption ability.Extensive experimental
results show that our InterBN significantly outperforms the current adversarial domain
adaptation methods by a large margin on four visual benchmarks. In particular, InterBN
achieves a remarkable improvement of 7.7% over the conditional adversarial adaptation
networks (CDAN) on VisDA-2017 benchmark.

SESSION: Session 25: Multimedia Art, Entertainment and Culture

Learning to Compose Stylistic Calligraphy Artwork with Emotions

Shaozu Yuan
Ruixue Liu
Meng Chen
Baoyang Chen
Zhijie Qiu
Xiaodong He

Emotion plays a critical role in calligraphy composition, which makes the calligraphy
artwork impressive and have a soul. However, previous research on calligraphy generation
all neglected the emotion as a major contributor to the artistry of calligraphy. Such
defects prevent them from generating aesthetic, stylistic, and diverse calligraphy
artworks, but only static handwriting font library instead. To address this problem,
we propose a novel cross-modal approach to generate stylistic and diverse Chinese
calligraphy artwork driven by different emotions automatically. We firstly detect
the emotions in the text by a classifier, then generate the emotional Chinese character
images via a novel modified Generative Adversarial Network (GAN) structure, finally
we predict the layout for all character images with a recurrent neural network. We
also collect a large-scale stylistic Chinese calligraphy image dataset with rich emotions.
Experimental results demonstrate that our model outperforms all baseline image translation
models significantly for different emotional styles in terms of content accuracy and
style discrepancy. Besides, our layout algorithm can also learn the patterns and habits
of calligrapher, and makes the generated calligraphy more artistic. To the best of
our knowledge, we are the first to work on emotion-driven discourse-level Chinese
calligraphy artwork composition.

Graph Neural Networks for Knowledge Enhanced Visual Representation of Paintings

Athanasios Efthymiou
Stevan Rudinac
Monika Kackovic
Marcel Worring
Nachoem Wijnberg

We propose ArtSAGENet, a novel multimodal architecture that integrates Graph Neural
Networks (GNNs) and Convolutional Neural Networks (CNNs), to jointly learn visual
and semantic-based artistic representations. First, we illustrate the significant
advantages of multi-task learning for fine art analysis and argue that it is conceptually
a much more appropriate setting in the fine art domain than the single-task alternatives.
We further demonstrate that several GNN architectures can outperform strong CNN baselines
in a range of fine art analysis tasks, such as style classification, artist attribution,
creation period estimation, and tag prediction, while training them requires an order
of magnitude less computational time and only a small amount of labeled data. Finally,
through extensive experimentation we show that our proposed ArtSAGENet captures and
encodes valuable relational dependencies between the artists and the artworks, surpassing
the performance of traditional methods that rely solely on the analysis of visual
content. Our findings underline a great potential of integrating visual content and
semantics for fine art analysis and curation.

ArtScience and the ICECUBE LED Display [ILDm^3]

Mark-David Hosale
Robert Allison
Jim Madsen
Marcus Gordon

ICECUBE LED Display [ILDm^3] is a cubic-meter, 1/1000th scale model of the IceCube
Neutrino Observatory, a novel telescope that looks for nearly invisible cosmic messengers,
neutrinos, using a cubic-kilometer of instrumented ice starting 1450 meters below
the surface at the South Pole. The display uses art methodologies as a means for expressing
imperceptible astrophysical events as sound, light and colour in the domain of the
human sensorium. The experience is as aesthetically critical as it is facilitatory
to an intuitive understanding of subatomic astrophysical data, leading to new ways
of knowing about our Universe and its processes.

The objective of this project was to build a static volumetric dis- play as a model
of IceCube for visualization of spatio-temporal data recorded by the observatory.
While the primary use of the display is as a model for artistic, educational, and
outreach purposes, the display is also being explored as an instrument for the scientific
analysis of IceCube data sets by human observers. The technical approach to designing
the display was to place an emphasis on reproducibility so that it can be readily
built and used by the re- searchers in the IceCube research community. Evaluation
of the display is being used as a baseline for the development of future exhibits.

PUGCQ: A Large Scale Dataset for Quality Assessment of Professional User-Generated
Content

Guo Li
Baoliang Chen
Lingyu Zhu
Qinwen He
Hongfei Fan
Shiqi Wang

Recent years have witnessed a surge of professional user-generated content (PUGC)
based video services, coinciding with the accelerated proliferation of video acquisition
devices such as mobile phones, wearable cameras, and unmanned aerial vehicles. Different
from traditional UGC videos by impromptu shooting, PUGC videos produced by professional
users tend to be carefully designed and edited, receiving high popularity with a relatively
satisfactory playing count. In this paper, we systematically conduct the comprehensive
study on the perceptual quality of PUGC videos and introduce a database consisting
of 10,000 PUGC videos with subjective ratings. In particular, during the subjective
testing, we collect the human opinions based upon not only the MOS, but also the attributes
that could potentially influence the visual quality including face, noise, blur, brightness,
and color. We make the attempt to analyze the large-scale PUGC database with a series
of video quality assessment (VQA) algorithms and a dedicated baseline model based
on pretrained deep neural network is further presented. The cross-dataset experiments
reveal a large domain gap between the PUGC and the traditional user-generated videos,
which are critical in learning based VQA. These results shed light on developing next-generation
PUGC quality assessment algorithms with desired properties including promising generalization
capability, high accuracy, and effectiveness in perceptual optimization. The dataset
and the codes are released at https://github.com/wlkdb/pugcq_create.

Combining Attention with Flow for Person Image Synthesis

Yurui Ren
Yubo Wu
Thomas H. Li
Shan Liu
Ge Li

Pose-guided person image synthesis aims to synthesize person images by transforming
reference images into target poses. In this paper, we observe that the commonly used
spatial transformation blocks have complementary advantages. We propose a novel model
by combining the attention operation with the flow-based operation. Our model not
only takes the advantage of the attention operation to generate accurate target structures
but also uses the flow-based operation to sample realistic source textures. Both objective
and subjective experiments demonstrate the superiority of our model. Meanwhile, comprehensive
ablation studies verify our hypotheses and show the efficacy of the proposed modules.
Besides, additional experiments on the portrait image editing task demonstrate the
versatility of the proposed combination.

Dual Learning Music Composition and Dance Choreography

Shuang Wu
Zhenguang Liu
Shijian Lu
Li Cheng

Music and dance have always co-existed as pillars of human activities, contributing
immensely to the cultural, social, and entertainment functions in virtually all societies.
Notwithstanding the gradual systematization of music and dance into two independent
disciplines, their intimate connection is undeniable and one art-form often appears
incomplete without the other. Recent research works have studied generative models
for dance sequences conditioned on music. The dual task of composing music for given
dances, however, has been largely overlooked. In this paper, we propose a novel extension,
where we jointly model both tasks in a dual learning approach. To leverage the duality
of the two modalities, we introduce an optimal transport objective to align feature
embeddings, as well as a cycle consistency loss to foster overall consistency. Experimental
results demonstrate that our dual learning framework improves individual task performance,
delivering generated music compositions and dance choreographs that are realistic
and faithful to the conditioned inputs.

SESSION: Session 26: Open Source Competition

MMFashion: An Open-Source Toolbox for Visual Fashion Analysis

Xin Liu
Jiancheng Li
Jiaqi Wang
Ziwei Liu

We present MMFashion, a comprehensive, flexible and user-friendly open-source visual
fashion analysis toolbox based on PyTorch. This toolbox supports a wide spectrum of
fashion analysis tasks, including Fashion Attribute Prediction, Fashion Recognition
and Retrieval, Fashion Landmark Detection, Fashion Parsing and Segmentation and Fashion
Compatibility and Recommendation. It covers almost all the mainstream tasks in fashion
analysis community. MMFashion has several appealing properties. Firstly, MMFashion
follows the principle of modular design. The framework is decomposed into different
components so that it is easily extensible for diverse customized modules. In addition,
detailed documentations, demo scripts and off-the-shelf models are available, which
ease the burden of layman users to leverage the recent advances in deep learning-based
fashion analysis. Our proposed MMFashion is currently the most complete platform for
visual fashion analysis in deep learning era, with more functionalities to be added.
This toolbox and the benchmark could serve the flourishing research community by providing
a flexible toolkit to deploy existing models and develop new ideas and approaches.
We welcome all contributions to this still-growing efforts towards open science: https://github.com/open-mmlab/mmfashion.

Efficient Reinforcement Learning Development with RLzoo

Zihan Ding
Tianyang Yu
Hongming Zhang
Yanhua Huang
Guo Li
Quancheng Guo
Luo Mai
Hao Dong

Many multimedia developers are exploring for adopting Deep Reinforcement Learning
(DRL) techniques in their applications. They however often find such an adoption challenging.
Existing DRL libraries provide poor support for prototyping DRL agents (i.e., models),
customising the agents, and comparing the performance of DRL agents. As a result,
the developers often report low efficiency in developing DRL agents. In this paper,
we introduce RLzoo, a new DRL library that aims to make the development of DRL agents
efficient. RLzoo provides developers with (i) high-level yet flexible APIs for prototyping
DRL agents, and further customising the agents for best performance, (ii) a model
zoo where users can import a wide range of DRL agents and easily compare their performance,
and (iii) an algorithm that can automatically construct DRL agents with custom components
(which are critical to improve agent's performance in custom applications). Evaluation
results show that RLzoo can effectively reduce the development cost of DRL agents,
while achieving comparable performance with existing DRL libraries.

Fast and Flexible Human Pose Estimation with HyperPose

Yixiao Guo
Jiawei Liu
Guo Li
Luo Mai
Hao Dong

Estimating human pose is an important yet challenging task in multimedia applications.
Existing pose estimation libraries target reproducing standard pose estimation algorithms.
When it comes to customising these algorithms for real-world applications, none of
the existing libraries can offer both the flexibility of developing custom pose estimation
algorithms and the high-performance of executing these algorithms on commodity devices.
In this paper, we introduce Hyperpose, a novel flexible and high-performance pose
estimation library. Hyperpose provides expressive Python APIs that enable developers
to easily customise pose estimation algorithms for their applications. It further
provides a model inference engine highly optimised for real-time pose estimation.
This engine can dynamically dispatch carefully designed pose estimation tasks to CPUs
and GPUs, thus automatically achieving high utilisation of hardware resources irrespective
of deployment environments. Extensive evaluation results show that Hyperpose can achieve
up to 3.1x~7.3x higher pose estimation throughput compared to state-of-the-art pose
estimation libraries without compromising estimation accuracy. By 2021, Hyperpose
has received over 1000 stars on GitHub and attracted users from both industry and
academy.

SmartEye: An Open Source Framework for Real-Time Video Analytics with Edge-Cloud Collaboration

Xuezhi Wang
Guanyu Gao

Video analytics with Deep Neural Networks (DNNs) empowers many vision-based applications.
However, deploying DNN models for video analytics services must address the challenges
of computational capacity, service delay, and cost. Leveraging the edge-cloud collaboration
to address these problems has become a growing trend. This paper provides the multimedia
research community with an open source framework named SmartEye for real-time video
analytics by leveraging the edge-cloud collaboration. The system consists of 1) an
edge layer which enables video preprocessing, model selection, on-edge inference,
and task offloading; 2) a request forwarding layer which serves as a gateway of the
cloud and forwards the offloaded tasks to backend workers; and 3) a backend worker
layer that processes the offloaded tasks with specified DNN models. One can easily
customize the policies for preprocessing, offloading, model selection, and request
forwarding. The framework can facilitate research and development in this field. The
project is released as an open source project on GitHub at https://github.com/MSNLAB/SmartEye.

ZoomSense: A Scalable Infrastructure for Augmenting Zoom

Tom Bartindale
Peter Chen
Harrison Marshall
Stanislav Pozdniakov
Dan Richardson

We have seen a dramatic increase in the adoption of teleconferencing systems such
as Zoom for remote teaching and working. Although designed primarily for traditional
video conferencing scenarios, these platforms are actually being deployed in many
diverse contexts. As such, Zoom offers little to aid hosts' understanding of attendee
participation and often hinders participant agency. We introduce ZoomSense : an open-source,
scalable infrastructure built upon 'virtual meeting participants', which exposes real-time
meta-data, meeting content and host controls through an easy to use abstraction -
so that developers can rapidly and sustainably augment Zoom.

Efficient Graph Deep Learning in TensorFlow with tf_geometric

Jun Hu
Shengsheng Qian
Quan Fang
Youze Wang
Quan Zhao
Huaiwen Zhang
Changsheng Xu

We introduce tf_geometric1, an efficient and friendly library for graph deep learning,
which is compatible with both TensorFlow 1.x and 2.x. It provides kernel libraries
for building Graph Neural Networks (GNNs) as well as implementations of popular GNNs.
The kernel libraries consist of infrastructures for building efficient GNNs, including
graph data structures, graph map-reduce framework, graph mini-batch strategy, etc.
These infrastructures enable tf_geometric to support single-graph computation, multi-graph
computation, graph mini-batch, distributed training, etc.; therefore, tf_geometric
can be used for a variety of graph deep learning tasks, such as node classification,
link prediction, and graph classification. Based on the kernel libraries, tf_geometric
implements a variety of popular GNN models. To facilitate the implementation of GNNs,
tf_geometric also provides some other libraries for dataset management, graph sampling,
etc. Different from existing popular GNN libraries, tf_geometric provides not only
Object-Oriented Programming (OOP) APIs, but also Functional APIs, which enable tf_geometric
to handle advanced tasks such as graph meta-learning. The APIs are friendly and suitable
for both beginners and experts.

FaceX-Zoo: A PyTorch Toolbox for Face Recognition

Jun Wang
Yinglu Liu
Yibo Hu
Hailin Shi
Tao Mei

Due to the remarkable progress in recent years, deep face recognition is in great
need of public support for practical model production and further exploration. The
demands are in three folds, including 1) modular training scheme, 2) standard and
automatic evaluation, and 3) groundwork of deployment. To meet these demands, we present
a novel open-source project, named FaceX-Zoo, which is constructed with modular and
scalable design, and oriented to the academic and industrial community of face-related
analysis. FaceX-Zoo provides 1) the training module with various choices of backbone
and supervisory head; 2) the evaluation module that enables standard and automatic
test on most popular benchmarks; 3) the module of simple yet fully functional face
SDK for the validation and primary application of end-to-end face recognition; 4)
the additional module that integrates a group of useful tools. Based on these easy-to-use
modules, FaceX-Zoo can help the community to easily build stateof-the-art solutions
for deep face recognition and, such like the newly-emerged challenge of masked face
recognition caused by the worldwide COVID-19 pandemic. Besides, FaceX-Zoo can be easily
upgraded and scaled up along with further exploration in face related fields. The
source codes and models have been released and received over 900 stars at https://github.com/JDAI-CV/FaceX-Zoo.

PyTorchVideo: A Deep Learning Library for Video Understanding

Haoqi Fan
Tullie Murrell
Heng Wang
Kalyan Vasudev Alwala
Yanghao Li
Yilei Li
Bo Xiong
Nikhila Ravi
Meng Li
Haichuan Yang
Jitendra Malik
Ross Girshick
Matt Feiszli
Aaron Adcock
Wan-Yen Lo
Christoph Feichtenhofer

We introduce PyTorchVideo, an open-source deep-learning library that provides a rich
set of modular, efficient, and reproducible components for a variety of video understanding
tasks, including classification, detection, self-supervised learning, and low-level
processing. The library covers a full stack of video understanding tools including
multimodal data loading, transformations, and models that reproduce state-of-the-art
performance. PyTorchVideo further supports hardware acceleration that enables real-time
inference on mobile devices. The library is based on PyTorch and can be used by any
training framework; for example, PyTorchLightning, PySlowFast, or Classy Vision. PyTorchVideo
is available at https://pytorchvideo.org/.

AICoacher: A System Framework for Online Realtime Workout Coach

Haocong Ying
Tie Liu
Mingxin Ai
Jiali Ding
Yuanyuan Shang

There is a growing demand for online fitness due to the impact of the epidemic. This
paper presents a real-time online fitness system framework called AICoacher, which
offers different online coaches. The framework constructs an extensible AI-based architecture
that supports a variety of fitness movements. Firstly, key frames of motion are extracted
automatically, and the feature vectors are calculated with the body pose points. Secondly,
the state transition matrix can effectively identify fitness actions and capture their
time-continuous characteristics. Finally, AICoacher can accurately provide the number
of repetitions and correction tips of fitness movements. Currently, the AICoacher
has a number of fitness courses supported by online coaches and has been tested on
hundreds of fitness movements. The code can be downloaded from https://github.com/liutiel/AICoacher.

MMOCR: A Comprehensive Toolbox for Text Detection, Recognition and Understanding

Zhanghui Kuang
Hongbin Sun
Zhizhong Li
Xiaoyu Yue
Tsui Hin Lin
Jianyong Chen
Huaqiang Wei
Yiqin Zhu
Tong Gao
Wenwei Zhang
Kai Chen
Wayne Zhang
Dahua Lin

We present MMOCR---an open-source toolbox which provides a comprehensive pipeline
for text detection and recognition, as well as their downstream tasks such as named
entity recognition and key information extraction. MMOCR implements 14 state-of-the-art
algorithms, which is significantly more than all the existing open-source OCR projects
we are aware of to date. To facilitate future research and industrial applications
of text recognition-related problems, we also provide a large number of trained models
and detailed benchmarks to give insights into the performance of text detection, recognition
and understanding. MMOCR is publicly released at https://github.com/open-mmlab/mmocr.

A Complete End to End Open Source Toolchain for the Versatile Video Coding (VVC) Standard

Adam Wieckowski
Christian Lehmann
Benjamin Bross
Detlev Marpe
Thibaud Biatek
Mikael Raulet
Jean Le Feuvre

Versatile Video Coding (VVC) is the most recent international video coding standard
jointly developed by ITU-T and ISO/IEC, which has been finalized in July 2020. VVC
allows for significant bit-rate reductions around 50% for the same subjective video
quality compared to its predecessor, High Efficiency Video Coding (HEVC). One year
after finalization, VVC support in devices and chipsets is still under development,
which is aligned with the typical development cycles of new video coding standards.
This paper presents open-source software packages that allow building a complete VVC
end-to-end toolchain already one year after its finalization. This includes the Fraunhofer
HHI VVenC library for fast and efficient VVC encoding as well as HHI's VVdeC library
for live decoding. An experimental integration of VVC in the GPAC software tools and
FFmpeg media framework allows packaging VVC bitstreams, e.g. encoded with VVenC, in
MP4 file format and using DASH for content creation and streaming. The integration
of VVdeC allows playback on the receiver. Given these packages, step-by-step tutorials
are provided for two possible application scenarios: VVC file encoding plus playback
and adaptive streaming with DASH.

X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics

Yehao Li
Yingwei Pan
Jingwen Chen
Ting Yao
Tao Mei

With the rise and development of deep learning over the past decade, there has been
a steady momentum of innovation and breakthroughs that convincingly push the state-of-the-art
of cross-modal analytics between vision and language in multimedia field. Nevertheless,
there has not been an open-source codebase in support of training and deploying numerous
neural network models for cross-modal analytics in a unified and modular fashion.
In this work, we propose X-modaler --- a versatile and high-performance codebase that
encapsulates the state-of-the-art cross-modal analytics into several general-purpose
stages (e.g., pre-processing, encoder, cross-modal interaction, decoder, and decode
strategy). Each stage is empowered with the functionality that covers a series of
modules widely adopted in state-of-the-arts and allows seamless switching in between.
This way naturally enables a flexible implementation of state-of-the-art algorithms
for image captioning, video captioning, and vision-language pre-training, aiming to
facilitate the rapid development of research community. Meanwhile, since the effective
modular designs in several stages (e.g., cross-modal interaction) are shared across
different vision-language tasks, X-modaler can be simply extended to power startup
prototypes for other tasks in cross-modal analytics, including visual question answering,
visual commonsense reasoning, and cross-modal retrieval. X-modaler is an Apache-licensed
codebase, and its source codes, sample projects and pre-trained models are available
on-line: https://github.com/YehLi/xmodaler.

Interpreting Super-Resolution CNNs for Sub-Pixel Motion Compensation in Video Coding

Luka Murn
Alan F. Smeaton
Marta Mrak

Machine learning approaches for more efficient video compression have been developed
thanks to breakthroughs in deep learning. However, they typically bring coding improvements
at the cost of significant increases in computational complexity, making them largely
unsuitable for practical applications. In this paper, we present open-source software
for convolutional neural network-based solutions which improve the interpolation of
reference samples needed for fractional precision motion compensation. Contrary to
previous efforts, the networks are fully linear, allowing them to be interpreted,
with a full interpolation filter set derived from trained models, making it simple
to integrate in conventional video coding schemes. When implemented in the context
of the state-of-the-art Versatile Video Coding (VVC) test model, the complexity of
the learned interpolation schemes is significantly reduced compared to the interpolation
with full neural networks, while achieving notable coding efficiency improvements
on lower resolution video sequences. The open-source software package is available
at https://github.com/bbc/cnn-fractional-motion-compensation under the 3-clause BSD
license.

SESSION: Session 27: Multimedia Search and Recommendation-I

Towards Accurate Localization by Instance Search

Yi-Geng Hong
Hui-Chu Xiao
Wan-Lei Zhao

Visual object localization is the key step in a series of object detection tasks.
In the literature, high localization accuracy is achieved with the mainstream strongly
supervised frameworks. However, such methods require object-level annotations and
are unable to detect objects of unknown categories. Weakly supervised methods face
similar difficulties. In this paper, a self-paced learning framework is proposed to
achieve accurate object localization on the rank list returned by instance search.
The proposed framework mines the target instance gradually from the queries and their
corresponding top-ranked search results. Since a common instance is shared between
the query and the images in the rank list, the target visual instance can be accurately
localized even without knowing what the object category is. In addition to performing
localization on instance search, the issue of few-shot object detection is also addressed
under the same framework. Superior performance over state-of-the-art methods is observed
on both tasks.

Database-adaptive Re-ranking for Enhancing Cross-modal Image Retrieval

Rintaro Yanagi
Ren Togo
Takahiro Ogawa
Miki Haseyama

We propose an approach that enhances arbitrary existing cross-modal image retrieval
performance. Most of the cross-modal image retrieval methods mainly focus on direct
computation of similarities between a text query and candidate images in an accurate
way. However, their retrieval performance is affected by the ambiguity of text queries
and the bias of target databases (DBs). Dealing with ambiguous text queries and DBs
with bias will lead to accurate cross-modal image retrieval in real-world applications.
A DB-adaptive re-ranking method using modality-driven spaces, which can extend arbitrary
cross-modal image retrieval methods for enhancing their performance, is proposed in
this paper. The proposed method includes two approaches: "DB-adaptive re-ranking''
and "modality-driven clue information extraction''. Our method estimates clue information
that can effectively clarify the desired image from the whole set of a target DB and
then receives user's feedback for the estimated information. Furthermore, our method
extracts more detailed information of a query text and a target DB by focusing on
modality-driven spaces, and it enables more accurate re-ranking. Our method allows
users to reach their desired single image by just answering questions. Experimental
results using MSCOCO, Visual Genome and newly introduced datasets including images
with a particular object show that the proposed method can enhance the performance
of state-of-the-art cross-modal image retrieval methods.

Fine-grained Cross-modal Alignment Network for Text-Video Retrieval

Ning Han
Jingjing Chen
Guangyi Xiao
Hao Zhang
Yawen Zeng
Hao Chen

Despite the recent progress of cross-modal text-to-video retrieval techniques, their
performance is still unsatisfactory. Most existing works follow a trend of learning
a joint embedding space to measure the distance between global-level or local-level
textual and video representation. The fine-grained interactions between video segments
and phrases are usually neglected in cross-modal learning, which results in suboptimal
retrieval performances. To tackle the problem, we propose a novel Fine-grained Cross-modal
Alignment Network (FCA-Net), which considers the interactions between visual semantic
units (i.e., sub-actions/sub-events) in videos and phrases in sentences for cross-modal
alignment. Specifically, the interactions between visual semantic units and phrases
are formulated as a link prediction problem optimized by a graph auto-encoder to obtain
the explicit relations between them and enhance the aligned feature representation
for fine-grained cross-modal alignment. Experimental results on MSR-VTT, YouCook2,
and VATEX datasets demonstrate the superiority of our model as compared to the state-of-the-art
method.

Meta Self-Paced Learning for Cross-Modal Matching

Jiwei Wei
Xing Xu
Zheng Wang
Guoqing Wang

Cross-modal matching has attracted growing attention due to the rapid emergence of
the multimedia data on the web and social applications. Recently, many re-weighting
methods have been proposed for accelerating model training by designing a mapping
function from similarity scores to weights. However, these re-weighting methods are
difficult to be universally applied in practice since manually pre-set weighting functions
inevitably involve hyper-parameters. In this paper, we propose a Meta Self-Paced Network
(Meta-SPN) that automatically learns a weighting scheme from data for cross-modal
matching. Specifically, a meta self-paced network composed of a fully connected neural
network is designed to fit the weight function, which takes the similarity score of
the sample pairs as input and outputs the corresponding weight value. Our meta self-paced
network considers not only the self-similarity scores, but also their potential interactions
(e.g., relative-similarity) when learning the weights. Motivated by the success of
meta-learning, we use the validation set to update the meta self-paced network during
the training of the matching network. Experiments on two image-text matching benchmarks
and two video-text matching benchmarks demonstrate the generalization and effectiveness
of our method.

CausalRec: Causal Inference for Visual Debiasing in Visually-Aware Recommendation

Ruihong Qiu
Sen Wang
Zhi Chen
Hongzhi Yin
Zi Huang

Visually-aware recommendation on E-commerce platforms aims to leverage visual information
of items to predict a user's preference for these items in addition to the historical
user-item interaction records. It is commonly observed that user's attention to visual
features does not always reflect the real preference. Although a user may click and
view an item in light of a visual satisfaction of their expectations, a real purchase
does not always occur due to the unsatisfaction of other essential features (e.g.,
brand, material, price). We refer to the reason for such a visually related interaction
deviating from the real preference as a visual bias. Existing visually-aware models
make use of the visual features as a separate collaborative signal similarly to other
features to directly predict the user's preference without considering a potential
bias, which gives rise to a visually biased recommendation. In this paper, we derive
a causal graph to identify and analyze the visual bias of these existing methods.
In this causal graph, the visual feature of an item acts as a mediator, which could
introduce a spurious relationship between the user and the item. To eliminate this
spurious relationship that misleads the prediction of the user's real preference,
an intervention and a counterfactual inference are developed over the mediator. Particularly,
the Total Indirect Effect is applied for a debiased prediction during the testing
phase of the model. This causal inference framework is model agnostic such that it
can be integrated into the existing methods. Furthermore, we propose a debiased visually-aware
recommender system, denoted as CausalRec to effectively retain the supportive significance
of the visual information and remove the visual bias. Extensive experiments are conducted
on eight benchmark datasets, which shows the state-of-the-art performance of CausalRec
and the efficacy of debiasing.

Semi-supervised Domain Adaptive Retrieval via Discriminative Hashing Learning

Haifeng Xia
Taotao Jing
Chen Chen
Zhengming Ding

Domain adaptive image retrieval (DAR) aims to train the model with well-labeled source
domain and target images in order to retrieve source instances given query target
samples from the identical category space. However, the practical scenario hinders
to manually annotate all retrieved images due to huge labeling cost. Motivated by
the realistic demand, we firstly define the semi-supervised domain adaptive retrieval
(SDAR) problem, assuming the database includes a small proportion annotated source
images and abundant unlabeled ones. To overcome the challenging SDAR, this paper propose
a novel method named Discriminative Hashing learning (DHLing) which mainly includes
two modules, i.e., domain-specific optimization and domain-invariant memory bank.
Specifically, the first component explores the structural knowledge of samples to
predict the unlabeled images with pseudo labels to achieve hash coding consistency.
While, the second one attempts to construct the domain-invariant memory bank to guide
the feature generation and achieve cross-domain alignment. Experimental results on
several popular cross-domain retrieval benchmarks illustrate the effectiveness of
our proposed DHLing on both conventional DAR and new SDAR scenarios by comparing with
the state-of-the-art retrieval methods.

SESSION: Session 28: Multimedia Search and Recommendation-II

Hierarchical View Predictor: Unsupervised 3D Global Feature Learning through Hierarchical
Prediction among Unordered Views

Zhizhong Han
Xiyang Wang
Yu-Shen Liu
Matthias Zwicker

Unsupervised learning of global features for 3D shape analysis is an important research
challenge because it avoids manual effort for supervised information collection. In
this paper, we propose a view-based deep learning model called Hierarchical View Predictor
(HVP) to learn 3D shape features from unordered views in an unsupervised manner. To
mine highly discriminative information from unordered views, HVP performs a novel
hierarchical view prediction over a view pair, and aggregates the knowledge learned
from the predictions in all view pairs into a global feature. In a view pair, we pose
hierarchical view prediction as the task of hierarchically predicting a set of image
patches in a current view from its complementary set of patches, and in addition,
completing the current view and its opposite from any one of the two sets of patches.
Hierarchical prediction, in patches to patches, patches to view and view to view,
facilitates HVP to effectively learn the structure of 3D shapes from the correlation
between patches in the same view and the correlation between a pair of complementary
views. In addition, the employed implicit aggregation over all view pairs enables
HVP to learn global features from unordered views. Our results show that HVP can outperform
state-of-the-art methods under large-scale 3D shape benchmarks in shape classification
and retrieval.

Mining Latent Structures for Multimedia Recommendation

Jinghao Zhang
Yanqiao Zhu
Qiang Liu
Shu Wu
Shuhui Wang
Liang Wang

Multimedia content is of predominance in the modern Web era. Investigating how users
interact with multimodal items is a continuing concern within the rapid development
of recommender systems. The majority of previous work focuses on modeling user-item
interactions with multimodal features included as side information. However, this
scheme is not well-designed for multimedia recommendation. Specifically, only collaborative
item-item relationships are implicitly modeled through high-order item-user-item relations.
Considering that items are associated with rich contents in multiple modalities, we
argue that the latent semantic item-item structures underlying these multimodal contents
could be beneficial for learning better item representations and further boosting
recommendation. To this end, we propose a LATent sTructure mining method for multImodal
reCommEndation, which we term LATTICE for brevity. To be specific, in the proposed
LATTICE model, we devise a novel modality-aware structure learning layer, which learns
item-item structures for each modality and aggregates multiple modalities to obtain
latent item graphs. Based on the learned latent graphs, we perform graph convolutions
to explicitly inject high-order item affinities into item representations. These enriched
item representations can then be plugged into existing collaborative filtering methods
to make more accurate recommendations. Extensive experiments on three real-world datasets
demonstrate the superiority of our method over state-of-the-art multimedia recommendation
methods and validate the efficacy of mining latent item-item relationships from multimodal
features.

Why Do We Click: Visual Impression-aware News Recommendation

Jiahao Xun
Shengyu Zhang
Zhou Zhao
Jieming Zhu
Qi Zhang
Jingjie Li
Xiuqiang He
Xiaofei He
Tat-Seng Chua
Fei Wu

There is a soaring interest in the news recommendation research scenario due to the
information overload. To accurately capture users' interests, we propose to model
multi-modal features, in addition to the news titles that are widely used in existing
works, for news recommendation. Besides, existing research pays little attention to
the click decision-making process in designing multi-modal modeling modules. In this
work, inspired by the fact that users make their click decisions mostly based on the
visual impression they perceive when browsing news, we propose to capture such visual
impression information with visual-semantic modeling for news recommendation. In this
paper, we refer to visual impression as the region of the news displayed on the user
interface of a news application, which delivers both content and layout information
to users. Specifically, we devise the local impression modeling module to simultaneously
attend to decomposed details in the impression when understanding the semantic meaning
of news title, which could explicitly get close to the process of users reading news.
In addition, we inspect the impression from a global view and take structural information,
such as the arrangement of different fields and spatial position of different words
on the impression, into the modeling of multiple modalities. To accommodate the research
of visual impression-aware news recommendation, we extend the text-dominated news
recommendation dataset MIND by adding snapshot impression images and will release
it to nourish the research field. Extensive comparisons with the state-of-the-art
news recommenders along with the in-depth analyses demonstrate the effectiveness of
the proposed method and the promising capability of modeling visual impressions for
the content-based recommenders.

Identity-Preserving Face Anonymization via Adaptively Facial Attributes Obfuscation

Jingzhi Li
Lutong Han
Ruoyu Chen
Hua Zhang
Bing Han
Lili Wang
Xiaochun Cao

With the popularity of using computer vision technology in monitoring system, there
is an increasing societal concern on intruding people's privacy as the captured images/videos
may contain identity-related information e.g. people's face. Existing methods on protecting
such privacy focus on removing the identity-related information from faces. However,
this would weaken the utility of current monitoring system. In this paper, we develop
a face anonymization framework that could obfuscate visual appearance while preserving
the identity discriminability. The framework is composed of two parts: an identity-aware
region discovery module and an identity-aware face confusion module. The former adaptively
locates the identity-independent attributes on human faces, and the latter generates
the privacy-preserving faces using original faces and discovered facial attributes.
To optimize the face generator, we employ a multi-task based loss function, which
consists of discriminator loss, identify preserving loss, and reconstruction loss
functions. Our model can achieve a balance between recognition utility and appearance
anonymizing by modifying different numbers of facial attributes according to pratical
demands, and provide a variety of results. Extensive experiments conducted on two
public benchmarks Celeb-A and VGG-Face2 demonstrate the effectiveness of our model
under distinct face recognition scenarios.

CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval

Zhijian Hou
Chong-Wah Ngo
W. K. Chan

This paper tackles a recently proposed Video Corpus Moment Retrieval task. This task
is essential because advanced video retrieval applications should enable users to
retrieve a precise moment from a large video corpus. We propose a novel CONtextual
QUery-awarE Ranking~(CONQUER) model for effective moment localization and ranking.
CONQUER explores query context for multi-modal fusion and representation learning
in two different steps. The first step derives fusion weights for the adaptive combination
of multi-modal video content. The second step performs bi-directional attention to
tightly couple video and query as a single joint representation for moment localization.
As query context is fully engaged in video representation learning, from feature fusion
to transformation, the resulting feature is user-centered and has a larger capacity
in capturing multi-modal signals specific to query. We conduct studies on two datasets,
TVR for closed-world TV episodes and DiDeMo for open-world user-generated videos,
to investigate the potential advantages of fusing video and query online as a joint
representation for moment retrieval.

Learning Unified Embeddings for Recommendation via Meta-path Semantics

Qianxiu Hao
Qianqian Xu
Zhiyong Yang
Qingming Huang

Heterogeneous information networks (HINs) have become a popular tool to capture complicated
user-item relationships in recommendation problems in recent years. As a typical instantiation
of HINs, meta-path is introduced in search of higher-level representations of user-item
interactions. Though remarkable success has been achieved along this direction, existing
meta-path-based recommendation methods face at least one of the following issues:
1) existing methods merely adopt simple meta-path fusion rules, which might be insufficient
to exclude inconsistent information of different meta-paths that may hurt model performance;
2) the representative power is limited by shallow/stage-wise formulations. To solve
these issues, we propose an end-to-end and unified embedding-based recommendation
framework with graph-based learning. To address 1), we propose a flexible fusion module
to integrate meta-path-based similarities into relative similarities between users
and items. To address 2), we take advantage of the powerful representative ability
of deep neural networks to learn more complicated and flexible latent embeddings.
Finally, empirical studies on real-world datasets demonstrate the effectiveness of
our proposed method.

SESSION: Session 29: Music, Speech and Audio Processing in Multimedia

ReconVAT: A Semi-Supervised Automatic Music Transcription Framework for Low-Resource
Real-World Data

Kin Wai Cheuk
Dorien Herremans
Li Su

Most of the current supervised automatic music transcription (AMT) models lack the
ability to generalize. This means that they have trouble transcribing real-world music
recordings from diverse musical genres that are not presented in the labelled training
data. In this paper, we propose a semi-supervised framework, ReconVAT, which solves
this issue by leveraging the huge amount of available unlabelled music recordings.
The proposed ReconVAT uses reconstruction loss and virtual adversarial training. When
combined with existing U-net models for AMT, ReconVAT achieves competitive results
on common benchmark datasets such as MAPS and MusicNet. For example, in the few-shot
setting for the string part version of MusicNet, ReconVAT achieves F1-scores of 61.0%
and 41.6% for the note-wise and note-with-offset-wise metrics respectively, which
translates into an improvement of 22.2% and 62.5% compared to the supervised baseline
model. Our proposed framework also demonstrates the potential of continual learning
on new data, which could be useful in real-world applications whereby new data is
constantly available.

Is Someone Speaking?: Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection

Ruijie Tao
Zexu Pan
Rohan Kumar Das
Xinyuan Qian
Mike Zheng Shou
Haizhou Li

Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of
one or more speakers. The successful ASD depends on accurate interpretation of short-term
and long-term audio and visual information, as well as audio-visual interaction. Unlike
the prior work where systems make decision instantaneously using short-term features,
we propose a novel framework, named TalkNet, that makes decision by taking both short-term
and long-term features into consideration. TalkNet consists of audio and visual temporal
encoders for feature representation, audio-visual cross-attention mechanism for inter-modality
interaction, and a self-attention mechanism to capture long-term speaking evidence.
The experiments demonstrate that TalkNet achieves 3.5% and 2.2% improvement over the
state-of-the-art systems on the AVA-ActiveSpeaker dataset and Columbia ASD dataset,
respectively. Code has been made available at: https://github.com/TaoRuijie/TalkNet_ASD.

Actions Speak Louder than Listening: Evaluating Music Style Transfer based on Editing Experience

Wei-Tsung Lu
Meng-Hsuan Wu
Yuh-Ming Chiu
Li Su

The subjective evaluation of music generation techniques has been mostly done with
questionnaire-based listening tests while ignoring the perspectives from music composition,
arrangement, and soundtrack editing. In this paper, we propose an editing test to
evaluate users' editing experience of music generation models in a systematic way.
To do this, we design a new music style transfer model combining the non-chronological
inference architecture, autoregressive models and the Transformer, which serves as
an improvement from the baseline model on the same style transfer task. Then, we compare
the performance of the two models with a conventional listening test and the proposed
editing test, in which the quality of generated samples is assessed by the amount
of effort (e.g., the number of required keyboard and mouse actions) spent by users
to polish a music clip. Results on two target styles indicate that the improvement
over the baseline model can be reflected by the editing test quantitatively. Also,
the editing test provides profound insights which are not accessible from usual listening
tests. The major contribution of this paper is the systematic presentation of the
editing test and the corresponding insights, while the proposed music style transfer
model based on state-of-the-art neural networks represents another contribution.

Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus

Rongjie Huang
Feiyang Chen
Yi Ren
Jinglin Liu
Chenye Cui
Zhou Zhao

High-fidelity multi-singer singing voice synthesis is challenging for neural vocoder
due to the singing voice data shortage, limited singer generalization, and large computational
cost. Existing open corpora could not meet requirements for high-fidelity singing
voice synthesis because of the scale and quality weaknesses. Previous vocoders have
difficulty in multi-singer modeling, and a distinct degradation emerges when conducting
unseen singer singing voice generation. To accelerate singing voice researches in
the community, we release a large-scale, multi-singer Chinese singing voice dataset
OpenSinger. To tackle the difficulty in unseen singer modeling, we propose Multi-Singer,
a fast multi-singer vocoder with generative adversarial networks. Specifically, 1)
Multi-Singer uses a multi-band generator to speed up both training and inference procedure.
2) to capture and rebuild singer identity from the acoustic feature (i.e., mel-spectrogram),
Multi-Singer adopts a singer conditional discriminator and conditional adversarial
training objective. 3) to supervise the reconstruction of singer identity in the spectrum
envelopes in frequency domain, we propose an auxiliary singer perceptual loss. The
joint training approach effectively works in GANs for multi-singer voices modeling.
Experimental results verify the effectiveness of OpenSinger and show that Multi-Singer
improves unseen singer singing voices modeling in both speed and quality over previous
methods. The further experiment proves that combined with FastSpeech 2 as the acoustic
model, Multi-Singer achieves strong robustness in the multi-singer singing voice synthesis
pipeline.

MusicBERT: A Self-supervised Learning of Music Representation

Hongyuan Zhu
Ye Niu
Di Fu
Hao Wang

Music recommendation has been one of the most used information retrieval services
on internet. Finding suitable music for users' demands from tens of millions of music
relies on the understanding of music content. Traditional studies usually focus on
music representation based on massive user behavioral data and music meta-data, which
ignore the audio characteristic of music. However, it is found that the melodic characteristics
of music themselves can be further used to understand music. Moreover, how to utilize
large-scale audio data to learn music representation is not well explored. To this
end, we propose a self-supervised learning model for music representation. We firstly
utilize a beat-level music pre-training model to learn the structure of music. Then,
we use a multi-task learning framework to model music self-representation and co-relations
between music, concurrently. Besides, we propose several downstream tasks to evaluate
music representation, including music genre classification, music highlight, and music
similarity retrieval. Extensive experiments on multiple music datasets demonstrate
our model's superiority over baselines on learning music representation.

UniCon: Unified Context Network for Robust Active Speaker Detection

Yuanhang Zhang
Susan Liang
Shuang Yang
Xiao Liu
Zhongqin Wu
Shiguang Shan
Xilin Chen

We propose a new efficient framework, the Unified Context Network (UniCon), for robust
active speaker detection (ASD). Traditional methods for ASD usually operate on each
candidate's pre-cropped face track separately and do not sufficiently consider the
relationships among the candidates. This potentially limits performance, especially
in challenging scenarios with low-resolution faces, multiple candidates, etc. Our
solution is a novel, unified framework that focuses on jointly modeling multiple types
of contextual information: spatial context to indicate the position and scale of each
candidate's face, relational context to capture the visual relationships among the
candidates and contrast audio-visual affinities with each other, and temporal context
to aggregate long-term information and smooth out local uncertainties. Based on such
information, our model optimizes all candidates in a unified process for robust and
reliable ASD. A thorough ablation study is performed on several challenging ASD benchmarks
under different settings. In particular, our method outperforms the state-of-the-art
by a large margin of about 15% mean Average Precision (mAP) absolute on two challenging
subsets: one with three candidate speakers, and the other with faces smaller than
64 pixels. Together, our UniCon achieves 92.0% mAP on the AVA-ActiveSpeaker validation
set, surpassing 90% for the first time on this challenging dataset at the time of
submission. Project website: https://unicon-asd.github.io/.

SESSION: Session 30: Multimedia Transport and Delivery

AITransfer: Progressive AI-powered Transmission for Real-Time Point Cloud Video Streaming

Yakun Huang
Yuanwei Zhu
Xiuquan Qiao
Zhijie Tan
Boyuan Bai

Point cloud video provides a more immersive holographic virtual experience than conventional
video services such as 360 degree video and virtual reality (VR) video. However, the
existing network bandwidth and transmission technology can not carry real-time point
cloud video streaming due to mass data volume, high processing overheads, and extremely
bandwidth-consuming. Unlike previous approaches that extend the VR video streaming,
we propose AITransfer, an AI-powered bandwidth-aware and adaptive transmission technique
driven by extracting and transferring key point cloud features to reduce the bandwidth
consumption and alleviate the computational pressure. AITransfer has two outstanding
contributions, including (1) incorporating the dynamic network bandwidth into the
design of an end-to-end architecture with two fundamental contents of feature extraction
and reconstruction, and (2) employing an online adapter to sense the network bandwidth
and match the optimal inference model. We conduct extensive experiments on the typical
dataset and develop a case study to demonstrate the efficiency and effectiveness.
The results show that AITransfer can provide more than 30.72 times compression ratio
under the existing network environments.

Game Theory-driven Rate Control for 360-Degree Video Coding

Tiesong Zhao
Jielian Lin
Yanjie Song
Xu Wang
Yuzhen Niu

The 360-degree video (omnidirectional video) has become popular recently due to its
capability of providing immersive experience, which is generally achieved via spherical
moving pictures with freedom of viewpoint changing. Nevertheless, the support of full-view
visual contents has inevitably reshaped its perceptual quality metric and dramatically
increased its bitrate output after video coding. Therefore in 360-degree video coding,
the Rate Control (RC) problem, which aims to maximize the resulted perceptual quality
under bitrate constraint, has become a challenging task yet to be addressed. In this
paper, we observe a latitude-based bitrate discrepancy in equirectangular-projected
360-degree video coding and further utilize this feature in bitrate allocation under
panoramic vision. We introduce game theory to find optimal inter/intra-frame bit allocations
that maximize the overall RC performance in terms of utility function. Finally, an
overall framework is proposed that is capable of providing both an improved bitrate
accuracy and an enhanced perceptual quality. Experimental results demonstrate the
efficiency of proposed method, with promising RC performances for 4K and 8K 360-degree
videos.

TBRA: Tiling and Bitrate Adaptation for Mobile 360-Degree Video Streaming

Lei Zhang
Yanyan Suo
Ximing Wu
Feng Wang
Yuchi Chen
Laizhong Cui
Jiangchuan Liu
Zhong Ming

Tile-based approach is widely adopted in adaptive 360\textdegree~video streaming systems.
Existing QoE-driven streaming approaches usually obtain the tile selection and adjust
the bitrate based on the viewport prediction with a fixed tiling, which fail to consider
the unstable prediction performance. However, varying the tiling of the video can
produce different number of tiles with different sizes, and thus can have distinct
impacts on error tolerance for viewport prediction and on decoding complexity for
resource-constrained mobile client. In this work, we introduce adaptive tiling into
the conventional bitrate adaptation for mobile 360degree~video streaming. We first
analyze the impacts of tilings on tile selection and decoding time, which verify the
benefit of tiling adaptation in various practical aspects. We then formulate the QoE
optimization problem for adaptive tiling and bitrate streaming and discuss the design
details of our adaptation algorithm, which can adapt to the performance of viewport
prediction and the decoding capabilities of mobile clients in addition to the conventional
influencing factors. Finally, the superiority of our proposed approach compared with
the state-of-the-art methods is evaluated through extensive trace-driven simulations.

QoE Ready to Respond: A QoE-aware MEC Selection Scheme for DASH-based Adaptive Video
Streaming to Mobile Users

Wanxin Shi
Qing Li
Ruishan Zhang
Gengbiao Shen
Yong Jiang
Zhenhui Yuan
Gabriel-Miro Muntean

The Multi-access Edge Computing (MEC) paradigm offers cloud-computing support to rich
media applications, including Dynamic Adaptive Streaming over HTTP (DASH)-based ones
at the edge of the network, close to mobile users. MEC servers, typically deployed
at base stations (BS), help reduce latency and improve quality of experience (QoE)
of video streaming. Unfortunately the communications involving mobile users require
handovers between BSs and these influence both transmission efficiency because of
the relative position of the MEC servers and transit cost. At the same time, serving
MEC for a mobile user should not necessarily be changed when handover occurs. This
paper introduces QoE Ready to Respond (QoE-R2R), a QoE-aware MEC Selection scheme
for DASH-based mobile adaptive video streaming for optimizing video transmission in
a MEC-supported network environment. Simulation-based testing shows that the proposed
(QoE-R2R) scheme outperforms some traditional alternative solutions. Compared to hit
rate and delay-based schemes, QoE-R2R reduces by 27.6% transmission time and improves
with 6.2% QoE.

Hierarchical Fusion for Practical Ghost-free High Dynamic Range Imaging

Pengfei Xiong
Yu Chen

Ghosting artifacts and missing content due to the over-/under-saturated regions caused
by misalignments are generally considered as the two key challenges in high dynamic
range (HDR) imaging for dynamic scenes. However, previous CNN-based methods directly
reconstruct the HDR image from the input low dynamic range (LDR) images, with implicit
ghost removal and multi-exposure image fusion in an end-to-end network structure.
In this paper, we decompose HDR imaging into ghost-free image fusion and ghost-based
image restoration, and propose a novel practical Hierarchical Fusion Network (HFNet),
which contains three sub-networks: Mask Fusion Network, Mask Compensation Network,
and Refine Network. Specifically, LDR images are linearly fused in Mask Fusion Network
ignoring the misaligned regions. Then the ghost regions of fusion image are restored
with mask compensation. Finally, all these results are refined in the third network.
This strategy of divide and rule makes the proposed method significantly more tiny
than previous methods. Experiments on different datasets show that superior performance
of HFNet with 9x fewer FLOPs, 4x fewer parameters and 3x faster inference speed than
the existing methods while providing comparable accuracy. And it achieves state-of-the-art
quantitative and qualitative results while applied with similar FLOPs.

Edge-oriented Convolution Block for Real-time Super Resolution on Mobile Devices

Xindong Zhang
Hui Zeng
Lei Zhang

Efficient and light-weight super resolution (SR) is highly demanded in practical applications.
However, most of the existing studies focusing on reducing the number of model parameters
and FLOPs may not necessarily lead to faster running speed on mobile devices. In this
work, we propose a re-parameterizable building block, namely Edge-oriented Convolution
Block (ECB), for efficient SR design. In the training stage, the ECB extracts features
in multiple paths, including a normal 3 x 3 convolution, a channel expanding-and-squeezing
convolution, and 1st-order and 2nd-order spatial derivatives from intermediate features.
In the inference stage, the multiple operations can be merged into one single 3 3
convolution. ECB can be regarded as a drop-in replacement to improve the performance
of normal 3 3 convolution without introducing any additional cost in the inference
stage. We then propose an extremely efficient SR network for mobile devices based
on ECB, namely ECBSR. Extensive experiments across five benchmark datasets demonstrate
the effectiveness and efficiency of ECB and ECBSR. Our ECBSR achieves comparable PSNR/SSIM
performance to state-of-the-art light-weight SR models, while it can super resolve
images from 270p/540p to 1080p in real-time on commodity mobile devices, e.g., Snapdragon
865 SOC and Dimensity 1000+ SOC. The source code can be found at https://github.com/xindongzhang/ECBSR.

SESSION: Poster Session 5

Semantic Scalable Image Compression with Cross-Layer Priors

Hanyue Tu
Li Li
Wengang Zhou
Houqiang Li

In an intelligent society, image compression needs to serve both human vision and
machine vision. Traditional image compression schemes only consider visual quality
for humans. In addition, the bitstream needs to be fully decoded to images before
performing semantic analysis (e.g., by deep neural networks). These two factors make
traditional image compression schemes semantically inefficient. To better serve the
needs of both human vision and machine vision, it is more reasonable to compress and
transmit image signals and features simultaneously. In this paper, we propose a novel
end-to-end semantic scalable image compression method, which progressively compresses
coarse-grained semantic features, fine-grained semantic features, and image signals.
To utilize the cross-layer correlation between features and image signals, we propose
a cross-layer context model to reduce the information redundancy, which takes higher-layer
features as cross-layer priors to predict the probability distribution parameters
for the entropy model of lower-layer features or images. Furthermore, we adopt a Region
of Interest (ROI) compression scheme. The objects with rich semantic information and
the background are compressed separately, to further improve the compression efficiency.
Experimental results on the CUB-200-2011 and FGVC-Aircraft datasets demonstrate the
effectiveness of our proposed scheme compared to separate compression of image signals
and features.

Cascade Cross-modal Attention Network for Video Actor and Action Segmentation from
a Sentence

Weidong Chen
Guorong Li
Xinfeng Zhang
Hongyang Yu
Shuhui Wang
Qingming Huang

In this paper, we address the problem that selectively segments the actor and its
action in the video clip given the sentence description. The main challenge is to
match the local semantic features of the video with the heterogeneous textual features.
A widely used language processing method in previous works is to leverage bi-LSTM
and self-attention, which fixed the attention of the sentence and neglected the personality
of the video, leading the attention of the sentence mismatch the most discriminative
feature of the video. The proposed algorithm in this paper allows the sentence to
learn the most discriminative features of the video, remarkably improving the accuracy
of matching and segmentation. Specifically, we propose a cascade cross-modal attention
to leverage two perspectives visual features to attend language from coarse to fine
to generate the discriminative vision-aware language features. Moreover, equipping
our framework with a contrastive learning method and a designed hard negative mining
strategy benefits our proposed network from identifying the positive sample from numbers
of negatives, and further improving the performance. To demonstrate the effectiveness
of our approach, we conduct experiments on two datasets: A2D Sentences and J-HMDB
Sentences. Experimental results show that our method significantly improves the performance
over recent state-of-the-art methods.

Extracting Useful Knowledge from Noisy Web Images via Data Purification for Fine-Grained
Recognition

Chuanyi Zhang
Yazhou Yao
Xing Xu
Jie Shao
Jingkuan Song
Zechao Li
Zhenmin Tang

Fine-grained visual recognition tasks typically require training data with reliable
acquisition and annotation processes. Acquiring such datasets with precise fine-grained
annotations is very expensive and time-consuming. Conversely, a vast amount of web
data is relatively easy to obtain with nearly no human effort. Nevertheless, the presence
of label noise in web images becomes a huge obstacle for training robust fine-grained
recognition models. In this work, we investigate the noisy label problem and propose
a method that can specifically distinguish in- and out-of-distribution noisy samples.
It can purify the web training data by discarding out-of-distribution noisy images
and relabeling in-distribution ones. After purification, we can train the model on
a less noisy web training set to achieve better robustness and performance. Extensive
experiments on three real-world web datasets for fine-grained visual recognition demonstrate
the superiority of our approach.

Complementary Factorization towards Outfit Compatibility Modeling

Tianyu Su
Xuemeng Song
Na Zheng
Weili Guan
Yan Li
Liqiang Nie

Recently, outfit compatibility modeling, which aims to evaluate the compatibility
of a given outfit that comprises a set of fashion items, has gained growing research
attention. Although existing studies have achieved prominent progress, most of them
overlook the essential global outfit representation learning, and the hidden complementary
factors behind the outfit compatibility uncovering. Towards this end, we propose an
Outfit Compatibility Modeling scheme via Complementary Factorization, termed as OCM-CF.
In particular, OCM-CF consists of two key components: context-aware outfit representation
modeling and hidden complementary factors modeling. The former works on adaptively
learning the global outfit representation with graph convolutional networks and the
multi-head attention mechanism, where the item context is fully explored. The latter
targets at uncovering the latent complementary factors with multiple parallel networks,
each of which corresponds to a factor-oriented context-aware outfit representation
modeling. In this part, a new orthogonality-based complementarity regularization is
proposed to encourage the learned factors to complement each other and better characterize
the outfit compatibility. Finally, the outfit compatibility is obtained by summing
all the hidden complementary factor-oriented outfit compatibility scores, each of
which is derived from the corresponding outfit representation. Extensive experiments
on two real-world datasets demonstrate the superiority of our OCM-CF over the state-of-the-art
methods.

Open Set Face Anti-Spoofing in Unseen Attacks

Xin Dong
Hao Liu
Weiwei Cai
Pengyuan Lv
Zekuan Yu

In this paper, we propose an end-to-end open set face anti-spoofing (OSFA) approach
for unseen attack recognition. Previous domain generalization approaches aim to align
multiple domains beyond one common subspace, leading to performance degradation due
to the discrepancy of different domains. To address this issue, our approach formulates
face anti-spoofing (FAS) in an open set recognition framework, which learns compact
representation for each known class in parallel to recognizing unseen attack examples.
To this end, we introduce the statistical extreme value theory incorporated in our
objective under the multi-task framework. Moreover, we develop an identity-aware contrastive
learning method, preventing us from confusion in unseen attack examples versus hard
examples. Experimental results on four datasets demonstrate the robustness of our
proposed OSFA, especially under diverse categories of unseen attacks.

Interventional Video Relation Detection

Yicong Li
Xun Yang
Xindi Shang
Tat-Seng Chua

Video Visual Relation Detection (VidVRD) aims to semantically describe the dynamic
interactions across visual concepts localized in a video in the form of subject, predicate,
object. It can help to mitigate the semantic gap between vision and language in video
understanding, thus receiving increasing attention in multimedia communities. Existing
efforts primarily leverage the multimodal/spatio-temporal feature fusion to augment
the representation of object trajectories as well as their interactions and formulate
the prediction of predicates as a multi-class classification task. Despite their effectiveness,
existing models ignore the severe long-tailed bias in VidVRD datasets. As a result,
the models' prediction will be easily biased towards the popular head predicates (e.g.,
next-to and in-front-of), thus leading to poor generalizability.

To fill the research gap, this paper proposes an Interventional Video Relation Detection
(IVRD) approach that aims to improve not only the accuracy but also the robustness
of the model prediction. Specifically, to better model the high-level visual predicate,
our IVRD consists of two key components: 1) we first learn a set of predicate prototypes,
where each prototype vector describes a set of relation references with the same predicate;
and 2) we apply a causality-inspired intervention on the model input subject, object,
which forces the model to fairly incorporate each possible predicate prototype into
consideration. We expect the model to focus more on the visual content of the dynamic
interaction between subject and object, rather than the spurious correlations between
the model input and predicate labels. Extensive experiments on two popular benchmark
datasets show the effectiveness of IVRD and also its advantages in reducing the bad
long-tailed bias.

CanvasEmb: Learning Layout Representation with Large-scale Pre-training for Graphic
Design

Yuxi Xie
Danqing Huang
Jinpeng Wang
Chin-Yew Lin

Layout representation, which models visual elements and their inter-relations in a
canvas, plays a crucial role in graphic design intelligence. With a large variety
of layout designs and the unique characteristic of layouts that visual elements are
defined as a list of categorical (e.g., type) and numerical (e.g., position and size)
properties, it is challenging to learn general and compact representations with limited
data. Inspired by the recent success of self-supervised pre-training techniques in
various natural language processing tasks, in this paper, we propose CanvasEmb (Canvas
Embedding), which pre-trains deep representations from unlabeled graphic designs by
jointly conditioning on all the context elements in a canvas, with a multi-dimensional
feature encoder and a multi-task learning objective. The pre-trained CanvasEmb model
can be fine-tuned with just one additional output layer and with a small size of training
data to create models for a wide range of downstream tasks. We verify our approach
with presentation slides data. We construct a large-scale dataset with more than one
million slides and propose two layout understanding tasks with human-labeled sets,
namely element role labeling and image captioning. Evaluation results on these two
tasks show that our model with fine-tuning achieves state-of-the-art performance.
Furthermore, we conduct a deep analysis aiming to understand the modeling mechanism
of CanvasEmb and demonstrate its great potential with two extended applications: layout
auto completion and layout retrieval.

Augmenting TV Shows via Uncalibrated Camera Small Motion Tracking in Dynamic Scene

Yizhen Lao
Jie Yang
Xinying Wang
Jianxin Lin
Yu Cao
Shien Song

To augment the TV show in post-production, we propose a novel solution to uncalibrated
camera small motion tracking in a dynamic scene that simultaneously reconstructs the
sparse 3D scene and computes camera poses and focal lengths of each frame. The critical
elements of our approach are a robust image feature tracking strategy in dynamic scenes
followed by automatic local-window frames slicing, local and global bundle adjustment
optimization initialized by a homography-based uncalibrated relative rotation solver.
The proposed method allows us to add the virtual objects (elements) into the reconstructed
3D scene, then composite them back into the original shot while perfectly matched
perspective and appear seamless.

The evaluation of a large variety of real TV show sequences demonstrates the merits
of our method against state-of-the-art works and commercial software products.

SimulSLT: End-to-End Simultaneous Sign Language Translation

Aoxiong Yin
Zhou Zhao
Jinglin Liu
Weike Jin
Meng Zhang
Xingshan Zeng
Xiaofei He

Sign language translation as a kind of technology with profound social significance
has attracted growing researchers' interest in recent years. However, the existing
sign language translation methods need to read all the videos before starting the
translation, which leads to a high inference latency and also limits their application
in real-life scenarios. To solve this problem, we propose SimulSLT, the first end-to-end
simultaneous sign language translation model, which can translate sign language videos
into target text concurrently. SimulSLT is composed of a text decoder, a boundary
predictor, and a masked encoder. We 1) use the wait-k strategy for simultaneous translation.
2) design a novel boundary predictor based on the integrate-and-fire module to output
the gloss boundary, which is used to model the correspondence between the sign language
video and the gloss. 3) propose an innovative re-encode method to help the model obtain
more abundant contextual information, which allows the existing video features to
interact fully. The experimental results conducted on the RWTH-PHOENIX-Weather 2014T
dataset show that SimulSLT achieves BLEU scores that exceed the latest end-to-end
non-simultaneous sign language translation model while maintaining low latency, which
proves the effectiveness of our method.

Mask and Predict: Multi-step Reasoning for Scene Graph Generation

Hongshuo Tian
Ning Xu
An-An Liu
Chenggang Yan
Zhendong Mao
Quan Zhang
Yongdong Zhang

Scene Graph Generation (SGG) aims to parse the image as a set of semantics, containing
objects and their relations. Currently, the SGG methods only stay at presenting the
intuitive detection in the image, such as the triplet "logo on board". Intuitively,
we humans can further refine these intuitive detections as rational descriptions like
"flower painted on surfboard". However, most of existing methods always formulate
SGG as a straightforward task, only limited by the manner of one-time prediction,
which focuses on a single-pass pipeline and predicts all the semantic. Therefore,
to handle this problem, we propose a novel multi-step reasoning manner for SGG. Concretely,
we break SGG into two explicit learning stages, including intuitive training stage
(ITS) and rational training stage (RTS). In the first stage, we follow the traditional
SGG processing to detect objects and relationships, yielding an intuitive scene graph.
In the second stage, we perform multi-step reasoning to refine the intuitive scene
graph. For each step of reasoning, it consists of two kinds of operations: mask and
predict. According to primary predictions and their confidences, we constantly select
and mask the low-confidence predictions, which features are optimized and predicted
again. After several iterations, all of intuitive semantics will gradually tend to
be revised with high confidences, yielding a rational scene graph. Extensive experiments
on Visual Genome prove the superiority of the proposed method. Additional ablation
studies and visualization cases further validate its effectiveness.

Heterogeneous Face Recognition with Attention-guided Feature Disentangling

Shanmin Yang
Xiao Yang
Yi Lin
Peng Cheng
Yi Zhang
Jianwei Zhang

This paper proposes an attention-guided feature disentangling framework (AgFD) to
eliminate the large cross-modality discrepancy for Heterogeneous Face Recognition
(HFR). Existing HFR methods either focus only on extracting identity features or impose
linear/no independence constraints on the decomposed components. Instead, our AgFD
disentangles the facial representation and forces intrinsic independence between identity
features and identity-irrelevant variations. To this end, an Attention-based Residual
Decomposition Module (AbRDM) and an Adversarial Decorrelation Module (ADM) are presented.
AbRDM provides hierarchical complementary feature disentanglement, while ADM is introduced
for decorrelation learning. Extensive experiments on the challenging CASIA NIR-VIS
2.0 Database, Oulu-CASIA NIR&VIS Database, BUAA-VisNir Database, and IIIT-D Viewed
Sketch Database demonstrate the generalization ability and competitive performance
of the proposed method.

Exploring the Quality of GAN Generated Images for Person Re-Identification

Yiqi Jiang
Weihua Chen
Xiuyu Sun
Xiaoyu Shi
Fan Wang
Hao Li

Recently, GAN based method has demonstrated strong effectiveness in generating augmentation
data for person re-identification (ReID), on account of its ability to bridge the
gap between domains and enrich the data variety in feature space. However, most of
the ReID works pick all the GAN generated data as additional training samples or evaluate
the quality of GAN generation at the entire data set level, ignoring the image-level
essential feature of data in ReID task. In this paper, we analyze the in-depth characteristics
of ReID sample and solve the problem of "What makes a GAN-generated image good for
ReID''. Specifically, we propose to examine each data sample with id-consistency and
diversity constraints by mapping image onto different spaces. With a metric-based
sampling method, we demonstrate that not every GAN-generated data is beneficial for
augmentation. Models trained with data filtered by our quality evaluation outperform
those trained with the full augmentation set by a large margin. Extensive experiments
show the effectiveness of our method on both supervised ReID task and unsupervised
domain adaptation ReID task.

Multi-view Clustering via Deep Matrix Factorization and Partition Alignment

Chen Zhang
Siwei Wang
Jiyuan Liu
Sihang Zhou
Pei Zhang
Xinwang Liu
En Zhu
Changwang Zhang

Multi-view clustering (MVC) has been extensively studied to collect multiple source
information in recent years. One typical type of MVC methods is based on matrix factorization
to effectively perform dimension reduction and clustering. However, the existing approaches
can be further improved with following considerations: i) The current one-layer matrix
factorization framework cannot fully exploit the useful data representations. ii)
Most algorithms only focus on the shared information while ignore the view-specific
structure leading to suboptimal solutions. iii) The partition level information has
not been utilized in existing work. To solve the above issues, we propose a novel
multi-view clustering algorithm via deep matrix decomposition and partition alignment.
To be specific, the partition representations of each view are obtained through deep
matrix decomposition, and then are jointly utilized with the optimal partition representation
for fusing multi-view information. Finally, an alternating optimization algorithm
is developed to solve the optimization problem with proven convergence. The comprehensive
experimental results conducted on six benchmark multi-view datasets clearly demonstrates
the effectiveness of the proposed algorithm against the SOTA methods. The code address
for this algorithm is https://github.com/ZCtalk/MVC-DMF-PA.

Video Similarity and Alignment Learning on Partial Video Copy Detection

Zhen Han
Xiangteng He
Mingqian Tang
Yiliang Lv

Existing video copy detection methods generally measure video similarity based on
spatial similarities between key frames, neglecting the latent similarity in temporal
dimension, so that the video similarity is biased towards spatial information. There
are methods modeling unified video similarity in an end-to-end way, but losing detailed
partial alignment information, which causes the incapability of copy segments localization.
To address the above issues, we propose the Video Similarity and Alignment Learning
(VSAL) approach, which jointly models spatial similarity, temporal similarity and
partial alignment. To mitigate the spatial similarity bias, we model the temporal
similarity as the mask map predicted from frame-level spatial similarity, where each
element indicates the probability of frame pair lying right on the partial alignments.
To further localize partial copies, the step map is learned from the spatial similarity
where the elements indicate extending directions of the current partial alignments
on the spatial-temporal similarity map. Obtained from the mask map, the start points
extend out into partial optimal alignments following instructions of the step map.
With the similarity and alignment learning strategy, VSAL achieves the state-of-the-art
F1-score on VCDB core dataset. Furthermore, we construct a new benchmark of partial
video copy detection and localization by adding new segment-level annotations for
FIVR-200k dataset, where VSAL also achieves the best performance, verifying its effectiveness
in more challenging situations. Our project is publicly available at https://pvcd-vsal.github.io/vsal/.

No-Reference Video Quality Assessment with Heterogeneous Knowledge Ensemble

Jinjian Wu
Yongxu Liu
Leida Li
Weisheng Dong
Guangming Shi

Blind assessment of video quality is still challenging even in this deep learning
era. The limited number of samples in existing databases is insufficient to learn
a good feature extractor for video quality assessment (VQA), while manually labeling
a larger database with subjective perception is very labor-intensive and time-consuming.
To relieve such difficulty, we first collect 3589 high-quality video clips as the
reference and build a large VQA dataset. The dataset contains more than 300K samples
degraded by various distortion types due to compression and transmission error, and
provides weak labels for each distorted sample with several full-reference VQA algorithms.
To learn effective representation from the weakly labeled data, we alleviate the bias
of single weak label (i.e., single knowledge) via learning from multiple heterogeneous
knowledge. To this end, we propose a novel no-reference VQA (NR-VQA) method with HEterogeneous
Knowledge Ensemble (HEKE). Comparing to learning from single knowledge, HEKE can theoretically
reach a lower infimum, and learn richer representation due to the heterogeneity. Extensive
experimental results show that the proposed HEKE outperforms existing NR-VQA methods,
and achieves the state-of-the-art performance. The source code will be available at
https://github.com/Sissuire/BVQA-HEKE.

Seeing is Believing?: Effects of Visualization on Smart Device Privacy Perceptions

Carlos Bermejo Fernandez
Petteri Nurmi
Pan Hui

Research on smart device privacy has consistently highlighted how privacy is an important
concern for users, but they fail to act on their concerns. While this discrepancy
between user perceptions and actions has been consistently reported, currently there
is a limited understanding of why this is the case or how the situation can be ameliorated.
This paper systematically studies how visualizations in privacy assistants can improve
the situation, reporting on two studies that explore the users' privacy perceptions
in smart device ecosystems. The first study shows that displaying device location
and data type reduces the users' privacy perceptions. Participants also weigh the
use of media such as online news as a source to inform users about the possible inferences.
The second study analyzes participants' preferences to visualize smart device information
and privacy policies using augmented reality. Through these two studies, we derive
insights and guidelines on how to design effective privacy assistants and to improve
user's knowledge of risks associated with data disclosure in smart home scenarios.

MHFC: Multi-Head Feature Collaboration for Few-Shot Learning

Shuai Shao
Lei Xing
Yan Wang
Rui Xu
Chunyan Zhao
Yanjiang Wang
Baodi Liu

Few-shot learning (FSL) aims to address the data-scarce problem. A standard FSL framework
is composed of two components: (1) Pre-train. Employ the base data to generate a CNN-based
feature extraction model (FEM). (2) Meta-test. Apply the trained FEM to acquire the
novel data's features and recognize them. FSL relies heavily on the design of the
FEM. However, various FEMs have distinct emphases. For example, several may focus
more attention on the contour information, whereas others may lay particular emphasis
on the texture information. The single-head feature is only a one-sided representation
of the sample. Besides the negative influence of cross-domain (e.g., the trained FEM
can not adapt to the novel class flawlessly), the distribution of novel data may have
a certain degree of deviation compared with the ground truth distribution, which is
dubbed as distribution-shift-problem (DSP). To address the DSP, we propose Multi-Head
Feature Collaboration (MHFC) algorithm, which attempts to project the multi-head features
(e.g., multiple features extracted from a variety of FEMs) to a unified space and
fuse them to capture more discriminative information. Typically, first, we introduce
a subspace learning method to transform the multi-head features to aligned low-dimensional
representations. It corrects the DSP via learning the feature with more powerful discrimination
and overcomes the problem of inconsistent measurement scales from different head features.
Then, we design an attention block to update combination weights for each head feature
automatically. It comprehensively considers the contribution of various perspectives
and further improves the discrimination of features. We evaluate the proposed method
on five benchmark datasets (including cross-domain experiments) and achieve significant
improvements of 2.1%-7.8% compared with state-of-the-arts.

Vision-guided Music Source Separation via a Fine-grained Cycle-Separation Network

Ma Shuo
Yanli Ji
Xing Xu
Xiaofeng Zhu

Music source separation from a sound mixture remains a big challenge because there
often exist heavy overlaps and interactions among similar music signals. In order
to correctly separate mixed sources, we propose a novel Fine-grained Cycle-Separation
Network (FCSN) for vision-guided music source separation. With the guidance of visual
features, the proposed FCSN approach preliminarily separated music sources by minimizing
the residual spectrogram which is calculated by removing preliminarily separated music
spectrograms from the original music mixture. The separation is repeated several times
until the residual spectrogram becomes empty or leaves only noise. Extensive experiments
are performed on three large-scale datasets, the MUSIC (MUSIC-21), the AudioSet, and
the VGGSound. Our approach outperforms state-of-the-art approaches in all datasets,
and both separation accuracies and visualization results demonstrate its effectiveness
for solving the problem of overlap and interaction in music source separation.

GLM-Net: Global and Local Motion Estimation via Task-Oriented Encoder-Decoder Structure

Yuchen Yang
Ye Xiang
Shuaicheng Liu
Lifang Wu
Boxuan Zhao
Bing Zeng

In this work, we study the problem of separating the global camera motion and the
local dynamic motion from an optical flow. Previous methods either estimate global
motions by a parametric model, such as a homography, or estimate both of them by an
optical flow field. However, none of these methods can directly estimate global and
local motions through an end-to-end manner. In addition, separating the two motions
accurately from a hybrid flow field is challenging. Because one motion can easily
confuse the estimate of the other one when they are compounded together. To this end,
we propose an end-to-end global and local motion estimation network GLM-Net. We design
two encoder-decoder structures for the motion separation in the optical flow based
on different task orientations. One structure adopts a mask autoencoder to extract
the global motion, while the other one uses attention U-net for the local motion refinement.
We further designed two effective training methods to overcome the problem of lacking
supervisions. We apply our method on the action recognition datasets NCAA and UCF-101
to verify the accuracy of the local motion, and the homography estimation dataset
DHE for the accuracy of the global motion. Experimental results show that our method
can achieve competitive performance in both tasks at the same time, validating the
effectiveness of the motion separation.

Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal Attention

Katsuyuki Nakamura
Hiroki Ohashi
Mitsuhiro Okada

Automatically describing video, or video captioning, has been widely studied in the
multimedia field. This paper proposes a new task of sensor-augmented egocentric-video
captioning, a newly constructed dataset for it called MMAC Captions, and a method
for the newly proposed task that effectively utilizes multi-modal data of video and
motion sensors, or inertial measurement units (IMUs). While conventional video captioning
tasks have difficulty in dealing with detailed descriptions of human activities due
to the limited view of a fixed camera, egocentric vision has greater potential to
be used for generating the finer-grained descriptions of human activities on the basis
of a much closer view. In addition, we utilize wearable-sensor data as auxiliary information
to mitigate the inherent problems in egocentric vision: motion blur, self-occlusion,
and out-of-camera-range activities. We propose a method for effectively utilizing
the sensor data in combination with the video data on the basis of an attention mechanism
that dynamically determines the modality that requires more attention, taking the
contextual information into account. We compared the proposed sensor-fusion method
with strong baselines on the MMAC Captions dataset and found that using sensor data
as supplementary information to the egocentric-video data was beneficial, and that
our proposed method outperformed the strong baselines, demonstrating the effectiveness
of the proposed method.

Cross Modal Compression: Towards Human-comprehensible Semantic Compression

Jiguo Li
Chuanmin Jia
Xinfeng Zhang
Siwei Ma
Wen Gao

Traditional image/video compression aims to reduce the transmission/storage cost with
signal fidelity as high as possible. However, with the increasing demand for machine
analysis and semantic monitoring in recent years, semantic fidelity rather than signal
fidelity is becoming another emerging concern in image/video compression. With the
recent advances in cross modal translation and generation, in this paper, we propose
the cross modal compression~(CMC), a semantic compression framework for visual data,
to transform the high redundant visual data~(such as image, video, etc.) into a compact,
human-comprehensible domain~(such as text, sketch, semantic map, attributions, etc.),
while preserving the semantic. Specifically, we first formulate the CMC problem as
a rate-distortion optimization problem. Secondly, we investigate the relationship
with the traditional image/video compression and the recent feature compression frameworks,
showing the difference between our CMC and these prior frameworks. Then we propose
a novel paradigm for CMC to demonstrate its effectiveness. The qualitative and quantitative
results show that our proposed CMC can achieve encouraging reconstructed results with
an ultrahigh compression ratio, showing better compression performance than the widely
used JPEG baseline.

RAMS-Trans: Recurrent Attention Multi-scale Transformer for Fine-grained Image Recognition

Yunqing Hu
Xuan Jin
Yin Zhang
Haiwen Hong
Jingfeng Zhang
Yuan He
Hui Xue

In fine-grained image recognition (FGIR), the localization and amplification of region
attention is an important factor, which has been explored extensively convolutional
neural networks (CNNs) based approaches. The recently developed vision transformer
(ViT) has achieved promising results in computer vision tasks. Compared with CNNs,
Image sequentialization is a brand new manner. However, ViT is limited in its receptive
field size and thus lacks local attention like CNNs due to the fixed size of its patches,
and is unable to generate multi-scale features to learn discriminative region attention.
To facilitate the learning of discriminative region attention without box/part annotations,
we use the strength of the attention weights to measure the importance of the patch
tokens corresponding to the raw images. We propose the recurrent attention multi-scale
transformer (RAMS-Trans), which uses the transformer's self-attention to recursively
learn discriminative region attention in a multi-scale manner. Specifically, at the
core of our approach lies the dynamic patch proposal module (DPPM) responsible for
guiding region amplification to complete the integration of multi-scale image patches.
The DPPM starts with the full-size image patches and iteratively scales up the region
attention to generate new patches from global to local by the intensity of the attention
weights generated at each scale as an indicator. Our approach requires only the attention
weights that come with ViT itself and can be easily trained end-to-end. Extensive
experiments demonstrate that RAMS-Trans performs better than exising works, in addition
to efficient CNN models, achieving state-of-the-art results on three benchmark datasets.

Memory-Augmented Deep Unfolding Network for Compressive Sensing

Jiechong Song
Bin Chen
Jian Zhang

Mapping a truncated optimization method into a deep neural network, deep unfolding
network (DUN) has attracted growing attention in compressive sensing (CS) due to its
good interpretability and high performance. Each stage in DUNs corresponds to one
iteration in optimization. By understanding DUNs from the perspective of the human
brain's memory processing, we find there exists two issues in existing DUNs. One is
the information between every two adjacent stages, which can be regarded as short-term
memory, is usually lost seriously. The other is no explicit mechanism to ensure that
the previous stages affect the current stage, which means memory is easily forgotten.
To solve these issues, in this paper, a novel DUN with persistent memory for CS is
proposed, dubbed Memory-Augmented Deep Unfolding Network (MADUN). We design a memory-augmented
proximal mapping module (MAPMM) by combining two types of memory augmentation mechanisms,
namely High-throughput Short-term Memory (HSM) and Cross-stage Long-term Memory (CLM).
HSM is exploited to allow DUNs to transmit multi-channel short-term memory, which
greatly reduces information loss between adjacent stages. CLM is utilized to develop
the dependency of deep information across cascading stages, which greatly enhances
network representation capability. Extensive CS experiments on natural and MR images
show that with the strong ability to maintain and balance information our MADUN outperforms
existing state-of-the-art methods by a large margin. The source code is available
at https://github.com/jianzhangcs/MADUN/.

Underwater Species Detection using Channel Sharpening Attention

Lihao Jiang
Yi Wang
Qi Jia
Shengwei Xu
Yu Liu
Xin Fan
Haojie Li
Risheng Liu
Xinwei Xue
Ruili Wang

With the continuous exploration of marine resources, underwater artificial intelligent
robots play an increasingly important role in the fish industry. However, the detection
of underwater objects is a very challenging problem due to the irregular movement
of underwater objects, the occlusion of sand and rocks, the diversity of water illumination,
and the poor visibility and low color contrast in the underwater environment. In this
article, we first propose a real-world underwater object detection dataset (UODD),
which covers more than 3K images of the most common aquatic products. Then we propose
Channel Sharpening Attention Module (CSAM) as a plug-and-play module to further fuse
high-level image information, providing the network with the privilege of selecting
feature maps. Fusion of original images through CSAM can improve the accuracy of detecting
small and medium objects, thereby improving the overall detection accuracy. We also
use Water-Net as a preprocessing method to remove the haze and color cast in complex
underwater scenes, which shows a satisfactory detection result on small-sized objects.
In addition, we use the class weighted loss as the training loss, which can accurately
describe the relationship between classification and precision of bounding boxes of
targets, and the loss function converges faster during the training process. Experimental
results show that the proposed method reaches a maximum AP of 50.1%, outperforming
other traditional and state-of-the-art detectors. In addition, our model only needs
an average inference time of 25.4 ms per image, which is quite fast and might suit
the real-time scenario.

Self-Supervised Pre-training on the Target Domain for Cross-Domain Person Re-identification

Junyin Zhang
Yongxin Ge
Xinqian Gu
Boyu Hua
Tao Xiang

Most existing cluster-based cross-domain person re-identification (re-id) methods
only pre-train the re-id model on the source domain. Unfortunately, the pre-trained
model may not perform well on the target domain due to the large domain gap between
source and target domains, which is harmful to the following optimization. In this
paper, we propose a novel Self-supervised Pre-training method on the Target Domain
(SPTD), which pre-trains the model on both the source and target domains in a self-supervised
manner. Specifically, SPTD uses different kinds of data augmentation manners to simulate
different intra-class changes and constraints the consistency between the augmented
data distribution and the original data distribution. As a result, the pre-trained
model involves some specific discriminative knowledge on the target domain and is
beneficial to the following optimization. It is easy to combine the proposed SPTD
with other cluster-based cross-domain re-id methods just by replacing the original
pre-trained model with our pre-trained model. Comprehensive experiments on three widely
used datasets, i.e. Market1501, DukeMTMC-ReID and MSMT17, demonstrate the effectiveness
of SPTD. Especially, the final results surpass previous state-of-the-art methods by
a large margin.

Exploring Graph-Structured Semantics for Cross-Modal Retrieval

Lei Zhang
Leiting Chen
Chuan Zhou
Fan Yang
Xin Li

We study and address the cross-modal retrieval problem which lies at the heart of
visual-textual processing. Its major challenge lies in how to effectively learn a
shared multi-modal feature space where the discrepancies of semantically related pairs,
such as images and texts, are minimized regardless of their modalities. Most current
methods focus on reasoning about cross-modality semantic relations within individual
image-text pair to learn the common representation. However, they overlook more global,
structural inter-pair knowledge within the dataset, i.e., the graph-structured semantics
within each training batch. In this paper, we introduce a graph-based, semantic-constrained
learning framework to comprehensively explore the intra- and inter-modality information
for cross-modal retrieval. Our idea is to maximally explore the structures of labeled
data in graph latent space, and use them as semantic constraints to enforce feature
embeddings from the semantically-matched (image-text) pairs to be more similar and
vice versa. It raises a novel graph-constrained common embedding learning paradigm
for cross-modal retrieval, which is largely under-explored up to now. Moreover, a
GAN-based dual learning approach is used to further improve the discriminability and
model the joint distribution across different modalities. Our fully-equipped approach,
called Graph-constrained Cross-modal Retrieval (GCR), is able to mine intrinsic structures
of training data for model learning and enable reliable cross-modal retrieval. We
empirically demonstrate that our GCR can achieve higher accuracy than existing state-of-the-art
approaches on Wikipedia, NUS-WIDE-10K, PKU XMedia and Pascal Sentence datasets. Our
code will be made publicly available. Code is available at https://github.com/neoscheung/GCR.

Text is NOT Enough: Integrating Visual Impressions into Open-domain Dialogue Generation

Lei Shen
Haolan Zhan
Xin Shen
Yonghao Song
Xiaofang Zhao

Open-domain dialogue generation in natural language processing (NLP) is by default
a pure-language task, which aims to satisfy human need for daily communication on
open-ended topics by producing related and informative responses. In this paper, we
point out that hidden images, named as visual impressions (VIs), can be explored from
the text-only data to enhance dialogue understanding and help generate better responses.
Besides, the semantic dependency between an dialogue post and its response is complicated,
e.g., few word alignments and some topic transitions. Therefore, the visual impressions
of them are not shared, and it is more reasonable to integrate the response visual
impressions (RVIs) into the decoder, rather than the post visual impressions (PVIs).
However, both the response and its RVIs are not given directly in the test process.
To handle the above issues, we propose a framework to explicitly construct VIs based
on pure-language dialogue datasets and utilize them for better dialogue understanding
and generation. Specifically, we obtain a group of images (PVIs) for each post based
on a pre-trained word-image mapping model. These PVIs are used in a co-attention encoder
to get a post representation with both visual and textual information. Since the RVIs
are not provided during testing, we design a cascade decoder that consists of two
sub-decoders. The first sub-decoder predicts the content words in response, and applies
the word-image mapping model to get corresponding RVIs. Then, the second sub-decoder
generates the response based on the post and RVIs. Experimental results on two open-domain
dialogue datasets show that our proposed approach achieves superior performance over
competitive baselines in terms of fluency, relatedness, and diversity.

Quality Assessment of End-to-End Learned Image Compression: The Benchmark and Objective Measure

Yang Li
Shiqi Wang
Xinfeng Zhang
Shanshe Wang
Siwei Ma
Yue Wang

Recently, learning-based lossy image compression has achieved notable breakthroughs
with their excellent modeling and representation learning capabilities. Comparing
to traditional image codecs based on block partitioning and transform, these data-driven
approaches with artificial-neural-network (ANN) structures bring significantly different
distortion patterns. Efficient objective image quality assessment (IQA) measures play
the key role in quantitative evaluation and optimization of image compression algorithms.
In this paper, we construct a large-scale image database for quality assessment of
compressed images. In the proposed database, 100 reference images are compressed to
different quality levels by 10 codecs, involving both traditional and learning-based
codecs. Based on this database, we present a benchmark for existing IQA methods and
reveal the challenges of IQA on learning-based compression distortions. Furthermore,
we develop an objective quality assessment framework in which a self-attention module
is adopted to leverage multi-level features from reference and compressed images.
Extensive experiments demonstrate the superiority of our method in terms of prediction
accuracy. The subjective and objective study of various compressed images also shed
lights on the optimization of image compression methods.

A Statistical Approach to Mining Semantic Similarity for Deep Unsupervised Hashing

Xiao Luo
Daqing Wu
Zeyu Ma
Chong Chen
Minghua Deng
Jianqiang Huang
Xian-Sheng Hua

The majority of deep unsupervised hashing methods usually first construct pairwise
semantic similarity information and then learn to map images into compact hash codes
while preserving the similarity structure, which implies that the quality of hash
codes highly depends on the constructed semantic similarity structure. However, since
the features of images for each kind of semantics usually scatter in high-dimensional
space with unknown distribution, previous methods could introduce a large number of
false positives and negatives for boundary points of distributions in the local semantic
structure based on pairwise cosine distances. Towards this limitation, we propose
a general distribution-based metric to depict the pairwise distance between images.
Specifically, each image is characterized by its random augmentations that can be
viewed as samples from the corresponding latent semantic distribution. Then we estimate
the distances between images by calculating the sample distribution divergence of
their semantics. By applying this new metric to deep unsupervised hashing, we come
up with Distribution-based similArity sTructure rEconstruction (DATE). DATE can generate
more accurate semantic similarity information by using non-parametric ball divergence.
Moreover, DATE explores both semantic-preserving learning and contrastive learning
to obtain high-quality hash codes. Extensive experiments on several widely-used datasets
validate the superiority of our DATE.

BAM: Bilateral Activation Mechanism for Image Fusion

Zi-Rong Jin
Liang-Jian Deng
Tian-Jing Zhang
Xiao-Xu Jin

As the conventional activation functions such as ReLU, LeakyReLU, and PReLU, the negative
parts in feature maps are simply truncated or linearized, which may result in unflexible
structure and undesired information distortion. In this paper, we propose a simple
but effective Bilateral Activation Mechanism (BAM) which could be applied to the activation
function to offer an efficient feature extraction model. Based on BAM, the Bilateral
ReLU Residual Block (BRRB) that still sufficiently keeps the nonlinear characteristic
of ReLU is constructed to separate the feature maps into two parts, i.e., the positive
and negative components, then adaptively represent and extract the features by two
independent convolution layers. Besides, our mechanism will not increase any extra
parameters or computational burden in the network. We finally embed the BRRB into
a basic ResNet architecture, called BRResNet, it is easy to obtain state-of-the-art
performance in two image fusion tasks, i.e., pansharpening and hyperspectral image
super-resolution (HISR). Additionally, deeper analysis and ablation study demonstrate
the effectiveness of BAM, the lightweight property of the network, etc. Please find
the code from the project page1 https://liangjiandeng.github.io/Projects_Res/bam_mm2021.html

Self-supervising Action Recognition by Statistical Moment and Subspace Descriptors

Lei Wang
Piotr Koniusz

In this paper, we build on a concept of self-supervision by taking RGB frames as input
to learn to predict both action concepts and auxiliary descriptors e.g., object descriptors.
So-called hallucination streams are trained to predict auxiliary cues, simultaneously
fed into classification layers, and then hallucinated at the testing stage to aid
network. We design and hallucinate two descriptors, one leveraging four popular object
detectors applied to training videos, and the other leveraging image- and video-level
saliency detectors. The first descriptor encodes the detector- and Image Net-wise
class prediction scores, confidence scores, and spatial locations of bounding boxes
and frame indexes to capture the spatio-temporal distribution of features per video.
Another descriptor encodes spatio-angular gradient distributions of saliency maps
and intensity patterns. Inspired by the characteristic function of the probability
distribution, we capture four statistical moments on the above intermediate descriptors.
As numbers of coefficients in the mean, covariance, coskewness and cokurtotsis grow
linearly, quadratically, cubically and quartically w.r.t. the dimension of feature
vectors, we describe the covariance matrix by its leading n' eigenvectors (so-called
subspace) and we capture skewness/kurtosis rather than costly coskewness/cokurtosis.
We obtain state of the art on five popular datasets such as Charades and EPIC-Kitchens.

Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based Action Recognition

Tailin Chen
Desen Zhou
Jian Wang
Shidong Wang
Yu Guan
Xuming He
Errui Ding

The task of skeleton-based action recognition remains a core challenge in human-centred
scene understanding due to the multiple granularities and large variation in human
motion. Existing approaches typically employ a single neural representation for different
motion patterns, which has difficulty in capturing fine-grained action classes given
limited training data. To address the aforementioned problems, we propose a novel
multi-granular spatio-temporal graph network for skeleton-based action classification
that jointly models the coarse- and fine-grained skeleton motion patterns. To this
end, we develop a dual-head graph network consisting of two interleaved branches,
which enables us to extract features at two spatio-temporal resolutions in an effective
and efficient manner. Moreover, our network utilises a cross-head communication strategy
to mutually enhance the representations of both heads. We conducted extensive experiments
on three large-scale datasets, namely NTU RGB+D 60, NTU RGB+D 120, and Kinetics-Skeleton,
and achieves the state-of-the-art performance on all the benchmarks, which validates
the effectiveness of our method1.

ION: Instance-level Object Navigation

Weijie Li
Xinhang Song
Yubing Bai
Sixian Zhang
Shuqiang Jiang

Visual object navigation is a fundamental task in Embodied AI. Previous works focus
on the category-wise navigation, in which navigating to any possible instance of target
object category is considered a success. Those methods may be effective to find the
general objects. However, it may be more practical to navigate to the specific instance
in our real life, since our particular requirements are usually satisfied with specific
instances rather than all instances of one category. How to navigate to the specific
instance has been rarely researched before and is typically challenging to current
works. In this paper, we introduce a new task of Instance Object Navigation (ION),
where instance-level descriptions of targets are provided and instance-level navigation
is required. In particular, multiple types of attributes such as colors, materials
and object references are involved in the instance-level descriptions of the targets.
In order to allow the agent to maintain the ability of instance navigation, we propose
a cascade framework with Instance-Relation Graph (IRG) based navigator and instance
grounding module. To specify the different instances of the same object categories,
we construct instance-level graph instead of category-level one, where instances are
regarded as nodes, encoded with the representation of colors, materials and locations
(bounding boxes). During navigation, the detected instances can activate corresponding
nodes in IRG, which are updated with graph convolutional neural network (GCNN). The
final instance prediction is obtained with the grounding module by selecting the candidates
(instances) with maximum probability (a joint probability of category, color and material,
obtained by corresponding regressors with softmax). For the task evaluation, we build
a benchmark for instance-level object navigation on AI2-Thor simulator, where over
27,735 object instance descriptions and navigation groundtruth are automatically obtained
through the interaction with the simulator. The proposed model outperforms the baseline
in instance-level metrics, showing that our proposed graph model can guide instance
object navigation, as well as leaving promising room for further improvement. The
project is available at https://github.com/LWJ312/ION.

Skeleton-Aware Neural Sign Language Translation

Shiwei Gan
Yafeng Yin
Zhiwei Jiang
Lei Xie
Sanglu Lu

As an essential communication way for deaf-mutes, sign languages are expressed by
human actions. To distinguish human actions for sign language understanding, the skeleton
which contains position information of human pose can provide an important cue, since
different actions usually correspond to different poses/skeletons. However, skeleton
has not been fully studied for Sign Language Translation (SLT), especially for end-to-end
SLT. Therefore, in this paper, we propose a novel end-to-end Skeleton-Aware neural
Network (SANet) for video-based SLT. Specifically, to achieve end-to-end SLT, we design
a self-contained branch for skeleton extraction. To efficiently guide the feature
extraction from video with skeletons, we concatenate the skeleton channel and RGB
channels of each frame for feature extraction. To distinguish the importance of clips,
we construct a skeleton-based Graph Convolutional Network (GCN) for feature scaling,
i.e., giving importance weight for each clip. The scaled features of each clip are
then sent to a decoder module to generate spoken language. In our SANet, a joint training
strategy is designed to optimize skeleton extraction and sign language translation
jointly. Experimental results on two large scale SLT datasets demonstrate the effectiveness
of our approach, which outperforms the state-of-the-art methods. Our code is available
at https://github.com/SignLanguageCode/SANet.

Fingerspelling Recognition in the Wild with Fixed-Query based Visual Attention

Srinivas Kruthiventi S S
George Jose
Nitya Tandon
Rajesh Biswal
Aashish Kumar

We propose an end-to-end solution for recognizing fingerspelling using multi-scale
attention with fixed-queries. Fingerspelling recognition in the wild gets challenging
because of the multiple sub-problems involved - detecting the signing hand, tracking
it across frames, and recognizing subtle variations in a hand gesture. While the current
state-of-the-art handles these with external face/hand detectors, optical flow features,
and iteratively refining the attention maps, our work proposes a deep learning model
that takes in the RGB videos and recognizes fingerspelling with a single forward pass.
Without any frame-level supervision, our proposed model learns to pay attention to
informative regions in each frame, such as fingers, hand, and face, to recognize signs.
Multi-scale features from these attended regions are then processed using a recurrent
neural network to recognize the alphabet sequentially. We train our model using a
curriculum learning strategy with simpler samples at the beginning, followed by challenging
samples at a later stage. We have evaluated our approach on Chicago Fingerspelling
Wild and WildPlus datasets and have achieved about 8% and 4% improvements, respectively,
compared to the current state-of-the-art methods. Further analysis of our method shows
that our attention mechanism is intuitive from a human perspective, and visualizing
it offers useful insights into the working of the model.

Deep Human Dynamics Prior

Qiongjie Cui
Huaijiang Sun
Yue Kong
Xiaoning Sun

Motion capture (MoCap) technology aims to provide an accurate record of human motion,
with specific potentials in activity analysis, human behavior understanding, as well
as multimedia industries of animation production and special effects movies. However,
because of joint occlusion and limitation of equipment precision, the raw motion data
are often damaged, which severely hinders its downstream applications. The latest
method relies on deep neural networks to reconstruct the underlying complete motion
from the degraded observation, achieving remarkable results. Unfortunately, due to
the non-enumerability of human motion, the trained model from large-scale training
data often fails to comprehensively cover incomputable action categories, which may
lead to a sharp decline in the performance of deep learning-based methods. To handle
these limitations, we propose an untrained deep generative model, in which Graph Convolutional
Networks (GCNs) are utilized to efficiently capture complicated topological relationships
of human joints. We show that the untrained GCN architecture with randomly-initialized
weights is sufficient to extract some low-level statistics for human motion reconstruction
without any training process. Notably, the performance of our approach is comparable
to that of those trained models, while its application is not restricted by the availability
of training data or a pre-trained network. Moreover, the proposed model even surpasses
the state-of-the-art methods when encountering unprecedented samples in the human
action database, regardless of the tasks of human motion recovery and gap-filling
problem.

Exploiting Invariance of Mining Facial Landmarks

Jiangming Shi
Zixian Gao
Hao Liu
Zekuan Yu
Fengjun Li

In this paper, we propose an invariant learning method for facial landmark mining
in a self-supervised manner. The conventional methods mostly train with raw data of
paired facial appearances and landmarks, assuming that they are evenly distributed.
However, assumptions like this tend to lead to failures in challenging cases even
undergo costly training since they usually don't hold in real-world scenarios. To
address this issue, our model achieves to be invariant to facial biases by learning
through the landmark-anchored distributions. Specifically, we generate faces from
these distributions, then group them based on the appearance sources and the probe
facial landmarks into intra-identities and intra-landmarks classes, respectively.
Thus, we construct intra-class invariance losses to disentangle the spatial structures
from appearances. In addition, we adopt a reconstruction loss to produce more realistic
faces with probe landmarks. Extensive experimental results on four standard facial
landmark datasets demonstrate that our method achieves compelling performance compared
with supervised and unsupervised methods.

Joint Implicit Image Function for Guided Depth Super-Resolution

Jiaxiang Tang
Xiaokang Chen
Gang Zeng

Guided depth super-resolution is a practical task where a low-resolution and noisy
input depth map is restored to a high-resolution version, with the help of a high-resolution
RGB guide image. Existing methods usually view this task as a generalized guided filtering
problem that relies on designing explicit filters and objective functions, or a dense
regression problem that directly predicts the target image via deep neural networks.
These methods suffer from either model capability or interpretability. Inspired by
the recent progress in implicit neural representation, we propose to formulate the
guided super-resolution as a neural implicit image interpolation problem, where we
take the form of a general image interpolation but use a novel Joint Implicit Image
Function (JIIF) representation to learn both the interpolation weights and values.
JIIF represents the target image domain with spatially distributed local latent codes
extracted from the input image and the guide image, and uses a graph attention mechanism
to learn the interpolation weights at the same time in one unified deep implicit function.
We demonstrate the effectiveness of our JIIF representation on guided depth super-resolution
task, significantly outperforming state-of-the-art methods on three public benchmarks.
Code can be found at https://git.io/JC2sU

Transformer-based Feature Reconstruction Network for Robust Multimodal Sentiment Analysis

Ziqi Yuan
Wei Li
Hua Xu
Wenmeng Yu

Improving robustness against data missing has become one of the core challenges in
Multimodal Sentiment Analysis (MSA), which aims to judge speaker sentiments from the
language, visual, and acoustic signals. In the current research, translation-based
methods and tensor regularization methods are proposed for MSA with incomplete modality
features. However, both of them fail to cope with random modality feature missing
in non-aligned sequences. In this paper, a transformer-based feature reconstruction
network (TFR-Net) is proposed to improve the robustness of models for the random missing
in non-aligned modality sequences. First, intra-modal and inter-modal attention-based
extractors are adopted to learn robust representations for each element in modality
sequences. Then, a reconstruction module is proposed to generate the missing modality
features. With the supervision of SmoothL1Loss between generated and complete sequences,
TFR-Net is expected to learn semantic-level features corresponding to missing features.
Extensive experiments on two public benchmark datasets show that our model achieves
good results against data missing across various missing modality combinations and
various missing degrees.

Self-feature Learning: An Efficient Deep Lightweight Network for Image Super-resolution

Jun Xiao
Qian Ye
Rui Zhao
Kin-Man Lam
Kao Wan

Deep learning-based models have achieved unprecedented performance in single image
super-resolution (SISR). However, existing deep learning-based models usually require
high computational complexity to generate high-quality images, which limits their
applications in edge devices, e.g., mobile phones. To address this issue, we propose
a dynamic, channel-agnostic filtering method in this paper. The proposed method not
only adaptively generates convolutional kernels based on the local information of
each position, but also can significantly reduce the cost of computing the inter-channel
redundancy. Based on this, we further propose a simple, yet effective, deep lightweight
model for SISR. Experiment results show that our proposed model outperforms other
state-of-the-art deep lightweight SISR models, leading to the best trade-off between
the performance and the number of model parameters.

DAWN: Dynamic Adversarial Watermarking of Neural Networks

Sebastian Szyller
Buse Gul Atli
Samuel Marchal
N. Asokan

Training machine learning (ML) models is expensive in terms of computational power,
amounts of labeled data and human expertise. Thus, ML models constitute business value
for their owners. Embedding digital watermarks during model training allows a model
owner to later identify their models in case of theft or misuse. However, model functionality
can also be stolen via model extraction, where an adversary trains a surrogate model
using results returned from a prediction API of the original model. Recent work has
shown that model extraction is a realistic threat. Existing watermarking schemes are
ineffective against model extraction since it is the adversary who trains the surrogate
model. In this paper, we introduce DAWN (Dynamic Adversarial Watermarking of Neural
Networks), the first approach to use watermarking to deter model extraction theft.
Unlike prior watermarking schemes, DAWN does not impose changes to the training process
but operates at the prediction API of the protected model, by dynamically changing
the responses for a small subset of queries (e.g., 0.5%) from API clients. This set
is a watermark that will be embedded in case a client uses its queries to train a
surrogate model. We show that DAWN is resilient against two state-of-the-art model
extraction attacks, effectively watermarking all extracted surrogate models, allowing
model owners to reliably demonstrate ownership (with confidence greater than 1-2-64),
incurring negligible loss of prediction accuracy (0.03-0.5%).

Visible Watermark Removal via Self-calibrated Localization and Background Refinement

Jing Liang
Li Niu
Fengjun Guo
Teng Long
Liqing Zhang

Superimposing visible watermarks on images provides a powerful weapon to cope with
the copyright issue. Watermark removal techniques, which can strengthen the robustness
of visible watermarks in an adversarial way, have attracted increasing research interest.
Modern watermark removal methods perform watermark localization and background restoration
simultaneously, which could be viewed as a multi-task learning problem. However, existing
approaches suffer from incomplete detected watermark and degraded texture quality
of restored background. Therefore, we design a two-stage multi-task network to address
the above issues. The coarse stage consists of a watermark branch and a background
branch, in which the watermark branch self-calibrates the roughly estimated mask and
passes the calibrated mask to background branch to reconstruct the watermarked area.
In the refinement stage, we integrate multi-level features to improve the texture
quality of watermarked area. Extensive experiments on two datasets demonstrate the
effectiveness of our proposed method.

Learning to Decode Contextual Information for Efficient Contour Detection

Ruoxi Deng
Shengjun Liu
Jinxin Wang
Huibing Wang
Hanli Zhao
Xiaoqin Zhang

Contour detection plays an important role in both academic research and real-world
applications. As the basic building block of many applications, its accuracy and efficiency
highly influence the subsequent stages. In this work, we propose a novel lightweight
system for contour detection that achieves state-of-the-art performance while keeps
ultra-slim model size. The proposed method is built on an efficient encoder in a bottom-up/top-down
fashion. Specially, we propose a novel decoder that compresses side features from
an encoder and effectively decodes compact contextual information for high-accurate
boundary localization. Besides, we propose a novel loss function that is able to assist
a model to produce crisp object boundaries.

We conduct extensive experiments to demonstrate the effectiveness of the proposed
system on the widely adopted benchmarks BSDS500 and Multi-Cue. The results show that
our system achieves the same best performance, yet only consumes 3.3% computational
cost (16.45GFlops VS. 499.15GFlops) and 2.35% model size (1.94M VS. 82.43M) of the
SOTA detector RCF-ResNet101. In the meantime, our method outperforms a large portion
of the recent top edge detectors by a clear margin.

Fast, High-Quality Hierarchical Depth-Map Super-Resolution

Yiguo Qiao
Licheng Jiao
Wenbin Li
Christian Richardt
Darren Cosker

The low spatial resolution of acquired depth maps is a major drawback of most RGBD
sensors. However, there are many scenarios in which fast acquisition of high-resolution
and high-quality depth maps would be desirable. One approach to achieve higher quality
depth maps is through super-resolution. However, edge preservation is challenging,
and artifacts such as depth confusion and blurring are easily introduced near boundaries.
In view of this, we propose a method for fast, high-quality hierarchical depth-map
super-resolution (HDS). In our method, a high-resolution RGB image is degraded layer
by layer to guide the bilateral filtering of the depth map. To improve the upsampled
depth map quality, we construct a feature-based bilateral filter (FBF) for the interpolation,
by using the extracted RGB shallow and multi-layer features. To accelerate the process,
we perform filtering only near depth boundaries and through matrix operations. We
also propose an extension of our HDS model to a Classification-based Hierarchical
Depth-map Super-resolution (C-HDS) model, where a context-aware trilateral filter
reduces the contributions of unreliable neighbors to the current missing depth location.
Experimental results show that the proposed method is significantly faster than existing
methods for generating high-resolution depth maps, while also significantly improving
depth quality compared to the current state-of-the-art approaches, especially for
large-scale 16x super-resolution.

TsFPS: An Accurate and Flexible 6DoF Tracking System with Fiducial Platonic Solids

Nan Xiang
Xiaosong Yang
Jian J Zhang

We present a vision-based system for real-time pose tracking of the rigid object,
it can not only estimate a single pose in six degrees of freedom (6DoF), but also
suitable for recovering compound movements. The system is comprised of a monocular
camera, and a series of 3D printed platonic solids with squared fiducial markers attached
on each single face, which is easy to setup and extend, extra cameras are allowed
to incorporate into the pipeline for meeting different requirements. The system realizes
object tracking by estimating the pose of the fiducial platonic solid (FPS) which
can be fixed onto the surface of the target object. Different sizes and shapes of
the platonic solids are allowed to combine with each other to adapt to different application
scenarios, this strategy provides enormous flexibility and applicability to our system.
In order to track the motion of the fiducial platonic solid accurately, a robust algorithm
that combines the fiducial constraint and the statistical constraint is introduced,
which is able to handle illumination changes, motion blur and partial occlusion. We
evaluate the performance of the proposed approach with qualitative and quantitative
experiments, in addition, a couple of mixed reality (MR) applications are developed
for demonstrating the effectiveness of the system.

Consistency-Constancy Bi-Knowledge Learning for Pedestrian Detection in Night Surveillance

Xiao Wang
Zheng Wang
Wu Liu
Xin Xu
Jing Chen
Chia-Wen Lin

Pedestrian detection in the night surveillance is a challenging yet not largely explored
task. As the success of the detector in the daytime surveillance and the convenient
acquisition of all-weather data, we learn knowledge from these data to benefit pedestrian
detection in night surveillance. We find two key properties of surveillance: distribution
cross-time consistency and background cross-frame constancy. This paper proposes a
consistency-constancy bi-knowledge learning (CCBL) for pedestrian detection in night
surveillance, which is able to simultaneously achieve the night pedestrian detection's
useful knowledge, coming from day and night surveillance. Firstly, based on the robustness
of the existing detector in day surveillance, we obtain pedestrians' distribution
in the daytime scene using the detector's detection results in the daytime scene.
Based on the consistency of pedestrians' distribution during the day and night in
the same scene, the pedestrian distribution from daytime is used as the consistency-knowledge
for pedestrian detection in night surveillance. Secondly, the background as a constant
knowledge of the surveillance scene is extractable and contributes to the division
of the foreground, which contains most of the pedestrian regions and helps in pedestrian
detection for night surveillance. Finally, we add bi-knowledge representation to promote
each other and merge them together as the final pedestrian representation. Through
extensive experiments, our CCBL significantly outperforms the state-of-the-art methods
on public pedestrian detection datasets. In the NightSurveillance dataset, CCBL reduced
the average missed detection rate by 3.04% compared to the existing best method.

SSconv: Explicit Spectral-to-Spatial Convolution for Pansharpening

Yudong Wang
Liang-Jian Deng
Tian-Jing Zhang
Xiao Wu

Pansharpening aims to fuse a high spatial resolution panchromatic (PAN) image and
a low resolution multispectral (LR-MS) image to obtain a multispectral image with
the same spatial resolution as the PAN image. Thanks to the flexible structure of
convolution neural networks (CNNs), they have been successfully applied to the problem
of pansharpening. However, most of the existing methods only simply feed the up-sampled
LR-MS into the CNNs and ignore the spatial distortion caused by direct up-sampling.
In this paper, we propose an explicit spectral-to-spatial convolution (SSconv) that
aggregates spectral features into the spatial domain to perform the up-sampling operation,
which can get better performance than the direct up-sampling. Furthermore, SSconv
is embedded into a multiscale U-shaped convolution neural network (MUCNN) for fully
utilizing the multispectral information of involved images. In particular, multiscale
injection branch and mixed loss on cross-scale levels are employed to fuse pixel-wise
image information. Benefiting from the distortion-free property of SSconv, the proposed
MUCNN can generate state-of-the-art performance with a simple structure, both on reduced-resolution
and full-resolution datasets acquired from WorldView-3 and GaoFen-2. Please find the
code from the project page.

TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network

Zhengyi Liu
Yuan Wang
Zhengzheng Tu
Yun Xiao
Bin Tang

Salient object detection is the pixel-level dense prediction task which can highlight
the prominent object in the scene. Recently U-Net framework is widely used, and continuous
convolution and pooling operations generate multi-level features which are complementary
with each other. In view of the more contribution of high-level features for the performance,
we propose a triplet transformer embedding module to enhance them by learning long-range
dependencies across layers. It is the first to use three transformer encoders with
shared weights to enhance multi-level features. By further designing scale adjustment
module to process the input, devising three-stream decoder to process the output and
attaching depth features to color features for the multi-modal fusion, the proposed
triplet transformer embedding network (TriTransNet) achieves the state-of-the-art
performance in RGB-D salient object detection, and pushes the performance to a new
level. Experimental results demonstrate the effectiveness of the proposed modules
and the competition of TriTransNet.

Learning Sample-Specific Policies for Sequential Image Augmentation

Pu Li
Xiaobai Liu
Xiaohui Xie

This paper presents a policy-driven sequential image augmentation approach for image-related
tasks. Our approach applies a sequence of image transformations (e.g., translation,
rotation) over a training image, one transformation at a time, with the augmented
image from the previous time step treated as the input for the next transformation.
This sequential data augmentation substantially improves sample diversity, leading
to improved test performance, especially for data-hungry models (e.g., deep neural
networks). However, the search for the optimal transformation of each image at each
time step of the sequence has high complexity due to its combination nature. To address
this challenge, we formulate the search task as a sequential decision process and
introduce a deep policy network that learns to produce transformations based on image
content. We also develop an iterative algorithm to jointly train a classifier and
the policy network in the reinforcement learning setting. The immediate reward of
a potential transformation is defined to encourage transformations producing hard
samples for the current classifier. At each iteration, we employ the policy network
to augment the training dataset, train a classifier with the augmented data, and train
the policy net with the aid of the classifier. We apply the above approach to both
public image classification benchmarks and a newly collected image dataset for material
recognition. Comparisons to alternative augmentation approaches show that our policy-driven
approach achieves comparable or improved classification performance while using significantly
fewer augmented images. The code is available at https://github.com/Paul-LiPu/rl_autoaug.

Image Quality Caption with Attentive and Recurrent Semantic Attractor Network

Wen Yang
Jinjian Wu
Leida Li
Weisheng Dong
Guangming Shi

In this paper, a novel quality caption model is inventively developed to assess the
image quality with hierarchical semantics. Existing image quality assessment (IQA)
methods usually represent image quality with a quantitative value, resulting in inconsistency
with human cognition. Generally, human beings are good at perceiving image quality
in terms of semantic description rather than quantitative value. Moreover, cognition
is a needs-oriented task where hierarchical semantics are extracted. The mediocre
quality value fails to reflect degradations on hierarchical semantics. Therefore,
a new IQA framework is proposed to describe the quality for needs-oriented cognition.
A novel quality caption procedure is firstly introduced, in which the quality is represented
as patterns of activation distributed across the diverse degradations on hierarchical
semantics. Then, an attentive and recurrent semantic attractor network (ARSANet) is
designed to activate the distributed patterns for image quality description. Experiments
demonstrate that our method achieves superior performance and is highly compliant
with human cognition.

Triangle-Reward Reinforcement Learning: A Visual-Linguistic Semantic Alignment for Image Captioning

Weizhi Nie
Jiesi Li
Ning Xu
An-An Liu
Xuanya Li
Yongdong Zhang

Image captioning aims to generate a sentence consisting of sequential linguistic words,
to describe visual units (i.e., objects, relationships, and attributes) in a given
image. Most of existing methods rely on the prevalent supervised learning with cross-entropy
(XE) function to transfer visual units into a sequence of linguistic words. However,
we argue that the XE objective is not sensitive to visual-linguistic alignment, which
cannot discriminately penalize the semantic inconsistency and shrink the context gap.
To solve these problems, we propose the Triangle-Reward Reinforcement Learning (TRRL)
method. TRRL uses the scene graph (G)---objects as nodes and relationships as edges---to
represent images, generated sentences, and ground truth sentences individually, and
mutually align them during the training process. Specifically, TRRL formulates the
image captioning into cooperative agents, where the first agent aims to extract visual
scene graph (Gimg) from image (I) and the second agent translates this graph into
sentence (S). To discriminately penalize the visual-linguistic inconsistency, TRRL
proposes the novel triangle-reward function: 1) the generated sentence and its corresponding
ground truth are decomposed into the linguistic scene graph (Gsen) and ground-truth
scene graph (Ggt), respectively; 2) Gimg, Gsen, and Ggt are paired to calculate the
semantic similarity scores which are proportionally assigned to reward each agent.
Meanwhile, to make the training objective sensitive to context changes, we propose
the node-level and triplet-level scoring methods to jointly measure the visual-linguistic
graph correlations. Extensive experiments on the MSCOCO dataset demonstrate the superiority
of TRRL. Additional ablation studies further validate its effectiveness.

Stacked Semantically-Guided Learning for Image De-distortion

Huiyuan Fu
Changhao Tian
Xin Wang
Huadong Ma

Image de-distortion is very important because distortions will degrade the image quality
significantly. It can benefit many computational visual media applications that are
primarily designed for high-quality images. In order to address this challenging issue,
we propose a stacked semantically-guided network, which is the first try on this task.
It can capture and restore the distortions around the humans and the adjacent background
effectively with the stacked network architecture and the semantically-guided scheme.
In addition, a discriminative restoration loss function is proposed to recover different
distorted regions in the images discriminatively. As another important effort, we
construct a large-scale dataset for image de-distortion. Extensive qualitative and
quantitative experiments show that our proposed method achieves a superior performance
compared with the state-of-the-art approaches.

Focal and Composed Vision-semantic Modeling for Visual Question Answering

Yudong Han
Yangyang Guo
Jianhua Yin
Meng Liu
Yupeng Hu
Liqiang Nie

Visual Question Answering (VQA) is a vital yet challenging task in the field of multimedia
comprehension. In order to correctly answer questions about an image, a VQA model
requires to sufficiently understand the visual scene, especially the vision-semantic
reasonings between the two modalities. Traditional relation-based methods allow to
encode the pairwise relations of objects to boost the VQA model performance. However,
this simple strategy is deficient to exploit the abundant concepts expressed by the
composition of diverse image objects, leading to sub-optimal performance. In this
paper, we propose a focal and composed vision-semantic modeling method, which is a
trainable end-to-end model, for better vision-semantic redundancy removal and compositionality
modeling. Concretely, we first introduce the LENA cell, a plug-and-play reasoning
module, which removes redundant semantic by a focal mechanism in the first step, followed
by the vision-semantic compositionality modeling for better visual reasoning. We then
incorporate the cell into a full LENA network, which progressively refines multimodal
composed representations, and can be leveraged to infer the high-order vision-semantic
in a multi-step learning way. Extensive experiments on two benchmark datasets, i.e.,
VQA v2 and VQA-CP v2, verify the superiority of our model as compared with several
state-of-the-art baselines.

Pose-Guided Feature Learning with Knowledge Distillation for Occluded Person Re-Identification

Kecheng Zheng
Cuiling Lan
Wenjun Zeng
Jiawei Liu
Zhizheng Zhang
Zheng-Jun Zha

Occluded person re-identification (ReID) aims to match person images with occlusion.
It is fundamentally challenging because of the serious occlusion which aggravates
the misalignment problem between images. At the cost of incorporating a pose estimator,
many works introduce pose information to alleviate the misalignment in both training
and testing. To achieve high accuracy while preserving low inference complexity, we
propose a network named Pose-Guided Feature Learning with Knowledge Distillation (PGFL-KD),
where the pose information is exploited to regularize the learning of semantics aligned
features but is discarded in testing. PGFL-KD consists of a main branch (MB), and
two pose-guided branches, e.g., a foreground-enhanced branch (FEB), and a body part
semantics aligned branch (SAB). The FEB intends to emphasise the features of visible
body parts while excluding the interference of obstructions and background (e.g.,
foreground feature alignment). The SAB encourages different channel groups to focus
on different body parts to have body part semantics aligned representation. To get
rid of the dependency on pose information when testing, we regularize the MB to learn
the merits of the FEB and SAB through knowledge distillation and interaction-based
training. Extensive experiments on occluded, partial, and holistic ReID tasks show
the effectiveness of our proposed network.

Multiple Objects-Aware Visual Question Generation

Jiayuan Xie
Yi Cai
Qingbao Huang
Tao Wang

Visual question generation task aims to generate meaningful questions about an image
according to a target answer. Existing studies mainly focus on merely one object related
to the target answer in an image to generate a question. However, a target answer
is often related to multiple key objects in an image, which focuses on only one object
may mislead its model to generate questions that are only related to partial fragments
of the answer. To address this problem, we propose a multi-objects aware generation
model to capture all key objects related to an answer and generate the corresponding
question. We first introduce a co-attention network to capture the relationship between
each object in an image and the answer, and then extract the key objects that are
related to the answer. Then, a graph network is introduced to capture the relationships
between the key objects and other objects in the image that are not related to the
answer, which helps generate questions that involve more visual content. Finally,
the learned information from the graph network is fed into a standard decoder module
to produce questions. Extensive experiments on the VQA v2.0 dataset show that the
proposed model outperforms the state-of-the-art models.

VASTile: Viewport Adaptive Scalable 360-Degree Video Frame Tiling

Chamara Madarasingha
Kanchana Thilakarathna

360° videos a.k.a. spherical videos are getting popular among users nevertheless,
omnidirectional view of these videos demands high bandwidth and processing power at
the end devices. Recently proposed viewport aware streaming mechanisms can reduce
the amount of data transmitted by streaming a limited portion of the frame covering
the current user viewport (VP). However, they still suffer from sending a high amount
of redundant data, as the fixed tile mechanisms can not provide a finer granularity
to the user VP. Though, making the tiles smaller can provide a finer granularity for
user viewport, it will significantly increase encoding-decoding overhead. To overcome
this trade-off, in this paper, we present a computational geometric approach based
adaptive tiling mechanism named VASTile, which takes visual attention information
on a 360° video frame as the input and provides a suitable non-overlapping variable
size tile cover on the frame. Experimental results show that VASTile can save up to
31.1% of pixel redundancy before compression and 35.4% of bandwidth saving compared
to recently proposed fixed tile configurations, providing tile schemes within 0.98
(±0.11) seconds time frame.

Delving into Deep Image Prior for Adversarial Defense: A Novel Reconstruction-based Defense Framework

Li Ding
Yongwei Wang
Xin Ding
Kaiwen Yuan
Ping Wang
Hua Huang
Z. Jane Wang

Deep learning based image classification models are shown vulnerable to adversarial
attacks by injecting deliberately crafted noises to clean images. To defend against
adversarial attacks in a training-free and attack-agnostic manner, this work proposes
a novel and effective reconstruction-based defense framework by delving into deep
image prior (DIP). Fundamentally different from existing reconstruction-based defenses,
the proposed method analyzes and explicitly incorporates the model decision process
into our defense. Given an adversarial image, firstly we map its reconstructed images
during DIP optimization to the model decision space, where cross-boundary images can
be detected and on-boundary images can be further localized. Then, adversarial noise
is purified by perturbing on-boundary images along the reverse direction to the adversarial
image. Finally, on-manifold images are stitched to construct an image that can be
correctly predicted by the victim classifier. Extensive experiments demonstrate that
the proposed method outperforms existing state-of-the-art reconstruction-based methods
both in defending white-box attacks and defense-aware attacks. Moreover, the proposed
method can maintain a high visual quality during adversarial image reconstruction.

Fine-Grained Language Identification in Scene Text Images

Yongrui Li
Shilian Wu
Jun Yu
Zengfu Wang

Identifying the language of the text in scene images is crucial for various applications.
Studies that focus on identifying the script, which is a set of letters used for writing
in a given language, in scene text images already exist. However, these works do not
distinguish between different languages written in the same script and are thus unable
to meet the needs of many applications. To address this challenge, we study a novel
task: fine-grained language identification in scene text images, which aims to distinguish
languages that share the same script. The datasets that include samples in seven languages,
which are Dutch, English, French, Italian, German, Spanish, and Portuguese, are constructed.
Furthermore, well-designed end-to-end trainable neural networks are proposed for fine-grained
language identification, where semantic information concerning the text is mined and
utilized to assist the language identification. We train the networks on the synthetic
dataset and evaluate them with the collected real dataset. The experimental results
demonstrate that the proposed frameworks are effective.

CARE: Cloudified Android OSes on the Cloud Rendering

Dongjie Tang
Cathy Bao
Yong Yao
Chao Xie
Qiming Shi
Marc Mao
Randy Xu
Linsheng Li
Mohammad R. Haghighat
Zhengwei Qi
Haibing Guan

GPUs have become ubiquitous in the Cloud-rendering areas due to the outstanding rendering
performance. However, many existing Cloud-rendering systems suffer from low GPU utilization
caused by the CPU bottleneck. Recent proposals (e.g., API-forwarding and c-GPU) for
GPU-usage optimization are promising but fail to address the system-resource redundancy
issues (i.e., each instance tends to occupy all the system resources exceeding their
requirements), leading to unnecessary CPU consumption and lowering GPU utilization.
We conducted an experiment by testing real-world applications on the percentage of
unused resources to demonstrate the severity of this issue. Nearly 50% of resources
are unused.

To solve this problem, we present CARE, the first framework intended to reduce the
system-level redundancy by cloudifying the system from monolithic to Cloud-native.
To allow users to configure their own required services, CARE puts forward a functional
unit called Configurable Android (CA). To allow multiple instances to share certain
types of resources, CARE innovates Sharing Resource (SR). To reduce the unused services,
CARE introduces Pruning Resources (PR). Last but not least, to further reduce CPU
consumption, CARE proposes a new isolation unit called CiC. So far, CARE primarily
focuses on Android systems due to the great popularity of Android Cloud-gaming frameworks.
Furthermore, CARE can handle 60 heavyweight instances (e.g., KOG (King of Glory))
on Intel® SG1.

Context-Aware Selective Label Smoothing for Calibrating Sequence Recognition Model

Shuangping Huang
Yu Luo
Zhenzhou Zhuang
Jin-Gang Yu
Mengchao He
Yongpan Wang

Despite the success of deep neural network (DNN) on sequential data (i.e., scene text
and speech) recognition, it suffers from the over-confidence problem mainly due to
overfitting in training with the cross-entropy loss, which may make the decision-making
less reliable. Confidence calibration has been recently proposed as one effective
solution to this problem. Nevertheless, the majority of existing confidence calibration
methods aims at non-sequential data, which is limited if directly applied to sequential
data since the intrinsic contextual dependency in sequences or the class-specific
statistical prior is seldom exploited. To the end, we propose a Context-Aware Selective
Label Smoothing (CASLS) method for calibrating sequential data. The proposed CASLS
fully leverages the contextual dependency in sequences to construct confusion matrices
of contextual prediction statistics over different classes. Class-specific error rates
are then used to adjust the weights of smoothing strength in order to achieve adaptive
calibration. Experimental results on sequence recognition tasks, including scene text
recognition and speech recognition, demonstrate that our method can achieve the state-of-the-art
performance.

Image Search with Text Feedback by Deep Hierarchical Attention Mutual Information
Maximization

Chunbin Gu
Jiajun Bu
Zhen Zhang
Zhi Yu
Dongfang Ma
Wei Wang

Image retrieval with text feedback is an emerging research topic with the objective
of integrating inputs from multiple modalities as queries. In this paper, queries
contain a reference image plus text feedback that describes modifications between
this image and the desired image. The existing work for this task mainly focuses on
designing a new fusion network to compose the image and text. Still, little research
pays attention to the modality gap caused by the inconsistent distribution of features
from different modalities, which dramatically influences the feature fusion and similarity
learning between queries and the desired image. We propose a Distribution-Aligned
Text-based Image Retrieval (DATIR) model, which consists of attention mutual information
maximization and hierarchical mutual information maximization, to bridge this gap
by increasing non-linear statistic dependencies between representations of different
modalities. More specifically, attention mutual information maximization narrows the
modality gap between different input modalities by maximizing mutual information between
the text representation and its semantically consistent representation captured from
the reference image and the desired image by the difference transformer. For hierarchical
mutual information maximization, it aligns distributions of features from the image
modality and the fusion modality by estimating mutual information between a single-layer
representation in the fusion network and the multi-level representations in the desired
image encoder. Extensive experiments on three large-scale benchmark datasets demonstrate
that we can bridge the modality gap between different modalities and achieve state-of-the-art
retrieval performance.

SESSION: Panel 2

Social Signals and Multimedia: Past, Present, Future

Hayley Hung
Cathal Gurrin
Martha Larson
Hatice Gunes
Fabien Ringeval
Elisabeth Andre
Louis-Philippe Morency

The rising popularity of Artificial Intelligence (AI) has brought considerable public
interest as well faster and more direct transfer of research ideas into practice.
One of the aspects of AI that still trails behind considerably is the role of machines
in interpreting, enhancing, modeling, generating, and influencing social behavior.
Such behavior is captured as social signals, usually by sensors recording multiple
modalities, making it classic multimedia data. Such behavior can also be generated
by an AI system when interacting with humans. Using AI techniques in combination with
multimedia data can be used to pursue multiple goals, two of which are high-lighted
here. First, supporting people during social interactions and helping them to fulfil
their social needs either actively or passively.Second, improving our understanding
of how people collaborate, build relationships, and process self identity. Despite
the rise of fields such as Social Signal Processing, a similar panel organised at
ACM Multimedia 2014, and an area on social and emotional signal sat the ACM MM since
2014, we argue that we have yet to truly fulfil the potential of the combining social
signals and multimedia. This panel asks where we have come far enough and what remaining
challenges there are in light of recent global events.

SESSION: Session 31: Multimedia Telepresence and Virtual/Augmented Reality

Learning Spatial-angular Fusion for Compressive Light Field Imaging in a Cycle-consistent
Framework

Xianqiang Lyu
Zhiyu Zhu
Mantang Guo
Jing Jin
Junhui Hou
Huanqiang Zeng

This paper investigates the 4-D light field (LF) reconstruction from 2-D measurements
captured by the coded aperture camera. To tackle such an ill-posed inverse problem,
we propose a cycle-consistent reconstruction network (CR-Net). To be specific, based
on the intrinsic linear imaging model of the coded aperture, CR-Net reconstructs an
LF through progressively eliminating the residuals between the projected measurements
from the reconstructed LF and input measurements. Moreover, to address the crucial
issue of extracting representative features from high-dimensional LF data efficiently
and effectively, we formulate the problem in a probability space and propose to approximate
a posterior distribution of a set of carefully-defined LF processing events, including
both layer-wise spatial-angular feature extraction and network-level feature aggregation.
Through droppath from a densely-connected template network, we derive an adaptively
learned spatial-angular fusion strategy, which is sharply contrasted with existing
manners that combine spatial and angular features empirically. Extensive experiments
on both simulated measurements and measurements by a real coded aperture camera demonstrate
the significant advantage of our method over state-of-the-art ones, i.e., our method
improves the reconstruction quality by 4.5 dB.

From Voxel to Point: IoU-guided 3D Object Detection for Point Cloud with Voxel-to-Point
Decoder

Jiale Li
Hang Dai
Ling Shao
Yong Ding

In this paper, we present an Intersection-over-Union (IoU) guided two-stage 3D object
detector with a voxel-to-point decoder. To preserve the necessary information from
all raw points and maintain the high box recall in voxel based Region Proposal Network
(RPN), we propose a residual voxel-to-point decoder to extract the point features
in addition to the map-view features from the voxel based RPN. We use a 3D Region
of Interest (RoI) alignment to crop and align the features with the proposal boxes
for accurately perceiving the object position. The RoI-Aligned features are finally
aggregated with the corner geometry embeddings that can provide the potentially missing
corner information in the box refinement stage. We propose a simple and efficient
method to align the estimated IoUs to the refined proposal boxes as a more relevant
localization confidence. The comprehensive experiments on KITTI and Waymo Open Dataset
demonstrate that our method achieves significant improvements with novel architectures
against the existing methods. The code is available on Github URLhttps://github.com/jialeli1/From-Voxel-to-Point
.

Extending 6-DoF VR Experience Via Multi-Sphere Images Interpolation

Jisheng Li
Yuze He
Jinghui Jiao
Yubin Hu
Yuxing Han
Jiangtao Wen

Three-degrees-of-freedom (3-DoF) omnidirectional imaging has been widely used in various
applications ranging from street maps to 3-DoF VR live broadcasting. Although allowing
for navigating viewpoints rotationally inside a virtual world, it does not provide
motion parallax key for human 3D perception. Recent research mitigates this problem
by introducing 3 transitional degrees of freedom (6-DoF) using multi-sphere images
(MSI) which is beginning to show promises in handling occlusions and reflective objects.
However, the design of MSI naturally limits the range of authentic 6-DoF experiences,
as existing mechanisms for MSI rendering cannot fully utilize multi-layer information
when synthesizing novel views between multiple MSIs. To tackle this problem and extend
the 6-DoF range, we propose an MSI interpolation pipeline that utilizes adjacent MSIs'
3D information embedded inside their layers. In this work, we describe an MSI projection
scheme along with an MSI interpolation network to predict intermediate MSIs in order
to facilitate the need for extended range. We demonstrate that our system significantly
improves the range of 6-DoF experience compared with other MSI-based methods. With
extensive experiments, we show our algorithm outperforms state-of-the-art methods
both qualitatively and quantitatively in synthesizing novel view panoramas.

iButter: Neural Interactive Bullet Time Generator for Human Free-viewpoint Rendering

Liao Wang
Ziyu Wang
Pei Lin
Yuheng Jiang
Xin Suo
Minye Wu
Lan Xu
Jingyi Yu

Generating "bullet-time" effects of human free-viewpoint videos is critical for immersive
visual effects and VR/AR experience. Recent neural advances still lack the controllable
and interactive bullet-time design ability for human free-viewpoint rendering, especially
under the real-time, dynamic and general setting for our trajectory-aware task. To
fill this gap, in this paper we propose a neural interactive bullet-time generator
(iButter) for photo-realistic human free-viewpoint rendering from dense RGB streams,
which enables flexible and interactive design for human bullet-time visual effects.
Our iButter approach consists of a real-time preview and design stage as well as a
trajectory-aware refinement stage. During preview, we propose an interactive bullet-time
design approach by extending the NeRF rendering to a real-time and dynamic setting
and getting rid of the tedious per-scene training. To this end, our bullet-time design
stage utilizes a hybrid training set, light-weight network design and an efficient
silhouette-based sampling strategy. During refinement, we introduce an efficient trajectory-aware
scheme within 20 minutes, which jointly encodes the spatial, temporal consistency
and semantic cues along the designed trajectory, achieving photo-realistic bullet-time
viewing experience of human activities. Extensive experiments demonstrate the effectiveness
of our approach for convenient interactive bullet-time design and photo-realistic
human free-viewpoint video generation.

Neural Free-Viewpoint Performance Rendering under Complex Human-object Interactions

Guoxing Sun
Xin Chen
Yizhang Chen
Anqi Pang
Pei Lin
Yuheng Jiang
Lan Xu
Jingyi Yu
Jingya Wang

4D reconstruction of human-object interaction is critical for immersive VR/AR experience
and human activity understanding. Recent advances still fail to recover fine geometry
and texture results from sparse RGB inputs, especially under challenging human-object
interactions scenarios. In this paper, we propose a neural human performance capture
and rendering system to generate both high-quality geometry and photo-realistic texture
of both human and objects under challenging interaction scenarios in arbitrary novel
views, from only sparse RGB streams. To deal with complex occlusions raised by human-object
interactions, we adopt a layer-wise scene decoupling strategy and perform volumetric
reconstruction and neural rendering of the human and object. Specifically, for geometry
reconstruction, we propose an interaction-aware human-object capture scheme that jointly
considers the human reconstruction and object reconstruction with their correlations.
Occlusion-aware human reconstruction and robust human-aware object tracking are proposed
for consistent 4D human-object dynamic reconstruction. For neural texture rendering,
we propose a layer-wise human-object rendering scheme, which combines direction-aware
neural blending weight learning and spatial-temporal texture completion to provide
high-resolution and photo-realistic texture results in the occluded scenarios. Extensive
experiments demonstrate the effectiveness of our approach to achieve high-quality
geometry and texture reconstruction in free viewpoints for challenging human-object
interactions.

Semi-supervised Learning via Improved Teacher-Student Network for Robust 3D Reconstruction
of Stereo Endoscopic Image

Hongkuan Shi
Zhiwei Wang
Jinxin Lv
Yilang Wang
Peng Zhang
Fei Zhu
Qiang Li

3D reconstruction of stereo endoscope image, as an enabling technique for varied surgical
systems, e.g., medical droids, navigations, etc., suffers from severe overfitting
problems due to scarce labels. Semi-supervised learning based on Teacher-Student Network
(TSN) is a potential solution, which utilizes a supervised teacher model trained on
available labeled data to teach a student model on all images via assigning them pseudo
labels. However, TSN often faces a dilemma: if given only few labeled endoscope images,
the teacher model will be trained to be defective and induce high-noised pseudo labels,
degrading the student model significantly. To solve this, we propose an improved TSN
for a robust 3D reconstruction of stereo endoscope image. Specifically, two novel
modules are introduced: 1) a semi-supervised teacher model based on adversarial learning
to produce mostly correct pseudo labels by forcing a consistency in predictions for
both labeled and unlabeled data, and 2) a confidence network to further filter out
noisy pseudo labels by estimating a confidence for each prediction of the teacher
model. By doing so, the student model is able to distill knowledge from more accurate
and noiseless pseudo labels, thus achieving improved performance. Experimental results
on two public datasets show that our improved TSN achieves a superior performance
than the state-of-the-arts by reducing the averaged disparity error by at least 13.5%.

SESSION: Session 32: Social Multimedia

FoodLogoDet-1500: A Dataset for Large-Scale Food Logo Detection via Multi-Scale Feature
Decoupling Network

Qiang Hou
Weiqing Min
Jing Wang
Sujuan Hou
Yuanjie Zheng
Shuqiang Jiang

Food logo detection plays an important role in the multimedia for its wide real-world
applications, such as food recommendation of the self-service shop and infringement
detection on e-commerce platforms. A large-scale food logo dataset is urgently needed
for developing advanced food logo detection algorithms. However, there are no available
food logo datasets with food brand information. To support efforts towards food logo
detection, we introduce the dataset FoodLogoDet-1500, a new large-scale publicly available
food logo dataset, which has 1,500 categories, about 100,000 images and about 150,000
manually annotated food logo objects. We describe the collection and annotation process
of FoodLogoDet-1500, analyze its scale and diversity, and compare it with other logo
datasets. To the best of our knowledge, FoodLogoDet-1500 is the first largest publicly
available high-quality dataset for food logo detection. The challenge of food logo
detection lies in the large-scale categories and similarities between food logo categories.
For that, we propose a novel food logo detection method Multi-scale Feature Decoupling
Network (MFDNet), which decouples classification and regression into two branches
and focuses on the classification branch to solve the problem of distinguishing multiple
food logo categories. Specifically, we introduce the feature offset module, which
utilizes the deformation-learning for optimal classification offset and can effectively
obtain the most representative features of classification in detection. In addition,
we adopt a balanced feature pyramid in MFDNet, which pays attention to global information,
balances the multi-scale feature maps, and enhances feature extraction capability.
Comprehensive experiments on FoodLogoDet-1500 and other two popular benchmark logo
datasets demonstrate the effectiveness of the proposed method. The code and FoodLogoDet-1500
can be found at https://github.com/hq03/FoodLogoDet-1500-Dataset.

Cross-View Representation Learning for Multi-View Logo Classification with Information
Bottleneck

Jing Wang
Yuanjie Zheng
Jingqi Song
Sujuan Hou

Multi-view logo classification is a challenging task due to the cross-view misalignment
of logo image varies under different viewpoints, large intra-classes and small inter-classes
variation of logo appearance. Cross-view data can represent objects from different
views and thus provide complementary information for data analysis. However, most
existing multi-view algorithms usually maximize the correlation between different
views for consistency. Those methods ignore the interaction among different views
and may cause semantic bias during the process of common feature learning. In this
paper, we investigate the information bottleneck (IB) to the multi-view learning for
extracting the different view common features of one category, named Dual-View Information
Bottleneck representation (Dual-view IB). To the best of our knowledge, this is the
first cross-view learning method for logo classification. Specifically, we maximize
the mutual information between the representations of the two views to achieve the
preservation of key features in the classification task, while eliminating the redundant
information that is not shared between the two views. In addition, due to the unbalance
of samples and limited computing resources, we further introduce a novel Pair Batch
Data Augmentation (PB) algorithm for Dual-view IB model, which applies augmentations
from a learned policy based on replicates instances of two samples within the same
batch. Comprehensive experiments on three existing benchmark datasets, which demonstrate
the effectiveness of the proposed method that outperforms the methods in the state
of the art. The proposed method is expected to further the development of cross-view
representation learning.

Parametric Reshaping of Portraits in Videos

Xiangjun Tang
WenXin Sun
Yong-Liang Yang
Xiaogang Jin

Sharing short personalized videos to various social media networks has become quite
popular in recent years. This raises the need for digital retouching of portraits
in videos. However, applying portrait image editing directly on portrait video frames
cannot generate smooth and stable video sequences. To this end, we present a robust
and easy-to-use parametric method to reshape the portrait in a video to produce smooth
retouched results. Given an input portrait video, our method consists of two main
stages: stabilized face reconstruction, and continuous video reshaping. In the first
stage, we start by estimating face rigid pose transformations across video frames.
Then we jointly optimize multiple frames to reconstruct an accurate face identity,
followed by recovering face expressions over the entire video. In the second stage,
we first reshape the reconstructed 3D face using a parametric reshaping model reflecting
the weight change of the face, and then utilize the reshaped 3D face to guide the
warping of video frames. We develop a novel signed distance function based dense mapping
method for the warping between face contours before and after reshaping, resulting
in stable warped video frames with minimum distortions. In addition, we use the 3D
structure of the face to correct the dense mapping to achieve temporal consistency.
We generate the final result by minimizing the background distortion through optimizing
a content-aware warping mesh. Extensive experiments show that our method is able to
create visually pleasing results by adjusting a simple reshaping parameter, which
facilitates portrait video editing for social media and visual effects.

Human Attributes Prediction under Privacy-preserving Conditions

Anshu Singh
Shaojing Fan
Mohan Kankanhalli

Human attributes prediction in visual media is a well-researched topic with a major
focus on human faces. However, face images are often of high privacy concern as they
can reveal an individual's identity. How to balance this trade-off between privacy
and utility is a key problem among researchers and practitioners. In this study, we
make one of the first attempts to investigate the human attributes (emotion, age,
and gender) prediction under the different de-identification (eyes, lower-face, face,
and head obfuscation) privacy scenarios. We first constructed the Diversity in People
and Context Dataset (DPaC). We then performed a human study with eye-tracking on how
humans recognize facial attributes without the presence of face and context. Results
show that in an image, situational context is informative of a target's attributes.
Motivated by our human study, we proposed a multi-tasking deep learning model - Context-Guided
Human Attributes Prediction (CHAPNet), for human attributes prediction under privacy-preserving
conditions. Extensive experiments on DPaC and three commonly used benchmark datasets
demonstrate the superiority of CHAPNet in leveraging the situational context for a
better interpretation of a target's attributes without the full presence of the target's
face. Our research demonstrates the feasibility of visual analytics under de-identification
for privacy.

Multi-Modal Sarcasm Detection with Interactive In-Modal and Cross-Modal Graphs

Bin Liang
Chenwei Lou
Xiang Li
Lin Gui
Min Yang
Ruifeng Xu

Sarcasm is a peculiar form and sophisticated linguistic act to express the incongruity
of someone's implied sentiment expression, which is a pervasive phenomenon in social
media platforms. Compared with sarcasm detection purely on texts, multi-modal sarcasm
detection is more adapted to the rapidly growing social media platforms, where people
are interested in creating multi-modal messages. When focusing on the multi-modal
sarcasm detection for tweets consisting of texts and images on Twitter, the significant
clue of improving the performance of multi-modal sarcasm detection evolves into how
to determine the incongruity relations between texts and images. In this paper, we
investigate multi-modal sarcasm detection from a novel perspective, so as to determine
the sentiment inconsistencies within a certain modality and across different modalities
by constructing heterogeneous in-modal and cross-modal graphs (InCrossMGs) for each
multi-modal example. Based on it, we explore an interactive graph convolution network
(GCN) structure to jointly and interactively learn the incongruity relations of in-modal
and cross-modal graphs for determining the significant clues in sarcasm detection.
Experimental results demonstrate that our proposed model achieves state-of-the-art
performance in multi-modal sarcasm detection.

Linking the Characters: Video-oriented Social Graph Generation via Hierarchical-cumulative GCN

Shiwei Wu
Joya Chen
Tong Xu
Liyi Chen
Lingfei Wu
Yao Hu
Enhong Chen

Recent years have witnessed the booming of online video platforms. Along this line,
a graph to illustrate social relation among characters has been long expected to not
only benefit the audiences for better understanding the story, but also support the
fine-grained video analysis task in a semantic way. Unfortunately, though we humans
could easily infer the social relations among characters, it is still an extremely
challenging task for intelligent systems to automatically capture the social relation
by absorbing multi-modal cues. Besides, they fail to describe the relations among
multiple characters in a graph-generation perspective. To that end, inspired by the
human inference ability on social relationship, we propose a novel Hierarchical- Cumulative
Graph Convolutional Network (HC-GCN) to generate the social relation graph for multiple
characters in the video. Specifically, we first integrate the short-term multi-modal
cues, including visual, textual and audio information, to generate the frame-level
graphs for part of characters via multimodal graph convolution technique. While dealing
with the video-level aggregation task, we design an end-to-end framework to aggregate
all frame-level subgraphs along the temporal trajectory, which results in a global
video-level social graph with various social relationships among multiple characters.
Extensive validations on two real-world large-scale datasets demonstrate the effectiveness
of our proposed method compared with SOTA baselines.

SESSION: Session 33: Multimedia Grand Challenge

Overview of Tencent Multi-modal Ads Video Understanding

Zhenzhi Wang
Zhimin Li
Liyu Wu
Jiangfeng Xiong
Qinglin Lu

Multi-modal Ads Video Understanding Challenge is the first grand challenge aiming
to comprehensively understand ads videos. Our challenge includes two tasks: video
structuring and multi-label classification. Video structuring asks the participants
to accurately predict both the scene boundaries and the multi-label categories of
each scene based on a fine-grained and ads-related category hierarchy. This task will
advance the foundation of comprehensive ads video understanding, which has a significant
impact on many applications in ads, such as video recommendation and user behavior
analysis. This paper presents an overview of the video structuring task in our grand
challenge, including the background of ads videos, an elaborate description of this
task, our proposed dataset, the evaluation protocol, and our baseline model. By ablating
the key components of our baseline, we would like to reveal the main challenges of
this task and provide useful guidance for future research of this area.

Better Learning Shot Boundary Detection via Multi-task

Haoxin Zhang
Zhimin Li
Qinglin Lu

Shot boundary detection (SBD) plays an important role in video understanding, since
most recent works take the shot as minimal granularity instead of frames for upstream
tasks. However, the large variations of hard-cut and gradual-change transitions within
shots significantly limit the performance of SBD. To deal with the variations, we
propose a multi-task architecture called Transnet++. Transnet++ disentangles the two
types of transition and adopts two separate branches to predict them respectively.
Two branches share the same video knowledge space and their results are fused for
final prediction. Moreover, we propose a spatial attention module (SAM) to enhance
the feature representations which suffers from redundant padding region. Meanwhile,
a temporal attention module (TAM) is applied to capture the long-term information
of the video for alleviating the over-segmentation problem. Experimental results (91.16%
f1-score) on Tencent AVS Dataset demonstrate the effectiveness and superiority of
Transnet++ for SBD.

Facial Micro-Expression Generation based on Deep Motion Retargeting and Transfer Learning

Xinqi Fan
Ali Raza Shahid
Hong Yan

Facial micro-expression (FME) refers to a brief spontaneous facial movement that can
reveal a person's genius emotion. One challenge in facial micro-expression is the
lack of data. Fortunately, generative deep neural network models can assist in the
creation of desired images. However, the issues for micro-expressions are the facial
variations are too subtle to capture, and the limited training data may make feature
extraction difficult. To address these issues, we developed a deep motion retargeting
and transfer learning based facial micro-expression generation model (DMT-FMEG). First,
to capture subtle variations, we employed a deep motion retargeting (DMR) network
that can learn keypoints in an unsupervised manner, estimate motions, and generate
desired images. Second, to enhance the feature extraction ability, we applied deep
transfer learning (DTL) by borrowing knowledge from macro-expression images. We evaluated
our method on three datasets, CASME II, SMIC, and SAMM, and found that it showed satisfactory
results on all of them. With the effectiveness of the method, we won the second place
in the generation task of the FME 2021 challenge.

Deadline and Priority-aware Congestion Control for Delay-sensitive Multimedia Streaming

Chao Zhou
Wenjun Wu
Dan Yang
Tianchi Huang
Liang Guo
Bing Yu

Most applications of interactive multimedia require the data to arrive within the
specific acceptable end-to-end latency (i.e., meeting deadline). To avoid efforts
being wasted, the content must reach the destination before the deadline. In our work,
we propose DAP (Deadline And Priority-aware congestion control) to achieve high throughput
within acceptable end-to-end latency, especially to send high-priority packets while
meeting deadline requirements. DAP is mainly composed of two modules: i) the scheduler
decides which packet should be sent at first w.r.t the reward function with fully
considering the packets' priority, deadline, and current network conditions. ii) the
deadline-sensitive congestion control module transmits packets with high efficiency
while guaranteeing the end-to-end latency. Specifically, we propose an improved packet-pair
scheme to adjust the best congestion window corresponding to the Bandwidth-Delay Product
and to update the instant sending rate by current queue length. Experimental results
demonstrate the significant performance of our scheme and DAP ranks first in both
the training phase and final phase of the ACM MM 2021 Grand Challenge: Meet Deadline
Requirements.

LSSNet: A Two-stream Convolutional Neural Network for Spotting Macro- and Micro-expression
in Long Videos

Wang-Wang Yu
Jingwen Jiang
Yong-Jie Li

Macro- and micro-expression spotting is a very challenging task to locate their occurrence
intervals in long face videos. In this paper, we propose an efficient two-stream network
named location suppression based spotting network (LSSNet), which includes three parts.
First, the optical flow is extracted using the traditional TV-L1 algorithm which captures
subtle facial movements while adding temporal information to alleviate the problem
of insufficient samples. Then, fixed length features are extracted from the sampled
optical flow and raw images by an I3D model, which is used to set sliding windows.
Finally, location suppression modules (LSMs) are added to the pyramidal convolutional
neural network (CNN) to reduce the proposals with too long and too short intervals.
In addition, we use two different methods, named top_k and top_threshold, for validation.
We adopt leave-one-subject-out (LOSO) to train our model on CAS(ME)2 and SAMM-LV.
Experimental results show that our LSSNet achieves the state-of-the-art result with
top_threshold, especially on the CAS(ME)2 dataset. The code is available at https://github.com/williamlee91/mer_spot.

Multi-Level Visual Representation with Semantic-Reinforced Learning for Video Captioning

Chengbo Dong
Xinru Chen
Aozhu Chen
Fan Hu
Zihan Wang
Xirong Li

This paper describes our bronze-medal solution for the video captioning task of the
ACMMM2021 Pre-Training for Video Understanding Challenge. We depart from the Bottom-Up-Top-Down
model, with technical improvements on both video content encoding and caption decoding.
For encoding, we propose to extract multi-level video features that describe holistic
scenes and fine-grained key objects, respectively. The scene-level and object-level
features are enhanced separately by multi-head self-attention mechanisms before feeding
them into the decoding module. Towards generating content-relevant and human-like
captions, we train our network end-to-end by semantic-reinforced learning. Finally,
in order to select the best caption from captions produced by distinct models, we
perform caption reranking by cross-modal matching between a given video and each candidate
caption. Both internal experiments on the MSR-VTT test set and external evaluations
by the challenge organizers justify the viability of the proposed solution.

Facial Prior Based First Order Motion Model for Micro-expression Generation

Yi Zhang
Youjun Zhao
Yuhang Wen
Zixuan Tang
Xinhua Xu
Mengyuan Liu

Spotting facial micro-expression from videos finds various potential applications
in fields including clinical diagnosis and interrogation, meanwhile this task is still
difficult due to the limited scale of training data. To solve this problem, this paper
tries to formulate a new task called micro-expression generation and then presents
a strong baseline which combines the first order motion model with facial prior knowledge.
Given a target face, we intend to drive the face to generate micro-expression videos
according to the motion patterns of source videos. Specifically, our new model involves
three modules. First, we extract facial prior features from a region focusing module.
Second, we estimate facial motion using key points and local affine transformations
with a motion prediction module. Third, expression generation module is used to drive
the target face to generate videos. We train our model on public CASME II, SAMM and
SMIC datasets and then use the model to generate new micro-expression videos for evaluation.
Our model achieves the first place in the Facial Micro-Expression Challenge 2021 (MEGC2021),
where our superior performance is verified by three experts with Facial Action Coding
System certification. Source code is provided in https://github.com/Necolizer/Facial-Prior-Based-FOMM.

Rethinking the Impacts of Overfitting and Feature Quality on Small-scale Video Classification

Xuansheng Wu
Feichi Yang
Tong Zhou
Xinyue Lin

While Transformers have yielded impressive results for video classification on large
datasets recently, simpler models without the transformer architecture can be promising
for small datasets. In this paper, we propose three major techniques to improve feature
quality and another three to alleviate overfitting in an attempt to make lightweight
models achieve higher performances. In particular, we enhance features of Image Flow
by combining temporal information, multi-level features of CNNs, and Text embedding.
We alleviate overfitting by removing redundant modal, fine-tuning dropout rate, and
augmenting data. In the 2021 Tencent Advertisement Algorithm Competition, the baseline
model achieved a GAP score of 0.8019 offline with our strategies. It's worth mentioning
that our design works well with the 10-fold method, which produces our final submitting
model with a GAP score of 0.8210 online, ranking the 5th among 287 teams. In addition,
our solution is among the fastest within the top 10 teams.

A Gradient Balancing Approach for Robust Logo Detection

Fuxing Leng

This paper presents the 1st place solution to the Grand Challenge of ACM MM2021 Robust
Logo Detection. We build our end-to-end solution on top of Cascade RCNN (using Res2Net101
as backbone). Through careful observation during training, we find that the model
performance is limited by imbalanced gradients from different classes of the long-tailed
dataset. We adopt a gradient balancing approach to tackle this problem. Our approach
reweighs the gradients of each class to guide the training process towards a balance
between all classes. Moreover, we design a series of data augmentation policies and
propose a progressive data augmentation strategy to train our model to deal with adversarial
samples. We demonstrate the accuracy and robustness of our method by achieving 70.2448
mAP on leaderboard A, and 63.8793 mAP on leaderboard B, which contains adversarial
images.

Multi-modal Representation Learning for Video Advertisement Content Structuring

Daya Guo
Zhaoyang Zeng

Video advertisement content structuring aims to segment a given video advertisement
and label each segment on various dimensions, such as presentation form, scene, and
style. Different from real-life videos, video advertisements contain sufficient and
useful multi-modal content like caption and speech, which provides crucial video semantics
and would enhance the structuring process. In this paper, we propose a multi-modal
encoder to learn multi-modal representation from video advertisements by interacting
between video-audio and text. Based on multi-modal representation, we then apply Boundary-Matching
Network to generate temporal proposals. To make the proposals more accurate, we refine
generated proposals by scene-guided alignment and re-ranking. Finally, we incorporate
proposal located embeddings into the introduced multi-modal encoder to capture temporal
relationships between local features of each proposal and global features of the whole
video for classification. Experimental results show that our method achieves significantly
improvement compared with several baselines and Rank 1 on the task of Multi-modal
Ads Video Understanding in ACM Multimedia 2021 Grand Challenge. Ablation study further
shows that leveraging multi-modal content like caption and speech in video advertisements
significantly improve the performance.

Phoenix: Combining Highest-Profit First Scheduling and Responsive Congestion Control for Delay-sensitive
Multimedia Transmission

Haozhe Li

The rapid development of real-time interactive applications has brought new challenges
to ensuring user's quality-of-experience (QoE). These applications have deadline requirements
and the characteristics of block transmission. This not only requires high throughput
and low latency, but also needs to consider the transmission sequence between data
blocks to ensure that the data blocks arrive before the deadline. However, the existing
congestion control algorithms and scheduling algorithms do not fit well with the characteristics
of real-time interactive applications. Therefore, a high-performance hybrid control
algorithm is urgently needed to ensure the user's QoE. In response to this problem,
this paper proposes a scheduling algorithm based on transmission profit and a responsive
congestion algorithm, and compared simulations in a variety of scenarios. Experimental
results show that Phoenix performs very well in a variety of scenarios, and the average
QoE is $33.7%$ higher than BBR+EDF.

VidVRD 2021: The Third Grand Challenge on Video Relation Detection

Wei Ji
Yicong Li
Meng Wei
Xindi Shang
Junbin Xiao
Tongwei Ren
Tat-Seng Chua

ACM Multimedia 2021 Video Relation Understanding Challenge is the third grand challenge
which aims at exploring the relationship of subjects and objects appearing in videos
for fine-grained and high-level video understanding. Given a video, the video relation
detection model should output a serious of relation triplet subject, predicate, object
and the corresponding trajectories of subject and object. The goal of this task is
to promote research on developing video semantic understanding model, so as to perform
complex inferences and mining of visual knowledge in videos. In this paper, we make
a comprehensive and detailed introduction of this task, conclude the proposed algorithms
in the last few years, and propose future direction for research in this task.

A Simple and Effective Baseline for Robust Logo Detection

Weipeng Xu
Ye Liu
Daquan Lin

This article introduces the solution of the third place team green hand for ACM Multimedia
2021 Security AI Challenger Phase 7: Robust defense competition for e-commerce logo
detection. In this work, we use the DetectoRS model and 4 strategies, including resampling,
equalization loss v2, data augmentation and weighted boxes fusion. It aims to solve
the three main problems in the competition, including small target detection, long-tail
distribution and adversarial examples. The final results show that our model achieves
an evaluation score of 0.654611 in this semi-final, which ranks first in semi-final
and third place in final among all 36,489 teams.

Robust Logo Detection in E-Commerce Images by Data Augmentation

Hang Chen
Xiao Li
Zefan Wang
Xiaolin Hu

Logo detection is an important task in the intellectual property protection in e-commerce.
In the paper, we introduce our solution for the ACM MM2021 Robust Logo Detection Grand
Challenge. The competition requires the detection of logos (515 categories) in e-commerce
images. This competition is challenged by long-tail distribution, small objects, and
different types of noises. To overcome these challenges, we built a highly optimized
and robust detector. We first tested many effective techniques for general object
detection and then focused on data augmentation. We found that data augmentation was
effective in improving the performance and robustness of logo detection. Based on
the combination of these techniques, we achieved APs of 64.6% and 61.3% on the clean
and noisy datasets respectively, which were improved by 8.1% and 19.5% relative to
the official baseline. We ranked 5th among 36489 teams in the competition.

Facial Action Unit-based Deep Learning Framework for Spotting Macro- and Micro-expressions
in Long Video Sequences

Bo Yang
Jianming Wu
Zhiguang Zhou
Megumi Komiya
Koki Kishimoto
Jianfeng Xu
Keisuke Nonaka
Toshiharu Horiuchi
Satoshi Komorita
Gen Hattori
Sei Naito
Yasuhiro Takishima

In this paper, we utilize facial action units (AUs) detection to construct an end-to-end
deep learning framework for the macro- and micro-expressions spotting task in long
video sequences. The proposed framework focuses on individual components of facial
muscle movement rather than processing the whole image, which eliminates the influence
of image change caused by noises, such as body or head movement. Compared with existing
models deploying deep learning methods with classical Convolutional Neural Network
(CNN) models, the proposed framework utilizes Gated Recurrent Unit (GRU) or Long Short-term
Memory (LSTM) or our proposed Concat-CNN models to learn the characteristic correlation
between AUs of distinctive frames. The Concat-CNN uses three convolutional kernels
with different sizes to observe features of different duration and emphasizes both
local and global mutation features by changing dimensionality (max-pooling size) of
the output space. Our proposal achieves state-of-the-art performance from the aspect
of overall F1-scores: 0.2019 on CAS(ME)2-cropped, 0.2736 on SAMM Long Video, and 0.2118
on CAS(ME)2, which not only outperforms the baseline but is also ranked the 3rd of
FME challenge 2021 for combined datasets of CAS(ME)2-cropped and SAMM-LV.

NJU MCG - Sensetime Team Submission to Pre-training for Video Understanding Challenge
Track II

Liwei Jin
Haoyue Cheng
Su Xu
Wayne Wu
Limin Wang

This paper presents the method that underlies our submission to the Pre-training for
Video Understanding Challenge Track II. We follow the basic pipeline of temporal segment
networks [20] and further improve its performance in several aspects. Specifically,
we use the latest transformer-based architectures, e.g., Swin Transformer, DeiT, CLIP-ViT,
to enhance the representation power. We analyze different pre-training proxy tasks
on the official pre-training datasets and other open-source video datasets. With these
techniques, we derive an ensemble of deep models to attain a high classification accuracy
(Top-1 accuracy 62.28%) on the testing set and secures first place in Track II of
this challenge.

Research on Micro-Expression Spotting Method Based on Optical Flow Features

He Yuhong

This paper aims to propose an automatic micro-expression spotting method of high accuracy
and high robustness. Due to the characteristics of small amplitude and short duration,
how to accurately capture the subtle movements of micro-expression is a complex problem.
The optical flow method is applied to estimate the motion trend of the facial regions.
Because the head shaking is an essential reason for the high false-positive rate of
micro-expression spotting, a reliable face alignment method becomes crucial. According
to the optical flow of the nose tip region, the cutting box was adjusted several times
to optimize the relative position between the face and the cutting box stable. On
this basis, the optical flow features from the 14 regions of interest on the face
are used to build a feature matrix, and a wave peak location technology is proposed
to accurately locate the moment when the micro-expression occurs on the time-domain
curve of the features. The experimental results on the CAS(ME)2-cropped and the SAMM
Long Videos datasets show that our method performs significantly better than the baseline
method and has a high application value in various application scenarios.

A Solution to Multi-modal Ads Video Tagging Challenge

Hao Wu
Jiajie Wang
Yuanzhe Gu
Peisen Zhao
Zhonglin Zu

In this paper, we present our solution to the Multi-modal Ads Video Tagging Challenge
of Tencent Advertising Algorithm Competition in ACM Multimedia 2021 Grand Challenges.
We extend the baseline model by redesigning the visual feature extraction procedure
and we modify the loss function to cope with sparse positive targets. Moreover, we
propose Semi-supervised Learning with Negative Masking to leverage both labeled data
and unlabeled data from the preliminary contest which effectively enhances the training
process. We further utilize Cross-Class Relevance Learning to boost the performance.
We achieve 0.8237 GAP score via model ensemble and rank the second place among all
submissions in the challenge.

FAMGAN: Fine-grained AUs Modulation based Generative Adversarial Network for Micro-Expression
Generation

Yifan Xu
Sirui Zhao
Huaying Tang
Xinglong Mao
Tong Xu
Enhong Chen

Micro-expressions (MEs) are significant and effective clues to reveal the true feelings
and emotions of human beings, and thus MEs analysis is widely used in different fields
such as medical diagnosis, interrogation and security. However, it is extremely difficult
to elicit and label MEs, resulting in a lack of sufficient MEs data for MEs analysis.
To address this challenge and inspired by the current face generation technology,
in this paper we introduce Generative Adversarial Network based on fine-grained Action
Units (AUs) modulation to generate MEs sequence (FAMGAN). Specifically, after comprehensively
analyzing the factors that lead to inaccurate AU values detection, we performed fine-grained
AUs modulation, which includes carefully eliminating the various noises and dealing
with the asymmetry of AUs intensity. Additionally, we incorporate super-resolution
into our model to enhance the quality of the generated images. Through experiments,
we show that the system achieves very competitive results on the Micro-Expression
Grand Challenge (MEGC2021).

Semantic Tag Augmented XlanV Model for Video Captioning

Yiqing Huang
Hongwei Xue
Jiansheng Chen
Huimin Ma
Hongbing Ma

The key of video captioning is to leverage the cross-modal information from both vision
and language perspectives. We propose to leverage the semantic tags to bridge the
gap between these modalities rather than directly concatenating or attending to the
visual and linguistic features as the previous works. The semantic tags are the object
tags and the action tags detected in videos, which can be viewed as partial captions
for the input video. To effectively exploit the semantic tags, we design a Semantic
Tag augmented XlanV (ST-XlanV) model which encodes 4 kinds of visual and semantic
features with X-Linear Attention based cross-attention modules. Moreover, tag related
tasks are also designed in the pre-training stage to aid the model more fruitfully
exploits the cross-modal information. The proposed model reaches the 5th place in
the pre-training for video captioning challenge with the help of the semantic tags.
Our codes will be available at: https://github.com/RubickH/ST-XlanV.

Automated Multi-Modal Video Editing for Ads Video

Qin Lin
Nuo Pang
Zhiying Hong

Video advertising is one of the most effective forms of advertisements because videos
are more attractive, more persuasive, and more informative than images or texts. Increasing
amounts of video advertisements requires faster and more intelligent technologies
for video generation. We have developed a multi-modal video editing approach that
can automatically generate advertising video clips from any source videos. We conduct
chronological boundary segmentation of the original video and construct a weighted
directed graph to assemble different segments. Experiments on our video editing datasets
validate success of the proposed method in producing compelling and consistent advertising
videos.

Rate Adaptation and Block Scheduling for Delay-sensitive Multimedia Applications

Dongyuan Su
Laizhong Cui
Lei Zhang
Yanyan Suo
Yan Qiu

Emerging multimedia applications like VR, AR, etc., exhibit unique transmission features,
such as block-based transmission, dynamic prioritization for different contents, and
deadline-aware delivery, which should be carefully managed but fail to be considered
in the design of existing transmission control algorithms. In this work, we propose
a delay-sensitive congestion control algorithm with a hybrid of coarse-grained and
fine-grained control to improve the QoE scores. The coarse-grained control scheme
maintains a low queuing delay and avoids missing the deadline in the steady state.
The fine-grained control scheme rapidly reacts to the network dynamics based on our
bandwidth estimation model. For the block scheduling, we heuristically model the realistic
priority of each block by examining the trade-off among the remaining time, the remaining
size, and the priority score of each block. Extensive experiments are conducted to
evaluate the performance of our solution, which show that our solution significantly
outperforms other baseline algorithms.

Video Relation Detection via Tracklet based Visual Transformer

Kaifeng Gao
Long Chen
Yifeng Huang
Jun Xiao

Video Visual Relation Detection (VidVRD), has received significant attention of our
community over recent years. In this paper, we apply the state-of-the-art video object
tracklet detection pipeline MEGA[7] and deepSORT [27] to generate tracklet proposals.
Then we perform VidVRD in a tracklet-based manner without any pre-cutting operations.
Specifically, we design a tracklet-based visual Transformer. It contains a temporal-aware
decoder which performs feature interactions between the tracklets and learnable predicate
query embeddings, and finally predicts the relations. Experimental results strongly
demonstrate the superiority of our method, which outperforms other methods by a large
margin on the Video Relation Understanding (VRU) Grand Challenge in ACM Multimedia
2021. Codes are released at https://github.com/Dawn-LX/VidVRD-tracklets.

Group-Level Focus of Visual Attention for Improved Next Speaker Prediction

Chris Birmingham
Kalin Stefanov
Maja J. Mataric

In this work we address the Next Speaker Prediction sub challenge of the ACM '21 MultiMediate
Grand Challenge. This challenge poses the problem of turn taking prediction in physically
situated multiparty interaction. Solving this problem is essential for enabling fluent
real-time multiparty human-machine interaction. This problem is made more difficult
by the need for a robust solution that can perform effectively across a wide variety
of settings and contexts. Prior work has shown that current state-of-the-art methods
rely on machine learning approaches that do not generalize well to new settings and
feature distributions. To address this problem, we propose the use of group-level
focus of visual attention as additional information. We show that a simple combination
of group-level focus of visual attention features and publicly available audio-video
synchronizer models is competitive with state-of-the-art methods fine-tuned for the
challenge dataset.

A Multimodal Framework for Video Ads Understanding

Zejia Weng
Lingchen Meng
Rui Wang
Zuxuan Wu
Yu-Gang Jiang

There is a growing trend in placing video advertisements on social platforms for online
marketing, which demands automatic approaches to understand the contents of advertisements
effectively. Taking the 2021 TAAC competition as an opportunity, we developed a multimodal
system to improve the ability of structured analysis of advertising video content.
In our framework, we break down the video structuring analysis problem into two tasks,
i.e., scene segmentation and multi-modal tagging. In scene segmentation, we build
upon a temporal convolution module for temporal modeling to predict whether adjacent
frames belong to the same scene. In multi-modal tagging, we first compute clip-level
visual features by aggregating frame-level features with NeXt-SoftDBoF. The visual
features are further complemented with textual features that are derived using a global-local
attention mechanism to extract useful information from OCR (Optical Character Recognition)
and ASR (Audio Speech Recognition) outputs. Our solution achieved a score of 0.2470
measured in consideration of localization and prediction accuracy, ranking fourth
in the 2021 TAAC final leaderboard.

Joint Learning for Relationship and Interaction Analysis in Video with Multimodal
Feature Fusion

Beibei Zhang
Fan Yu
Yanxin Gao
Tongwei Ren
Gangshan Wu

To comprehend long duration videos, the deep video understanding (DVU) task is proposed
to recognize interactions on scene level and relationships on movie level and answer
questions on these two levels. In this paper, we propose a solution to the DVU task
which applies joint learning of interaction and relationship prediction and multimodal
feature fusion. Our solution handles the DVU task with three joint learning sub-tasks:
scene sentiment classification, scene interaction recognition and super-scene video
relationship recognition, all of which utilize text features, visual features and
audio features, and predict representations in semantic space. Since sentiment, interaction
and relationship are related to each other, we train a unified framework with joint
learning. Then, we answer questions for video analysis in DVU according to the results
of the three sub-tasks. We conduct experiments on the HLVU dataset to evaluate the
effectiveness of our method.

MM21 Pre-training for Video Understanding Challenge: Video Captioning with Pretraining Techniques

Sihan Chen
Xinxin Zhu
Dongze Hao
Wei Liu
Jiawei Liu
Zijia Zhao
Longteng Guo
Jing Liu

The quality of video representation directly decides the performance of video related
tasks, for both understanding and generation. In this paper, we propose single-modality
pretrained feature fusion technique which is composed of reasonable multi-view feature
extraction method and designed multi-modality feature fusion strategy. We conduct
comprehensive ablation studies on MSR-VTT dataset to demonstrate the effectiveness
of proposed method and it surpasses the state-of-the-art methods on both MSR-VTT and
VATEX datasets. We further propose the multi-modality pretrained model finetuning
technique and dataset augmentation scheme to improve the model's generalization capability.
Based on these two proposed pretraining techniques and dataset augmentation scheme,
we win the first place in the video captioning track of the MM21 pretraining for video
understanding challenge.

CLIP4Caption: CLIP for Video Caption

Mingkang Tang
Zhanyu Wang
Zhenhua LIU
Fengyun Rao
Dian Li
Xiu Li

Video captioning is a challenging task since it requires generating sentences describing
various diverse and complex videos. Existing video captioning models lack adequate
visual representation due to the neglect of the existence of gaps between videos and
texts. To bridge this gap, in this paper, we propose a CLIP4Caption framework that
improves video captioning based on a CLIP-enhanced video-text matching network (VTM).
This framework is taking full advantage of the information from both vision and language
and enforcing the model to learn strongly text-correlated video features for text
generation. Besides, unlike most existing models using LSTM or GRU as the sentence
decoder, we adopt a Transformer structured decoder network to effectively learn the
long-range visual and language dependency. Additionally, we introduce a novel ensemble
strategy for captioning tasks. Experimental results demonstrate the effectiveness
of our method on two datasets: 1) on MSR-VTT dataset, our method achieved a new state-of-the-art
result with a significant gain of up to 10% in CIDEr; 2) on the private test data,
our method ranking 2nd place in the ACM MM multimedia grand challenge 2021: Pre-training
for Video Understanding Challenge. It is noted that our model is only trained on the
MSR-VTT dataset.

The ACM Multimedia 2021 Meet Deadline Requirements Grand Challenge

Jie Zhang
Junjie Deng
Mowei Wang
Yong Cui
Wei Tsang Ooi
Jiangchuan Liu
Xinyu Zhang
Kai Zheng
Yi Li

Delay-sensitive multimedia streaming applications require their data to be delivered
before a deadline to be useful. The data transmitted by these applications can usually
be partitioned into blocks with different priorities, assigned based on the impact
of a block on the Quality of Experience (QoE) if it misses its delivery deadline.
Meet their deadline requirements is challenging due to the dynamics of the network
and these applications' high demand on network resources. To encourage the research
community to address this challenge, we organize the "Meet Deadline Requirements"
Grand Challenge at ACM Multimedia 2021. This grand challenge provides a simulation
platform onto which the participants can implement their block scheduler and bandwidth
estimator and then benchmark against each other using a common set of application
traces and network traces.

MultiModal Language Modelling on Knowledge Graphs for Deep Video Understanding

Vishal Anand
Raksha Ramesh
Boshen Jin
Ziyin Wang
Xiaoxiao Lei
Ching-Yung Lin

The natural language processing community has had a major interest in auto-regressive
[4, 13] and span-prediction based language models [7] recently, while knowledge graphs
are often referenced for common-sense based reasoning and fact-checking models. In
this paper, we present an equivalence representation of span-prediction based language
models and knowledge-graphs to better leverage recent developments of language modelling
for multi-modal problem statements. Our method performed well, especially with sentiment
understanding for multi-modal inputs, and discovered potential bias in naturally occurring
videos when compared with movie-data interaction-understanding. We also release a
dataset of an auto-generated questionnaire with ground-truths consisting of labels
spanning across 120 relationships, 99 sentiments, and 116 interactions, among other
labels for finer-grained analysis of model comparisons in the community.

Using Motion Histories for Eye Contact Detection in Multiperson Group Conversations

Eugene Yujun Fu
Michael W. Ngai

Eye contact detection in group conversations is the key to developing artificial mediators
that can understand and interact with a group. In this paper, we propose to model
a group's appearances and behavioral features to perform eye contact detection for
each participant in the conversation. Specifically, we extract the participants' appearance
features at the detection moment, and extract the participants' behavioral features
based on their motion history image, which is encoded with the participants' body
movements within a small time window before the detection moment. In order to attain
powerful representative features from these images, we propose to train a Convolutional
Neural Network (CNN) to model them. A set of relevant features are obtained from the
network, which achieves an accuracy of 0.60 on the validation set in the eye contact
detection challenge in ACM MM 2021. Furthermore, our experimental results also demonstrate
that making use of both participants' appearance and behavior features can lead to
higher accuracy at eye detection than only using one of them.

MultiMediate: Multi-modal Group Behaviour Analysis for Artificial Mediation

Philipp Müller
Michael Dietz
Dominik Schiller
Dominike Thomas
Guanhua Zhang
Patrick Gebhard
Elisabeth André
Andreas Bulling

Artificial mediators are promising to support human group conversations but at present
their abilities are limited by insufficient progress in group behaviour analysis.
The MultiMediate challenge addresses, for the first time, two fundamental group behaviour
analysis tasks in well-defined conditions: eye contact detection and next speaker
prediction. For training and evaluation, MultiMediate makes use of the MPIIGroup Interaction
dataset consisting of 22 three- to four-person discussions as well as of an unpublished
test set of six additional discussions. This paper describes the MultiMediate challenge
and presents the challenge dataset including novel fine-grained speaking annotations
that were collected for the purpose of MultiMediate. Furthermore, we present baseline
approaches and ablation studies for both challenge tasks

SESSION: Session 34: Summarization, Analytics, and Storytelling

MeshNet++: A Network with a Face

Vinit Veerendraveer Singh
Shivanand Venkanna Sheshappanavar
Chandra Kambhamettu

Polygon meshes are a popular representation in computer graphics. They efficiently
provide delineations of complex 3D shapes. However, their irregular structure hinders
mesh analysis efforts in deep learning frameworks; few neural networks exist to describe
meshes. MeshNet is a pioneer in this direction. In this paper, we propose a novel
neural network that is substantially deeper than its MeshNet predecessor. This increase
in depth is achieved through our specialized convolution and pooling blocks that operate
on mesh faces. Our network named MeshNet++ learns local structures at multiple scales
and is also robust to shortcomings of mesh decimation. We evaluated it for the shape
classification task on various data sets, and results significantly higher than state-of-the-art
were observed. In particular, results demonstrated that even a small number of examples
suffice for training MeshNet++. Our code is available at https://github.com/VimsLab/MeshNet2.

Latent Memory-augmented Graph Transformer for Visual Storytelling

Mengshi Qi
Jie Qin
Di Huang
Zhiqiang Shen
Yi Yang
Jiebo Luo

Visual storytelling aims to automatically generate a human-like short story given
an image stream. Most existing works utilize either scene-level or object-level representations,
neglecting the interaction among objects in each image and the sequential dependency
between consecutive images. In this paper, we present a novel Latent Memory-augmented
Graph Transformer~(LMGT ), a Transformer based framework for visual story generation.
LMGT directly inherits the merits from the Transformer, which is further enhanced
with two carefully designed components, i.e., a graph encoding module and a latent
memory unit. Specifically, the graph encoding module exploits the semantic relationships
among image regions and attentively aggregates critical visual features based on the
parsed scene graphs. Furthermore, to better preserve inter-sentence coherence and
topic consistency, we introduce an augmented latent memory unit that learns and records
highly summarized latent information as the story line from the image stream and the
sentence history. Experimental results on three widely-used datasets demonstrate the
superior performance of LMGT over the state-of-the-art methods.

TSA-Net: Tube Self-Attention Network for Action Quality Assessment

Shunli Wang
Dingkang Yang
Peng Zhai
Chixiao Chen
Lihua Zhang

In recent years, assessing action quality from videos has attracted growing attention
in computer vision community and human-computer interaction. Most existing approaches
usually tackle this problem by directly migrating the model from action recognition
tasks, which ignores the intrinsic differences within the feature map such as foreground
and background information. To address this issue, we propose a Tube Self-Attention
Network (TSA-Net) for action quality assessment (AQA). Specifically, we introduce
a single object tracker into AQA and propose the Tube Self-Attention Module (TSA),
which can efficiently generate rich spatio-temporal contextual information by adopting
sparse feature interactions. The TSA module is embedded in existing video networks
to form TSA-Net. Overall, our TSA-Net is with the following merits: 1) High computational
efficiency, 2) High flexibility, and 3) The state-of-the-art performance. Extensive
experiments are conducted on popular action quality assessment datasets including
AQA-7 and MTL-AQA. Besides, a dataset named Fall Recognition in Figure Skating (FR-FS)
is proposed to explore the basic action assessment in the figure skating scene. Our
TSA-Net achieves the Spearman's Rank Correlation of 0.8476 and 0.9393 on AQA-7 and
MTL-AQA, respectively, which are the new state-of-the-art results. The results on
FR-FS also verify the effectiveness of the TSA-Net. The code and FR-FS dataset are
publicly available at https://github.com/Shunli-Wang/TSA-Net.

Exploring Contextual-Aware Representation and Linguistic-Diverse Expression for Visual
Dialog

Xiangpeng Li
Lianli Gao
Lei Zhao
Jingkuan Song

Visual dialog is a fundamental vision-language task where an AI agent holds a meaningful
dialogue about visual content with humans in nature. However, this task remains challenging,
since there is still no consensus way to capture rich visual contextual information
contained in the environment rather than only focusing on visual objects. Furthermore,
conventional methods suffer from the single-answer learning strategy, where it only
accepts one correct answer without considering the diverse expressions of the language
(i.e., one identical meaning but multiple expressions via rephrasing or adopting synonyms
etc). In this paper, we introduce Contextual-Aware Representation and linguistic-diverse
Expression (CARE), a novel plug-and-play framework with contextual-based graph embedding
and curriculum contrastive learning to solve the above two issues. Specifically, the
contextual-based graph embedding (CGE) module aims to integrate the environmental
context information with visual objects to improve the answer quality. In addition,
we propose a curriculum contrastive learning (CCL) paradigm to imitate the learning
habits of humans when facing a question with multiple correct answers sharing the
same meaning but with diverse expressions. To support CCL, a CCL loss is designed
to progressively strengthen the model's ability in identifying the answers with correct
semantics. Extensive experiments are conducted on two benchmark datasets, and our
proposed method outperforms the state-of-the-arts by a considerable margin on VisDial
V1.0 (4.63% NDCG) and VisDial V0.9 (1.27% MRR, 1.74% R@1, 0.87% R@5, 1.28% R@10, 0.26
Mean.

Automated Playtesting with a Cognitive Model of Sensorimotor Coordination

Injung Lee
Hyunchul Kim
Byungjoo Lee

Playtesting is widely performed in the game industry to gauge the difficulty of a
game. A large number of test participants with different skills must be recruited
for reliable test results, resulting in high costs. Automated playtesting based on
player simulation is expected to reduce playtesting costs. Still, it has not yet been
widely applied due to the lack of a method that realistically simulates players' gameplays
with different skills. Based on a cognitive model of sensorimotor coordination that
explains the human button input process, we propose a novel automated playtesting
technique that predicts the game difficulty experienced by players with different
skills in moving-target acquisition (MTA) games. The model has free parameters representing
the inherent skills of players. Once the parameters are obtained for a specific population
(e.g., seniors), it is possible to estimate the game difficulty at the population
level in multiple games. We applied the technique to two simple MTA games and showed
that it could predict the relative difference in game difficulties experienced by
players with different skills.

CAA: Candidate-Aware Aggregation for Temporal Action Detection

Yifan Ren
Xing Xu
Fumin Shen
Yazhou Yao
Huimin Lu

Temporal action detection aims to locate specific segments of action instances in
an untrimmed video. Most existing approaches commonly extract the features of all
candidate video segments and then classify them separately. However, they may neglect
the underlying relationship among candidates unconsciously. In this paper, we propose
a novel model termed Candidate-Aware Aggregation (CAA) to tackle this problem. In
CAA, we design the Global Awareness (GA) module to exploit long-range relations among
all candidates from a global perspective, which enhances the features of action instances.
The GA module is then embedded into a multi-level hierarchical network named FENet,
to aggregate local features in adjacent candidates to suppress background noise. As
a result, the relationship among candidates is explicitly captured from both local
and global perspectives, which ensures more accurate prediction results for the candidates.
Extensive experiments conducted on two popular benchmarks ActivityNet-1.3 and THUMOS-14
demonstrate the superiority of CAA comparing to the state-of-the-art methods.

SESSION: Session 35: Vision and Language-I

Disentangle Your Dense Object Detector

Zehui Chen
Chenhongyi Yang
Qiaofei Li
Feng Zhao
Zheng-Jun Zha
Feng Wu

Deep learning-based dense object detectors have achieved great success in the past
few years and have been applied to numerous multimedia applications such as video
understanding. However, the current training pipeline for dense detectors is compromised
to lots of conjunctions that may not hold. In this paper, we investigate three such
important conjunctions: 1) only samples assigned as positive in classification head
are used to train the regression head; 2) classification and regression share the
same input feature and computational fields defined by the parallel head architecture;
and 3) samples distributed in different feature pyramid layers are treated equally
when computing the loss. We first carry out a series of pilot experiments to show
disentangling such conjunctions can lead to persistent performance improvement. Then,
based on these findings, we propose Disentangled Dense Object Detector (DDOD), in
which simple and effective disentanglement mechanisms are designed and integrated
into the current state-of-the-art dense object detectors. Extensive experiments on
MS COCO benchmark show that our approach can lead to 2.0~mAP, 2.4~mAP and 2.2~mAP
absolute improvements on RetinaNet, FCOS, and ATSS baselines with negligible extra
overhead. Notably, our best model reaches 55.0 mAP on the COCOtest-dev set and 93.5
AP on the hard subset of WIDER FACE, achieving new state-of-the-art performance on
these two competitive benchmarks. Code is available at https://github.com/zehuichen123/DDOD.

Do We Really Need Frame-by-Frame Annotation Datasets for Object Tracking?

Lei Hu
Shaoli Huang
Shilei Wang
Wei Liu
Jifeng Ning

There has been an increasing emphasis on building large-scale datasets as the driver
of deep learning-based trackers' success. However, accurately annotating tracking
data is highly labor-intensive and expensive, making it infeasible in real-world applications.
In this study, we investigate the necessity of large-scale training data to ensure
tracking algorithms' performance. To this end, we introduce a FAT (Few-Annotation
Tracking) benchmark constructed by sampling one or a few frames per video from some
existing tracking datasets. The proposed dataset can be used to evaluate the effectiveness
of tracking algorithms considering data efficiency and new data augmentation approaches
for object tracking. We further present AMMC (Augmentation by Mimicking Motion Change),
a data augmentation strategy that enables learning high-performing trackers using
small-scale datasets. AMMC first cuts out the tracked targets and performs a sequence
of transformations to simulate the possible change by object motion. Then the transformed
targets are pasted on the inpainted background images and further conjointly augmented
to mimic variability caused by camera motion. Compared with standard augmentation
methods, AMMC explicitly considers tracking data characteristics, which synthesizes
more valid data for object tracking. We extensively evaluate our approach with two
popular trackers on the FAT datasets. Experiments show that our method allows these
trackers to even trained on a dataset requiring much less annotation to achieve comparable
or even better performance to those on the full-annotation dataset. The results imply
complete video annotation might not be necessary for object tracking if leveraging
motion-driven data augmentations during training.

Video-to-Image Casting: A Flatting Method for Video Analysis

Xu Chen
Chenqiang Gao
Feng Yang
Xiaohan Wang
Yi Yang
Yahong Han

Previous mainstream video analysis methods, especially 3D CNNs-based models, mainly
aim to transfer frameworks from the image domain to the video domain, and they follow
the regime which has been succeeded in image processing, i.e., large-scale benchmarks
and deep networks. However, processing videos is still time-consuming due to the increased
computational cost. In this paper, we propose to flat the video and construct a Spatio-temporal
Image (STI), i.e., squeezing the temporal dimension into a spatial plane. To pursuit
the video-level modeling and efficient architecture, we devise a Collective Convolution
(CoConv) operation to replace the 2D convolution. With the holistic sampling strategy,
this novel operation can extract the video-level spatio-temporal representation. Moreover,
we ensure that each CoConv operation has the same number of parameters as the original
2D filter, thus we can utilize a 2D network equipped with CoConv to analyze videos
without additional computations. To verify the effectiveness of our method for the
general video analysis, we evaluate it on three typical tasks, i.e., supervised action
recognition, self-supervised action recognition, and dynamic texture recognition.
Extensive experimental results show that our method can achieve comparable or state-of-the-art
performances on these benchmarks while using much fewer computations compared with
its 3D counterpart.

Complementary Trilateral Decoder for Fast and Accurate Salient Object Detection

Zhirui Zhao
Changqun Xia
Chenxi Xie
Jia Li

Salient object detection (SOD) has made great progress, but most of existing SOD methods
focus more on performance than efficiency. Besides, the U-shape structure exists some
drawbacks and there is still a lot of room for improvement. Therefore, we propose
a novel framework to treat semantic context, spatial detail and boundary information
separately in the decoder part. Specifically, we propose an efficient and effective
Complementary Trilateral Decoder (CTD) for saliency detection with three branches:
Semantic Path, Spatial Path and Boundary Path. These three branches are designed to
solve the dilution of semantic information, loss of spatial information and absence
of boundary information, respectively. These three branches are complementary to each
other and we design three distinctive fusion modules to gradually merge them according
to "coarse-fine-finer'' strategy, which significantly improves the region accuracy
and boundary quality. To facilitate the practical application in different environments,
we provide two versions: CTDNet-18 (11.82M, 180FPS) and CTDNet-50 (24.63M, 110FPS).
Experiments show that our model performs better than state-of-the-art approaches on
five benchmarks, which achieves a favorable balance between speed and accuracy.

Learning Human Motion Prediction via Stochastic Differential Equations

Kedi Lyu
Zhenguang Liu
Shuang Wu
Haipeng Chen
Xuhong Zhang
Yuyu Yin

Human motion understanding and prediction is an integral aspect in our pursuit of
machine intelligence and human-machine interaction systems. Current methods typically
pursue a kinematics modeling approach, relying heavily upon prior anatomical knowledge
and constraints. However, such an approach is hard to generalize to different skeletal
model representations, and also tends to be inadequate in accounting for the dynamic
range and complexity of motion, thus hindering predictive accuracy. In this work,
we propose a novel approach in modeling the motion prediction problem based on stochastic
differential equations and path integrals. The motion profile of each skeletal joint
is formulated as a basic stochastic variable and modeled with the Langevin equation.
We develop a strategy of employing GANs to simulate path integrals that amounts to
optimizing over possible future paths. We conduct experiments in two large benchmark
datasets, Human 3.6M and CMU MoCap. It is highlighted that our approach achieves a
12.48% accuracy improvement over current state-of-the-art methods in average.

Spatio-Temporal Interaction Graph Parsing Networks for Human-Object Interaction Recognition

Ning Wang
Guangming Zhu
Liang Zhang
Peiyi Shen
Hongsheng Li
Cong Hua

For a given video-based Human-Object Interaction scene, modeling the spatio-temporal
relationship between humans and objects is the important cue to understand the contextual
information presented in the video. With the efficient spatio-temporal relationship
modeling, it is possible not only to uncover contextual information in each frame,
but to directly capture inter-frame dependencies as well. Capturing the position changes
of human and objects over the spatio-temporal dimension is more critical when significant
changes in the appearance features may not occur over time. When utilizing appearance
features, the spatial location and the semantic information are also the key to improve
the video-based Human-Object Interaction recognition performance. In this paper, Spatio-Temporal
Interaction Graph Parsing Networks (STIGPN) are constructed, which encode the videos
with a graph composed of human and object nodes. These nodes are connected by two
types of relations: (i) intra-frame relations: modeling the interactions between human
and the interacted objects within each frame. (ii) inter-frame relations: capturing
the long range dependencies between human and the interacted objects across frame.
With the graph, STIGPN learn spatio-temporal features directly from the whole video-based
Human-Object Interaction scenes. Multi-modal features and a multi-stream fusion strategy
are used to enhance the reasoning capability of STIGPN. Two Human-Object Interaction
video datasets, including CAD-120 and Something-Else, are used to evaluate the proposed
architectures, and the state-of-the-art performance demonstrates the superiority of
STIGPN. Code for STIGPN is available at https://github.com/GuangmingZhu/STIGPN.

SESSION: Session 36: Vision and Language-II

Learning Hierarchal Channel Attention for Fine-grained Visual Classification

Xiang Guan
Guoqing Wang
Xing Xu
Yi Bin

Learning delicate feature representation of object parts plays a critical role in
fine-grained visual classification tasks. However, advanced deep convolutional neural
networks trained for general visual classification tasks usually tend to focus on
the coarse-grained information while ignoring the fine-grained one, which is of great
significance for learning discriminative representation. In this work, we explore
the great merit of multi-modal data in introducing semantic knowledge and sequential
analysis techniques in learning hierarchical feature representation for generating
discriminative fine-grained features. To this end, we propose a novel approach, termed
Channel Cusum Attention ResNet (CCA-ResNet ), for multi-modal joint learning of fine-grained
representation. Specifically, we use feature-level multi-modal alignment to connect
image and text classification models for joint multi-modal training. Through joint
training, image classification models trained with semantic level labels tend to focus
on the most discriminative parts, which enhances the cognitive ability of the model.
Then, we propose a Channel Cusum Attention (CCA ) mechanism to equip feature maps
with hierarchical properties through unsupervised reconstruction of local and global
features. The benefits brought by the CCA are in two folds: a) allowing fine-grained
features from early layers to be preserved in the forward propagation of deep networks;
b) leveraging the hierarchical properties to facilitate multi-modal feature alignment.
We conduct extensive experiments to verify that our proposed model can achieve state-of-the-art
performance on a series of fine-grained visual classification benchmarks.

Group-based Distinctive Image Captioning with Memory Attention

Jiuniu Wang
Wenjia Xu
Qingzhong Wang
Antoni B. Chan

Describing images using natural language is widely known as image captioning, which
has made consistent progress due to the development of computer vision and natural
language generation techniques. Though conventional captioning models achieve high
accuracy based on popular metrics, i.e., BLEU, CIDEr, and SPICE, the ability of captions
to distinguish the target image from other similar images is under-explored. To generate
distinctive captions, a few pioneers employ contrastive learning or re-weighted the
ground-truth captions, which focuses on one single input image. However, the relationships
between objects in a similar image group (e.g., items or properties within the same
album or fine-grained events) are neglected. In this paper, we improve the distinctiveness
of image captions using a Group-based Distinctive Captioning Model (GdisCap), which
compares each image with other images in one similar group and highlights the uniqueness
of each image. In particular, we propose a group-based memory attention (GMA) module,
which stores object features that are unique among the image group (i.e., with low
similarity to objects in other images). These unique object features are highlighted
when generating captions, resulting in more distinctive captions. Furthermore, the
distinctive words in the ground-truth captions are selected to supervise the language
decoder and GMA. Finally, we propose a new evaluation metric, distinctive word rate
(DisWordRate) to measure the distinctiveness of captions. Quantitative results indicate
that the proposed method significantly improves the distinctiveness of several baseline
models, and achieves the state-of-the-art performance on both accuracy and distinctiveness.
Results of a user study agree with the quantitative evaluation and demonstrate the
rationality of the new metric DisWordRate.

VQMG: Hierarchical Vector Quantised and Multi-hops Graph Reasoning for Explicit Representation
Learning

Lei Li
Chun Yuan

Vector Quantized Variational AutoEncoder (VQ-VAE) models realize fast image generation
by encoding and quantifying the raw input in the single-level or hierarchical compressed
latent space. However, the learned representations are not expert in capturing complex
relations existed, while one usually adopts domain-specific autoregressive models
to fit a prior distribution for two stages of learning. In this work, we propose VQMG,
a novel and unified framework for multi-hops relational reasoning and explicit representation
learning. By introducing Multi-hops Graph Convolution Networks (MGCN), complicated
relations from hierarchical latent space are effectively captured by Inner Graph,
while the fitting of autoregressive prior are performed coherently by Outer Graph
to promote the performance. Experiments on multimedia tasks including Point cloud
segementation, Stroke-level text detection and Image generation verify the efficiency
and applicability of our approach.

Structure-aware Mathematical Expression Recognition with Sequence-Level Modeling

Minli Li
Peilin Zhao
Yifan Zhang
Shuaicheng Niu
Qingyao Wu
Mingkui Tan

Mathematical expression recognition (MER) aims to convert an image of mathematical
expressions into a Latex sequence. In practice, the task of MER is challenging, since
1) the images of mathematical expressions often contain complex structure relationships,
e.g., fractions, matrixes, and subscripts; 2) the generated Latex sequences can be
very complex and they have to satisfy strict syntax rules. Existing methods, however,
often ignore the complex dependence among image regions, resulting in poor feature
representation. In addition, they may fail to capture the rigorous relations among
different formula symbols as they consider MER as a common language generation task.
To address these issues, we propose a Structure-Aware Sequence-Level (SASL) model
for MER. First, to better represent and recognize the visual content of formula images,
we propose a structure-aware module to capture the relationship among different symbols.
Meanwhile, the sequence-level modeling helps the model to concentrate on the generation
of entire sequences. To make the problem feasible, we cast the generation problem
into a Markov decision process (MDP) and seek to learn a Latex sequence generating
policy. Based on MDP, we learn SASL by maximizing the matching score of each image-sequence
pair to obtain the generation policy. Extensive experiments on the IM2LATEX-100K dataset
verify the effectiveness and superiority of the proposed method.

Exploring Logical Reasoning for Referring Expression Comprehension

Ying Cheng
Ruize Wang
Jiashuo Yu
Rui-Wei Zhao
Yuejie Zhang
Rui Feng

Referring expression comprehension aims to localize the target object in an image
referred by a natural language expression. Most existing approaches neglect the implicit
logical correlations among fine-grained cues, e.g., categories, attributes, which
are beneficial for distinguishing objects. In this paper, we propose a logic-guided
approach to explore logical knowledge for referring expression comprehension in a
hierarchical modular-based framework. Specifically, we propose to extract fine-grained
cues in visual and textual domains and perform logical reasoning over them with explicit
logical expressions to regularize the matching process without extra parameters. Besides,
we propose to improve existing modular-based methods by introducing context information
of objects in the relationship module. Extensive experiments are conducted on three
referring expression datasets, and the results demonstrate that our model can produce
more consistent predictions and further achieve superior performance compared with
previous methods.

Direction Relation Transformer for Image Captioning

Zeliang Song
Xiaofei Zhou
Linhua Dong
Jianlong Tan
Li Guo

Image captioning is a challenging task that combines computer vision and natural language
processing for generating a textual description of the content within an image. Recently,
Transformer-based encoder-decoder architectures have shown great success in image
captioning, where multi-head attention mechanism is utilized to capture the contextual
interactions between object regions. However, such methods regard region features
as a bag of tokens without considering the directional relationships between them,
making it hard to understand the relative position between objects in the image and
generate correct captions effectively. In this paper, we propose a novel Direction
Relation Transformer to improve the orientation perception between visual features
by incorporating the relative direction embedding into multi-head attention, termed
DRT. We first generate the relative direction matrix according to the positional information
of the object regions, and then explore three forms of direction-aware multi-head
attention to integrate the direction embedding into Transformer architecture. We conduct
experiments on challenging Microsoft COCO image captioning benchmark. The quantitative
and qualitative results demonstrate that, by integrating the relative directional
relation, our proposed approach achieves significant improvements over all evaluation
metrics compared with baseline model, e.g., DRT improves task-specific metric CIDEr
score from 129.7% to 133.2% on the offline ''Karpathy'' test split.

SESSION: Session 37: Vision and Language-III

Contrastive Disentangled Meta-Learning for Signer-Independent Sign Language Translation

Tao Jin
Zhou Zhao

Sign language translation aims at directly translating a sign language video into
a natural sentence. The majority of existing methods take the video-sentence pairs
labeled by multiple specific signers as training and testing samples. However, such
setting does not fit in with the real-world applications. A practicable sign language
translation system is supposed to provide accurate translation results for unseen
signers. In this paper, we mainly attack the signer-independent setting and focus
on augmenting the generalization ability of translation model. To adapt to the challenging
setting, we propose a novel framework called contrastive disentangled meta-learning
(CDM), which develops several improvements in both deep architecture and training
mode. Specifically, based on the minimax entropy objective, a disentangled module
with adaptive gated units is developed to decouple the signer-specific and task-specific
representation in the encoder. Besides, we facilitate the frame-word alignments by
leveraging contrastive constraints between the obtained task-specific representation
and the decoding output. The disentangled and contrastive modules could provide complementary
information for each other. As for the training mode, we encourage the model to perform
well in the simulated signer-independent scenarios by finding the generalized learning
directions in the meta-learning process. Considering that vanilla meta-learning methods
utilize the multiple specific signers insufficiently, we adopt a fine-grained learning
strategy that simultaneously conducts meta-learning in a variety of domain shift scenarios
in each iteration. Extensive experiments on the benchmark dataset RWTH-PHOENIX-Weather-2014T(PHOENIX14T)
show that CDM could achieve competitive results compared with the state-of-the-art
methods.

Scene Graph with 3D Information for Change Captioning

Zeming Liao
Qingbao Huang
Yu Liang
Mingyi Fu
Yi Cai
Qing Li

Change captioning aims to describe the differences in image pairs with natural language.
It is an interesting task under-explored with two main challenges: describing the
relative position relationship between objects correctly and overcoming the disturbances
from viewpoint changes. To address these issues, we propose a three-dimensional (3D)
information aware Scene Graph based Change Captioning (SGCC) model. We extract the
semantic attributes of objects and the 3D information of images (i.e., depths of objects,
relative two-dimensional image plane distances, and relative angles between objects)
to construct the scene graphs for image pairs, then aggregate the nodes representations
with a graph convolutional network. Owing to the relative position relationships between
objects and the scene graphs, our model thereby is capable of assisting observers
to locate the changed objects quickly and being immune to the viewpoint change to
some extent. Extensive experiments show that our SGCC model achieves competitive performance
with the state-of-the-art models on the CLEVR-Change and Spot-the-Diff datasets, thus
verifying the effectiveness of our proposed model. Codes are available at https://github.com/VISLANG-Lab/SGCC.

Progressive Semantic Matching for Video-Text Retrieval

Hongying Liu
Ruyi Luo
Fanhua Shang
Mantang Niu
Yuanyuan Liu

Cross-modal retrieval between texts and videos is important yet challenging. Until
recently, previous works in this domain typically rely on learning a common space
to match the text and video, but it is difficult to match due to the semantic gap
between videos and texts. Although some methods employ coarse-to-fine or multi-expert
networks to encode one or more common spaces for easier matching, they almost directly
optimize one matching space, which is challenging, because of the huge semantic gap
between different modalities. To address this issue, we aim at narrowing semantic
gap by a progressive learning process with a coarse-to-fine architecture, and propose
a novel Progressive Semantic Matching (PSM) method. We first construct a multilevel
encoding network for videos and texts, and design some auxiliary common spaces, which
are mapped by the outputs of encoders in different levels. Then all the common spaces
are jointly trained end to end. In this way, the model can effectively encode videos
and texts into a fusion common space by a progressive paradigm. Experimental results
on three video-text datasets (i.e., MSR-VTT, TIGF and MSVD) demonstrate the advantages
of our PSM, which achieves significant performance improvement compared with state-of-the-art
approaches.

Multimodal Asymmetric Dual Learning for Unsupervised Eyeglasses Removal

Qing Lin
Bo Yan
Weimin Tan

Glasses removal is a challenging task due to the diversity of glasses species and
the difficulty of obtaining paired datasets. Most existing methods need to build different
models for different glasses or expensive paired datasets for supervised training,
which lacks universality. In this paper, we propose a multimodal asymmetric dual learning
method for unsupervised glasses removal. This method uses large-scale face images
with and without glasses for dual feature learning, which does not require intensive
manual marking of the glasses. Given a face image with glasses, we aim to generate
a glasses-free image preserving the person identity. Thus, in order to make up for
the lack of semantic features in the glasses region, we introduce the text description
of the target image into the task, and propose a text-guided multimodal feature fusion
method. We adaptively select the glasses-free image closest to the target one for
better dual feature learning. We also propose a exchange residual loss to generate
more precise mask of glasses. Extensive experiments prove that our method can generate
real glasses-free images, and better retain the person identity, which can be useful
for face recognition.

Neighbor-view Enhanced Model for Vision and Language Navigation

Dong An
Yuankai Qi
Yan Huang
Qi Wu
Liang Wang
Tieniu Tan

Vision and Language Navigation (VLN) requires an agent to navigate to a target location
by following natural language instructions. Most of existing works represent a navigation
candidate by the feature of the corresponding single view where the candidate lies
in. However, an instruction may mention landmarks out of the single view as references,
which might lead to failures of textual-visual matching of existing methods. In this
work, we propose a multi-module Neighbor-View Enhanced Model (NvEM) to adaptively
incorporate visual contexts from neighbor views for better textual-visual matching.
Specifically, our NvEM utilizes a subject module and a reference module to collect
contexts from neighbor views. The subject module fuses neighbor views at a global
level, and the reference module fuses neighbor objects at a local level. Subjects
and references are adaptively determined via attention mechanisms. Our model also
includes an action module to utilize the strong orientation guidance (e.g., "turn
left'') in instructions. Each module predicts navigation action separately and their
weighted sum is used for predicting the final action. Extensive experimental results
demonstrate the effectiveness of the proposed method on the R2R and R4R benchmarks
against several state-of-the-art navigators, and NvEM even beats some pre-training
ones. Our code is available at https://github.com/MarSaKi/NvEM.

Multi-Perspective Video Captioning

Yi Bin
Xindi Shang
Bo Peng
Yujuan Ding
Tat-Seng Chua

This work targets at the problems of comprehensive video captioning and the generation
of multiple descriptions from different perspectives, termed asMulti-Perspective Video
Captioning. We build and release a dataset named VidOR-MPVC, the first dataset for
multi-perspective video captioning, where each video is annotated with multiple descriptions
from different perspectives. We also propose a novel model, dubbedperspective-aware
captioner (PAC), which is capable of mining the various perspectives in a video and
generating a description from each perspective. More specifically, a perspective generator
is designed to perceive video content with perspective preferences, and followed by
a language generator equipped with perspective-aware attention mechanism. As our new
task expects to produce multiple descriptions for a video, existing evaluation metrics
are fail to handle this situation. To address this problem, we devise the maximum
matching scores based on existing metrics for an overall evaluation which aims to
cover the aspects of semantic similarity, completeness and compactness. The experimental
results demonstrate that our model is able to describe videos with multiple descriptions
from different perspectives.

SESSION: Poster Session 6

Pairwise VLAD Interaction Network for Video Question Answering

Hui Wang
Dan Guo
Xian-Sheng Hua
Meng Wang

Video Question Answering (VideoQA) is a challenging problem, as it requires a joint
understanding of video and natural language question. Existing methods perform correlation
learning between video and question have achieved great success. However, previous
methods merely model relations between individual video frames (or clips) and words,
which are not enough to correctly answer the question. From human's perspective, answering
a video question should first summarize both visual and language information, and
then explore their correlations for answer reasoning. In this paper, we propose a
new method called Pairwise VLAD Interaction Network (PVI-Net) to address this problem.
Specifically, we develop a learnable clustering-based VLAD encoder to respectively
summarize video and question modalities into a small number of compact VLAD descriptors.
For correlation learning, a pairwise VLAD interaction mechanism is proposed to better
exploit complementary information for each pair of modality descriptors, avoiding
modeling uninformative individual relations (e.g., frame-word and clip-word relations),
and exploring both inter- and intra-modality relations simultaneously. Experimental
results show that our approach achieves state-of-the-art performance on three VideoQA
datasets: TGIF-QA, MSVD-QA, and MSRVTT-QA.

Attention-guided Temporally Coherent Video Object Matting

Yunke Zhang
Chi Wang
Miaomiao Cui
Peiran Ren
Xuansong Xie
Xian-Sheng Hua
Hujun Bao
Qixing Huang
Weiwei Xu

This paper proposes a novel deep learning-based video object matting method that can
achieve temporally coherent matting results. Its key component is an attention-based
temporal aggregation module that maximizes image matting networks' strength for video
matting networks. This module computes temporal correlations for pixels adjacent to
each other along the time axis in feature space, which is robust against motion noises.
We also design a novel loss term to train the attention weights, which drastically
boosts the video matting performance. Besides, we show how to effectively solve the
trimap generation problem by fine-tuning a state-of-the-art video object segmentation
network with a sparse set of user-annotated keyframes. To facilitate video matting
and trimap generation networks' training, we construct a large-scale video matting
dataset with 80 training and 28 validation foreground video clips with ground-truth
alpha mattes. Experimental results show that our method can generate high-quality
alpha mattes for various videos featuring appearance change, occlusion, and fast motion.
Our code and dataset can be found at: https://github.com/yunkezhang/TCVOM

Disentangling Hate in Online Memes

Roy Ka-Wei Lee
Rui Cao
Ziqing Fan
Jing Jiang
Wen-Haw Chong

Hateful and offensive content detection has been extensively explored in a single
modality such as text. However, such toxic information could also be communicated
via multimodal content such as online memes. Therefore, detecting multimodal hateful
content has recently garnered much attention in academic and industry research communities.
This paper aims to contribute to this emerging research topic by proposing DisMultiHate,
which is a novel framework that performed the classification of multimodal hateful
content. Specifically, DisMultiHate is designed to disentangle target entities in
multimodal memes to improve the hateful content classification and explainability.
We conduct extensive experiments on two publicly available hateful and offensive memes
datasets. Our experiment results show that DisMultiHate is able to outperform state-of-the-art
unimodal and multimodal baselines in the hateful meme classification task. Empirical
case studies were also conducted to demonstrate DisMultiHate's ability to disentangle
target entities in memes and ultimately showcase DisMultiHate's explainability of
the multimodal hateful content classification task.

Robust Real-World Image Super-Resolution against Adversarial Attacks

Jiutao Yue
Haofeng Li
Pengxu Wei
Guanbin Li
Liang Lin

Recently deep neural networks (DNNs) have achieved significant success in real-world
image super-resolution (SR). However, adversarial image samples with quasi-imperceptible
noises could threaten deep learning SR models. In this paper, we propose a robust
deep learning framework for real-world SR that randomly erases potential adversarial
noises in the frequency domain of input images or features. The rationale is that
on the SR task clean images or features have a different pattern from the attacked
ones in the frequency domain. Observing that existing adversarial attacks usually
add high-frequency noises to input images, we introduce a novel random frequency mask
module that blocks out high-frequency components possibly containing the harmful perturbations
in a stochastic manner. Since the frequency masking may not only destroys the adversarial
perturbations but also affects the sharp details in a clean image, we further develop
an adversarial sample classifier based on the frequency domain of images to determine
if applying the proposed mask module. Based on the above ideas, we devise a novel
real-world image SR framework that combines the proposed frequency mask modules and
the proposed adversarial classifier with an existing super-resolution backbone network.
Experiments show that our proposed method is more insensitive to adversarial attacks
and presents more stable SR results than existing models and defenses.

Towards Robust Deep Hiding Under Non-Differentiable Distortions for Practical Blind
Watermarking

Chaoning Zhang
Adil Karjauv
Philipp Benz
In So Kweon

Data hiding is one widely used approach for proving ownership through blind watermarking.
Deep learning has been widely used in data hiding, for which inserting an attack simulation
layer (ASL) after the watermarked image has been widely recognized as the most effective
approach for improving the pipeline robustness against distortions. Despite its wide
usage, the gain of enhanced robustness from ASL is usually interpreted through the
lens of augmentation, while our work explores this gain from a new perspective by
disentangling the forward and backward propagation of such ASL. We find that the main
influential component is forward propagation instead of backward propagation. This
observation motivates us to use forward ASL to make the pipeline compatible with non-differentiable
and/or black-box distortion, such as lossy (JPEG) compression and photoshop effects.
Extensive experiments demonstrate the efficacy of our simple approach.

Bottom-Up and Bidirectional Alignment for Referring Expression Comprehension

Liuwu Li
Yuqi Bu
Yi Cai

In this paper, we propose a one-stage approach to improve referring expression comprehension
(REC) which aims at grounding the referent according to a natural language expression.
We observe that humans understand referring expressions through a fine-to-coarse bottom-up
way, and bidirectionally obtain vision-language information between image and text.
Inspired by this, we define the language granularity and the vision granularity. Otherwise,
existing methods do not follow the mentioned way of human understanding in referring
expression. Motivated by our observation and to address the limitations of existing
methods, we propose a bottom-up and bidirectional alignment (BBA) framework. Our method
constructs the cross-modal alignment starting from fine-grained representation to
coarse-grained representation and bidirectionally obtains vision-language information
between image and text. Based on the structure of BBA, we further propose a progressive
visual attribute decomposing approach to decompose visual proposals into several independent
spaces to enhance the bottom-up alignment framework. Experiments on five benchmark
datasets of RefCOCO, RefCOCO+, ReferItGame, RefCOCOg and Flick30K show that our approach
obtains +2.16%, +4.47%, +2.85%, +3.44%, and +2.91% improvements over the one-stage
SOTA approaches, which validates the effectiveness of our approach.

SalS-GAN: Spatially-Adaptive Latent Space in StyleGAN for Real Image Embedding

Lingyun Zhang
Xiuxiu Bai
Yao Gao

Many GAN inversion methods have emerged to embed a given real image into the latent
space of GAN for real image editing. These methods usually use a latent space composed
of a series of one-dimensional vectors as an optimization space to reconstruct real
images such as W+ latent space. However, the reconstructed image of these methods
is usually difficult to maintain the rich detailed information in the real image.
How to better preserve details in the real image is still a challenge. To solve this
problem, we propose a spatially-adaptive latent space, called SA latent space, and
adopt it as the optimization latent space in GAN inversion task. In particular, we
use the affine transformation parameters of each convolutional layer in the generator
to form the SA latent space and change affine transformation parameters from a one-dimensional
vector to a spatially-adaptive three-dimensional tensor. With the more expressive
latent space, we can better reconstruct the details of the real image. Extensive experiments
suggest that the image reconstruction quality can be significantly improved while
maintaining the semantic disentanglement ability of latent code. The code is available
at https://github.com/zhang-lingyun/SalS-GAN.

Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval

Xuri Ge
Fuhai Chen
Joemon M. Jose
Zhilong Ji
Zhongqin Wu
Xiao Liu

The current state-of-the-art image-sentence retrieval methods implicitly align the
visual-textual fragments, like regions in images and words in sentences, and adopt
attention modules to highlight the relevance of cross-modal semantic correspondences.
However, the retrieval performance remains unsatisfactory due to a lack of consistent
representation in both semantics and structural spaces. In this work, we propose to
address the above issue from two aspects: (i) constructing intrinsic structure (along
with relations) among the fragments of respective modalities, e.g., "dog → play →
ball" in semantic structure for an image, and (ii) seeking explicit inter-modal structural
and semantic correspondence between the visual and textual modalities.

In this paper, we propose a novel Structured Multi-modal Feature Embedding and Alignment
(SMFEA) model for image-sentence retrieval. In order to jointly and explicitly learn
the visual-textual embedding and the cross-modal alignment, SMFEA creates a novel
multi-modal structured module with a shared context-aware referral tree. In particular,
the relations of the visual and textual fragments are modeled by constructing Visual
Context-aware Structured Tree encoder (VCS-Tree) and Textual Context-aware Structured
Tree encoder (TCS-Tree) with shared labels, from which visual and textual features
can be jointly learned and optimized. We utilize the multi-modal tree structure to
explicitly align the heterogeneous image-sentence data by maximizing the semantic
and structural similarity between corresponding inter-modal tree nodes. Extensive
experiments on Microsoft COCO and Flickr30K benchmarks demonstrate the superiority
of the proposed model in comparison to the state-of-the-art methods.

Keyframe Extraction from Motion Capture Sequences with Graph based Deep Reinforcement
Learning

Clinton Mo
Kun Hu
Shaohui Mei
Zebin Chen
Zhiyong Wang

Animation production workflows centred around motion capture techniques often require
animators to edit the motion for various artistic and technical reasons. This process
generally uses a set of keyframes. Unsupervised keyframe selection methods for motion
capture sequences are highly demanded to reduce the laborious annotations. However,
most existing methods are optimization-based, which cause the issues of flexibility
and efficiency and eventually constrains the interactions and controls with animators.
To address these limitations, we propose a novel graph based deep reinforcement learning
method for efficient unsupervised keyframe selection. First, a reward function is
devised in terms of reconstruction difference by comparing the original sequence and
the interpolated sequence produced by the keyframes. The reward complies with the
requirements of the animation pipeline satisfying: 1) incremental reward to evaluate
the interpolated keyframes immediately; 2) order insensitivity for consistent evaluation;
and 3) non-diminishing return for comparable rewards between optimal and sub-optimal
solutions. Then by representing each skeleton frame as a graph, a graph-based deep
agent is guided to heuristically select keyframes to maximize the reward. During the
inference it is no longer necessary to estimate the reconstruction difference, and
the evaluation time can be reduced significantly. The experimental results on the
CMU Mocap dataset demonstrate that our proposed method is able to select keyframes
at a high efficiency without clearly compromising the quality in comparison with the
state-of-the-art methods.

Dense Contrastive Visual-Linguistic Pretraining

Lei Shi
Kai Shuang
Shijie Geng
Peng Gao
Zuohui Fu
Gerard de Melo
Yunpeng Chen
Sen Su

Inspired by the success of BERT, several multimodal representation learning approaches
have been proposed that jointly represent image and text. These approaches achieve
superior performance by capturing high-level semantic information from large-scale
multimodal pretraining. In particular, LXMERT and UNITER adopt visual region feature
regression and label classification as pretext tasks. However, they tend to suffer
from the problems of noisy labels and sparse semantic annotations, based on the visual
features having been pretrained on a crowdsourced dataset with limited and inconsistent
semantic labeling. To overcome these issues, we propose unbiased Dense Contrastive
Visual-Linguistic Pretraining (DCVLP), which replaces the region regression and classification
with cross-modality region contrastive learning that requires no annotations. Two
data augmentation strategies (Mask Perturbation and Intra-Inter-Adversarial Perturbation)
are developed to improve the quality of negative samples used in contrastive learning.
Overall, DCVLP allows cross-modality dense region contrastive learning in a self-supervised
setting independent of any object annotations. We compare our method against prior
visual-linguistic pretraining frameworks to validate the superiority of dense contrastive
learning on multimodal representation learning.

Hybrid Reasoning Network for Video-based Commonsense Captioning

Weijiang Yu
Jian Liang
Lei Ji
Lu Li
Yuejian Fang
Nong Xiao
Nan Duan

The task of video-based commonsense captioning aims to generate event-wise captions
and meanwhile provide multiple commonsense descriptions (e.g., attribute, effect and
intention) about the underlying event in the video. Prior works explore the commonsense
captions by using separate networks for different commonsense types, which is time-consuming
and lacks mining the interaction of different commonsense. In this paper, we propose
a Hybrid Reasoning Network (HybridNet) to endow the neural networks with the capability
of semantic-level reasoning and word-level reasoning. Firstly, we develop multi-commonsense
learning for semantic-level reasoning by jointly training different commonsense types
in a unified network, which encourages the interaction between the clues of multiple
commonsense descriptions, event-wise captions and videos. Then, there are two steps
to achieve the word-level reasoning: (1) a memory module records the history predicted
sequence from the previous generation processes; (2) a memory-routed multi-head attention
(MMHA) module updates the word-level attention maps by incorporating the history information
from the memory module into the transformer decoder for word-level reasoning. Moreover,
the multimodal features are used to make full use of diverse knowledge for commonsense
reasoning. Experiments and abundant analysis on the large-scale Video-to-Commonsense
benchmark show that our HybridNet achieves state-of-the-art performance compared with
other methods.

Learning Regularizer for Monocular Depth Estimation with Adversarial Guidance

Guibao Shen
Yingkui Zhang
Jialu Li
Mingqiang Wei
Qiong Wang
Guangyong Chen
Pheng-Ann Heng

Monocular Depth Estimation (MDE) is a fundamental task in computer vision and multimedia.
With the wide applications of deep Convolutional Neural Networks (CNNs), learning-based
methods have achieved superior performance on MDE tasks in recent years. Because loss
functions are important to train an accurate CNN with good generalization performance,
nearly all previous efforts contribute to proposing powerful loss functions with careful
hand-crafted regularizers(e.g., gradient loss and normal loss) added to the basic
depth L1-Loss. However, the hand-crafted regularizers require rich domain knowledge,
while their performance can still not be guaranteed. In this paper, we learn a new
regularizer, approximated by a tiny CNN Regularizrer-Net(RN), and train it in an adversarial
way. As demonstrated experimentally, our learned regularizer can notably outperform
the current state-of-the-art methods by both quantitative evaluation and qualitative
visualization on the benchmark NYU-Depth-v2 dataset, and well generalize to the new
ScanNet dataset without any further training. Our code will be released soon.

Pixel-wise Graph Attention Networks for Person Re-identification

Wenyu Zhang
Qing Ding
Jian Hu
Yi Ma
Mingzhe Lu

Graph convolutional networks (GCN) is widely used to handle irregular data since it
updates node features by using the structure information of graph. With the help of
iterated GCN, high-order information can be obtained to further enhance the representation
of nodes. However, how to apply GCN to structured data (such as pictures) has not
been deeply studied. In this paper, we explore the application of graph attention
networks (GAT) in image feature extraction. First of all, we propose a novel graph
generation algorithm to convert images into graphs through matrix transformation.
It is one magnitude faster than the algorithm based on K Nearest Neighbors (KNN).
Then, GAT is used on the generated graph to update the node features. Thus, a more
robust representation is obtained. These two steps are combined into a module called
pixel-wise graph attention module (PGA). Since the graph obtained by our graph generation
algorithm can still be transformed into a picture after processing, PGA can be well
combined with CNN. Based on these two modules, we consulted the ResNet and design
a pixel-wise graph attention network (PGANet). The PGANet is applied to the task of
person re-identification in the datasets Market1501, DukeMTMC-reID and Occluded-DukeMTMC
(outperforms state-of-the-art by 0.8%, 1.1% and 11% respectively, in mAP scores).
Experiment results show that it achieves the state-of-the-art performance.

Neighbor-Vote: Improving Monocular 3D Object Detection through Neighbor Distance Voting

Xiaomeng Chu
Jiajun Deng
Yao Li
Zhenxun Yuan
Yanyong Zhang
Jianmin Ji
Yu Zhang

As cameras are increasingly deployed in new application domains such as autonomous
driving, performing 3D object detection on monocular images becomes an important task
for visual scene understanding. Recent advances on monocular 3D object detection mainly
rely on the "pseudo-LiDAR'' generation, which performs monocular depth estimation
and lifts the 2D pixels to pseudo 3D points. However, depth estimation from monocular
images, due to its poor accuracy, leads to inevitable position shift of pseudo-LiDAR
points within the object. Therefore, the predicted bounding boxes may suffer from
inaccurate location and deformed shape. In this paper, we present a novel neighbor-voting
method that incorporates neighbor predictions to ameliorate object detection from
severely deformed pseudo-LiDAR point clouds. Specifically, each feature point around
the object forms their own predictions, and then the "consensus'' is achieved through
voting. In this way, we can effectively combine the neighbors' predictions with local
prediction and achieve more accurate 3D detection. To further enlarge the difference
between the foreground region of interest (ROI) pseudo-LiDAR points and the background
points, we also encode the ROI prediction scores of 2D foreground pixels into the
corresponding pseudo-LiDAR points. We conduct extensive experiments on the KITTI benchmark
to validate the merits of our proposed method. Our results on the bird's eye view
detection outperform the state-of-the-art performance, especially for the "hard" level
detection. The code is available at https://github.com/cxmomo/Neighbor-Vote.

Remember and Reuse: Cross-Task Blind Image Quality Assessment via Relevance-aware Incremental Learning

Rui Ma
Hanxiao Luo
Qingbo Wu
King Ngi Ngan
Hongliang Li
Fanman Meng
Linfeng Xu

Existing blind image quality assessment (BIQA) methods have made great progress in
various task-specific applications, including the synthetic, authentic, or over-enhanced
distortion evaluations. However, limited by the static model and once-for-all learning
strategy, they failed to perform the cross-task evaluations in many practical applications,
where diverse evaluation criteria and distortion types are constantly emerging. To
address this issue, in this paper, we propose a dynamic Remember and Reuse (R&R) network,
which efficiently performs the cross-task BIQA based on a novel relevance-aware incremental
learning strategy. Given multiple evaluation tasks across different distortion types
or databases, our R&R network sequentially updates the parameters for every task one
by one. After each update step, part of task-specific parameters is settled, which
ensures R&R Remembers their dedicated evaluation preferences. The remaining parameters
are pruned for the dynamic usage of the subsequent tasks. To further exploit the correlation
between different tasks, we feed the training data of a new task to previously settled
parameters. Better prediction accuracy is considered as higher task relevance and
vice versa. Then, we selectively Reuse parts of previously settled parameters, whose
proportion is adaptively determined by the task relevance. Extensive experiments show
that the proposed method efficiently achieves the cross-task BIQA without catastrophic
forgetting, and significantly outperforms many state-of-the-art methods. Code is available
at https://github.com/maruiperfect/R-R-Net.

MSO: Multi-Feature Space Joint Optimization Network for RGB-Infrared Person Re-Identification

Yajun Gao
Tengfei Liang
Yi Jin
Xiaoyan Gu
Wu Liu
Yidong Li
Congyan Lang

The RGB-infrared cross-modality person re-identification (ReID) task aims to recognize
the images of the same identity between the visible modality and the infrared modality.
Existing methods mainly use a two-stream architecture to eliminate the discrepancy
between the two modalities in the final common feature space, which ignore the single
space of each modality in the shallow layers. To solve it, in this paper, we present
a novel multi-feature space joint optimization (MSO) network, which can learn modality-sharable
features in both the single-modality space and the common space. Firstly, based on
the observation that edge information is modality-invariant, we propose an edge features
enhancement module to enhance the modality-sharable features in each single-modality
space. Specifically, we design a perceptual edge features (PEF) loss after the edge
fusion strategy analysis. According to our knowledge, this is the first work that
proposes explicit optimization in the single-modality feature space on cross-modality
ReID task. Moreover, to increase the difference between cross-modality distance and
class distance, we introduce a novel cross-modality contrastive-center (CMCC) loss
into the modality-joint constraints in the common feature space. The PEF loss and
CMCC loss jointly optimize the model in an end-to-end manner, which markedly improves
the network's performance. Extensive experiments demonstrate that the proposed model
significantly outperforms state-of-the-art methods on both the SYSU-MM01 and RegDB
datasets.

Point Cloud Projection and Multi-Scale Feature Fusion Network Based Blind Quality
Assessment for Colored Point Clouds

Wen-xu Tao
Gang-yi Jiang
Zhi-di Jiang
Mei Yu

With the wide applications of colored point cloud (CPC) in many fields, many attentions
have been paid to CPC's distortions caused by its compression and reconstruction.
How to effectively evaluate the visual quality of CPC has become an urgent issue to
be resolved. In this paper, a Point cloud projection and Multi-scale feature fusion
network based Blind Visual Quality Assessment method (denoted as PM-BVQA) is proposed
for CPC. CPC in 3D space is first projected into 2D color projection map and geometric
projection map, then a multi-scale feature fusion network is designed to blindly evaluate
the visual quality of CPC. The proposed PM-BVQA method includes three modules, that
is, joint color-geometric feature extractor, two-stage multi-scale feature fusion,
and spatial pooling module. Considering the multi-channel characteristics of human
visual system (HVS), unimodal features of different scales are obtained by joint color-geometric
feature extractor from the color and geometric projection maps. The fusion of the
unimodal color and geometric features is carried out to capture the cross-modal complementary
information between these two types of information. By integrating cross-modal fused
features at different scales, the complementary relationships between different channels
of HVS are simulated. The spatial pooling module takes into account the attention
mechanism of HVS and realizes the weighted summation of local regional quality to
obtain the final global quality score of CPC. A subjective CPC database with coding
distortion is used to verify the effectiveness of the proposed method, and the experimental
results show that the proposed blind quality assessment method is more consistent
with the subjective visual perception than the existing quality assessment methods.

Multi-branch Channel-wise Enhancement Network for Fine-grained Visual Recognition

Guangjun Li
Yongxiong Wang
Fengting Zhu

The challenge in fine-grained visual classification (FGVC) is that the similarity
within intra-class may be larger than inter-class, where the discriminative details
require more attention than traditional classification tasks. To generate channel-wise
complementary and discriminative features in beneficial details of FGVC, we propose
a multi-branch channel-wise enhancement network (MCEN), which includes multi-pattern
spatial disruption mechanism, inter-channel complementarity module(ICM), and novel
soft target loss. The raw images are scrambled in multi-pattern and then the sub-images
with different degrees of confusion are combined into three pairs as inputs, where
the scrambled operation can force the channel to look for the discriminative details.
And ICM can measure the complementarity between key features and overall features
to restrain the redundancy of features. The soft target loss is designed for classification
and the semantic relationship between the blocks is learned to judge the degree of
the chaos of the image. Our designed multi-branched structure utilizes the shallow
visual and deep semantic features to judge the outcome jointly, where the image pairs
obtained by segmentation and rearrangement are input into the different branches to
extract more complementary features from different patterns of the raw image. Our
method is trained end-to-end with only class labels. Experimental results show that
our model outperforms the state-of-the-art performance on three fine-grained benchmarks.

General Approximate Cross Validation for Model Selection: Supervised, Semi-supervised and Pairwise Learning

Bowei Zhu
Yong Liu

Cross-validation (CV) is a ubiquitous model-agnostic tool for assessing the error
of machine learning. However, it has high complexity due to the requirement of multiple
times of learner training especially in multimedia tasks with huge amounts of data.
In this paper, we provide a unified framework to approximate the CV error for various
common multimedia tasks such as supervised, semi-supervised and pairwise learning
which requires training only once. Moreover, we study the theoretical performance
of the proposed approximate CV and provide an explicit finite-sample error bound.
Experimental results on several datasets demonstrate that our approximate CV has no
statistical discrepancy from the original CV, but can significantly improve the efficiency,
which is a great advantage in model selection.

Progressive and Selective Fusion Network for High Dynamic Range Imaging

Qian Ye
Jun Xiao
Kin-man Lam
Takayuki Okatani

This paper considers the problem of generating an HDR image of a scene from its LDR
images. Recent studies employ deep learning and solve the problem in an end-to-end
fashion, leading to significant performance improvements. However, it is still hard
to generate a good quality image from LDR images of a dynamic scene captured by a
hand-held camera, e.g., occlusion due to the large motion of foreground objects, causing
ghosting artifacts. The key to success relies on how well we can fuse the input images
in their feature space, where we wish to remove the factors leading to low-quality
image generation while performing the fundamental computations for HDR image generation,
e.g., selecting the best-exposed image/region. We propose a novel method that can
better fuse the features based on two ideas. One is multi-step feature fusion; our
network gradually fuses the features in a stack of blocks having the same structure.
The other is the design of the component block that effectively performs two operations
essential to the problem, i.e., comparing and selecting appropriate images/regions.
Experimental results show that the proposed method outperforms the previous state-of-the-art
methods on the standard benchmark tests.

Multimodal Relation Extraction with Efficient Graph Alignment

Changmeng Zheng
Junhao Feng
Ze Fu
Yi Cai
Qing Li
Tao Wang

Relation extraction (RE) is a fundamental process in constructing knowledge graphs.
However, previous methods on relation extraction suffer sharp performance decline
in short and noisy social media texts due to a lack of contexts. Fortunately, the
related visual contents (objects and their relations) in social media posts can supplement
the missing semantics and help to extract relations precisely. We introduce the multimodal
relation extraction (MRE), a task that identifies textual relations with visual clues.
To tackle this problem, we present a large-scale dataset which contains 15000+ sentences
with 23 pre-defined relation categories. Considering that the visual relations among
objects are corresponding to textual relations, we develop a dual graph alignment
method to capture this correlation for better performance. Experimental results demonstrate
that visual contents help to identify relations more precisely against the text-only
baselines. Besides, our alignment method can find the correlations between vision
and language, resulting in better performance. Our dataset and code are available
at https://github.com/thecharm/Mega.

Legitimate Adversarial Patches: Evading Human Eyes and Detection Models in the Physical World

Jia Tan
Nan Ji
Haidong Xie
Xueshuang Xiang

It is known that deep neural models are vulnerable to adversarial attacks. Digital
attacks can craft imperceptible perturbations but lack of the ability to apply in
physical environment. To address this issue, efforts have been investigated to study
physical patch attacks in the physical world, especially for object detection models.
Previous works mostly focus on evading the detection model itself but ignore the impact
of human observers. In this paper, we study legitimate adversarial attacks that evade
both human eyes and detection models in the physical world. To this end, we delve
into the issue of patch rationality, and propose some indicators for evaluating the
rationality of physical adversarial patches. Besides, we propose a novel framework
with a two-stage training strategy to generate our legitimate adversarial patches
(LAPs). Both in numerical simulations and physical experiments our LAPs have significant
attack effects and visual rationality.

Unsupervised Vehicle Search in the Wild: A New Benchmark

Xian Zhong
Shilei Zhao
Xiao Wang
Kui Jiang
Wenxuan Liu
Wenxin Huang
Zheng Wang

In urban surveillance systems, finding a specific vehicle in video frames efficiently
and accurately has always been an essential part of traffic supervision and criminal
investigation. Existing studies focus on vehicle re-identification (re-ID), but vehicle
search is still underexploited. These methods depend on the locations of many vehicles
(bounding boxes) that are not available in most real-world applications. Therefore,
the unsupervised joint study of vehicle location and identification for the observed
scene is a pressing need. Inspired by person search, we conduct a study on the vehicle
search while considering four main discrepancies among them, summarized as: 1) It
is challenging to select the candidate regions for the observed vehicle due to the
perspective differences (front or side); 2) The sides of the same type of vehicles
are almost the same, resulting in smaller inter-class; 3) Lacking satisfied dataset
for vehicle search to meet the practical scenarios; 4) Supervised search publishing
methods rely on datasets with expensive annotations. To address these issues, we have
established a new vehicle search dataset. We design an unsupervised framework on this
benchmark dataset to generate pseudo labels for further training existing vehicle
re-ID or person search models. Experimental results reveal that these methods turn
less effective on vehicle search tasks. Therefore, the vehicle search task needs to
be further developed, and this dataset can advance the research of vehicle search.
Https://github.com/zsl1997/VSW.

Meta-FDMixup: Cross-Domain Few-Shot Learning Guided by Labeled Target Data

Yuqian Fu
Yanwei Fu
Yu-Gang Jiang

A recent study [4] finds that existing few-shot learning methods, trained on the source
domain, fail to generalize to the novel target domain when a domain gap is observed.
This motivates the task of Cross-Domain Few-Shot Learning (CD-FSL). In this paper,
we realize that the labeled target data in CD-FSL has not been leveraged in any way
to help the learning process. Thus, we advocate utilizing few labeled target data
to guide the model learning. Technically, a novel meta-FDMixup network is proposed.
We tackle this problem mainly from two aspects. Firstly, to utilize the source and
the newly introduced target data of two different class sets, a mixup module is re-proposed
and integrated into the meta-learning mechanism. Secondly, a novel disentangle module
together with a domain classifier is proposed to extract the disentangled domain-irrelevant
and domain-specific features. These two modules together enable our model to narrow
the domain gap thus generalizing well to the target datasets. Additionally, a detailed
feasibility and pilot study is conducted to reflect the intuitive understanding of
CD-FSL under our new setting. Experimental results show the effectiveness of our new
setting and the proposed method. Codes and models are available at https://github.com/lovelyqian/Meta-FDMixup.

Target-guided Adaptive Base Class Reweighting for Few-Shot Learning

Jiliang Yan
Deming Zhai
Junjun Jiang
Xianming Liu

For few-shot learning, minimizing the empirical risk cannot reach the optimal hypothesis
from image to its label due to the effect of overfitting. Therefore, most of the existing
work leverages a set of base classes with sufficient labeled samples to pre-train
a general encoder for feature representation, which is then applied for all few-shot
classification tasks without considering the uniqueness of the target task. We suppose
that different base classes help solve a target task in varying degrees, and some
classes even introduce a negative effect. To this end, we propose a Target-guided
Base Class Reweighting (TBR) approach, which uses a reweighting-in-the-loop optimization
algorithm to assign a set of weights for base classes adaptively given a target task.
Specifically, TBR learns the parameter of the encoder via minimizing weighted empirical
risk on base class data, then optimizes the weights according to the the encoder's
performance on support set of the target task. Such an alternating optimization procedure
brings reweighting into the loop which makes the encoder more sensitive to the novel
classes of the target task. Extensive experiments demonstrate that the proposed method
can improve the performance of model-based approaches on two few-shot classification
benchmarks.

Deep Reasoning Network for Few-shot Semantic Segmentation

Yunzhi Zhuge
Chunhua Shen

Few-shot Semantic Segmentation (FSS) is a challenging problem in computer vision.
It aims at segmenting objects of the unseen categories given only one or several annotated
samples. The essence of FSS is to disseminate information from support images to query
images for segmenting the mutual object categories. In this paper, we propose a Dynamic
Reasoning Network (DRNet) to adaptively generate the parameters of predicting layers
and infer the segmentation mask for each unseen category. More specifically, an Attentional
Feature Integration Sub-network (AFIS) is first proposed to extract consistent features
from support im-ages and query images. With shared weights, it stimulates the category
consistency of different data streams. Then a Pooling-based Guidance Module (PGM)
is used to cor-relate support features with query features progressively. To disseminate
information from support images to various query images, we further propose a Dynamic
PredictionModule (DPM) for generating the parameters of predicting layers. The proposed
modules are unified for the dynamic reasoning of each query image segmentation. Experiments
on two public benchmarks have demonstrated that our approach achieves superior performance
and outperforms thevery recent state-of-the-art methods.

Heterogeneous Feature Fusion and Cross-modal Alignment for Composed Image Retrieval

Gangjian Zhang
Shikui Wei
Huaxin Pang
Yao Zhao

Composed image retrieval aims at performing image retrieval task by giving a reference
image and a complementary text piece. Since composing both image and text information
can accurately model the users' search intent, composed image retrieval can perform
target-specific image retrieval task and be potentially applied to many scenarios
such as interactive product search. However, two key challenging issues must be addressed
in composed image retrieval occasion. One of them is how to fuse heterogeneous image
and text piece in the query into a complementary feature space. The other is how to
bridge the heterogeneous gap between text pieces in the query and images in the database.
To address the issues, we propose an end-to-end framework for composed image retrieval,
which consists of three key components including Multi-modal Complementary Fusion
(MCF), Cross-modal Guided Pooling (CGP), and Relative Caption-aware Consistency (RCC).
By incorporating MCF and CGP modules, we can fully integrate the complementary information
of image and text piece in the query through multiple deep interactions and aggregate
obtained local features into an embedding vector. To bridge the heterogeneous gap,
we introduce the RCC constraint to align text pieces in the query and images in the
database. Extensive experiments on four public benchmark datasets show that the proposed
composed image retrieval framework achieves outstanding performance against the state-of-the-art
methods.

Similar Scenes Arouse Similar Emotions: Parallel Data Augmentation for Stylized Image Captioning

Guodun Li
Yuchen Zhai
Zehao Lin
Yin Zhang

Stylized image captioning systems aim to generate a caption not only semantically
related to a given image but also consistent with a given style description. One of
the biggest challenges with this task is the lack of sufficient paired stylized data.
Many studies focus on unsupervised approaches, without considering from the perspective
of data augmentation. We begin with the observation that people may recall similar
emotions when they are in similar scenes, and often express similar emotions with
similar style phrases, which underpins our data augmentation idea. In this paper,
we propose a novel Extract-Retrieve-Generate data augmentation framework to extract
style phrases from small-scale stylized sentences and graft them to large-scale factual
captions. First, we design the emotional signal extractor to extract style phrases
from small-scale stylized sentences. Second, we construct the plugable multi-modal
scene retriever to retrieve scenes represented with pairs of an image and its stylized
caption, which are similar to the query image or caption in the large-scale factual
data. In the end, based on the style phrases of similar scenes and the factual description
of the current scene, we build the emotion-aware caption generator to generate fluent
and diversified stylized captions for the current scene. Extensive experimental results
show that our framework can alleviate the data scarcity problem effectively. It also
significantly boosts the performance of several existing image captioning models in
both supervised and unsupervised settings, which outperforms the state-of-the-art
stylized image captioning methods in terms of both sentence relevance and stylishness
by a substantial margin.

Trajectory is not Enough: Hidden Following Detection

Danni Xu
Ruimin Hu
Zixiang Xiong
Zheng Wang
Linbo Luo
Dengshi Li

In outdoor crimes such as robbery and kidnapping, suspects generally secretly follow
their victims in public places and then look for opportunities to commit crimes. Video
anomaly detection (VAD) has achieved fruitful results through deep neural networks
(DNN). However, as an abnormal behavior without obvious abnormal physical features,
hidden following is highly similar to ordinary walking and accompanying behaviors,
so it is difficult to effectively detect hidden dangerous followers using video anomaly
detection methods or traditional trajectory analysis methods. We propose "hidden follower''
detection (HFD) task and a HFD model based on gaze pattern extraction. It extracts
gaze pattern features of pedestrians from gaze-interval-series and introduces a time
series classification model to classify pedestrians with or without hidden following
purposes. Based on this model, we propose a hidden follower detection framework (HFDF)
to detect hidden followers from normal pedestrians, which utilizes the trajectories
and gaze patterns extracted from videos. To cope with the lack of test data, we construct
a dataset of 1200 pedestrians from the crowd simulation model to simulate scenes including
hidden followers, and we also collected a surveillance video dataset including the
hidden following behaviors. The experiments conducted on these two datasets show that
HFDF can consistently outperform the state-of-the-art method by a notable margin in
the HFD task on the commonly-used F1 benchmark.

Contrastive Learning for Cold-Start Recommendation

Yinwei Wei
Xiang Wang
Qi Li
Liqiang Nie
Yan Li
Xuanping Li
Tat-Seng Chua

Recommending purely cold-start items is a long-standing and fundamental challenge
in the recommender systems. Without any historical interaction on cold-start items,
the collaborative filtering (CF) scheme fails to leverage collaborative signals to
infer user preference on these items. To solve this problem, extensive studies have
been conducted to incorporate side information of items (e.g. content features) into
the CF scheme. Specifically, they employ modern neural network techniques (e.g., dropout,
consistency constraint) to discover and exploit the coalition effect of content features
and collaborative representations. However, we argue that these works less explore
the mutual dependencies between content features and collaborative representations
and lack sufficient theoretical supports, thus resulting in unsatisfactory performance
on cold-start recommendation.

In this work, we reformulate the cold-start item representation learning from an information-theoretic
standpoint. It aims to maximize the mutual dependencies between item content and collaborative
signals. Specifically, the representation learning is theoretically lower-bounded
by the integration of two terms: mutual information between collaborative embeddings
of users and items, and mutual information between collaborative embeddings and feature
representations of items. To model such a learning process, we devise a new objective
function founded upon contrastive learning and develop a simple yet efficient Contrastive
Learning-based Cold-start Recommendation framework (CLCRec). In particular, CLCRec
consists of three components: contrastive pair organization, contrastive embedding,
and contrastive optimization modules. It allows us to preserve collaborative signals
in the content representations for both warm and cold-start items. Through extensive
experiments on four publicly accessible datasets, we observe that CLCRec achieves
significant improvements over state-of-the-art approaches in both warm- and cold-start
scenarios.

CG-GAN: Class-Attribute Guided Generative Adversarial Network for Old Photo Restoration

Jixin Liu
Rui Chen
Shipeng An
Heng Zhang

Old photos are an important carrier to preserve the past. Usually, the degradation
of old photos is rather diverse and complex. Therefore, the existing methods to solve
conventional restoration tasks are difficult to generalize. To solve this problem,
we propose a novel method based on generative adversarial network. Our method utilizes
the class-attributes of old photos to complete restoration in latent space. Specifically,
we divide the process of restoring old photos into two stages, one is global defect
restoration stage and the other is local detail restoration stage. In global defect
restoration stage, we extract the latent representations of four classes of high-level
attributes that are smoothness, clarity, connectivity and completeness. We use latent
class-attribute information to restore global defects in latent space and we obtain
conditional control vector through a condition network to guide the subsequent local
detail restoration stage. In local detail restoration stage, we propose a dynamic
condition-guided restoration module that selects the most suitable combination of
features to further restore local details through a dynamic network. In addition,
we propose a dual discriminator to pay more attention to style and defect restoration.
We ignore the complex degradation of old photos to directly restore advanced class-attributes.
Therefore, our method has better generalization performance. Experiments show that
our method is superior to other existing methods of image restoration in terms of
visual quality and numerical metrics.

Get The Best of the Three Worlds: Real-Time Neural Image Compression in a Non-GPU Environment

Zekun Zheng
Xiaodong Wang
Xinye Lin
Shaohe Lv

Lossy image compression always faces a tradeoff between rate-distortion performance
and compression/decompression speed. With the advent of neural image compression,
hardware (GPU) becomes the new vertex in the tradeoff triangle. By resolving the high
GPU dependency and improving the low speed of neural models, this paper proposes two
non-GPU models that get the best of the three worlds. First, the CPU-friendly Independent
Separable Down-Sampling (ISD) and Up-Sampling (ISU) modules are proposed to lighten
the network while ensuring a large receptive field. Second, an asymmetric autoencoder
architecture is adopted to boost the decoding speed. At last, the Inverse Quantization
Residual (IQR) module is proposed to reduce the error caused by quantization. In terms
of rate-distortion performance, our network surpasses the state-of-the-art real-time
GPU neural compression work at medium and high bit rates. In terms of speed, our model's
compression and decompression speeds surpass all other traditional compression methods
except JPEG, using only CPUs. In terms of hardware, the proposed models are CPU friendly
and perform stably well in a non-GPU environment. The code is publicly available at
https://github.com/kengchikengchi/FasiNet.

Visual Language Based Succinct Zero-Shot Object Detection

Ye Zheng
Xi Huang
Li Cui

On account of a large scale of dataset need to be annotated to train the deep learning
based modern object detection model, zero-shot object detection has become an important
research field which aims to simultaneously localize and recognize unseen objects
that are not observed during training. In order to improve the performance of zero-shot
object detection, recent state of the art methods tend to make complicated modifications
to the modern object detectors in terms of the model structure, loss function and
training process. They always take the simple modification as a baseline, and think
it is worse than more complicated methods. In contrast, we find that simple modification
can achieve better performance. Considering that the redundant modification may increase
the risk of over-fitting in seen classes and reduce generalization performance on
unseen classes, we propose a visual language based succinct zero-shot object detection
framework, which only replaces the classification branch in the modern object detector
with a lightweight visual-language network. Since zero-shot object detection is a
classic multi-modal learning protocol which consists of a visual feature space and
a language space, our visual-language network learns the visual language alignment
from the image and language data of seen classes and transfers this alignment to detect
unseen objects. Following the Occam's razor principle that "Entities should not be
multiplied unnecessarily", extensive experimental results show that our succinct framework
can suppress all existing zero-shot object detection methods on several benchmarks
and gets the new state-of-the-art.

GAMnet: Robust Feature Matching via Graph Adversarial-Matching Network

Bo Jiang
Pengfei Sun
Ziyan Zhang
Jin Tang
Bin Luo

Recently, deep graph matching (GM) methods have gained increasing attention. These
methods integrate graph nodes¡¯s embedding, node/edges¡¯s affinity learning and final
correspondence solver together in an end-to-end manner. For deep graph matching problem,
one main issue is how to generate consensus node's embeddings for both source and
target graphs that best serve graph matching tasks. In addition, it is also challenging
to incorporate the discrete one-to-one matching constraints into the differentiable
correspondence solver in deep matching network. To address these issues, we propose
a novel Graph Adversarial Matching Network (GAMnet) for graph matching problem. GAMnet
integrates graph adversarial embedding and graph matching simultaneously in a unified
end-to-end network which aims to adaptively learn distribution consistent and domain
invariant embeddings for GM tasks. Also, GAMnet exploits sparse GM optimization as
correspondence solver which is differentiable and can also incorporate discrete one-to-one
matching constraints approximately in natural in the final matching prediction. Experimental
results on three public benchmarks demonstrate the effectiveness and benefits of the
proposed GAMnet.

MCCN: Multimodal Coordinated Clustering Network for Large-Scale Cross-modal Retrieval

Zhixiong Zeng
Ying Sun
Wenji Mao

Cross-modal retrieval is an important multimedia research area which aims to take
one type of data as the query to retrieve relevant data of another type. Most of the
existing methods follow the paradigm of pair-wise learning and class-level learning
to generate a common embedding space, where the similarity of heterogeneous multimodal
samples can be calculated. However, in contrast to large-scale cross-modal retrieval
applications which often need to tackle multiple modalities, previous studies on cross-modal
retrieval mainly focus on two modalities (i.e., text-image or text-video). In addition,
for large-scale cross-modal retrieval with modality diversity, another important problem
is that the available training data are considerably modality-imbalanced. In this
paper, we focus on the challenging problem of modality-imbalanced cross-modal retrieval,
and propose a Multimodal Coordinated Clustering Network (MCCN) which consists of two
modules, Multimodal Coordinated Embedding (MCE) module to alleviate the imbalanced
training data and Multimodal Contrastive Clustering (MCC) module to tackle the imbalanced
optimization. The MCE module develops a data-driven approach to coordinate multiple
modalities via multimodal semantic graph for the generation of modality-balanced training
samples. The MCC module learns class prototypes as anchors to preserve the pair-wise
and class-level similarities across modalities for intra-class compactness and inter-class
separation, and further introduces intra-class and inter-class margins to enhance
optimization flexibility. We conduct experiments on the benchmark multimodal datasets
to verify the effectiveness of our proposed method.

AFEC: Adaptive Feature Extraction Modules for Learned Image Compression

Yi Ma
Yongqi Zhai
Jiayu Yang
Chunhui Yang
Ronggang Wang

With the rapid development of various multimedia applications, research on image compression
technology has become particularly important. Learning-based compression methods have
developed rapidly and achieved excellent rate-distortion performance. Most existing
researches have focused on designing a better entropy model to facilitate the probability
estimation without attaching importance to how to extract features from images more
effectively. However, information extracted by image compression networks is often
not realistic and complete enough, especially when the fixed-shape receptive field
of the compression network crosses the texture boundary of an image. In this paper,
we propose to extract high-fidelity image features adaptively with local textures
as the basic unit, which significantly improves the quality of the extracted information
and enhances the compactness of the latent representation of the image. Besides, a
cross-information-fusion gate is proposed to fuse the two features extracted from
the adaptive image feature extraction branch and the main compression branch for reducing
spatial redundancy in the latent representation. Experimental results demonstrate
our proposed method achieves superior performance compared to existing learned image
compression methods and traditional codecs and produces visually pleasing reconstructed
images with high-fidelity details.

How Video Super-Resolution and Frame Interpolation Mutually Benefit

Chengcheng Zhou
Zongqing Lu
Linge Li
Qiangyu Yan
Jing-Hao Xue

Video super-resolution (VSR) and video frame interpolation (VFI) are inter-dependent
for enhancing videos of low resolution and low frame rate. However, most studies treat
VSR and temporal VFI as independent tasks. In this work, we design a spatial-temporal
super-resolution network based on exploring the interaction between VSR and VFI. The
main idea is to improve the middle frame of VFI by the super-resolution (SR) frames
and feature maps from VSR. In the meantime, VFI also provides extra information for
VSR and thus, through interacting, the SR of consecutive frames of the original video
can also be improved by the feedback from the generated middle frame. Drawing on this,
our approach leverages a simple interaction of VSR and VFI and achieves state-of-the-art
performance on various datasets. Due to such a simple strategy, our approach is universally
applicable to any existing VSR or VFI networks for effectively improving their video
enhancement performance.

FOCAS: Practical Video Super Resolution using Foveated Rendering

Lingdong Wang
Mohammad Hajiesmaili
Ramesh K. Sitaraman

Super-resolution (SR) is a well-studied technique for reconstructing high-resolution
(HR) images from low-resolution (LR) ones. SR holds great promise for video streaming
since an LR video segment can be transmitted from the video server to the client that
then reconstructs the HR version using SR, resulting in a significant reduction in
network bandwidth. However, SR is seldom used in practice for real-time video streaming,
because the computational overhead of frame reconstruction results in large latency
and low frame rate.

To reduce the computational overhead and make SR practical, we propose a deep-learning-based
SR method called Fo veated Cas caded Video Super Resolution (focas). focas relies
on the fact that human eyes only have high acuity in a tiny central foveal region
of the retina. focas uses more neural network blocks in the foveal region to provide
higher video quality, while using fewer blocks in the periphery as lower quality is
sufficient. To optimize the computational resources and reduce reconstruction latency,
focas formulates and solves a convex optimization problem to decide the number of
neural network blocks to use in each region of the frame. Using extensive experiments,
we show that focas reduces the latency by 50%-70% while maintaining comparable visual
quality as traditional (non-foveated) SR. Further, focas provides a 12-16x reduction
in the client-to-server network bandwidth in comparison with sending the full HR video
segments.

Adaptive Affinity Loss and Erroneous Pseudo-Label Refinement for Weakly Supervised
Semantic Segmentation

Xiangrong Zhang
Zelin Peng
Peng Zhu
Tianyang Zhang
Chen Li
Huiyu Zhou
Licheng Jiao

Semantic segmentation has been continuously investigated in the last ten years, and
majority of the established technologies are based on supervised models. In recent
years, image-level weakly supervised semantic segmentation (WSSS), including single-
and multi-stage process, has attracted large attention due to data labeling efficiency.
In this paper, we propose to embed affinity learning of multi-stage approaches in
a single-stage model. To be specific, we introduce an adaptive affinity loss to thoroughly
learn the local pairwise affinity. As such, a deep neural network is used to deliver
comprehensive semantic information in the training phase, whilst improving the performance
of the final prediction module. On the other hand, considering the existence of errors
in the pseudo labels, we propose a novel label reassign loss to mitigate over-fitting.
Extensive experiments are conducted on the PASCAL VOC 2012 dataset to evaluate the
effectiveness of our proposed approach that outperforms other standard single-stage
methods and achieves comparable performance against several multi-stage methods.

Relationship-Preserving Knowledge Distillation for Zero-Shot Sketch Based Image Retrieval

Jialin Tian
Xing Xu
Zheng Wang
Fumin Shen
Xin Liu

Zero-shot sketch-based image retrieval is challenging for the modal gap between distributions
of sketches and images and the inconsistency of label spaces during training and testing.
Previous methods mitigate the modal gap by projecting sketches and images into a joint
embedding space. Most of them also bridge seen and unseen classes by leveraging semantic
embeddings, i.e., word vectors and hierarchical similarities. In this paper, we propose
Relationship-Preserving Knowledge Distillation (RPKD) to study generalizable embeddings
from the perspective of knowledge distillation bypassing the usage of semantic embeddings.
In particular, we firstly distill the instance-level knowledge to preserve inter-class
relationships without semantic similarities that require extra effort to collect.
We also reconcile the contrastive relationships among instances between different
embedding spaces, which is complementary to instance-level relationships. Furthermore,
embedding-induced supervision, which measures the similarities of an instance to partial
class embedding centers from the teacher, is developed to align the student's classification
confidences. Extensive experiments conducted on three benchmark ZS-SBIR datasets,
i.e., Sketchy, TU-Berlin, and QuickDraw, demonstrate the superiority of our proposed
RPKD approach comparing to the state-of-the-art methods.

Partially Fake it Till you Make It: Mixing Real and Fake Thermal Images for Improved Object Detection

Francesco Bongini
Lorenzo Berlincioni
Marco Bertini
Alberto Del Bimbo

In this paper we propose a novel data augmentation approach for visual content domains
that have scarce training datasets, compositing synthetic 3D objects within real scenes.
We show the performance of the proposed system in the context of object detection
in thermal videos, a domain where i) training datasets are very limited compared to
visible spectrum datasets and ii) creating full realistic synthetic scenes is extremely
cumbersome and expensive due to the difficulty in modeling the thermal properties
of the materials of the scene. We compare different augmentation strategies, including
state of the art approaches obtained through RL techniques, the injection of simulated
data and the employment of a generative model, and study how to best combine our proposed
augmentation with these other techniques. Experimental results demonstrate the effectiveness
of our approach, and our single-modality detector achieves state-of-the-art results
on the FLIR ADAS dataset.

CDP: Towards Optimal Filter Pruning via Class-wise Discriminative Power

Tianshuo Xu
Yuhang Wu
Xiawu Zheng
Teng Xi
Gang Zhang
Errui Ding
Fei Chao
Rongrong Ji

Neural network pruning has shown promising performance in reducing computational complexity
and facilitate the deployment of deep neural networks on resource-limited edge devices.
Most existing pruning methods focus on the indicators of the filter's weight, gradient,
or feature map and regard the weak or similar filters as network redundancy. In contrast,
the representation of discriminative power is also a fundamental attribute that analog
neural networks to have extraordinary performance in various tasks. However, such
representation is neglected in existing works. Alternatively, we propose a novel filter
pruning strategy via class-wise discriminative power (CDP). Unlike the previous methods,
CDP treats the filters that always yield large or small activation values as redundant
and reserves the filters that show different magnitudes in activations as they yield
high discriminative power. We further propose to obtain such discriminative power
by employing the widely-used Term Frequency-Inverse Document Frequency (TF-IDF) on
feature representations across classes. Specifically, the output of a filter is considered
as a word, and the whole feature map is considered as a document. Then, TF-IDF is
used to generate the relevant score between words and all documents. If a filter has
low TF-IDF scores is less discriminate and can be pruned. Thus, the filters with high
TF-IDF scores are reserved. To our best knowledge, this is the first work that prunes
neural networks through class-wise discriminative power and measures such power by
introducing TF-IDF in feature representation among different classes. Without any
iterative process, CDP achieves better compression trade-offs comparing to the state-of-the-art
compression algorithms. For instance, in VGG-16, we achieve a 68.05%-FLOPs reduction,
with a 94.86% Top-1 accuracy on CIFAR-10. Specifically, we compress a 90.12%-FLOPs
reduction VGG-16, even retains 93.30% Top-1 accuracy on CIFAR-10. The code is available
at https://github.com/Tianshuo-Xu/CDP-Towards-Optimal-Filter-Pruning-via-Cl...

Face Hallucination via Split-Attention in Split-Attention Network

Tao Lu
Yuanzhi Wang
Yanduo Zhang
Yu Wang
Liu Wei
Zhongyuan Wang
Junjun Jiang

Recently, convolutional neural networks (CNNs) have been widely employed to promote
the face hallucination due to the ability to predict high-frequency details from a
large number of samples. However, most of them fail to take into account the overall
facial profile and fine texture details simultaneously, resulting in reduced naturalness
and fidelity of the reconstructed face, and further impairing the performance of downstream
tasks (e.g., face detection, facial recognition). To tackle this issue, we propose
a novel external-internal split attention group (ESAG), which encompasses two paths
responsible for facial structure information and facial texture details, respectively.
By fusing the features from these two paths, the consistency of facial structure and
the fidelity of facial details are strengthened at the same time. Then, we propose
a split-attention in split-attention network (SISN) to reconstruct photorealistic
high-resolution facial images by cascading several ESAGs. Experimental results on
face hallucination and face recognition unveil that the proposed method not only significantly
improves the clarity of hallucinated faces, but also encourages the subsequent face
recognition performance substantially. Codes have been released at https://github.com/mdswyz/SISN-Face-Hallucination.

Fake Gradient: A Security and Privacy Protection Framework for DNN-based Image Classification

Xianglong Feng
Yi Xie
Mengmei Ye
Zhongze Tang
Bo Yuan
Sheng Wei

Deep neural networks (DNNs) have demonstrated phenomenal success in image classification
applications and are widely adopted in multimedia internet of things (IoT) use cases,
such as smart home systems. To compensate for the limited resources on the IoT devices,
the computation-intensive image classification tasks are often offloaded to remote
cloud services. However, the offloading-based image classification could pose significant
security and privacy concerns to the user data and the DNN model, leading to effective
adversarial attacks that compromise the classification accuracy. The existing defense
methods either impact the original functionality or result in high computation or
model re-training overhead. In this paper, we develop a novel defense approach, namely
Fake Gradient, to protect the privacy of the data and defend against adversarial attacks
based on encryption of the output. Fake Gradient can hide the real output information
by generating fake classes and further mislead the adversarial perturbation generation
based on fake gradient knowledge, which helps maintain a high classification accuracy
on the perturbed data. Our evaluations using ImageNet and 7 popular DNN models indicate
that Fake Gradient is effective in protecting the privacy and defending against adversarial
attacks targeting image classification applications.

Integrating Semantic and Temporal Relationships in Facial Action Unit Detection

Zhihua Li
Xiang Deng
Xiaotian Li
Lijun Yin

Facial action unit (AU) detection is a challenging task due to the variety and subtlety
of individuals' facial behavior. Facial muscle characteristics such as temporal dependencies
and action correlations make AU detection differ from general multi-label classification
tasks, and capturing these two characteristics is the key to accurate AU detection.
However, there is little work to date taking both of them into consideration concurrently.
To capture the AU correlations in an image, we first disentangle the global (image)
feature into multiple AU-specific features with an AU contrastive loss, and then we
compute the feature for each AU by aggregating the features from the other AUs with
a self-attention based transformer. Different from the original transformer, we embed
the AU semantic dependency matrix into it to weakly guide the attention learning.
We then weighted fuse the AU-wise features to obtain the frame-wise features. We further
capture the temporal dependencies among frames by using another attention-based transformer,
which achieves information aggregation from the prior frames. Extensive experiments
on two benchmark datasets (i.e., BP4D and DISFA) demonstrate that the proposed framework
outperforms the state-of-the-art approaches.

Sparse to Dense Depth Completion using a Generative Adversarial Network with Intelligent
Sampling Strategies

Md Fahim Faysal Khan
Nelson Daniel Troncoso Aldas
Abhishek Kumar
Siddharth Advani
Vijaykrishnan Narayanan

Predicting dense depth accurately is essential for 3D scene understanding applications
such as autonomous driving and robotics. However, the depth obtained from commercially
available LiDAR and Time-of-Flight sensors is very sparse. With RGB color guidance,
modern convolutional neural network (CNN) based approaches can recover the missing
depth information. However, there could be scenarios such as low-light environments
where it might be difficult to get an associated RGB image with the sparse depth.
In this work, we propose a Generative Adversarial Network (GAN) that can accurately
predict the dense depth using only sparse samples without any RGB inputs. Generally,
the sparsity in the depth samples is uniformly distributed and cannot guarantee capturing
all intricate details. In this study, we also explore different variants of sparse
sampling strategies from uniform to feature based directed sampling. We find that
feature based intelligent sampling enjoys better compression ratio without sacrificing
intricate details, saving data communication bandwidth. Compared to uniform sampling,
depending on how aggressively the directed sampling is done, we observe about 3% to
25% reduction in size. We can easily reduce the size by 8% with directed sampling
without sacrificing the reconstruction accuracy. Although such directed sampling strategies
are not readily available with commercially viable depth sensors, we believe that
our study paves the way for future intelligent sensing and sampling strategies. To
further investigate data reduction and reconstruction accuracy trade-offs we deploy
our GAN to generate higher resolution dense depth from 4 times smaller sparse samples.
With slight decrease in accuracy, our GAN is able to recover the depth successfully
which shows great promise in edge Internet of Things (IoT) applications where we have
very tight constraint on data transmission bandwidth. Our source code along with examples
is available at: https://github.com/kocchop/depth-completion-gan

How does Color Constancy Affect Target Recognition and Instance Segmentation?

Siyan Xue
Shaobing Gao
Minjie Tan
Zhen He
Liangtian He

Previous work has demonstrated that incorrect white balance (WB) in the camera image
signal processing pipeline has a negative impact on the performance of deep neural
networks (DNNs) in high-level vision tasks, and traditional image augmentation approaches
are not well suited for modeling WB errors. However, it is still unclear when this
impact will occur for which kinds of images and objects. In this paper, we manually
labeled 2304 images from the RECommended dataset and NUS dataset and discovered that
the effect of WB on DNNs is greatly associated with object size and occlusion level
among objects. In images with incorrect WB, small objects and objects with heavily
occluded backgrounds are the main factors resulting in the bad performance of DNNs,
indicating that the effect of WB is clearly associated with the shape of objects.
Our findings may support that the functional role of some neurons in the visual cortex
(e.g., V1 or V4 areas) realizing color constancy (CC) and encoding object attributes
such as color and shape dependently is to contribute to high-level vision. Furthermore,
based on this scientific finding, we proposed a novel augmentation strategy to address
the negative impact of incorrect WB by expanding the training datasets in both color
transformation and synthetic occlusion. We compared our proposed strategy with the
current augmentation strategies and showed that our approach clearly improves the
performance of DNNs in detection and segmentation tasks with small objects and objects
with heavily occluded backgrounds.

Convolutional Transformer based Dual Discriminator Generative Adversarial Networks
for Video Anomaly Detection

Xinyang Feng
Dongjin Song
Yuncong Chen
Zhengzhang Chen
Jingchao Ni
Haifeng Chen

Detecting abnormal activities in real-world surveillance videos is an important yet
challenging task as the prior knowledge about video anomalies is usually limited or
unavailable. Despite that many approaches have been developed to resolve this problem,
few of them can capture the normal spatio-temporal patterns effectively and efficiently.
Moreover, existing works seldom explicitly consider the local consistency at frame
level and global coherence of temporal dynamics in video sequences. To this end, we
propose Convolutional Transformer based Dual Discriminator Generative Adversarial
Networks (CT-D2GAN) to perform unsupervised video anomaly detection. Specifically,
we first present a convolutional transformer to perform future frame prediction. It
contains three key components, i.e., a convolutional encoder to capture the spatial
information of the input video clips, a temporal self-attention module to encode the
temporal dynamics, and a convolutional decoder to integrate spatio-temporal features
and predict the future frame. Next, a dual discriminator based adversarial training
procedure, which jointly considers an image discriminator that can maintain the local
consistency at frame-level and a video discriminator that can enforce the global coherence
of temporal dynamics, is employed to enhance the future frame prediction. Finally,
the prediction error is used to identify abnormal video frames. Thoroughly empirical
studies on three public video anomaly detection datasets, i.e., UCSD Ped2, CUHK Avenue,
and Shanghai Tech Campus, demonstrate the effectiveness of the proposed adversarial
spatio-temporal modeling framework.

Salient Error Detection based Refinement for Wide-baseline Image Interpolation

Yuan Chang
Yisong Chen
Guoping Wang

Wide-baseline image interpolation is useful in many multimedia applications such as
virtual street roaming and 3D TV. It is also a challenging problem because the large
translations and rotations of image patches make it hard to estimate the motion fields
between wide-baseline image pairs. We propose a refinement strategy based on salient
error detection to improve the result of existing approaches of wide-baseline image
interpolation, where we combine the advantages of methods based on piecewise-linear
transformation and methods based on variational model. We first use a lightweight
interpolation method to estimate the initial motion field between the input image
pair, and synthesize the intermediate image as the initial result. Then we detect
regions with noticeable artifacts in the initial image to find areas whose motion
vectors should be refined. Finally, we refine the motion field of the detected regions
using a variational model based method, and obtain the refined intermediate image.
The refinement strategy of our method can be used as the post refinement step for
many other image interpolation algorithms. We show the effectiveness and efficiency
of our method through experiments on different datasets.

A Multi-Domain Adaptive Graph Convolutional Network for EEG-based Emotion Recognition

Rui Li
Yiting Wang
Bao-Liang Lu

Among all solutions of emotion recognition tasks, electroencephalogram (EEG) is a
very effective tool and has received broad attention from researchers. In addition,
information across multimedia in EEG often provides a more complete picture of emotions.
However, few of the existing studies concurrently incorporate EEG information from
temporal domain, frequency domain and functional brain connectivity. In this paper,
we propose a Multi-Domain Adaptive Graph Convolutional Network (MD-AGCN), fusing the
knowledge of both the frequency domain and the temporal domain to fully utilize the
complementary information of EEG signals. MD-AGCN also considers the topology of EEG
channels by combining the inter-channel correlations with the intra-channel information,
from which the functional brain connectivity can be learned in an adaptive manner.
Extensive experimental results demonstrate that our model exceeds state-of-the-art
methods in most experimental settings. At the same time, the results show that MD-AGCN
could extract complementary domain information and exploit channel relationships for
EEG-based emotion recognition effectively.

Interpolation Variable Rate Image Compression

Zhenhong Sun
Zhiyu Tan
Xiuyu Sun
Fangyi Zhang
Yichen Qian
Dongyang Li
Hao Li

Compression standards have been used to reduce the cost of image storage and transmission
for decades. In recent years, learned image compression methods have been proposed
and achieved compelling performance to the traditional standards. However, in these
methods, a set of different networks are used for various compression rates, resulting
in a high cost in model storage and training. Although some variable-rate approaches
have been proposed to reduce the cost by using a single network, most of them brought
some performance degradation when applying fine rate control. To enable variable-rate
control without sacrificing the performance, we propose an efficient Interpolation
Variable-Rate (IVR) network, by introducing a handy Interpolation Channel Attention
(InterpCA) module in the compression network. With the use of two hyperparameters
for rate control and linear interpolation, the InterpCA achieves a fine PSNR interval
of 0.001 dB and a fine rate interval of 0.0001 Bits-Per-Pixel (BPP) with 9000 rates
in the IVR network. Experimental results demonstrate that the IVR network is the first
variable-rate learned method that outperforms VTM 9.0 (intra) in PSNR and Multiscale
Structural Similarity (MS-SSIM).

Armor: A Benchmark for Meta-evaluation of Artificial Music

Songhe Wang
Zheng Bao
Jingtong E

Objective evaluation (OE) is essential to artificial music, but it's often very hard
to determine the quality of OEs. Hitherto, subjective evaluation (SE) remains reliable
and prevailing but suffers inevitable disadvantages that OEs may overcome. Therefore,
a meta-evaluation system is necessary for designers to test the effectiveness of OEs.
In this paper, we present Armor, a complex and cross-domain benchmark dataset that
serves this purpose. Since OEs should correlate with human judgment, we provide music
as test cases for OEs and human judgment scores as touchstones. We also provide two
meta-evaluation scenarios and their corresponding testing methods to assess the effectiveness
of OEs. To the best of our knowledge, Armor is the first comprehensive and rigorous
framework that future works could follow, take example by, and improve upon for the
task of evaluating computer-generated music and the field of computational music as
a whole. By analyzing different OE methods on our dataset, we observe that there is
still a huge gap between SE and OE, meaning that hard-coded algorithms are far from
catching human's judgment to the music.

DRDF: Determining the Importance of Different Multimodal Information with Dual-Router Dynamic
Framework

Haiwen Hong
Xuan Jin
Yin Zhang
Yunqing Hu
Jingfeng Zhang
Yuan He
Hui Xue

In multimodal tasks, the importance of text and image modal information often varies
for different input cases. To model the difference of importance of different modal
information, we propose a high-performance and highly general Dual-Router Dynamic
Framework (DRDF), consisting of Dual-Router, MWF-Layer, experts and expert fusion
unit. The text router and image router in Dual-Router take text modal information
and image modal information respectively, and MWF-Layer is responsible to determine
the importance of modal information. Based on the result of the determination, MWF-Layer
generates fused weights for the subsequent experts fusion. Experts can adopt a variety
of backbones that match the current multimodal or unimodal task. DRDF features high
generality and modularity, and we test 12 backbones such as Visual BERT and their
corresponding DRDF instances on the multimodal dataset Hateful memes, and unimodal
datasets CIFAR10, CIFAR100, and TinyImagenet. Our DRDF instance outperforms those
backbones. We also validate the effectiveness of components of DRDF by ablation studies,
and discuss the reasons and ideas of DRDF design.

CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising

Jianjie Luo
Yehao Li
Yingwei Pan
Ting Yao
Hongyang Chao
Tao Mei

BERT-type structure has led to the revolution of vision-language pre-training and
the achievement of state-of-the-art results on numerous vision-language downstream
tasks. Existing solutions dominantly capitalize on the multi-modal inputs with mask
tokens to trigger mask-based proxy pre-training tasks (e.g., masked language modeling
and masked object/frame prediction). In this work, we argue that such masked inputs
would inevitably introduce noise for cross-modal matching proxy task, and thus leave
the inherent vision-language association under-explored. As an alternative, we derive
a particular form of cross-modal proxy objective for video-language pre-training,
i.e., Contrastive Cross-modal matching and denoising (CoCo). By viewing the masked
frame/word sequences as the noisy augmentation of primary unmasked ones, CoCo strengthens
video-language association by simultaneously pursuing inter-modal matching and intra-modal
denoising between masked and unmasked inputs in a contrastive manner. Our CoCo proxy
objective can be further integrated into any BERT-type encoder-decoder structure for
video-language pre-training, named as Contrastive Cross-modal BERT (CoCo-BERT). We
pre-train CoCo-BERT on TV dataset and a newly collected large-scale GIF video dataset
(ACTION). Through extensive experiments over a wide range of downstream tasks (e.g.,
cross-modal retrieval, video question answering, and video captioning), we demonstrate
the superiority of CoCo-BERT as a pre-trained structure.

DLA-Net for FG-SBIR: Dynamic Local Aligned Network for Fine-Grained Sketch-Based Image Retrieval

Jiaqing Xu
Haifeng Sun
Qi Qi
Jingyu Wang
Ce Ge
Lejian Zhang
Jianxin Liao

Fine-grained sketch-based image retrieval is considered as an ideal alternative to
keyword-based image retrieval and image search by image due to the rich and easily
accessible characteristics of sketches. Previous works always follow a paradigm that
first extracting image global feature with convolution neural network and then optimizing
the model with triplet loss. Many efforts on narrowing the domain gap and extracting
discriminating features are made by these works. However, they ignored that the global
feature is not good at capturing fine-grained details. In this paper, we emphasize
the local features are more discriminating than global feature in FG-SBIR and explore
an effective way to utilize local features. Specifically, Local Aligned Network (LA-Net)
is proposed first, which solves FG-SBIR by directly aligning the mid-level local features.
Experiment manifests it can beat all previous baselines and is easy to implement.
LA-Net is hoped to be a new strong baseline for FG-SBIR. Next, Dynamic Local Aligned
Network (DLA-Net) is proposed to enhance LA-Net. The question of spatial misalignment
caused by the abstraction of the sketch is not considered by LA-Net. To solve this
question, a dynamic alignment mechanism is introduced into LA-Net. This new mechanism
makes the sketch interact with the photo and dynamically decide where to align according
to the different photos. The Experiment indicates DLA-Net successfully addresses the
question of spatial misalignment. It gains a significant performance boost over LA-Net
and outperforms the state-of-the-art in FG-SBIR. To the best of our knowledge, DLA-Net
is the first model that beats humans on all datasets---QMUL FG-SBIR, QMUL Handbag,
and Sketchy.

Pareto Optimality for Fairness-constrained Collaborative Filtering

Qianxiu Hao
Qianqian Xu
Zhiyong Yang
Qingming Huang

The well-known collaborative filtering (CF) models typically optimize a single objective
summed over all historical user-item interactions. Due to inevitable imbalances and
biases in real-world data, they may develop a policy that unfairly discriminates against
certain subgroups with low sample frequencies. To balance overall recommendation performance
and fairness, prevalent solutions apply fairness constraints or regularizations to
enforce equality of certain performance across different subgroups. However, simply
enforcing equality of performance may lead to large performance degradation of those
advantaged subgroups. To address this issue, we formulate a constrained Multi-Objective
Optimization (MOO) problem. In contrast to the single objective, we treat the performance
of each subgroup equivalently as an objective. This ensures that the imbalanced subgroup
sample frequency does not affect the gradient information. We further propose fairness
constraints to limit the search space to obtain more balanced solutions. To solve
the constrained MOO problem, a gradient-based constrained MOO algorithm is proposed
to seek a proper Pareto optimal solution for the performance trade-off. Extensive
experiments on synthetic and real-world datasets show that our approach could help
improve the recommendation accuracy of disadvantaged groups, while not damaging the
overall performance.

Decoupled IoU Regression for Object Detection

Yan Gao
Qimeng Wang
Xu Tang
Haochen Wang
Fei Ding
Jing Li
Yao Hu

Non-maximum suppression (NMS) is widely used in object detection pipelines for removing
duplicated bounding boxes. The inconsistency between the confidence for NMS and the
real localization confidence seriously affects detection performance. Prior works
propose to predict Intersection-over-Union (IoU) between bounding boxes and corresponding
ground-truths to improve NMS, while accurately predicting IoU is still a challenging
problem. We argue that the complex definition of IoU and feature misalignment make
it difficult to predict IoU accurately. In this paper, we propose a novel Decoupled
IoU Regression (DIR) model to handle these problems. The proposed DIR decouples the
traditional localization confidence metric IoU into two new metrics, Purity and Integrity.
Purity reflects the proportion of the object area in the detected bounding box, and
Integrity refers to the completeness of the detected object area. Separately predicting
Purity and Integrity can divide the complex mapping between the bounding box and its
IoU into two clearer mappings and model them independently. In addition, a simple
but effective feature realignment approach is also introduced to make the IoU regressor
work in a hindsight manner, which can make the target mapping more stable. The proposed
DIR can be conveniently integrated with existing two-stage detectors and significantly
improve their performance. Through a simple implementation of DIR with HTC, we obtain
51.3% AP on MS COCO benchmark, which outperforms previous methods and achieves state-of-the-art.

RCNet: Reverse Feature Pyramid and Cross-scale Shift Network for Object Detection

Zhuofan Zong
Qianggang Cao
Biao Leng

Feature pyramid networks (FPN) are widely exploited for multi-scale feature fusion
in existing advanced object detection frameworks. Numerous previous works have developed
various structures for bidirectional feature fusion, all of which are shown to improve
the detection performance effectively. We observe that these complicated network structures
require feature pyramids to be stacked in a fixed order, which introduces longer pipelines
and reduces the inference speed. Moreover, semantics from non-adjacent levels are
diluted in the feature pyramid since only features at adjacent pyramid levels are
merged by the local fusion operation in a sequence manner. To address these issues,
we propose a novel architecture named RCNet, which consists of Reverse Feature Pyramid
(RevFP) and Cross-scale Shift Network (CSN). RevFP utilizes local bidirectional feature
fusion to simplify the bidirectional pyramid inference pipeline. CSN directly propagates
representations to both adjacent and non-adjacent levels to enable multi-scale features
more correlative. Extensive experiments on the MS COCO dataset demonstrate RCNet can
consistently bring significant improvements over both one-stage and two-stage detectors
with subtle extra computational overhead. In particular, RetinaNet is boosted to 40.2
AP, which is 3.7 points higher than baseline, by replacing FPN with our proposed model.
On COCO test-dev, RCNet can achieve very competitive performance with a single-model
single-scale 50.5 AP.

Recursive Fusion and Deformable Spatiotemporal Attention for Video Compression Artifact
Reduction

Minyi Zhao
Yi Xu
Shuigeng Zhou

A number of deep learning based algorithms have been proposed to recover high-quality
videos from low-quality compressed ones. Among them, some restore the missing details
of each frame via exploring the spatiotemporal information of neighboring frames.
However, these methods usually suffer from a narrow temporal scope, thus may miss
some useful details from some frames outside the neighboring ones. In this paper,
to boost artifact removal, on the one hand, we propose a Recursive Fusion (RF) module
to model the temporal dependency within a long temporal range. Specifically, RF utilizes
both the current reference frames and the preceding hidden state to conduct better
spatiotemporal compensation. On the other hand, we design an efficient and effective
Deformable Spatiotemporal Attention (DSTA) module such that the model can pay more
effort on restoring the artifact-rich areas like the boundary area of a moving object.
Extensive experiments show that our method outperforms the existing ones on the MFQE
2.0 dataset in terms of both fidelity and perceptual effect. Code is available at
https://github.com/zhaominyiz/RFDA-PyTorch.

JokerGAN: Memory-Efficient Model for Handwritten Text Generation with Text Line Awareness

Jan Zdenek
Hideki Nakayama

Collecting labeled data for training of models for image recognition problems, including
handwritten text recognition (HTR), is a tedious and expensive task. Recent work on
handwritten text generation shows that generative models can be used as a data augmentation
method to improve the performance of HTR systems.

We propose a new method for handwritten text generation that uses generative adversarial
networks with multi-class conditional batch normalization, which enables us to use
character sequences with variable lengths as conditional input. Compared to existing
methods, it has significantly lower memory requirements which are almost constant
regardless of the size of the character set. This allows us to train a generative
model for languages with a large number of characters, such as Japanese. We also introduce
an additional condition that makes the generator aware of vertical properties of the
characters in the generated sequence, which helps generate text with well-aligned
characters in the text line.

Experiments on handwritten text datasets show that our proposed model can be used
to boost the performance of HTR, particularly when we only have access to partially
annotated data and train our generative model in a semi-supervised fashion. The results
also show that our model outperforms the current state-of-the-art for handwritten
text generation. In addition, we perform a human evaluation study that indicates that
the proposed method generates handwritten text images that look more realistic and
natural.

Our source code will be available at https://github.com/janzd/jokergan.

SESSION: Tutorials

Image Quality Assessment in the Modern Age

Kede Ma
Yuming Fang

This tutorial provides the audience with the basic theories, methodologies, and current
progresses of image quality assessment (IQA). From an actionable perspective, we will
first revisit several subjective quality assessment methodologies, with emphasis on
how to properly select visual stimuli. We will then present in detail the design principles
of objective quality assessment models, supplemented by an in-depth analysis of their
advantages and disadvantages. Both hand-engineered and (deep) learning-based methods
will be covered. Moreover, the limitations with the conventional model comparison
methodology for objective quality models will be pointed out, and novel comparison
methodologies such as those based on the theory of "analysis by synthesis" will be
introduced. We will last discuss the real-world multimedia applications of IQA, and
give a list of open challenging problems, in the hope of encouraging more and more
talented researchers and engineers devoting to this exciting and rewarding research
field.

Trustworthy Multimedia Analysis

Xiaowen Huang
Jiaming Zhang
Yi Zhang
Xian Zhao
Jitao Sang

This tutorial discusses the trustworthiness issue in multimedia analysis. Starting
from introducing two types of spurious correlations learned from distilling human
knowledge, we partition the (visual) feature space along two dimensions of task-relevance
and semantic-orientation. Trustworthy multimedia analysis ideally relies on the task-relevant
semantic features and consists of three modules as trainer, interpreter and tester.
These three modules essentially form a closed loop, which respectively address goals
of extracting task-relevant features, extracting task-relevant semantic features,
and detecting spurious correlations to be corrected by the trainer and interpreter.

Multimedia Classifiers: Behind the Scenes

Manjunath Iyer

This tutorial provides an in-depth understanding of the art and science behind the
decision-making in a multimedia classifier. A multimedia classifier typically takes
image, text, waveform, ordinal number or categorical data or their combination as
the input and produces a single output indicating the class of the input pattern.
Such a piece of AI ML system is extensively used as a decision-making element in several
autonomous systems. The yardstick used by the human experts for decision making for
the same input pattern often differs from the system, still producing the same output.
In some cases, the outputs differ for the same input data and throws open a question
on the reliability of the model. If such models are used in critical applications,
which is often the case in an autonomous system, adequate mitigations for minimizing
impact of the misjudgment has to be taken. It calls for ripping open the decision
making process in the black box classifiers. Unwinding the black box is the need of
the hour for the regulatory bodies as well. EU region has already made it mandatory
to provide the details of the decision-making mechanism if it involves some form of
AI ML components. This tutorial throws light on the decision making process in a classifier
that may be used for a variety of applications. More than one technique to get a glimpse
of the classifier in action would be discussed. The explanation can come in the form
of ta heatmap indicating the relevant features influencing the decision making process,
patterns learnt by the neurons or Textual description of the attributes of the input.
The bottom line is these explanations are to be consistent. The mechanism to achieve
the coherent explanation would be detailed in this tutorial.

Few-shot Learning for Multi-Modality Tasks

Jie Chen
Qixiang Ye
Xiaoshan Yang
S. Kevin Zhou
Xiaopeng Hong
Li Zhang

Recent deep learning methods rely on a large amount of labeled data to achieve high
performance. These methods may be impractical in some scenarios, where manual data
annotation is costly or the samples of certain categories are scarce (e.g., tumor
lesions, endangered animals and rare individual activities). When only limited annotated
samples are available, these methods usually suffer from the overfitting problem severely,
which degrades the performance significantly. In contrast, humans can recognize the
objects in the images rapidly and correctly with their prior knowledge after exposed
to only a few annotated samples. To simulate the learning schema of humans and relieve
the reliance on the large-scale annotation benchmarks, researchers start shifting
towards the few-shot learning problem: they try to learn a model to correctly recognize
novel categories with only a few annotated samples.

Plenoptic Quality Assessment: The JPEG Pleno Experience

Antonio M. G. Pinheiro

Plenoptic representations, like light fields, point clouds or digital holography,
provide the means for 3D representations suitable for multiple immersive and computer
vision applications. JPEG has been standardizing coding tools for these types of plenoptic
data in its project JPEG Pleno. This standardization effort has been developing quality
assessment models suitable for the quality evaluation of the coding technologies.
In this tutorial the quality assessment methodologies defined for the evaluation of
the different proposals of the three plenoptic modalities, are explained. The tutorial
also includes possible alternatives considered in the definition of the quality assessment
models and the selection of appropriate anchors decided during JPEG Pleno development
process.

A Tutorial on AI Music Composition

Xu Tan
Xiaobing Li

AI music composition is one of the most attractive and important topics in artificial
intelligence, music, and multimedia. The typical tasks in AI music composition include
melody generation, song writing, accompaniment generation, arrangement, performance
generation, timbre rendering, sound generation, and singing voice synthesis, which
cover different modalities (e.g., symbolic music score, sound) and well match to the
theme of ACM Multimedia. As the rapid development of artificial intelligence techniques
such as content creation and deep learning, AI based music composition has achieved
rapid progress, but still encountered a lot of challenges. A thorough introduction
and review on the basics, the research progress, as well as how to address the challenges
in AI music composition are timely and necessary for a broad audience working on artificial
intelligence, music, and multimedia. In this tutorial, we will first introduce the
background of AI music composition, including music basics and deep learning techniques
for music composition. Then we will introduce AI music composition from two perspectives:
1) key components, which include music score generation, music performance generation,
and music sound generation; 2) advanced topics, which include music structure/form/style/emotion
modeling, timbre synthesis/transfer/mixing, etc. At last, we will point out some research
challenges and future directions in AI music composition. This tutorial can serve
both academic researchers and industry practitioners working on AI music composition.

Out-of-distribution Generalization and Its Applications for Multimedia

Xin Wang
Peng Cui
Wenwu Zhu

Out-of-distribution generalization is becoming a hot research topic in both academia
and industry. This tutorial is to disseminate and promote the recent research achievements
on out-of-distribution generalization as well as their applications on multimedia,
which is an exciting and fast-growing research direction in the general field of machine
learning and multimedia. We will advocate novel, high-quality research findings, as
well as innovative solutions to the challenging problems in out-of-distribution generalization
and its applications for multimedia. This topic is at the core of the scope of ACM
Multimedia, and is attractive to MM audience from both academia and industry.

Deep Learning for Visual Data Compression

Guo Lu
Ren Yang
Shenlong Wang
Shan Liu
Radu Timofte

In this paper, we will introduce the recent progress in deep learning based visual
data compression, including image compression, video compression and point cloud compression.
In the past few years, deep learning techniques have been successfully applied to
various computer vision and image processing applications. However, for the data compression
task, the traditional approaches (i.e., block based motion estimation and motion compensation,
etc.) are still widely employed in the mainstream codecs. Considering the powerful
representation capability of neural networks, it is feasible to improve the data compression
performance by employing the advanced deep learning technologies. To this end, the
deep leaning based compression approaches have recently received increasing attention
from both academia and industry in the field of computer vision and signal processing.

SESSION: Workshop Summaries

ADVM'21: 1st International Workshop on Adversarial Learning for Multimedia

Aishan Liu
Xinyun Chen
Yingwei Li
Chaowei Xiao
Xun Yang
Xianglong Liu
Dawn Song
Dacheng Tao
Alan Yuille
Anima Anandkumar

Deep learning has achieved significant success in multimedia fields involving computer
vision, natural language processing, and acoustics. However research in adversarial
learning also shows that they are highly vulnerable to adversarial examples. Extensive
works have demonstrated that adversarial examples could easily fool deep neural networks
to wrong predictions threatening practical deep learning applications in both digital
and physical world. Though challenging, discovering and harnessing adversarial attacks
is beneficial for diagnosing model blind-spots and further understanding as well as
improving multimedia systems in practice. In this workshop, we aim to bring together
researchers from the fields of adversarial machine learning, model robustness, and
explainable AI to discuss recent research and future directions for adversarial robustness
of deep learning models, with a particular focus on multimedia applications, including
computer vision, acoustics, etc. As far as we know, we are the first workshop to focus
on adversarial learning of multimedia deep learning systems, which is of great significance
and we hope will be held annually in conjunction with ACM MM.

AIxFood'21: 3rd Workshop on AIxFood

Ricardo Guerrero
Michael Spranger
Shuqiang Jiang
Chong-Wah Ngo

Food and cooking analysis present exciting research and application challenges for
modern AI systems, particularly in the context of multimodal data such as images or
video. A meal that appears in a food image is a product of a complex progression of
cooking stages, often described in the accompanying textual recipe form. In the cooking
process, individual ingredients change their physical properties, become combined
with other food components, all to produce a final, yet highly variable, appearance
of the meal. Recognizing food items or meals on a plate from images or videos, their
physical properties such as the amount, nutritional content such as the caloric value,
food attributes such as the flavor, elucidating the cooking process behind it, or
creating robotic assistants that help users complete that cooking process, is of essential
scientific and technological value yet technically extremely challenging. The 3rd
AIxFood workshop was held as a half-day workshop in conjunction with the 29th ACM
International Conference on Multimedia (ACM MM 2021), in Chengdu, China and virtually.

HUMA'21: 2nd International Workshop on Human-centric Multimedia Analysis

Wu Liu
Xinchen Liu
Jingkuan Song
Dingwen Zhang
Wenbing Huang
Junbo Guo
John Smith

The Second International Workshop on Human-centric Multimedia Analysis is focused
on human-centric analysis using multimedia information. The human-centric multimedia
analysis is one of the fundamental and challenging problems of multimedia understanding.
It involves various human-centric analysis tasks like face recognition, human pose
estimation, person re-identification, human action recognition, person tracking, human-computer
interaction, etc. Nowadays, various multimedia sensing devices and large-scale computing
infrastructures are generating a wide variety of multi-modality data at a rapid velocity,
which supplies rich knowledge to tackle these challenges for human-centric analysis.
Researchers and engineers have strived to push the limits of human-centric multimedia
analysis in a wide variety of applications, such as smart city, retailing, intelligent
manufacturing, and public services. To this end, our workshop aims to provide a platform
to promote exchanges and integration for the fields of human analysis and multimedia.

MMSports'21: 4th International Workshop on Multimedia Content Analysis in Sports

Rainer Lienhart
Thomas B. Moeslund
Hideo Saito

The fourth ACM International Workshop on Multimedia Content Analysis in Sports (ACM
MMSports'21) is part of the ACM International Conference on Multimedia 2021 (ACM Multimedia
2021). Exceptionally, due to the corona pandemic, the workshop is held virtually.
The goal of this workshop is to bring together researchers and practitioners from
academia and industry to address challenges and report progress in mining, analyzing,
understanding and visualizing the multimedia/multimodal data in sports, sports broadcasts,
sports games and sports medicine. The combination of sports and modern technology
offers a novel and intriguing field of research with promising approaches for visual
broadcast augmentation and understanding, statistical analysis and evaluation, and
sensor fusion during workouts as well as competitions. There is a lack of research
communities focusing on the fusion of multiple modalities. We are helping to close
this research gap with this workshop series on multimedia content analysis in sports.

SUMAC'21: 3rd Workshop on Structuring and Understanding of Multimedia heritAge Contents

Valérie Gouet-Brunet
Margarita Khokhlova
Ronak Kosti
Li Weng

SUMAC 2021 is the third edition of the workshop on Structuring and Understanding of
Multimedia heritAge Contents. It is held in Chengdu, China on October 20th, 2021 and
is co-located with the 29th ACM International Conference on Multimedia. Its objective
is to present and discuss the latest and most significant trends and challenges in
the analysis, structuring and understanding of multimedia contents dedicated to the
valorization of heritage, with the emphasis on the unlocking of and access to the
big data of the past. A representative scope of Computer Science methodologies dedicated
to the processing of multimedia heritage contents and their exploitation is covered
by the works presented, with the ambition of advancing and raising awareness about
this fully developing research field.

UrbanMM'21: 1st International Workshop on Multimedia Computing for Urban Data

Stevan Rudinac
Alessandro Bozzon
Tat-Seng Chua
Suzanne Little
Daniel Gatica-Perez
Kiyoharu Aizawa

Understanding complex processes that give cities their form traditionally relied primarily
on the analysis of various open data statistics in relation to e.g. neighbourhood
demographics, economy and mobility. However, recent years have seen an unprecedented
increase in the availability and use of city-related sensors, participatory data and
social multimedia. As the valuable information about urban challenges is usually encoded
across multiple modalities, such as visual (e.g. panoramic, satellite and user-contributed
images), text (e.g. social media and participatory data) and open data statistics,
extracting this information requires effective multimedia analysis tools. This Workshop
will showcase the power of multimedia computing in addressing various urban challenges,
ranging from event detection and analysis, location recommendation and crowdedness
estimation to more efficient handling of citizen reports and modelling and improving
city liveability. In addition, it will serve as an impulse for the multimedia community
to intensify research on these interesting, challenging and truly multimodal problems.

ADGD'21: 1st Workshop on Synthetic Multimedia - Audiovisual Deepfake Generation and Detection

Stefan Winkler
Weiling Chen
Abhinav Dhall
Pavel Korshunov

Deepfakes, i.e.synthetic or "fake" media content generated using deep learning, are
a double-edged sword. On one hand, they pose new threats and risks in the form of
scams, fraud, disinformation, social manipulation, or celebrity porn. On the other
hand, deepfakes have just as many meaningful and beneficial applications - they allow
us to create and experience things that no longer exist, or that have never existed,
enabling numerous exciting applications in entertainment, education, and even privacy.

While most work has focused on fake images and video alone, the multi-modal, audiovisual
aspect is very important to both convincing generation and accurate detection of fake
multimedia content. Therefore, we organize ADGD21: 1st Workshop on Synthetic Multimedia
- Audiovisual Deepfake Generation and Detection so as to provide a platform for researchers
and engineers to share their ideas and approaches in this field.

Related Workshop Proceedings are available in the ACM DL at: http://dl.acm.org/citation.cfm?id=3476099

FME'21: 1st Workshop on Facial Micro-Expression: Advanced Techniques for Facial Expressions
Generation and Spotting

Jingting Li
Moi Hoon Yap
Wen-Huang Cheng
John See
Xiaopeng Hong
Xiaobai Li
Su-Jing Wang

Facial micro-expressions (FMEs) are involuntary facial movements that occur spontaneously
when a person experiences an emotion but tries to suppress or repress the facial expression
and usually occur in high-risk situations. Thus, FMEs are very short in duration,
an important feature that distinguishes them from ordinary facial expressions. And
MEs are considered to be one of the most valuable cues for complex human emotion understanding
and lie detection. Since 2014, the computational analysis and automation of MEs have
been an emerging area of face research. The workshop will explore various dimensions
of the human mind through emotion understanding and FME analysis, as well as extended
research based on multi modal approaches.

MuCAI'21: 2nd ACM Multimedia Workshop on Multimodal Conversational AI

Joao Magalhaes
Alexander G. Hauptmann
Ricardo G. Sousa
Carlos Santiago

The second edition of the International Workshop on Multimodal Conversational AI puts
forward a diverse set of contributions that aim to brainstorm this new field. Conversational
agents are now becoming a commodity as this technology is being applied to a wide
range of domains. Healthcare, assisting technologies, e-commerce, information seeking,
are some of the domains where multimodal conversational AI is being explored. The
wide use of multimodal conversational agents exposes the many challenges in achieving
more natural, human-like, and engaging conversational agents. The research contributions
of the Workshop actively address several of relevant challenges: How to include assistive-technologies
in dialog systems? How can agents engage in negotiation in dialogs? How to handle
the embodiment of conversational agents?

Keynote speakers, both with real-world experience in conversational AI, will share
their most recent and exciting work. The panel will address technological, ethical,
legal and social aspects of conversational search. Finally, invited contributions
from research projects will showcase how the different domains can benefit from conversational
technology.

MULL'21: First International Workshop on Multimedia Understanding with Less Labeling

Xiu-Shen Wei
Jufeng Yang
Han-Jia Ye
Jian Yang

With the advent of deep neural networks, quite a lot of multimedia tasks have been
significantly improved. While, however, deep neural networks still lack the ability
of learning from less labeling, e.g., with limited exemplars or fast generalizing
to new tasks. In order to address the current inefficiency of multimedia, there is
pressing need to research methods to drastically reduce requirements for labeled training
data. This workshop aims to provide a platform for discussing the challenges and corresponding
innovative approaches in multimedia with less labeling. We hope more advanced technologies
can be proposed or inspired, and also we invite several domain-specific experts for
sharing their insights and research progress on the topic of MULL.

MuSe 2021 Challenge: Multimodal Emotion, Sentiment, Physiological-Emotion, and Stress Detection

Lukas Stappen
Eva-Maria Meßner
Erik Cambria
Guoying Zhao
Björn W. Schuller

The 2nd Multimodal Sentiment Analysis (MuSe) 2021 Challenge-based Workshop is held
in conjunction with ACM Multimedia'21. Two datasets are provided as part of the challenge.
Firstly, the MuSe-CaR dataset, which focuses on user-generated, emotional vehicle
reviews from YouTube, and secondly, the novel Ulm-Trier Social Stress (Ulm-TSST) dataset,
which shows people in stressful circumstances. Participants are faced with four sub-challenges:
predicting arousal and valence in a time- and value-continuous manner on a) MuSe-CaR
(MuSe-Wilder) and b) Ulm-TSST (MuSe-Stress); c) predicting unsupervised created emotion
classes on MuSe-CaR (MuSe-Sent); d) predicting a fusion of human-annotated arousal
and measured galvanic skin response also as a continuous target on Ulm-TSST (MuSe-Physio).
In this summary, we describe the motivation, the sub-challenges, the challenge conditions,
the participation, and the most successful approaches.

Trustworthy AI'21: 1st International Workshop on Trustworthy AI for Multimedia Computing

Teddy Furon
Jingen Liu
Yogesh Rawat
Wei Zhang
Qi Zhao

In this workshop, we are addressing the trustworthy AI issues for Multimedia Computing.
We aim to bring together researchers in the trustworthy aspects of Multimedia Computing
and facilitate discussions in injecting trusts into multimedia to develop trustworthy
AI techniques that are reliable and acceptable to multimedia researchers and practitioners.
Our scope is at the conjunction of multimedia, computer vision and trustworthy AI,
including Explainability, Robustness and Safety, Data Privacy, Accountability and
Transparency, and Fairness.

WAB'21: 1st Workshop on Multimodal Product Identification in Livestreaming and WAB
Challenge

Yueting Zhuang
Xing Tang
Guilin Wu
Yahong Han
Haihong Tang
Xiaobo Li
Xiaohan Wang
Baoming Yan
Bo Gao
Yi Yang

Product identification has become a very important component in the modern E-commerce
shopping system. Consumers could enjoy watching livingstreaming and buying products
that livestream hosts recommended. However, with hundreds of products presented in
a livingstreaming video, finding the specific product could be laboursome for consumers.
Hence, automatic product identification is desired in livingstreaming based E-commerce
system. Compared with the image-based visual searching system, the complicated contents
in the livestreaming videos make the identification even more challenging. To promote
the research on product identification in livestreaming, we present the largest multimodal
product retrieval dataset named "Watch and Buy" (WAB) and launch the multimodal product
retrieval challenge. We hope this workshop could help researchers further advance
the performance and applicability of livestreaming product identification in real-world
systems.