MMPT '21: Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding

Digital Library logo
Full Citation in the ACM Digital Library

SESSION: Keynote Talks

Cross-modal Pretraining and Matching for Video Understanding

Limin Wang

Videos are generally accompanied with multi-modal information such as audio, text,
and motion. The multi-modal information is becoming an important cue for understanding
video content. How to model the correlation between multi-modalities in videos is
still an unsolved problem in video understanding tasks such as video action recognition,
video temporal grounding, and video description. In this talk, we focus on two specific
video understanding tasks (i.e., cross-modal self-supervised pretraining and temporal
grounding) by exploiting the video-text cross modal information. In particular, we
notice that videos are naturally accompanied by abundant text information such as
YouTube titles, Instagram captions, and Movie scripts. This textual information could
serve as a general information to guide us train a multi-modal network, which could
be used as a general video representation to be fine-tuned on the downstream tasks,
or as cross-modal matching similarity to be used for video segment retrieval. Specifically,
we first present a general cross-modal pair discrimination (CPD) framework to capture
this correlation between a video and its associated text. We train our CPD models
on both standard video dataset (Kinetics-210k) and uncurated web video dataset (Instagram-300k)
to demonstrate its effectiveness. Without further fine-tuning, the learnt models obtain
competitive results for action classification on Kinetics under the linear classification
protocol. Moreover, our visual model provides an effective initialization to fine-tune
on downstream tasks, which yields a remarkable performance gain for action recognition
on UCF101 and HMDB51. Our CPD demonstrates that pre-training on a relatively small
dataset is able to yield a comparable performance to those methods of using order
magnitude more data, which is meaningful and practicable for the scenarios with limited
computational facilities. Second, we present a Contrastive and Compatible Matching
Network (C2M-Net), to directly model the relations between language queries and video
moments in a joint embedding space. This new metric-learning framework enables fully
exploiting negative samples from two new aspects: constructing negative pairs from
a dual matching scheme and mining negative pairs across different videos. These new
negative samples could enhance the joint representation learning of two modalities
via contrastive learning to maximize their mutual information. In addition, to precisely
rank relatively positive pairs for accurate temporal grounding, we also learn the
compatibility between queries and moments by directly regressing their IoU-based similarity.
Our C2M-Net yields state-of-the-art performance on three benchmarks of CharadesSTA,
TACoS, and ActivityNet-Captions.

WenLan: Efficient Large-Scale Multi-Modal Pre-Training on Real World Data

Ruihua Song

Multi-modal pre-training models have been intensively explored to bridge vision and
language in recent years. However, most of them explicitly model the cross-modal interaction
between image-text pairs, by assuming that there exists strong semantic correlation
between the text and image modalities. Since this strong assumption is often invalid
in real-world scenarios, we choose to implicitly model the cross-modal correlation
for large-scale multi-modal pre-training, which is the focus of the Chinese project
'WenLan' led by our team. Specifically, with the weak correlation assumption over
image-text pairs, we propose a two-tower pre-training model called BriVL within the
cross-modal contrastive learning framework [1]. We construct a large Chinese multi-source
dataset of 650 million image-text pairs for pre-training our model. Extensive experiments
demonstrate that WenLan on various downstream tasks and easy to build efficient applications
based on searching between images and texts.

SESSION: MMPT 2021 Workshop Presentation

Be Specific, Be Clear: Bridging Machine and Human Captions by Scene-Guided Transformer

Yupan Huang
Zhaoyang Zeng
Yutong Lu

Automatically generating natural language descriptions for images, i.e., image captioning,
is one of the primary goals for multimedia understanding. The recent success of deep
neural networks in image captioning has been accompanied by region-based bottom-up-attention
features. Region-based features are representative of the contents of local regions
while lacking an overall understanding of images, which is critical to more specific
and clear language expression. Visual scene perception can facilitate overall understanding
and provide prior knowledge to generate specific and clear captions of objects, object
relations, and overall image scenes. In this paper, we propose a Scene-Guided Transformer
(SG-Transformer) model that leverages the scene-level global context to generate more
specific and descriptive image captions. SG-Transformer adopts an encoder-decoder
architecture. The encoder aggregates global scene context as external knowledge with
object region-based features in attention learning to facilitate object relation reasoning.
It also incorporates high-level auxiliary scene-guided tasks towards more specific
visual representation learning. Then the decoder integrates both object-level and
scene-level information refined by the encoder for an overall image perception. Extensive
experiments on MSCOCO and Flickr30k benchmarks show the superiority and generality
of SG-Transformer. Besides, the proposed scene-guided approach can enrich object-level
and scene graph visual representations in the encoder and generalize to both RNN-
and Transformer-based architectures in the decoder.

Language-Conditioned Region Proposal and Retrieval Network for Referring Expression
Comprehension

Yanwei Xie
Daqing Liu
Xuejin Chen
Zheng-Jun Zha

Referring expression comprehension (REC) is a multi-modal task that aims to localize
target regions in images according to language descriptions. Existing methods can
be concluded into two categories, proposal-based methods and proposal-free methods.
Proposal-based methods first detect all candidate objects in the image and then retrieve
the target among those objects based on the language description, while proposal-free
methods directly locate the region based on the language without any region proposals.
However, the proposal-based methods suffer from separate region proposal networks
that actually do not suit this task well, and the proposal-free methods are not able
to perform fine-grained visual-language alignments to yield higher precision. To overcome
the above drawbacks, we propose a language-conditioned region proposal and retrieval
network that first detects those regions only related to the language and then retrieves
the target region by compositional reasoning on the language. Specifically, the proposed
network consists of a language-conditioned region proposal network (LC-RPN) to detect
those language-related regions, and a language-conditioned region retrieval network
(LC-RRN) to perform region retrieval with a full understanding of the language. A
pre-training mechanism is proposed to teach our model knowledge about language decomposing
and vision-language alignment. Experimental results demonstrate that our proposed
method achieves leading performance with high inference speed on RefCOCO, RefCOCO+,
and RefCOCOg benchmarks.

Residual Recurrent CRNN for End-to-End Optical Music Recognition on Monophonic Scores

Aozhi Liu
Lipei Zhang
Yaqi Mei
Baoqiang Han
Zifeng Cai
Zhaohua Zhu
Jing Xiao

One of the challenges of the Optical Music Recognition task is to transcript the symbols
of the camera-captured images into digital music notations. Previous end-to-end model
which was developed as a Convolutional Recurrent Neural Network does not explore sufficient
contextual information from full scales and there is still a large room for improvement.
We propose an innovative framework that combines a block of Residual Recurrent Convolutional
Neural Network with a recurrent Encoder-Decoder network to map a sequence of monophonic
music symbols corresponding to the notations present in the image. The Residual Recurrent
Convolutional block can improve the ability of the model to enrich the context information.
The experiment results are benchmarked against a publicly available dataset called
CAMERA-PRIMUS, which demonstrates that our approach surpass the state-of-the-art end-to-end
method using Convolutional Recurrent Neural Network.

Style-Guided Image-to-Image Translation for Multiple Domains

Tingting Li
Huan Zhao
Song Wang
Jing Huang

The cross-domain image translation has drawn more and more attention. It aims to translate
images from a source domain into target domains, such that images can appear in multiple
styles. The most popular approaches are using encoders to extract style features from
the source domain and then pushing them into a generator to produce new images. However,
these methods usually only suit for two domains translation, and present low diversity
in multiple domains since the extracted features are roughly used as input for the
generator, instead of making full use of them. In this paper, we design a novel loss
function, style-guided diversity loss (Sd loss), which utilizes the extracted style
features to encourage our model exploring the image space and discovering diverse
images. It is proved theoretically that the proposed loss is better than the diversity
sensitive loss in the state-of-the-art approaches. In addition, qualitative and quantitative
experiments demonstrate the superiority of the proposed approach against several state-of-the-art
approaches in terms of the quality and the diversity of translated images.

A Fair and Comprehensive Comparison of Multimodal Tweet Sentiment Analysis Methods

Gullal S. Cheema
Sherzod Hakimov
Eric Müller-Budack
Ralph Ewerth

Opinion and sentiment analysis is a vital task to characterize subjective information
in social media posts. In this paper, we present a comprehensive experimental evaluation
and comparison with six state-of-the-art methods, from which we have re-implemented
one of them. In addition, we investigate different textual and visual feature embeddings
that cover different aspects of the content, as well as the recently introduced multimodal
CLIP embeddings. Experimental results are presented for two different publicly available
benchmark datasets of tweets and corresponding images. In contrast to the evaluation
methodology of previous work, we introduce a reproducible and fair evaluation scheme
to make results comparable. Finally, we conduct an error analysis to outline the limitations
of the methods and possibilities for the future work.

Unsupervised Training Data Generation of Handwritten Formulas using Generative Adversarial
Networks with Self-Attention

Matthias Springstein
Eric Müller-Budack
Ralph Ewerth

The recognition of handwritten mathematical expressions in images and video frames
is a difficult and unsolved problem yet. Deep convectional neural networks are basically
a promising approach, but typically require a large amount of labeled training data.
However, such a large training dataset does not exist for the task of handwritten
formula recognition. In this paper, we introduce a system that creates a large set
of synthesized training examples of mathematical expressions which are derived from
LaTeX documents. For this purpose, we propose a novel attention-based generative adversarial
network to translate rendered equations to handwritten formulas. The datasets generated
by this approach contain hundreds of thousands of formulas, making it ideal for pretraining
or the design of more complex models. We evaluate our synthesized dataset and the
recognition approach on the CROHME 2014 benchmark dataset. Experimental results demonstrate
the feasibility of the approach.

MMPT '21: Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding

MMPT '21: Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding

SESSION: Keynote Talks

Cross-modal Pretraining and Matching for Video Understanding

WenLan: Efficient Large-Scale Multi-Modal Pre-Training on Real World Data

SESSION: MMPT 2021 Workshop Presentation

Be Specific, Be Clear: Bridging Machine and Human Captions by Scene-Guided Transformer

Language-Conditioned Region Proposal and Retrieval Network for Referring Expression
Comprehension

Residual Recurrent CRNN for End-to-End Optical Music Recognition on Monophonic Scores

Style-Guided Image-to-Image Translation for Multiple Domains

A Fair and Comprehensive Comparison of Multimodal Tweet Sentiment Analysis Methods

Unsupervised Training Data Generation of Handwritten Formulas using Generative Adversarial
Networks with Self-Attention

Sections

User login