MULL'21: Multimedia Understanding with Less Labeling on Multimedia Understanding with Less Labeling

MULL'21: Multimedia Understanding with Less Labeling on Multimedia Understanding with Less
Labeling

Digital Library logo
Full Citation in the ACM Digital Library

SESSION: Oral Presentations

Occlusion Contrasts for Self-Supervised Facial Age Estimation

Weiwei Cai
Hao Liu

In this paper, we propose an Occlusion Contrast(OCCO) approach for self-supervised
facial partial occluded age estimation. Unlike the conventional facial age estimation
approaches which utilize fully-visible faces as input data that does not generalize
well for occlusion images, our approach aims to ignore the occlusion and only focus
on the non-occluded facial areas so that we can improve the occluded facial age estimation
accuracy. To achieve this, we utilize self-supervised contrastive learning to learn
non-occluded feature representation, since contrastive learning makes the distances
between the anchor and positive samples as close as possible in embedded space, while
simultaneously pushing apart the negative samples. Furthermore, our OCCO incorporates
with ordinal relationship of different ages, which is modeled by the deep label distribution
learning. Considering that face aging datasets usually undergo a label imbalance problem,
we employ the cost-sensitive strategy to constrain the learning of classifier. Extensive
experimental results on two face aging datasets show that our OCCO not only achieve
satisfactory performance over the masked faces but also comparable to the state-of-the-art
age estimation methods for raw facial images.

Incomplete Label Distribution Learning by Exploiting Global Sample Correlation

Qifa Teng
Xiuyi Jia

In recent years, label distribution learning (LDL) has become a new learning paradigm
in the field of machine learning. LDL is mainly designed to solve the problem of ambiguity
among labels. Although LDL has been successful in many applications, most of these
efforts are centered around complete supervised information. However, in reality,
the supervised information is often incomplete due to the huge cost of label annotation.
To address this problem, this paper proposes a novel incomplete LDL approach by utilizing
the global sample correlation (IncomLDL-GSC). The label correlation is also considered
to improve the performance of the model. Extensive experiments are conducted on 13
data sets to demonstrate the effectiveness of our proposed method.

Improving Multimodal Data Labeling with Deep Active Learning for Post Classification
in Social Networks

Dmitry Krylov
Semen Poliakov
Natalia Khanzhina
Alexey Zabashta
Andrey Filchenkov
Aleksandr Farseev

Automatic user post classification is an important task in the field of social network
analysis. Being effectively solved, post classification could be used for thematic
user feed composition or inappropriate content identification. Commonly addressed
by applying various Machine Learning approaches, the task often involves manual processes
related to ground truth sourcing, which is known to be a hardly-scalable and increasingly
expensive procedure. At the same time, Active Learning for automatic user post classification
is a promising way to bridge such a gap, as it does not require massive ground truth
availability aligning our research with the real world settings. In this work, we
put our focus on leveraging textual and visual data modalities for the application
of user post classification and investigate how batch size and batch normalization
disabling techniques could affect active deep neural network learning process. We
solve the problem of automatic user post classification by employing our novel multimodal
neural network architecture with multi-head tunable loss function components. We show
that the proposed approach, coupled with Active Learning, allows for the achievement
of a significant classification performance boost in terms of crowd assessing resources
as compared to the passive learning approaches.

Contextual Image Parsing via Panoptic Segment Sorting

Jyh-Jing Hwang
Tsung-Wei Ke
Stella X. Yu

Real-world visual recognition is far more complex than object recognition; there is
stuff without distinctive shape or appearance, and the same object appearing in different
contexts calls for different actions. While we need context-aware visual recognition,
visual context is hard to describe and impossible to label manually. We consider visual
context as semantic correlations between objects and their surroundings that include
both object instances and stuff categories. We approach contextual object recognition
as a pixel-wise feature representation learning problem that accomplishes supervised
panoptic segmentation while discovering and encoding visual context automatically.
Panoptic segmentation is a dense image parsing task that segments an image into regions
with both semantic category and object instance labels. These two aspects could conflict
each other, for two adjacent cars would have the same semantic label but different
instance labels. Whereas most existing approaches handle the two labeling tasks separately
and then fuse the results together, we propose a single pixel-wise feature learning
approach that unifies both aspects of semantic segmentation and instance segmentation.
Our work takes the metric learning perspective of SegSort but extends it non-trivially
to panoptic segmentation, as we must merge segments into proper instances and handle
instances of various scales. Our most exciting result is the emergence of visual context
in the feature space through contrastive learning between pixels and segments, such
that we can retrieve a person crossing a somewhat empty street without any such context
labeling. Our experimental results on Cityscapes and PASCAL VOC demonstrate that,
in terms of surround semantics distributions, our retrievals are much more consistent
with the query than the state-of-the-art segmentation method, validating our pixel-wise
representation learning approach for the unsupervised discovery and learning of visual
context.

SESSION: Other Presentations

Iterative Image Translation for Unsupervised Domain Adaptation

Sachin Chhabra
Hemanth Venkateswara
Baoxin Li

In this paper, we propose an image-translation-based unsupervised domain adaptation
approach that iteratively trains an image translation and a classification network
using each other. In Phase A, a classification network is used to guide the image
translation to preserve the content and generate images. In Phase B, the generated
images are used to train the classification network. With each step, the classification
network and generator improve each other to learn the target domain representation.
Detailed analysis and the experiments are testimony of the strength of our approach.

Glocal Alignment for Unsupervised Domain Adaptation

Sachin Chhabra
Prabal Bijoy Dutta
Baoxin Li
Hemanth Venkateswara

Traditional unsupervised domain adaptation methods attempt to align source and target
domains globally and are agnostic to the categories of the data points. This results
in an inaccurate categorical alignment and diminishes the classification performance
on the target domain. In this paper, we alter existing adversarial domain alignment
methods to adhere to category alignment by imputing category information. We partition
the samples based on category using source labels and target pseudo labels and then
apply domain alignment for every category. Our proposed modification provides a boost
in performance even with a modest pseudo label estimator. We evaluate our approach
on 4 popular domain alignment loss functions using object recognition and digit datasets.

Multi-Branch Convolution Network for Few-Shot Classification

Jie Hua
Xueliang Liu

Few-shot learning aims to complete the classification by only a small number of samples.
In many few-shot learning frameworks, relation network is an end-to-end method, which
can identify new categories through a small number of label samples based on metric
learning. However, a simple feature extractor is used in this method, which limits
the further improvement of the classification accuracy. To solve this problem, this
paper proposes a multi-branch convolution network for feature extraction. This method
combines the training strategies of multi-scale feature extraction, relation network,
receptive field block and meta-learning. Firstly, the multi-scale feature vectors
of the input image are extracted from the multi-branch convolution network. Then the
feature vectors from the support set and the prediction set are input into the relation
model, while the receptive field block is employed to improve the measurement ability
of the network. Finally, the classification of the testing samples are realized according
to the similarity score. In this paper, the effectiveness of the proposed model is
verified on Omniglot and MiniImageNet datasets. The experimental results show that
the classification accuracy of the proposed model is higher than that of other classical
few-shot learning models.

MULL'21: Multimedia Understanding with Less Labeling on Multimedia Understanding with Less Labeling

MULL'21: Multimedia Understanding with Less Labeling on Multimedia Understanding with Less
Labeling

SESSION: Oral Presentations

Occlusion Contrasts for Self-Supervised Facial Age Estimation

Incomplete Label Distribution Learning by Exploiting Global Sample Correlation

Improving Multimodal Data Labeling with Deep Active Learning for Post Classification
in Social Networks

Contextual Image Parsing via Panoptic Segment Sorting

SESSION: Other Presentations

Iterative Image Translation for Unsupervised Domain Adaptation

Glocal Alignment for Unsupervised Domain Adaptation

Multi-Branch Convolution Network for Few-Shot Classification

Sections

User login

MULL'21: Multimedia Understanding with Less Labeling on Multimedia Understanding with Less Labeling

MULL'21: Multimedia Understanding with Less Labeling on Multimedia Understanding with Less Labeling

SESSION: Oral Presentations

Occlusion Contrasts for Self-Supervised Facial Age Estimation

Incomplete Label Distribution Learning by Exploiting Global Sample Correlation

Improving Multimodal Data Labeling with Deep Active Learning for Post Classification in Social Networks

Contextual Image Parsing via Panoptic Segment Sorting

SESSION: Other Presentations

Iterative Image Translation for Unsupervised Domain Adaptation

Glocal Alignment for Unsupervised Domain Adaptation

Multi-Branch Convolution Network for Few-Shot Classification

Sections

User login

MULL'21: Multimedia Understanding with Less Labeling on Multimedia Understanding with Less
Labeling

Improving Multimodal Data Labeling with Deep Active Learning for Post Classification
in Social Networks