MuSe'20: Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop

MuSe'20: Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life<br /> Media Challenge and Workshop<br />

MuSe'20: Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life
Media Challenge and Workshop

Full Citation in the ACM Digital Library

SESSION: Keynotes

Vehicle Interiors as Sensate Environments

  • Michael Würtenberger

The research field of biologically inspired and cognitive systems is currently gaining
increasing interest. However, modern vehicles and their architectures are still dominated
by traditional, engineered systems. This talk will give an industrial perspective
on potential usage of biologicallyinspired systems and cognitive architectures in
future vehicles.

A vehicle's interior can be considered a highly interactive sensate environment.
With the advent of highly automated driving, even more emphasis will be on this smart
space and the corresponding user experience. New interior layouts become possible,
with the attention shifting from the driver to the wellbeing and comfort of rider
passengers in highly reconfigurable interior layouts. Tactile intelligence in particular
will add an exciting new modality and help address challenges of safe human-robot

By focusing on opportunities for such approaches but also by pointing out challenges
with respect to industrial requirements, the goal of this talk is to initiate and
stimulate discussions regarding integration of cognitive systems in future vehicle

Personalized Machine Learning for Human-centered Machine Intelligence

  • Ognjen Oggi Rudovic

Recent developments in AI and Machine Learning (ML) are revolutionizing traditional
technologies for health and education by enabling more intelligent therapeutic and
learning tools that can automatically perceive and predict user's behavior (e.g. from
videos) or health status from user's past clinical data.

To date, most of these tools still rely on traditional 'on-size-fits-all' ML paradigm,
rendering generic learning algorithms that, in most cases, are suboptimal on the individual
level, mainly because of the large heterogeneity of the target population. Furthermore,
such approach may provide misleading outcomes as it fails to account for context in
which target behaviors/clinical data are being analyzed. This calls for new human-centered
machine intelligence enabled by ML algorithms that are tailored to each individual
and context under the study.

In this talk, I will present the key ideas and applications of Personalized Machine
Learning (PML) framework specifically designed to tackle those challenges. The applications
range from personalized forecasting of Alzheimer's related cognitive decline, using
Gaussian Process models, to Personalized Deep Neural Networks, designed for classification
of facial affect of typical individuals using the notion of meta-learning and reinforcement
learning. I will then describe in more detail how this framework can be used to tackle
a challenging problem of robot perception of affect and engagement in autism therapy.
Lastly, I will discuss the future research on PML and human-centered ML design, outlining
challenges and opportunities.

SESSION: Invited Talks

Multimodal Social Media Mining

  • Ioannis Kompatsiaris

Social media have transformed the Web into an interactive sharing platform where users
upload data and media, comment on, and share this content within their social circles.
The large-scale availability of user-generated content in social media platforms has
opened up new possibilities for studying and understanding real-world phenomena, trends
and events. The objective of this talk is to provide an overview of social media mining,
which offers a unique opportunity to discover, collect, and extract relevant information
in order to provide useful insights. It will include key challenges and issues, such
as fighting misinformation, data collection, analysis and visualization components,
applications, results and demonstrations from multiple areas ranging from news to
environmental and security ones.

Extending Multimodal Emotion Recognition with Biological Signals: Presenting a Novel Dataset and Recent Findings

  • Alice Baird

Multimodal fusion has shown great promise in recent literature, particularly for audio
dominant tasks. In this talk, we outline a the finding from a recently developed multimodal
dataset, and discuss the promise of fusing biological signals with speech for continuous
recognition of the emotional dimensions of valence and arousal in the context of public
speaking. As well as this, we discuss the advantage of cross-language (German and
English) analysis by training language-independent models and testing them on speech
from various native and non-native groupings. For the emotion recognition task used
as a case study, a Long Short-Term Memory - Recurrent Neural Network (LSTM-RNN) architecture
with a self-attention layer is used.

End2You: Multimodal Profiling by End-to-End Learning and Applications

  • Panagiotis Tzirakis

Multimodal profiling is a fundamental component towards a complete interaction between
human and machine. This is an important task for intelligent systems as they can automatically
sense and adapt their responses according to the human behavior. The last 10 years,
several advancements have been accomplished with the use of Deep Neural Networks (DNNs)
in several areas including but not limited to affect recognition[1,2]. Convolution
and recurrent neural networks are core components of DNNs that have been extensively
used to extract robust spatial and temporal features, accordingly. To this end, we
introduce End2You[3] an open-source toolkit implemented in Python and based on Tensorflow.
It provides capabilities to train and evaluate models in an end-to-end manner, i.e.,
using raw input. It supports input from raw audio, visual, physiological or other
types of information, and the output can be of an arbitrary representation, for either
classification or regression tasks. Well known audio- and visual-model implementations
are provided including ResNet[4], and MobileNet[5]. It can also capture the temporal
dynamics in the signal, utilizing recurrent neural networks such as Long Short-Term
Memory (LSTM). The toolkit also provides pretrained unimodal and multimodal models
for the emotion recognition task using the RECOLA dataset[6]. To our knowledge, this
is the first toolkit that provides generic end-to-end learning for profiling capabilities
in either unimodal or multimodal cases. We depict results of the toolkit on the RECOLA
dataset and show how it can be used on different datasets.

SESSION: Paper Presentations

Unsupervised Representation Learning with Attention and Sequence to Sequence Autoencoders
to Predict Sleepiness From Speech

  • Shahin Amiriparian
  • Pawel Winokurow
  • Vincent Karas
  • Sandra Ottl
  • Maurice Gerczuk
  • Björn Schuller

Motivated by the attention mechanism of the human visual system and recent developments
in the field of machine translation, we introduce our attention-based and recurrent
sequence to sequence autoencoders for fully unsupervised representation learning from
audio files. In particular, we test the efficacy of our novel approach on the task
of speech-based sleepiness recognition. We evaluate the learnt representations from
both autoencoders, and conduct an early fusion to ascertain possible complementarity
between them. In our frameworks, we first extract Mel-spectrograms from raw audio.
Second, we train recurrent autoencoders on these spectrograms which are considered
as time-dependent frequency vectors. Afterwards, we extract the activations of specific
fully connected layers of the autoencoders which represent the learnt features of
spectrograms for the corresponding audio instances. Finally, we train support vector
regressors on these representations to obtain the predictions. On the development
partition of the data, we achieve Spearman's correlation coefficients of .324, .283,
and .320 with the targets on the Karolinska Sleepiness Scale by utilising attention
and non-attention autoencoders, and the fusion of both autoencoders' representations,
respectively. In the same order, we achieve .311, .359, and .367 Spearman's correlation
coefficients on the test data, indicating the suitability of our proposed fusion strategy.

Multi-modal Fusion for Video Sentiment Analysis

  • Ruichen Li
  • Jinming Zhao
  • Jingwen Hu
  • Shuai Guo
  • Qin Jin

Automatic sentiment analysis can support revealing a subject's emotional state and
opinion tendency toward an entity. In this paper, we present our solutions for the
MuSe-Wild sub-challenge of Multimodal Sentiment Analysis in Real-life Media (MuSe)
2020. The videos in this challenge are collected from YouTube about emotional car
reviews. In the scenarios, the speaker's sentiment can be conveyed in different modalities
including acoustic, visual, and textual modalities. Due to the complementarity of
different modalities, the fusion of the multiple modalities has a large impact on
sentiment analysis. In this paper, we highlight two aspects of our solutions: 1) we
explore various low-level and high-level features from different modalities for emotional
state recognition, such as expert-defined low-level descriptors (LLD) and deep learned
features, etc. 2) we propose several effective multi-modal fusion strategies to make
full use of the different modalities. Our solutions achieve the best CCC performance
of 0.4346 and 0.4513 on arousal and valence respectively on the challenge testing
set, which significantly outperforms the baseline system with corresponding CCC of
0.2843 and 0.2413 on arousal and valence. The experimental results show that our proposed
various effective representations of different modalities and fusion strategies have
a strong generalization ability and can bring more robust performance.

Multi-modal Continuous Dimensional Emotion Recognition Using Recurrent Neural Network
and Self-Attention Mechanism

  • Licai Sun
  • Zheng Lian
  • Jianhua Tao
  • Bin Liu
  • Mingyue Niu

Automatic perception and understanding of human emotion or sentiment has a wide range
of applications and has attracted increasing attention nowadays. The Multimodal Sentiment
Analysis in Real-life Media (MuSe) 2020 provides a testing bed for recognizing human
emotion or sentiment from multiple modalities (audio, video, and text) in the wild
scenario. In this paper, we present our solutions to the MuSe-Wild sub-challenge of
MuSe 2020. The goal of this sub-challenge is to perform continuous emotion (arousal
and valence) predictions on a car review database, Muse-CaR. To this end, we first
extract both handcrafted features and deep representations from multiple modalities.
Then, we utilize the Long Short-Term Memory (LSTM) recurrent neural network as well
as the self-attention mechanism to model the complex temporal dependencies in the
sequence. The Concordance Correlation Coefficient (CCC) loss is employed to guide
the model to learn local variations and the global trend of emotion simultaneously.
Finally, two fusion strategies, early fusion and late fusion, are adopted to further
boost the model's performance by exploiting complementary information from different
modalities. Our proposed method achieves CCC of 0.4726 and 0.5996 for arousal and
valence respectively on the test set, which outperforms the baseline system with corresponding
CCC of 0.2834 and 0.2431.

MuSe 2020 Challenge and Workshop: Multimodal Sentiment Analysis, Emotion-target Engagement and Trustworthiness Detection
in Real-life Media: Emotional Car Reviews in-the-wild

  • Lukas Stappen
  • Alice Baird
  • Georgios Rizos
  • Panagiotis Tzirakis
  • Xinchen Du
  • Felix Hafner
  • Lea Schumann
  • Adria Mallol-Ragolta
  • Bjoern W. Schuller
  • Iulia Lefter
  • Erik Cambria
  • Ioannis Kompatsiaris

Multimodal Sentiment Analysis in Real-life Media (MuSe) 2020 is a Challenge-based
Workshop focusing on the tasks of sentiment recognition, as well as emotion-target
engagement and trustworthiness detection by means of more comprehensively integrating
the audio-visual and language modalities. The purpose of MuSe 2020 is to bring together
communities from different disciplines; mainly, the audio-visual emotion recognition
community (signal-based), and the sentiment analysis community (symbol-based). We
present three distinct sub-challenges: MuSe-Wild, which focuses on continuous emotion
(arousal and valence) prediction; MuSe-Topic, in which participants recognise 10 domain-specific
topics as the target of 3-class (low, medium, high) emotions; and MuSe-Trust, in which
the novel aspect of trustworthiness is to be predicted. In this paper, we provide
detailed information on MuSe-CAR, the first of its kind in-the-wild database, which
is utilised for the challenge, as well as the state-of-the-art features and modelling
approaches applied. For each sub-challenge, a competitive baseline for participants
is set; namely, on test we report for MuSe-Wild a combined (valence and arousal) CCC
of .2568, for MuSe-Topic a score (computed as 0.34 * UAR + 0.66 * F1) of 76.78 % on
the 10-class topic and 40.64 % on the 3-class emotion prediction, and for MuSe-Trust
a CCC of .4359.

AAEC: An Adversarial Autoencoder-based Classifier for Audio Emotion Recognition

  • Changzeng Fu
  • Jiaqi Shi
  • Chaoran Liu
  • Carlos Toshinori Ishi
  • Hiroshi Ishiguro

In recent years, automatic emotion recognition has attracted the attention of researchers
because of its great effects and wide implementations in supporting humans' activities.
Given that the data about emotions is difficult to collect and organize into a large
database like the dataset of text or images, the true distribution would be difficult
to be completely covered by the training set, which affects the model's robustness
and generalization in subsequent applications. In this paper, we proposed a model,
Adversarial Autoencoder-based Classifier (AAEC), that can not only augment the data
within real data distribution but also reasonably extend the boundary of the current
data distribution to a possible space. Such an extended space would be better to fit
the distribution of training and testing sets. In addition to comparing with baseline
models, we modified our proposed model into different configurations and conducted
a comprehensive self-comparison with audio modality. The results of our experiment
show that our proposed model outperforms the baselines.