MuSe '21: Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge


MuSe '21: Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge

Full Citation in the ACM Digital Library

SESSION: Keynotes

Getting Really Wild: Challenges and Opportunities of Real-World Multimodal Affect Detection

  • Sidney DMello

Affect detection in the "real" wild - where people go about their daily routines in their homes and workplaces - is arguably a different problem than affect detection in the lab or in the "quasi" wild (e.g., YouTube videos). How will our affect detection systems hold up when put to the test in the real wild? Some in the Affective Computing community had an opportunity to address this question as part of the MOSAIC (Multimodal Objective Sensing to Assess Individuals with Context [1]) program which ran from 2017 to 2020. Results were sobering, but informative. I'll discuss those efforts with an emphasis on performance achieved, insights gleaned, challenges faced, and lessons learned.

New Directions in Emotion Theory

  • Panagiotis Tzirakis

Emotional intelligence is a fundamental component towards a complete and natural interaction between human and machine. Towards this goal several emotion theories have been exploited in the affective computing domain. Along with the studies developed in the theories of emotion, there are two major approaches to characterize emotional models: categorical models and dimensional models. Whereas, categorical models indicate there are a few basic emotions that are independent on the race (e.g. Ekman's model), dimensional approaches suggest that emotions are not independent, but related to one another in a systematic manner (e.g. Circumplex of Affect). Although these models have been dominating in the affective computing research, recent studies in emotion theories have shown that these models only capture a small fraction of the variance of what people perceive. In this talk, I will present the new directions in emotion theory that can better capture the emotional behavior of individuals. First, I will discuss the statistical analysis behind key emotions that are conveyed in human vocalizations, speech prosody, and facial expressions, and how these relate to conventional categorical and dimensional models. Based on these new emotional models, I will describe new datasets we have collected at Hume AI, and show the different patterns captured when training deep neural network models.

SESSION: Papers

The MuSe 2021 Multimodal Sentiment Analysis Challenge: Sentiment, Emotion, Physiological-Emotion, and Stress

  • Lukas Stappen
  • Alice Baird
  • Lukas Christ
  • Lea Schumann
  • Benjamin Sertolli
  • Eva-Maria Meßner
  • Erik Cambria
  • Guoying Zhao
  • Björn W. Schuller

Multimodal Sentiment Analysis (MuSe) 2021 is a challenge focusing on the tasks of sentiment and emotion, as well as physiological-emotion and emotion-based stress recognition through more comprehensively integrating the audio-visual, language, and biological signal modalities. The purpose of MuSe 2021 is to bring together communities from different disciplines; mainly, the audio-visual emotion recognition community (signal-based), the sentiment analysis community (symbol-based), and the health informatics community. We present four distinct sub-challenges: MuSe-Wilder and MuSe-Stress which focus on continuous emotion (valence and arousal) prediction; MuSe-Sent, in which participants recognise five classes each for valence and arousal; and MuSe-Physio, in which the novel aspect of 'physiological-emotion' is to be predicted. For this year's challenge, we utilise the MuSe-CaR dataset focusing on user-generated reviews and introduce the Ulm-TSST dataset, which displays people in stressful depositions. This paper also provides detail on the state-of-the-art feature sets extracted from these datasets for utilisation by our baseline model, a Long Short-Term Memory-Recurrent Neural Network. For each sub-challenge, a competitive baseline for participants is set; namely, on test, we report a Concordance Correlation Coefficient (CCC) of .4616 CCC for MuSe-Wilder; .5088 CCC for MuSe-Stress, and .4908 CCC for MuSe-Physio. For MuSe-Sent an F1 score of 32.82% is obtained.

Multimodal Emotion Recognition and Sentiment Analysis via Attention Enhanced Recurrent Model

  • Licai Sun
  • Mingyu Xu
  • Zheng Lian
  • Bin Liu
  • Jianhua Tao
  • Meng Wang
  • Yuan Cheng

With the proliferation of user-generated videos in online websites, it becomes particularly important to achieve automatic perception and understanding of human emotion/sentiment from these videos. In this paper, we present our solutions to the MuSe-Wilder and MuSe-Sent sub-challenges in MuSe 2021 Multimodal Sentiment Analysis Challenge. MuSe-Wilder focuses on continuous emotion (i.e., arousal and valence) recognition while the task of MuSe-Sent concentrates on discrete sentiment classification. To this end, we first extract a variety of features from three common modalities (i.e., audio, visual, and text), including both low-level handcrafted features and high-level deep representations from supervised/unsupervised pre-trained models. Then, the long short-term memory recurrent neural network, as well as the self-attention mechanism is employed to model the complex temporal dependencies in the feature sequence. The concordance correlation coefficient (CCC) loss and F1-loss are used to guide continuous regression and discrete classification, respectively. To further boost the model's performance, we adopt late fusion to exploit complementary information from different modalities. Our proposed method achieves CCCs of 0.4117 and 0.6649 for arousal and valence respectively on the test set of MuSe-Wilder, which outperforms the baseline system (i.e., 0.3386 and 0.5974) by a large margin. For MuSe-Sent, F1-scores of 0.3614 and 0.4451 for arousal and valence are obtained, which also outperforms the baseline system significantly (i.e., 0.3512 and 0.3291). With these promising results, we ranked top3 in both sub-challenges.

Multi-modal Fusion for Continuous Emotion Recognition by Using Auto-Encoders

  • Salam Hamieh
  • Vincent Heiries
  • Hussein Al Osman
  • Christelle Godin

Human stress detection is of great importance for monitoring mental health. The Multimodal Sentiment Analysis Challenge (MuSe) 2021 focuses on emotion, physiological-emotion, and stress recognition as well as sentiment classification by exploiting several modalities. In this paper, we present our solution for the Muse-Stress sub-challenge. The target of this sub-challenge is continuous prediction of arousal and valence for people under stressful conditions where text transcripts, audio and video recordings are provided. To this end, we utilize bidirectional Long Short-Term Memory (LSTM) and Gated Recurrent Unit networks (GRU) to explore high-level and low-level features from different modalities. We employ Concordance Correlation Coefficient (CCC) as a loss function and evaluation metric for our model. To improve the unimodal predictions, we add difficulty indicators of the data obtained by using Auto-Encoders. Finally, we perform late fusion on our unimodal predictions in addition to the difficulty indicators to obtain our final predictions. With this approach, we achieve CCC of 0.4278 and 0.5951 for arousal and valence respectively on the test set, our submission to MuSe 2021 ranks in the top three for arousal, fourth for valence, and in top three for combined results.

Hybrid Mutimodal Fusion for Dimensional Emotion Recognition

  • Ziyu Ma
  • Fuyan Ma
  • Bin Sun
  • Shutao Li

In this paper, we extensively present our solutions for the MuSe-Stress sub-challenge and the MuSe-Physio sub-challenge of Multimodal Sentiment Challenge (MuSe) 2021. The goal of MuSe-Stress sub-challenge is to predict the level of emotional arousal and valence in a time-continuous manner from audio-visual recordings and the goal of MuSe-Physio sub-challenge is to predict the level of psycho-physiological arousal from a) human annotations fused with b) galvanic skin response (also known as Electrodermal Activity (EDA)) signals from the stressed people. The Ulm-TSST dataset which is a novel subset of the audio-visual textual Ulm-Trier Social Stress dataset that features German speakers in a Trier Social Stress Test (TSST) induced stress situation is used in both sub-challenges. For the MuSe-Stress sub-challenge, we highlight our solutions in three aspects: 1) the audio-visual features and the bio-signal features are used for emotional state recognition. 2) the Long Short-Term Memory (LSTM) with the self-attention mechanism is utilized to capture complex temporal dependencies within the feature sequences. 3) the late fusion strategy is adopted to further boost the model's recognition performance by exploiting complementary information scattered across multimodal sequences. Our proposed model achieves CCC of 0.6159 and 0.4609 for valence and arousal respectively on the test set, which both rank in the top 3. For the MuSe-Physio sub-challenge, we first extract the audio-visual features and the bio-signal features from multiple modalities. Then, the LSTM module with the self-attention mechanism, and the Gated Convolutional Neural Networks (GCNN) as well as the LSTM network are utilized for modeling the complex temporal dependencies in the sequence. Finally, the late fusion strategy is used. Our proposed method also achieves CCC of 0.5412 on the test set, which ranks in the top 3.

Multi-modal Stress Recognition Using Temporal Convolution and Recurrent Network with Positional Embedding

  • Anh-Quang Duong
  • Ngoc-Huynh Ho
  • Hyung-Jeong Yang
  • Guee-Sang Lee
  • Soo-Hyung Kim

Chronic stress causes cancer, cardiovascular disease, depression, and diabetes, therefore, it is profoundly harmful to physiologic and psychological health. Various works have examined ways to identify, prevent, and manage people's stress conditions by using deep learning techniques. The 2nd Multimodal Sentiment Analysis Challenge (MuSe 2021) provides a testing bed for recognizing human emotion in stressed dispositions. In this study, we present our proposal to the Muse-Stress sub-challenge of MuSe 2021. There are several modalities including frontal frame sequence, audio signals, and transcripts. Our model uses temporal convolution and recurrent network with positional embedding. As result, our model achieved a concordance correlation coefficient of 0.5095, which is the average of valence and arousal. Moreover, we ranked 3rd in this competition under the team name CNU_SCLab.

Multimodal Fusion Strategies for Physiological-emotion Analysis

  • Tenggan Zhang
  • Zhaopei Huang
  • Ruichen Li
  • Jinming Zhao
  • Qin Jin

Physiological-emotion analysis is a novel aspect of automatic emotion analysis. It can support revealing a subject's emotional state, even if he/she consciously suppresses the emotional expression. In this paper, we present our solutions for the MuSe-Physio sub-challenge of Multimodal Sentiment Analysis (MuSe) 2021. The aim of this task is to predict the level of psycho-physiological arousal from combined audio-visual signals and the galvanic skin response (also known as Electrodermal Activity signals) of subjects under a highly stress-induced free speech scenario. In the scenarios, the speaker's emotion can be conveyed in different modalities including acoustic, visual, textual, and physiological signal modalities. Due to the complementarity of different modalities, the fusion of the multiple modalities has a large impact on emotion analysis. In this paper, we highlight two aspects of our solutions: 1) we explore various efficient low-level and high-level features from different modalities for this task, 2) we propose two effective multi-modal fusion strategies to make full use of the different modalities. Our solutions achieve the best CCC performance of 0.5728 on the challenge testing set, which significantly outperforms the baseline system with corresponding CCC of 0.4908. The experimental results show that our proposed various effective features and efficient fusion strategies have a strong generalization ability and can bring more robust performance.

Fusion of Acoustic and Linguistic Information using Supervised Autoencoder for Improved Emotion Recognition

  • Bogdan Vlasenko
  • RaviShankar Prasad
  • Mathew Magimai.-Doss

Automatic recognition of human emotion has a wide range of applications and has always attracted increasing attention. Expressions of human emotions can apparently be identified across different modalities of communication, such as speech, text, mimics, etc. The "Multimodal Sentiment Analysis in Real-life Media' (MuSe) 2021 challenge provides an environment to develop new techniques to recognize human emotions or sentiments using multiple modalities (audio, video, and text) over in-the-wild data. The challenge encourages to jointly model the information across audio, video and text modalities, for improving emotion recognition. The present paper describes our attempt towards the MuSe-Sent task in the challenge. The goal of the sub-challenge is to perform turn-level prediction of emotions within the arousal and valence dimensions. In the paper, we investigate different approaches to optimally fuse linguistic and acoustic information for emotion recognition systems. The proposed systems employ features derived from these modalities, and uses different deep learning architectures to explore their cross-dependencies. Wide range of acoustic and linguistic features provided by organizers and recently established acoustic embedding wav2vec 2.0 are used for modeling the inherent emotions. In this paper we compare discriminative characteristics of hand-crafted and data-driven acoustic features in a context of emotional classification in arousal and valence dimensions. Ensemble based classifiers were compared with advanced supervised autoendcoder (SAE) technique with Bayesian Optimizer hyperparameter tuning approach. Comparison of uni- and bi-modal classification techniques showed that joint modeling of acoustic and linguistic cues could improve classification performance compared to individual modalities. Experimental results show improvement over the proposed baseline system, which focuses on fusion of acoustic and text based information, on the test set evaluation.

Multimodal Sentiment Analysis based on Recurrent Neural Network and Multimodal Attention

  • Cong Cai
  • Yu He
  • Licai Sun
  • Zheng Lian
  • Bin Liu
  • Jianhua Tao
  • Mingyu Xu
  • Kexin Wang

Automatic estimation of emotional state has a wide application in human-computer interaction. In this paper, we present our solutions for the MuSe-Stress and MuSe-Physio sub-challenge of Multimodal Sentiment Analysis (MuSe 2021). The goal of these two sub-challenges is to perform continuous emotion predictions from people in stressed dispositions. To this end, we first extract both handcrafted features and deep representations from multiple modalities. Then, we explore the Long Short-Term Memory network and Transformer Encoder with Multimodal Multi-head Attention to model the complex temporal dependencies in the sequence. Finally, we adopt the early fusion, late fusion and model fusion to boost the model's performance by exploiting complementary information from different modalities. Our method achieves CCC of 0.6648, 0.3054 and 0.5781 for valence, arousal and arousal plus EDA (anno12_EDA). The results of valence and anno12_EDA outperform the baseline system with corresponding CCC of 0.5614 and 0.4908, and both rank Top3 in these challenges.

A Physiologically-Adapted Gold Standard for Arousal during Stress

  • Alice Baird
  • Lukas Stappen
  • Lukas Christ
  • Lea Schumann
  • Eva-Maria Messner
  • Björn W. Schuller

Emotion is an inherently subjective psycho-physiological human state and to produce an agreed-upon representation (gold standard) for continuously perceived emotion requires time-consuming and costly training of multiple human annotators. With this in mind, there is strong evidence in the literature that physiological signals are an objective marker for states of emotion, particularly arousal. In this contribution, we utilise a multimodal dataset captured during a Trier Social Stress Test to explore the benefit of fusing physiological signals - Heartbeats per Minute ($BPM$), Electrodermal Activity (EDA), and Respiration-rate - for recognition of continuously perceived arousal utilising a Long Short-Term Memory, Recurrent Neural Network architecture, and various audio, video, and textual based features. We use the MuSe-Toolbox to create a gold standard that considers annotator delay and agreement weighting. An improvement in Concordance Correlation Coefficient (CCC) is seen across features sets when fusing EDA with arousal, compared to the arousal only gold standard results. Additionally, BERT-based textual features' results improved for arousal plus all physiological signals, obtaining up to .3344 CCC (.2118 CCC for arousal only). Multimodal fusion also improves CCC. Audio plus video features obtain up to .6157 CCC for arousal plus EDA, BPM.

MuSe-Toolbox: The Multimodal Sentiment Analysis Continuous Annotation Fusion and Discrete Class Transformation Toolbox

  • Lukas Stappen
  • Lea Schumann
  • Benjamin Sertolli
  • Alice Baird
  • Benjamin Weigell
  • Erik Cambria
  • Björn W. Schuller

We introduce the MuSe-Toolbox - a Python-based open-source toolkit for creating a variety of continuous and discrete emotion gold standards. In a single framework, we unify a wide range of fusion methods and propose the novel Rater Aligned Annotation Weighting (RAAW), which aligns the annotations in a translation-invariant way before weighting and fusing them based on the inter-rater agreements between the annotations. Furthermore, discrete categories tend to be easier for humans to interpret than continuous signals. With this in mind, the MuSe-Toolbox provides the functionality to run exhaustive searches for meaningful class clusters in the continuous gold standards. To our knowledge, this is the first toolkit that provides a wide selection of state-of-the-art emotional gold standard methods and their transformation to discrete classes. Experimental results indicate that MuSe-Toolbox can provide promising and novel class formations which can be better predicted than hard-coded classes boundaries with minimal human intervention. The implementation is out-of-the-box available with all dependencies using a Docker container.