MuSe' 22: Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge

MuSe' 22: Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge

MuSe' 22: Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge


Full Citation in the ACM Digital Library

SESSION: Keynote Talks

Uncovering the Nuanced Structure of Expressive Behavior Across Modalities

  • Alan Cowen

Guided by semantic space theory, large-scale computational studies have advanced our understanding of the structure and function of expressive behavior. I will integrate findings from experimental studies of facial expression (N=19,656), vocal bursts (N=12,616), speech prosody (N=20,109), multimodal reactions (N=8,056), and an ongoing study of dyadic interactions (N=1,000+). These studies combine methods from psychology and computer science to yield new insights into what expressive behaviors signal, how they are perceived, and how they shape social interaction. Using machine learning to extract cross-cultural dimensions of behavior while minimizing biases due to demographics and context, we arrive at objective measures of the structural dimensions that make up human expression. Expressions are consistently found to be high-dimensional and blended, with their meaning across cultures being efficiently conceptualized in terms of a wide range of specific emotion concepts. Altogether, these findings generate a comprehensive new atlas of expressive behavior, which I will explore through a variety of visualizations. This new taxonomy departs from models such as the basic six and affective circumplex, suggesting a new way forward for expression understanding and sentiment analysis.

The Dos and Don'ts of Affect Analysis

  • Shahin Amiriparian

As an inseparable and crucial component of communication affects play a substantial role in human-device and human-human interaction. They convey information about a person's specific traits and states [1, 4, 5], how one feels about the aims of a conversation, the trustworthiness of one's verbal communication [3], and the degree of adaptation in interpersonal speech [2]. This multifaceted nature of human affects poses a great challenge when it comes to applying machine learning systems for their automatic recognition and understanding. Contemporary self-supervised learning architectures such as Transformers, which define state-of-the-art (SOTA) in this area, have shown noticeable deficits in terms of explainability, while more conventional, non-deep machine learning methods, which provide more transparency, often fall (far) behind SOTA systems. So, is it possible to get the best of these two 'worlds'? And more importantly, at what price? In this talk, I provide a set of Dos and Don'ts guidelines for addressing affective computing tasks w. r. t. (i) preserving privacy for affective data and individuals/groups, (ii) being efficient in computing such data in a transparent way, (iii) ensuring reproducibility of the results, (iv) knowing the differences between causation and correlation, and (v) properly applying social and ethical protocols.

SESSION: Session 1: MuSe Challenge 2022

The MuSe 2022 Multimodal Sentiment Analysis Challenge: Humor, Emotional Reactions, and Stress

  • Lukas Christ
  • Shahin Amiriparian
  • Alice Baird
  • Panagiotis Tzirakis
  • Alexander Kathan
  • Niklas Müller
  • Lukas Stappen
  • Eva-Maria Meßner
  • Andreas König
  • Alan Cowen
  • Erik Cambria
  • Björn W. Schuller

The Multimodal Sentiment Analysis Challenge (MuSe) 2022 is dedicated to multimodal sentiment and emotion recognition. For this year's challenge, we feature three datasets: (i) the Passau Spontaneous Football Coach Humor (Passau-SFCH) dataset that contains audio-visual recordings of German football coaches, labelled for the presence of humour; (ii) the Hume-Reaction dataset in which reactions of individuals to emotional stimuli have been annotated with respect to seven emotional expression intensities, and (iii) the Ulm-Trier Social Stress Test (Ulm-TSST) dataset comprising of audio-visual data labelled with continuous emotion values (arousal and valence) of people in stressful dispositions. Using the introduced datasets, MuSe 2022 addresses three contemporary affective computing problems: in the Humor Detection Sub-Challenge (MuSe-Humor), spontaneous humour has to be recognised; in the Emotional Reactions Sub-Challenge (MuSe-Reaction), seven fine-grained 'in-the-wild' emotions have to be predicted; and in the Emotional Stress Sub-Challenge (MuSe-Stress), a continuous prediction of stressed emotion values is featured. The challenge is designed to attract different research communities, encouraging a fusion of their disciplines. Mainly, MuSe 2022 targets the communities of audio-visual emotion recognition, health informatics, and symbolic sentiment analysis. This baseline paper describes the datasets as well as the feature sets extracted from them. A recurrent neural network with LSTM cells is used to set competitive baseline results on the test partitions for each sub-challenge. We report an Area under the Curve (AUC) of .8480 for MuSe-Humor; .2801 mean (from 7-classes) Pearson's Correlations Coefficient for MuSe-Reaction, as well as .4931 Concordance Correlation Coefficient (CCC) and .4761 for valence and arousal in MuSe-Stress, respectively.

Hybrid Multimodal Fusion for Humor Detection

  • Haojie Xu
  • Weifeng Liu
  • Jiangwei Liu
  • Mingzheng Li
  • Yu Feng
  • Yasi Peng
  • Yunwei Shi
  • Xiao Sun
  • Meng Wang

In this paper, we present our solution to the MuSe-Humor sub-challenge of the Multimodal Emotional Challenge (MuSe) 2022. The goal of the MuSe-Humor sub-challenge is to detect humor and calculate AUC from audiovisual recordings of German football Bundesliga press conferences. It is annotated for humor displayed by the coaches. For this sub-challenge, we first build a discriminant model using the transformer module and BiLSTM module, and then propose a hybrid fusion strategy to use the prediction results of each modality to improve the performance of the model. Our experiments demonstrate the effectiveness of our proposed model and hybrid fusion strategy on multimodal fusion, and the AUC of our proposed model on the test set is 0.8972.

Integrating Cross-modal Interactions via Latent Representation Shift for Multi-modal Humor Detection

  • Chengxin Chen
  • Pengyuan Zhang

Multi-modal sentiment analysis has been an active research area and has attracted increasing attention from multi-disciplinary communities. However, it is still challenging to fuse the information from different modalities in an efficient way. In prior studies, the late fusion strategy has been commonly adopted due to its simplicity and efficacy. Unfortunately, it failed to model the interactions across different modalities. In this paper, we propose a transformer-based hierarchical framework to effectively model both the intrinsic semantics and cross-modal interactions of the relevant modalities. Specifically, the features from each modality are first encoded via standard transformers. Later, the cross-modal interactions from one modality to other modalities are calculated using cross-modal transformers. The derived intrinsic semantics and cross-modal interactions are used to determine the latent representation shift of a particular modality. We evaluate the proposed approach on the MuSe-Humor sub-challenge of Multi-modal Sentiment Analysis Challenge (MuSe) 2022. Experimental results show that an Area Under the Curve (AUC) of 0.9065 can be achieved on the test set of MuSe-Humor. With the promising results, our best submission ranked first place in the sub-challenge.

A Personalised Approach to Audiovisual Humour Recognition and its Individual-level Fairness

  • Alexander Kathan
  • Shahin Amiriparian
  • Lukas Christ
  • Andreas Triantafyllopoulos
  • Niklas Müller
  • Andreas König
  • Björn W. Schuller

Humour is one of the most subtle and contextualised behavioural patterns to study in social psychology and has a major impact on human emotions, social cognition, behaviour, and relations. Consequently, an automatic understanding of humour is crucial and challenging for a naturalistic human-robot interaction. Recent artificial intelligence (AI)-based methods have shown progress in multimodal humour recognition. However, such methods lack a mechanism in adapting to each individual's characteristics, resulting in a decreased performance, e.g., due to different facial expressions. Further, these models are faced with generalisation problems when being applied for recognition of different styles of humour. We aim to address these challenges by introducing a novel multimodal humour recognition approach in which the models are personalised for each individual in the Passau Spontaneous Football Coach Humour (Passau-SFCH) dataset. We begin by training a model on all individuals in the dataset. Subsequently, we fine-tune all layers of this model with the data from each individual. Finally, we use these models for the prediction task. Using the proposed personalised models, it is possible to significantly (two-tailed t-test, p < 0.05) outperform the non-personalised models. In particular, the mean Area Under the Curve (AUC) is increased from .7573 to .7731 for the audio modality, and from .9203 to .9256 for the video modality. In addition, we apply a weighted late fusion approach which increases the overall performance to an AUC of .9308, demonstrating the complementarity of the features. Finally, we evaluate the individual-level fairness of our approach and show which group of subjects benefits most of using personalisation.

Comparing Biosignal and Acoustic feature Representation for Continuous Emotion Recognition

  • Sarthak Yadav
  • Tilak Purohit
  • Zohreh Mostaani
  • Bogdan Vlasenko
  • Mathew Magimai.-Doss

Automatic recognition of human emotion has a wide range of applications. Human emotions can be identified across different modalities, such as biosignal, speech, text, and mimics. This paper is focusing on time-continuous prediction of level of valence and psycho-physiological arousal. In that regard, we investigate, (a) the use of different feature embeddings obtained from neural networks pre-trained on different speech tasks (e.g., phone classification, speech emotion recognition) and self-supervised neural networks, (b) estimation of arousal and valence from physiological signals in an end-to-end manner and (c) combining different neural embeddings. Our investigations on the MuSe-Stress sub-challenge shows that (a) the embeddings extracted from physiological signals using CNNs trained in an end-to-end manner improves over the baseline approach of modeling physiological signals, (b) neural embeddings obtained from phone classification neural network and speech emotion recognition neural network trained on auxiliary language data sets yield improvement over baseline systems purely trained on the target data, and (c) task-specific neural embeddings yield improved performance over self-supervised neural embeddings for both arousal and valence. Our best performing system on test-set surpass the DeepSpectrum baseline (combined score) by a relative 7.7% margin

Towards Multimodal Prediction of Time-continuous Emotion using Pose Feature Engineering and a Transformer Encoder

  • Ho-min Park
  • Ilho Yun
  • Ajit Kumar
  • Ankit Kumar Singh
  • Bong Jun Choi
  • Dhananjay Singh
  • Wesley De Neve

MuSe-Stress 2022 aims at building sequence regression models for predicting valence and physiological arousal levels of persons who are facing stressful conditions. To that end, audio-visual recordings, transcripts, and physiological signals can be leveraged. In this paper, we describe the approach we developed for Muse-Stress 2022. Specifically, we engineered a new pose feature that captures the movement of human body keypoints. We also trained a Long Short-Term Memory (LSTM) network and a Transformer encoder on different types of feature sequences and different combinations thereof. In addition, we adopted a two-pronged strategy to tune the hyperparameters that govern the different ways the available features can be used. Finally, we made use of late fusion to combine the predictions obtained for the different unimodal features. Our experimental results show that the newly engineered pose feature obtains the second highest development CCC among the seven unimodal features available. Furthermore, our Transformer encoder obtains the highest development CCC for five out of fourteen possible combinations of features and emotion dimensions, with this number increasing from five to nine when performing late fusion. In addition, when searching for optimal hyperparameter settings, our two-pronged hyperparameter tuning strategy leads to noticeable improvements in maximum development CCC, especially when the underlying models are based on an LSTM. In summary, we can conclude that our approach is able to achieve a test CCC of 0.6196 and 0.6351 for arousal and valence, respectively, securing a Top-3 rank in Muse-Stress 2022.

Improving Dimensional Emotion Recognition via Feature-wise Fusion

  • Yiping Liu
  • Wei Sun
  • Xing Zhang
  • Yebao Qin

This paper introduces the solution of the RiHNU team for the MuSe-Stress sub-challenge as part of Multimodal Sentiment Analysis (MuSe) 2022. The MuSe-Stress is a task to discern human emotional states via internal or external responses (e.g., audio, physiological signal, and facial expression) in a job-interview setting. Multimodal learning is extensively considered an available approach for multimodal sentiment analysis tasks. However, most multimodal models fail to capture the association among each modality, resulting in limited generalizability. We argue that those methods are incapable of establishing discriminative features, mainly because they typically neglect fine-grained information. To address this problem, we first encode spatio-temporal features via a feature-wise fuse mechanism to learn more informative representations. Then we exploit the late fusion strategy to capture fine-grained relations between multiple modalities. The ensemble strategy is also used to enhance the final performance. Our method achieves CCC of 0.6803 and 0.6689 for valence and physiological arousal, respectively, on the test set.

Multimodal Temporal Attention in Sentiment Analysis

  • Yu He
  • Licai Sun
  • Zheng Lian
  • Bin Liu
  • Jianhua Tao
  • Meng Wang
  • Yuan Cheng

In this paper, we present the solution to the MuSe-Stress sub-challenge in the MuSe 2022 Multimodal Sentiment Analysis Challenge. The task of MuSe-Stress is to predict a time-continuous value (i.e., physiological arousal and valence) based on multimodal data of audio, visual, text, and physiological signals. In this competition, we find that multimodal fusion has good performance for physiological arousal on the validation set, but poor prediction performance on the test set. We believe that problem may be due to the over-fitting caused by the model's over-reliance on some specific modal features. To deal with the above problem, we propose Multimodal Temporal Attention (MMTA), which considers the temporal effects of all modalities on each unimodal branch, realizing the interaction between unimodal branches and adaptive inter-modal balance. The concordance correlation coefficient (CCC) of physiological arousal and valence are 0.6818 with MMTA and 0.6841 with early fusion, respectively, both ranking Top 1, outperforming the baseline system by a large margin (i.e., 0.4761 and 0.4931) on the test set.

ViPER: Video-based Perceiver for Emotion Recognition

  • Lorenzo Vaiani
  • Moreno La Quatra
  • Luca Cagliero
  • Paolo Garza

Recognizing human emotions from videos requires a deep understanding of the underlying multimodal sources, including images, audio, and text. Since the input data sources are highly variable across different modality combinations, leveraging multiple modalities often requires ad hoc fusion networks. To predict the emotional arousal of a person reacting to a given video clip we present ViPER, a multimodal architecture leveraging a modality-agnostic transformer based model to combine video frames, audio recordings, and textual annotations. Specifically, it relies on a modality-agnostic late fusion network which makes ViPER easily adaptable to different modalities. The experiments carried out on the Hume-Reaction datasets of the MuSe-Reaction challenge confirm the effectiveness of the proposed approach.

Emotional Reaction Analysis based on Multi-Label Graph Convolutional Networks and Dynamic Facial Expression Recognition Transformer

  • Kexin Wang
  • Zheng Lian
  • Licai Sun
  • Bin Liu
  • Jianhua Tao
  • Yin Fan

Automatically predicting and understanding human emotional reactions have wide applications in human-computer interaction. In this paper, we present our solutions to the MuSe-Reaction sub-challenge in MuSe 2022. The task of this sub-challenge is to predict the intensity of 7 emotional expressions from human reactions to a wide range of emotionally evocative stimuli. Specifically, we design an end-to-end model, which is composed of a Spatio-Temporal Transformer for dynamic facial representation learning and a multi-label graph convolutional network for emotion dependency modeling.We also explore the effects of a temporal model with a variety of features from acoustic and visual modalities. Our proposed method achieves mean Pearson's correlation coefficient of 0.3375 on the test set of MuSe-Reaction, which outperforms the baseline system(i.e., 0.2801) by a large margin.

Hybrid Multimodal Feature Extraction, Mining and Fusion for Sentiment Analysis

  • Jia Li
  • Ziyang Zhang
  • Junjie Lang
  • Yueqi Jiang
  • Liuwei An
  • Peng Zou
  • Yangyang Xu
  • Sheng Gao
  • Jie Lin
  • Chunxiao Fan
  • Xiao Sun
  • Meng Wang

In this paper, we present our solutions for the Multimodal Sentiment Analysis Challenge (MuSe) 2022, which includes MuSe-Humor, MuSe-Reaction and MuSe-Stress Sub-challenges. The MuSe 2022 focuses on humor detection, emotional reactions and multimodal emotional stress utilising different modalities and data sets. In our work, different kinds of multimodal features are extracted, including acoustic, visual, text and biological features. These features are fused by TEMMA and GRU with self-attention mechanism frameworks. In this paper, 1) several new audio features, facial expression features and paragraph-level text embeddings are extracted for accuracy improvement. 2) we substantially improve the accuracy and reliability for multimodal sentiment prediction by mining and blending the multimodal features. 3) effective data augmentation strategies are applied in model training to alleviate the problem of sample imbalance and prevent the model form learning biased subject characters. For the MuSe-Humor sub-challenge, our model obtains the AUC score of 0.8932. For the MuSe-Reaction sub-challenge, the Pearson's Correlations Coefficient of our approach on the test set is 0.3879, which outperforms all other participants. For the MuSe-Stress sub-challenge, our approach outperforms the baseline in both arousal and valence on the test dataset, reaching a final combined result of 0.5151.

SESSION: Session 2: Multimodal and Audio-based Sentiment Analysis

Transformer-based Non-Verbal Emotion Recognition: Exploring Model Portability across Speakers' Genders

  • Lorenzo Vaiani
  • Alkis Koudounas
  • Moreno La Quatra
  • Luca Cagliero
  • Paolo Garza
  • Elena Baralis

Recognizing emotions in non-verbal audio tracks requires a deep understanding of their underlying features. Traditional classifiers relying on excitation, prosodic, and vocal traction features are not always capable of effectively generalizing across speakers' genders. In the ComParE 2022 vocalisation sub-challenge we explore the use of a Transformer architecture trained on contrastive audio examples. We leverage augmented data to learn robust non-verbal emotion classifiers. We also investigate the impact of different audio transformations, including neural voice conversion, on the classifier capability to generalize across speakers' genders. The empirical findings indicate that neural voice conversion is beneficial in the pretraining phase, yielding an improved model generality, whereas is harmful at the finetuning stage as hinders model specialization for the task of non-verbal emotion recognition.

Bridging the Gap: End-to-End Domain Adaptation for Emotional Vocalization Classification using Adversarial Learning

  • Dominik Schiller
  • Silvan Mertes
  • Pol van Rijn
  • Elisabeth André

Good classification performance on a hold-out partition can only be expected if the data distribution of the test data matches the training data. However, in many real-life use cases, this constraint is not met. In this work, we explore if it is feasible to use existing methods of an adversarial domain transfer to bridge this inter-domain gap. To do so, we use a CycleGAN that was trained on converting between the domains. We demonstrate that the quality of the generated data has a substantial impact on the effectiveness of the domain adaptation, and propose an additional step to overcome this problem. To evaluate the approach, we classify emotions in female and male vocalizations. Furthermore, we show that our model successfully approximates the distribution of acoustic features and that our approach can be employed to improve emotion classification performance. Since the presented approach is domain and feature independent it can therefore be applied to any classification task.

Leveraging Multi-modal Interactions among the Intermediate Representations of Deep Transformers for Emotion Recognition

  • Yang Wu
  • Zhenyu Zhang
  • Pai Peng
  • Yanyan Zhao
  • Bing Qin

Multi-modal emotion recognition aims to recognize emotion states from multi-modal inputs. Existing end-to-end models typically fuse the uni-modal representations in the last layers without leveraging the multi-modal interactions among the intermediate representations. In this paper, we propose the multi-modal Recurrent Intermediate-Layer Aggregation (RILA) model to explore the effectiveness of leveraging the multi-modal interactions among the intermediate representations of deep pre-trained transformers for end-to-end emotion recognition. At the heart of our model is the Intermediate-Representation Fusion Module (IRFM), which consists of the multi-modal aggregation gating module and multi-modal token attention module. Specifically, at each layer, we first use the multi-modal aggregation gating module to capture the utterance-level interactions across the modalities and layers. Then we utilize the multi-modal token attention module to leverage the token-level multi-modal interactions. The experimental results on IEMOCAP and CMU-MOSEI show that our model achieves the state-of-the-art performance, benefiting from fully exploiting the multi-modal interactions among the intermediate representations.